Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I search ISSN, ISBN and DOI in a web-page, Not only URL? #280

Open
hwiorn opened this issue Mar 15, 2022 · 5 comments
Open

Can I search ISSN, ISBN and DOI in a web-page, Not only URL? #280

hwiorn opened this issue Mar 15, 2022 · 5 comments
Labels
cannon Related to URL normalisation frontend Related to browser extension source Related to specific sources/modules/indexers

Comments

@hwiorn
Copy link

hwiorn commented Mar 15, 2022

I'm writing an indexer for org-roam and BibTeX to link between org-roam to web-browser.

Some org-file has citation syntax like below.

:PROPERTIES:
:ID:       120cf393-9ec3-40b8-a486-d903036236f8
:ROAM_REFS: cite:Dong2018
:END:
#+TITLE: Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
#+CREATED: [2021-07-18 Sun 15:38]
#+filetags: :Literature:

- tags ::
- keywords ::
- author(s) :: Dong, Linhao and Xu, Shuang and Xu, Bo

The bib file would be like this.

@InProceedings{Dong2018,
  author     = {Dong, Linhao and Xu, Shuang and Xu, Bo},
  booktitle  = {2018 {IEEE} {International} {Conference} on {Acoustics}, {Speech} and {Signal} {Processing} ({ICASSP})},
  title      = {Speech-{Transformer}: {A} {No}-{Recurrence} {Sequence}-to-{Sequence} {Model} for {Speech} {Recognition}},
  year       = {2018},
  month      = apr,
  note       = {ZSCC: 0000311 ISSN: 2379-190X},
  pages      = {5884--5888},
  abstract   = {Recurrent sequence-to-sequence models using encoder-decoder architecture have made great progress in speech recognition task. However, they suffer from the drawback of slow training speed because the internal recurrence limits the training parallelization. In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency. We also propose a 2D-Attention mechanism, which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech-Transformer. Evaluated on the Wall Street Journal (WSJ) speech recognition dataset, our best model achieves competitive word error rate (WER) of 10.9\%, while the whole training process only takes 1.2 days on 1 GPU, significantly faster than the published results of recurrent sequence-to-sequence models.},
  doi        = {10.1109/ICASSP.2018.8462506},
  file       = {:Dong2018 - Speech Transformer_ a No Recurrence Sequence to Sequence Model for Speech Recognition.html:URL;:dong2018.pdf:PDF},
  issn       = {2379-190X},
  keywords   = {Hidden Markov models, Encoding, Training, Decoding, Speech recognition, Time-frequency analysis, Spectrogram, Speech Recognition, Sequence-to-Sequence, Attention, Transformer},
  shorttitle = {Speech-{Transformer}},
}

BibTeX can have ISBN or ISSN or DOI or URL.

The Indexer parse the BibTeX files first and links URL to ROAM_REFS and CUSTOM_ID of the Org file.
I think this quite works well.

However, some entries are books which have only ISBN.
I think Promnesia extension needs to scrape identifiers(ISBN, DOI) in web-page to link it to org-roam files.
Book sites except Amazon Kindle provide ISBN in open-graph meta of their web-page.
But I don't think it is a good idea. It means Promnesia extension needs some identifier parsers or using extra scraping in the indexer.

Can I add it to Promnesia to scrape identifiers in a web-page? Will it be a good idea?

@karlicoss karlicoss added frontend Related to browser extension source Related to specific sources/modules/indexers labels Mar 17, 2022
@karlicoss
Copy link
Owner

Hi! It's an interesting idea, definitely in the spirit of Promnesia!

For the backend it should be relatively easy, although will require some rethinking because currently it's aiming URLs mainly. But hopefully extracting ISBN/DOI is much easier than url and should be a simple regex.

Possible problems I can think of are mainly on the frontend:

  • it's very quick to query all hyperlinks from the DOM. Not sure what would it take to scrape ISBN/DOI, but hopefully if it's just a regex it should be pretty quick?
  • for regular URLs there is a natural DOM element (e.g. the <a> box around it) to attach the visited marks etc. the DOI would just normally be within the text so might require very hacky page modifications. On the other hand, maybe even without attaching any marks to the page, just having reference DOIs in the sidebar would already give people the most benefit.

But DOI detection could be opt-in to start with so I don't find these too concerning :)

Let me know if you want any guidance, there might be some rought edges, especially with all the extension shenanigans.

And by the way you'll be very welcome in https://memex.zulipchat.com/ -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)

also related: #271

@sopoforic
Copy link
Contributor

it's very quick to query all hyperlinks from the DOM. Not sure what would it take to scrape ISBN/DOI, but hopefully if it's just a regex it should be pretty quick?

Depending on the site, these are very often in a <meta> tag. For example, this book has:

<meta content="9780191776267" property="book:isbn"/>
<meta content="10.1093/actrade/9780192840943.001.0001" name="dc.identifier"/>

An article from ACM similarly has:

<meta name="dc.Identifier" scheme="doi" content="10.1145/953051.801372">

This is typical for journal publishers' sites. It's less convenient if you're looking at other pages, but e.g. Abebooks has <meta itemprop="isbn" content="9781435127739" /> and amazon has the ASIN scattered all over including stuff like <input type="hidden" id="ASIN" name="ASIN" value="0385015836"> which shouldn't be hard to get at reliably.

@sopoforic
Copy link
Contributor

sopoforic commented Mar 18, 2022

For the backend it should be relatively easy, although will require some rethinking because currently it's aiming URLs mainly.

However, I do get orig_urls from hypothesis like urn:x-pdf:3719.. that produce norm_urls like x-pdf%3A3719f..., so certainly the world wouldn't end if we stored urn:isbn:0123456789 or doi:10.1234/5678 or even com.github.karlicoss.promnesia:novel-id:1234567 if you want to make up something non-conflicting. The canonifier just needs to emit something sensible given non-URL URIs.

@karlicoss
Copy link
Owner

Right -- I guess this is because the URL extractor is on the relaxed side: we'd rather detect some non-URLs than not detect some URLs, since extra broken URLs only result in minor database bloat.
So if there is a separate DOI/ISBN extractor and it works, we should be fine without having to mess with URL extractor. Or we could just detect DOIs first and then subtract them from the URL set.

@hwiorn
Copy link
Author

hwiorn commented Mar 25, 2022

Let me know if you want any guidance, there might be some rought edges, especially with all the extension shenanigans.

And by the way you'll be very welcome in https://memex.zulipchat.com/ -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)

Actually, I already in the memex chat. But I have no enough time to make an implementation now because of work.
meta-data(tag) of web page that I said is what @sopoforic said is. I don't think it is good to parse every DOM and HTML using Regex which extractor can bloat easily. But always there will be exception, may be needs to specific parsers(extractors) sometimes.

However, I do get orig_urls from hypothesis like urn:x-pdf:3719.. that produce norm_urls like x-pdf%3A3719f..., so certainly the world wouldn't end if we stored urn:isbn:0123456789 or doi:10.1234/5678 or even com.github.karlicoss.promnesia:novel-id:1234567 if you want to make up something non-conflicting. The canonifier just needs to emit something sensible given non-URL URIs.

Using urn:isbn:0123456789 or urn:doi:10.1234/5678 is a good idea.

@karlicoss karlicoss added the cannon Related to URL normalisation label Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cannon Related to URL normalisation frontend Related to browser extension source Related to specific sources/modules/indexers
Projects
None yet
Development

No branches or pull requests

3 participants