Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Web-scraping inside JabRef #11093

Open
koppor opened this issue Mar 25, 2024 · 4 comments
Open

Introduce Web-scraping inside JabRef #11093

koppor opened this issue Mar 25, 2024 · 4 comments

Comments

@koppor
Copy link
Member

koppor commented Mar 25, 2024

Currently, our web search sends out search strings to API endpoints and then interprets the results. In other words: We have fetchers with API key and screen scraping. For the screen scapers, they mostly don't work. We should switch to a browser-based screen-scraping. Mostly because of CloudFlare.

JabRef should display the HTML page inside JabRef and offer scraping the citations directly from the page. Similar as BibDesk does.

316482562-b4a3d1e7-bd0a-4475-ae52-71120ae0d1fe 316482726-6a80130f-f920-44a4-8689-f420fa459226

Maybe the Java Chromium Embedded Framework (JCEF) helps. The test class https://github.com/chromiumembedded/java-cef/blob/master/java/tests/detailed/handler/RequestHandler.java seems to guide one to the usage.


The PR #7075 attempted to display the Google Scholar captchas in JabRef. The PR was not completed. -- This issue says: Rewrite the fetchers not to use URLDownload, but JCEF.

Note that this is different from #11093. There, a new UI is demanded.

Here, it should be allowed that the fetchers run stand-alone without user interaction.


Affected fetchers:

  • ACS: org.jabref.logic.importer.fetcher.ACS
  • Google Scholar: org.jabref.logic.importer.fetcher.GoogleScholar)
  • Icar: org.jabref.logic.importer.fetcher.IacrEprintFetcher
  • JStor: org.jabref.logic.importer.fetcher.JstorFetcher
  • ResearchGate: org.jabref.logic.importer.fetcher.ResearchGate
  • ScienceDirect: org.jabref.logic.importer.fetcher.ScienceDirect
  • SpringerLink: org.jabref.logic.importer.fetcher.SpringerLink

Sometimes, the API used. Then findFullText is the method handling HTML only.

@Siedlerchr
Copy link
Member

Works now, was probably a temporary glitch

@Siedlerchr
Copy link
Member

I checked the Bib Desk code:
They basically use a Safari based View Control and use a simple XPath query to check for matching links in the document's dom. The parsing itself is very similar to our existing fetcher infrastructure.
I experimented a bit with using javafx's WebView, while that can display websites and even captchas e.g. on google scholar,
I was not yet able to get the correct DOM after clicking on some page. This would require some further testing.

@koppor
Copy link
Member Author

koppor commented May 27, 2024

@ThiloteE
Copy link
Member

ThiloteE commented May 27, 2024

When it comes to scrapping, I have seen JSoup being mentioned a lot: https://jsoup.org/
See also https://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Free to take
Status: Normal priority
Development

No branches or pull requests

3 participants