Introduce Web-scraping inside JabRef #11093

koppor · 2024-03-25T12:21:51Z

Currently, our web search sends out search strings to API endpoints and then interprets the results. In other words: We have fetchers with API key and screen scraping. For the screen scapers, they mostly don't work. We should switch to a browser-based screen-scraping. Mostly because of CloudFlare.

JabRef should display the HTML page inside JabRef and offer scraping the citations directly from the page. Similar as BibDesk does.

316482562-b4a3d1e7-bd0a-4475-ae52-71120ae0d1fe

316482726-6a80130f-f920-44a4-8689-f420fa459226

Maybe the Java Chromium Embedded Framework (JCEF) helps. The test class https://github.com/chromiumembedded/java-cef/blob/master/java/tests/detailed/handler/RequestHandler.java seems to guide one to the usage.

The PR #7075 attempted to display the Google Scholar captchas in JabRef. The PR was not completed. -- This issue says: Rewrite the fetchers not to use URLDownload, but JCEF.

Note that this is different from #11093. There, a new UI is demanded.

Here, it should be allowed that the fetchers run stand-alone without user interaction.

Affected fetchers:

ACS: org.jabref.logic.importer.fetcher.ACS
Google Scholar: org.jabref.logic.importer.fetcher.GoogleScholar)
Icar: org.jabref.logic.importer.fetcher.IacrEprintFetcher
JStor: org.jabref.logic.importer.fetcher.JstorFetcher
ResearchGate: org.jabref.logic.importer.fetcher.ResearchGate
ScienceDirect: org.jabref.logic.importer.fetcher.ScienceDirect
SpringerLink: org.jabref.logic.importer.fetcher.SpringerLink

Sometimes, the API used. Then findFullText is the method handling HTML only.

The text was updated successfully, but these errors were encountered:

Siedlerchr · 2024-03-25T12:49:41Z

Works now, was probably a temporary glitch

Siedlerchr · 2024-05-17T22:56:14Z

I checked the Bib Desk code:
They basically use a Safari based View Control and use a simple XPath query to check for matching links in the document's dom. The parsing itself is very similar to our existing fetcher infrastructure.
I experimented a bit with using javafx's WebView, while that can display websites and even captchas e.g. on google scholar,
I was not yet able to get the correct DOM after clicking on some page. This would require some further testing.

koppor · 2024-05-27T21:10:48Z

Related work: https://github.com/HtmlUnit/htmlunit?tab=readme-ov-file#getting-started

ThiloteE · 2024-05-27T21:17:51Z

When it comes to scrapping, I have seen JSoup being mentioned a lot: https://jsoup.org/
See also https://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program

koppor added ui fetcher labels Mar 25, 2024

koppor mentioned this issue Mar 25, 2024

Make fetchers web-based koppor/jabref#683

Closed

This was referenced Jun 13, 2024

feature request: add new websites to web search (Google Scholar, Nature, Science) #10263

Open

"Download linked file" option creates an html file instead of downloading the pdf on Windows 10. #10149

Closed

Siedlerchr mentioned this issue Aug 27, 2024

ACS Jsoup fetch runs into 403: Forbidden #11682

Open

2 tasks

koppor mentioned this issue Sep 7, 2024

Enable JCEF koppor/jabref#695

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Web-scraping inside JabRef #11093

Introduce Web-scraping inside JabRef #11093

koppor commented Mar 25, 2024 •

edited

Loading

Siedlerchr commented Mar 25, 2024

Siedlerchr commented May 17, 2024

koppor commented May 27, 2024

ThiloteE commented May 27, 2024 •

edited

Loading

Introduce Web-scraping inside JabRef #11093

Introduce Web-scraping inside JabRef #11093

Comments

koppor commented Mar 25, 2024 • edited Loading

Siedlerchr commented Mar 25, 2024

Siedlerchr commented May 17, 2024

koppor commented May 27, 2024

ThiloteE commented May 27, 2024 • edited Loading

koppor commented Mar 25, 2024 •

edited

Loading

ThiloteE commented May 27, 2024 •

edited

Loading