BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters #67

GerHobbelt · 2019-09-04T00:27:11Z

Related: #54, #56, etc.

Related commits:

SHA-1: f737ebe

WARNING: a PDF URI does not have to include a PDF extension!

Case in point:
```
    https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx
```
is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!
SHA-1: ae42133

AddNewDocumentToLibraryFromInternet_*() APIs: some nasty/ill-configured servers don't produce a legal Content-Type header, or don't provide that header at all -- which made Qiqqa barf a hairball instead of properly attempting to import the downloaded PDF.

Also don't yak about images which are downloaded as part of Google search pages, etc.: these content-types now make it through part of the PDF import code as we cannot rely on the Content-Type header being valid or present, hence we need to be very lenient about what we accept as "potentially a PDF document" to inspect before importing.

Fixes: Qiqqa: crashes/fails to import PDF from ill-configured servers #63
SHA-1: 26b581f

remaining work for BibTeX Sniffer: clicking on a PDF which returns a HTTP 302 Redirect response is not picked up #56 / BibTeX Sniffer: Sometimes clicking on a PDF does show the PDF but DOES NOT import it in your library #54 -- catch some nasty PDF URIs which weren't recognized as such before. Right now we're pretty aggressive as we fetch almost everything that crosses our path; once fetched we check if's actually a valid PDF file after all. CiteSeerX and other sites now deliver once again...
SHA-1: b3f1f2d

fixes for BibTeX Sniffer: clicking on a PDF which returns a HTTP 302 Redirect response is not picked up #56 ; also ensuring every document that's fetched off the Internet is opened in Qiqqa for review/editing (some PDF documents were silently downloaded and then dumped into the Guest Library just because and you'ld have to go around and check to see the stuff actually arrived in a library of yours. :'-(
SHA-1: 5bdd2ea

HTTP/HTTPS web grab of PDF files: we don't care which TLS/SSL protocol is required, we should just grab the PDF and not bother. Some websites require TLS2 while today I ran into a website which requires old SSL (not TLS): make sure they're all turned ON.
SHA-1: e877a36

fix crash in "grab/download all PDF files which are available on this page" webbrowser toolbar button functionality: the code crashed on relative URIs being fed into new Uri(url) code lines. Now the code copes correctly with both absolute and relative URIs and also corrupt/invalid URIs don't crash the grab-extractor code any more. Also improved the check for any URI found in the page being a PDF file a little: check for ".pdf" rather than "pdf": this will prevent us from trying not-a-pdf-file URIs such as "http://www.example.com/blog-about-pdf".

SHA-1: d8cf09f & SHA-1: 7935018

further fiddling with the weird download issue reported in commit SHA-1: 7935018 --> had a look if and how Chrome browser does it. It succeeds, with these headers:

General:

Request URL: https://ora.ox.ac.uk/objects/uuid:49e0183f-277e-486c-87bc-17097cbef0b3/download_file?file_format=pdf&safe_filename=fmcad2012.pdf&type_of_work=Conference+item
Request Method: GET
Status Code: 200 OK
Remote Address: 127.0.0.1:8118
Referrer Policy: no-referrer-when-downgrade

Response Headers:

Cache-Control: private
Content-Disposition: inline; filename="fmcad2012.pdf"
Content-Transfer-Encoding: binary
Content-Type: application/pdf
Date: Sat, 17 Aug 2019 23:06:36 GMT
Referrer-Policy: strict-origin-when-cross-origin
Server: Apache/2.4.34 (Red Hat)
Status: 200 OK
Strict-Transport-Security: max-age=15768000; includeSubDomains
Transfer-Encoding: chunked
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Frame-Options: SAMEORIGIN
X-Permitted-Cross-Domain-Policies: none
X-Powered-By: Phusion Passenger 6.0.2
X-Request-Id: 4f9a59d9-dcc9-49d1-8216-970461db251d
X-Runtime: 0.063538
X-XSS-Protection: 1; mode=block

Request Headers:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,nl;q=0.8,de;q=0.7
Cache-Control: no-cache
Connection: keep-alive
Host: ora.ox.ac.uk
Pragma: no-cache
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36

Query Strings:

file_format: pdf
safe_filename: fmcad2012.pdf
type_of_work: Conference item

Trying to tackle a very weird PDF download problem, which doesn't go away.

HTR:

Sniffer search on Scholar: "Deciding Floating point logic with systematic abstraction"
no click the first PDF link (ora.ox.ac.uk) to download PDF.

This will silently FAIL! (same error as mentioned here: https://stackoverflow.com/questions/21728773/the-underlying-connection-was-closed-an-unexpected-error-occurred-on-a-receiv and here https://stackoverflow.com/questions/21481682/httpwebrequest-the-underlying-connection-was-closed-the-connection-was-closed )

Link: https://ora.ox.ac.uk/objects/uuid:49e0183f-277e-486c-87bc-17097cbef0b3/download_file?file_format=pdf&safe_filename=fmcad2012.pdf&type_of_work=Conference+item
open the same search in "Google.com" tab: will show a PDF entry as a result at the down.
download that one without trouble.

Link: http://www.cs.ox.ac.uk/people/leopold.haller/papers/fmcad2012.pdf

SHA-1: c28eb11

fix BibTeX Sniffer: Sometimes clicking on a PDF does show the PDF but DOES NOT import it in your library #54 in GoogleBibTexSnifferControl

Gecko these days crashes on ContentDispositionXXXX member accesses: Exception thrown: 'System.Runtime.InteropServices.COMException' in Geckofx-Core.dll

I'm not sure why; the only change I know of is an update of MSVS2019. :-S
SHA-1: 7bf0c72

re-added to 'Add This PDF to Library' button in the browser; TODO: make it work akin to the <embed> handling to prevent confusion: when the browser shows a single PDF, it MAY be an <embed> web page and we should account for that!
SHA-1: 4c3b1ed
- fix crash in PDF import when website/webserver does not provide a Content-Disposable HTTP response header
- add ability to cope with <embed> PDF links, e.g. when a HTML page is shown with PDF embedded instead of the PDF itself
- detect PDF files in URLs which have query parameters: '.pdf' is not always the end of the URL for downloading the filename

The text was updated successfully, but these errors were encountered:

WARNING: a PDF URI does *not* have to include a PDF extension! Case in point: https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!

GerHobbelt · 2019-09-05T17:07:56Z

Closing ^{and decluttering the issue list so it stays workable for me}: fixed in https://github.com/GerHobbelt/qiqqa-open-source mainline=master branch, pending #15 / any maintainer rights/actions.

WARNING: a PDF URI does *not* have to include a PDF extension! Case in point: https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!

GerHobbelt mentioned this issue Sep 4, 2019

Qiqqa: crashes/fails to import PDF from ill-configured servers #63

Closed

GerHobbelt changed the title ~~BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters~~ ✅BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters Sep 4, 2019

GerHobbelt closed this as completed Sep 5, 2019

GerHobbelt added 🐛bug Something isn't working 🦸‍♀️enhancement🦸‍♂️ New feature or request labels Oct 4, 2019

GerHobbelt changed the title ~~✅BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters~~ BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters Oct 4, 2019

GerHobbelt added this to the v82 milestone Oct 4, 2019

GerHobbelt mentioned this issue Oct 5, 2019

v82pre: some PDFs are downloaded twice from Sniffer #83

Closed

GerHobbelt mentioned this issue Dec 10, 2019

upgrade the embedded browser (xulrunner) to the latest version #2

Open

GerHobbelt mentioned this issue May 26, 2021

Unable to flush xulrunner cache - sniffer #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters #67

BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters #67

GerHobbelt commented Sep 4, 2019 •

edited

Loading

GerHobbelt commented Sep 5, 2019

BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters #67

BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters #67

Comments

GerHobbelt commented Sep 4, 2019 • edited Loading

GerHobbelt commented Sep 5, 2019

GerHobbelt commented Sep 4, 2019 •

edited

Loading