Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters #67

Closed
GerHobbelt opened this issue Sep 4, 2019 · 1 comment
Labels
🐛bug Something isn't working 🦸‍♀️enhancement🦸‍♂️ New feature or request
Milestone

Comments

@GerHobbelt
Copy link
Collaborator

GerHobbelt commented Sep 4, 2019

Related: #54, #56, etc.

Related commits:

  • SHA-1: f737ebe

    WARNING: a PDF URI does not have to include a PDF extension!

    Case in point:

        https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx
    

    is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!

  • SHA-1: ae42133

    AddNewDocumentToLibraryFromInternet_*() APIs: some nasty/ill-configured servers don't produce a legal Content-Type header, or don't provide that header at all -- which made Qiqqa barf a hairball instead of properly attempting to import the downloaded PDF.

    Also don't yak about images which are downloaded as part of Google search pages, etc.: these content-types now make it through part of the PDF import code as we cannot rely on the Content-Type header being valid or present, hence we need to be very lenient about what we accept as "potentially a PDF document" to inspect before importing.

    Fixes: Qiqqa: crashes/fails to import PDF from ill-configured servers  #63

  • SHA-1: 26b581f

    remaining work for BibTeX Sniffer: clicking on a PDF which returns a HTTP 302 Redirect response is not picked up #56 / BibTeX Sniffer: Sometimes clicking on a PDF does show the PDF but DOES NOT import it in your library #54 -- catch some nasty PDF URIs which weren't recognized as such before. Right now we're pretty aggressive as we fetch almost everything that crosses our path; once fetched we check if's actually a valid PDF file after all. CiteSeerX and other sites now deliver once again...

  • SHA-1: b3f1f2d

    fixes for BibTeX Sniffer: clicking on a PDF which returns a HTTP 302 Redirect response is not picked up #56 ; also ensuring every document that's fetched off the Internet is opened in Qiqqa for review/editing (some PDF documents were silently downloaded and then dumped into the Guest Library just because and you'ld have to go around and check to see the stuff actually arrived in a library of yours. :'-(

  • SHA-1: 5bdd2ea

    HTTP/HTTPS web grab of PDF files: we don't care which TLS/SSL protocol is required, we should just grab the PDF and not bother. Some websites require TLS2 while today I ran into a website which requires old SSL (not TLS): make sure they're all turned ON.

  • SHA-1: e877a36

    fix crash in "grab/download all PDF files which are available on this page" webbrowser toolbar button functionality: the code crashed on relative URIs being fed into new Uri(url) code lines. Now the code copes correctly with both absolute and relative URIs and also corrupt/invalid URIs don't crash the grab-extractor code any more. Also improved the check for any URI found in the page being a PDF file a little: check for ".pdf" rather than "pdf": this will prevent us from trying not-a-pdf-file URIs such as "http://www.example.com/blog-about-pdf".

  • SHA-1: d8cf09f & SHA-1: 7935018

    further fiddling with the weird download issue reported in commit SHA-1: 7935018 --> had a look if and how Chrome browser does it. It succeeds, with these headers:

    General:
    
    Request URL: https://ora.ox.ac.uk/objects/uuid:49e0183f-277e-486c-87bc-17097cbef0b3/download_file?file_format=pdf&safe_filename=fmcad2012.pdf&type_of_work=Conference+item
    Request Method: GET
    Status Code: 200 OK
    Remote Address: 127.0.0.1:8118
    Referrer Policy: no-referrer-when-downgrade
    
    Response Headers:
    
    Cache-Control: private
    Content-Disposition: inline; filename="fmcad2012.pdf"
    Content-Transfer-Encoding: binary
    Content-Type: application/pdf
    Date: Sat, 17 Aug 2019 23:06:36 GMT
    Referrer-Policy: strict-origin-when-cross-origin
    Server: Apache/2.4.34 (Red Hat)
    Status: 200 OK
    Strict-Transport-Security: max-age=15768000; includeSubDomains
    Transfer-Encoding: chunked
    X-Content-Type-Options: nosniff
    X-Download-Options: noopen
    X-Frame-Options: SAMEORIGIN
    X-Permitted-Cross-Domain-Policies: none
    X-Powered-By: Phusion Passenger 6.0.2
    X-Request-Id: 4f9a59d9-dcc9-49d1-8216-970461db251d
    X-Runtime: 0.063538
    X-XSS-Protection: 1; mode=block
    
    Request Headers:
    
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
    Accept-Encoding: gzip, deflate, br
    Accept-Language: en-US,en;q=0.9,nl;q=0.8,de;q=0.7
    Cache-Control: no-cache
    Connection: keep-alive
    Host: ora.ox.ac.uk
    Pragma: no-cache
    Sec-Fetch-Mode: navigate
    Sec-Fetch-Site: none
    Sec-Fetch-User: ?1
    Upgrade-Insecure-Requests: 1
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36
    
    Query Strings:
    
    file_format: pdf
    safe_filename: fmcad2012.pdf
    type_of_work: Conference item
    
    

    • Trying to tackle a very weird PDF download problem, which doesn't go away.

    HTR:

  • SHA-1: c28eb11

    fix BibTeX Sniffer: Sometimes clicking on a PDF does show the PDF but DOES NOT import it in your library #54 in GoogleBibTexSnifferControl

    Gecko these days crashes on ContentDispositionXXXX member accesses: Exception thrown: 'System.Runtime.InteropServices.COMException' in Geckofx-Core.dll

    I'm not sure why; the only change I know of is an update of MSVS2019. :-S

  • SHA-1: 7bf0c72

    re-added to 'Add This PDF to Library' button in the browser; TODO: make it work akin to the <embed> handling to prevent confusion: when the browser shows a single PDF, it MAY be an <embed> web page and we should account for that!

  • SHA-1: 4c3b1ed

    • fix crash in PDF import when website/webserver does not provide a Content-Disposable HTTP response header
    • add ability to cope with <embed> PDF links, e.g. when a HTML page is shown with PDF embedded instead of the PDF itself
    • detect PDF files in URLs which have query parameters: '.pdf' is not always the end of the URL for downloading the filename
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Sep 4, 2019
WARNING: a PDF URI does *not* have to include a PDF extension!

Case in point:

    https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx

is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!
@GerHobbelt GerHobbelt changed the title BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters ✅BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters Sep 4, 2019
@GerHobbelt
Copy link
Collaborator Author

Closing and decluttering the issue list so it stays workable for me: fixed in https://github.com/GerHobbelt/qiqqa-open-source mainline=master branch, pending #15 / any maintainer rights/actions.

GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Oct 2, 2019
WARNING: a PDF URI does *not* have to include a PDF extension!

Case in point:

    https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx

is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Oct 3, 2019
WARNING: a PDF URI does *not* have to include a PDF extension!

Case in point:

    https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx

is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!
@GerHobbelt GerHobbelt added 🐛bug Something isn't working 🦸‍♀️enhancement🦸‍♂️ New feature or request labels Oct 4, 2019
@GerHobbelt GerHobbelt changed the title ✅BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters BibTeX Sniffer: PDFs are not downloaded into the Qiqqa library when their URL does not include the "pdf" characters Oct 4, 2019
@GerHobbelt GerHobbelt added this to the v82 milestone Oct 4, 2019
GerHobbelt added a commit that referenced this issue Nov 5, 2019
WARNING: a PDF URI does *not* have to include a PDF extension!

Case in point:

    https://pubs.acs.org/doi/pdf/10.1021/ed1010618?rand=zf7t0csx

is an example of such a URI: this URI references a PDF but DOES NOT contain the string ".pdf" itself!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working 🦸‍♀️enhancement🦸‍♂️ New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant