Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a SHA256 hash of file content #4195

Closed
mitar opened this issue Jan 26, 2014 · 6 comments
Closed

Provide a SHA256 hash of file content #4195

mitar opened this issue Jan 26, 2014 · 6 comments

Comments

@mitar
Copy link
Contributor

mitar commented Jan 26, 2014

We are creating an open source cloud service (http://peerlibrary.org) where users can import PDFs and have them displayed with PDF.js. Because users can request to load external PDFs we are in need to know if PDF loaded from the external URL is same as the one it was initially imported - to verify that file has not changed. So currently we load file and compute SHA256 hash first to verify and then open with PDF.js if it passes. It would be much better if we could use PDF.js directly for this so that we can reuse all the worker and PDF transmission capabilities. From what I see this would be easy to add (simply add another API call, message to the worker, which then uses GetData to get data and compute hash). Is this something which would core PDF.js be interested to have if I make a pull request?

We tested many implementations of SHA256 hash function and digest.js seems fastest, because it uses typed arrays. Would be use of that library be OK for PDF.js? Or does this limit too much browsers which PDF.js wants to support?

Additionally, library uses GPLv3 which is incompatible with Apache 2 license of PDF.js. I can try to obtain permission from the author for inclusion under Apache 2 license (only one author has contributed all the code).

If not, is there currently an easy way to extend PDF.js with additional API method from outside the code, without modifying the code (but maybe just extending prototypes)? Is there some plugin architecture in place?

We would like to use a secure hash function to have assurance file has not been changed in any malicious way after initial import, so we would not like to use MD5 hash. Our service might be used for sensitive content someday.

@yurydelendik
Copy link
Contributor

SHA256 can be embedded in the PDF.js only if it will be part of the signature verification (see #1076). Implementing it as you specified above sounds like a custom solution and does not match the PDF32000 specification.

library uses GPLv3 which is incompatible with Apache 2 license of PDF.js.

The overall solution must be released under GPL license and we cannot do that (so it kinda compatible but for GPL folks).

I can try to obtain permission from the author for inclusion under Apache 2 license (only one author has contributed all the code).

It would be best if author will contribute to the project under Apache 2 license. (Not sure if we need all algorithms though) PDF32000 lists: SHA1, SHA256, SHA384, SHA512 and RIPEMD160 (and MD5) only.

@mitar
Copy link
Contributor Author

mitar commented Jan 27, 2014

Yes, this is custom solution to know if file once opened in PDF.js is exactly the same (byte for byte) as it was opened at some time before. As you are already providing a custom checksum function, I do not see the reason why not implement that as well.

Maybe custom checksum function could be extended so that it can get a optional parameter which tells which algorithm to use: SHA256, your current one (the default) or something else.

@yurydelendik
Copy link
Contributor

you are already providing a custom checksum function?

What custom checksum function? If you mean https://github.com/mozilla/pdf.js/blob/master/src/core/core.js#L488 , then it's a psedo-unique pdf identifier (for corrupted pdf documents). Requirement for this function was to not download entire document. What you are asking is to download whole data, which is unacceptable for our use case.

@mitar
Copy link
Contributor Author

mitar commented Jan 27, 2014

I am saying, that by default fingerprint could compute what it computes now, without downloading whole date. But as an optional argument it could take a parameter which would allow computing other types of fingerprint, for example, computing SHA256 over whole data.

@mitar
Copy link
Contributor Author

mitar commented Jan 27, 2014

Or, what about a way to plug-in custom fingerprint functions from outside? So this could be a very simple API where you could assign function to some API object, like PDFJS.Fingerprints.SHA256 = function () {...} and it would be called if you would call fingerprint('SHA256'). The default would be the same as now, calling this default fingerprint function currently implemented. Is this something I could make a pull request for?

@yurydelendik
Copy link
Contributor

That's something out of scope of this project. You can easily incorporate the solution above into your build process, e.g. via applying a simple patch for core.js.

As mentioned in #1076, we will be glad to accept a pull request that will implement (at least partially) a digital signature verification. Closing as won't fix for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@mitar @yurydelendik and others