Provide a SHA256 hash of file content #4195

mitar · 2014-01-26T00:58:18Z

We are creating an open source cloud service (http://peerlibrary.org) where users can import PDFs and have them displayed with PDF.js. Because users can request to load external PDFs we are in need to know if PDF loaded from the external URL is same as the one it was initially imported - to verify that file has not changed. So currently we load file and compute SHA256 hash first to verify and then open with PDF.js if it passes. It would be much better if we could use PDF.js directly for this so that we can reuse all the worker and PDF transmission capabilities. From what I see this would be easy to add (simply add another API call, message to the worker, which then uses GetData to get data and compute hash). Is this something which would core PDF.js be interested to have if I make a pull request?

We tested many implementations of SHA256 hash function and digest.js seems fastest, because it uses typed arrays. Would be use of that library be OK for PDF.js? Or does this limit too much browsers which PDF.js wants to support?

Additionally, library uses GPLv3 which is incompatible with Apache 2 license of PDF.js. I can try to obtain permission from the author for inclusion under Apache 2 license (only one author has contributed all the code).

If not, is there currently an easy way to extend PDF.js with additional API method from outside the code, without modifying the code (but maybe just extending prototypes)? Is there some plugin architecture in place?

We would like to use a secure hash function to have assurance file has not been changed in any malicious way after initial import, so we would not like to use MD5 hash. Our service might be used for sensitive content someday.

The text was updated successfully, but these errors were encountered:

yurydelendik · 2014-01-27T14:35:04Z

SHA256 can be embedded in the PDF.js only if it will be part of the signature verification (see #1076). Implementing it as you specified above sounds like a custom solution and does not match the PDF32000 specification.

library uses GPLv3 which is incompatible with Apache 2 license of PDF.js.

The overall solution must be released under GPL license and we cannot do that (so it kinda compatible but for GPL folks).

I can try to obtain permission from the author for inclusion under Apache 2 license (only one author has contributed all the code).

It would be best if author will contribute to the project under Apache 2 license. (Not sure if we need all algorithms though) PDF32000 lists: SHA1, SHA256, SHA384, SHA512 and RIPEMD160 (and MD5) only.

mitar · 2014-01-27T17:55:48Z

Yes, this is custom solution to know if file once opened in PDF.js is exactly the same (byte for byte) as it was opened at some time before. As you are already providing a custom checksum function, I do not see the reason why not implement that as well.

Maybe custom checksum function could be extended so that it can get a optional parameter which tells which algorithm to use: SHA256, your current one (the default) or something else.

yurydelendik · 2014-01-27T21:58:38Z

you are already providing a custom checksum function?

What custom checksum function? If you mean https://github.com/mozilla/pdf.js/blob/master/src/core/core.js#L488 , then it's a psedo-unique pdf identifier (for corrupted pdf documents). Requirement for this function was to not download entire document. What you are asking is to download whole data, which is unacceptable for our use case.

mitar · 2014-01-27T22:52:03Z

I am saying, that by default fingerprint could compute what it computes now, without downloading whole date. But as an optional argument it could take a parameter which would allow computing other types of fingerprint, for example, computing SHA256 over whole data.

mitar · 2014-01-27T23:50:23Z

Or, what about a way to plug-in custom fingerprint functions from outside? So this could be a very simple API where you could assign function to some API object, like PDFJS.Fingerprints.SHA256 = function () {...} and it would be called if you would call fingerprint('SHA256'). The default would be the same as now, calling this default fingerprint function currently implemented. Is this something I could make a pull request for?

yurydelendik · 2014-01-28T00:12:15Z

That's something out of scope of this project. You can easily incorporate the solution above into your build process, e.g. via applying a simple patch for core.js.

As mentioned in #1076, we will be glad to accept a pull request that will implement (at least partially) a digital signature verification. Closing as won't fix for now.

mitar mentioned this issue Jan 26, 2014

Use web workers to compute PDF hash in client peerlibrary/peerlibrary#186

Closed

yurydelendik closed this as completed Jan 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a SHA256 hash of file content #4195

Provide a SHA256 hash of file content #4195

mitar commented Jan 26, 2014

yurydelendik commented Jan 27, 2014

mitar commented Jan 27, 2014

yurydelendik commented Jan 27, 2014

mitar commented Jan 27, 2014

mitar commented Jan 27, 2014

yurydelendik commented Jan 28, 2014

Provide a SHA256 hash of file content #4195

Provide a SHA256 hash of file content #4195

Comments

mitar commented Jan 26, 2014

yurydelendik commented Jan 27, 2014

mitar commented Jan 27, 2014

yurydelendik commented Jan 27, 2014

mitar commented Jan 27, 2014

mitar commented Jan 27, 2014

yurydelendik commented Jan 28, 2014