Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for password protected documents #1916

Open
lengoyvaerts opened this issue Aug 13, 2024 · 0 comments
Open

Add support for password protected documents #1916

lengoyvaerts opened this issue Aug 13, 2024 · 0 comments
Labels
feature_request for feature request

Comments

@lengoyvaerts
Copy link

This is a feature request as a result of this Elastic Discuss thread, which mentions #229

When trying to send a password protected document to the FSCrawler REST API, I'm getting the following exception:

08:24:49,504 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [document-with-password.docx] org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:274) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) ~[tika-core-2.9.1.jar:2.9.1] at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:197) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:205) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at jdk.internal.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.5.jar:?] at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.1.jar:4.0.1] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] 08:24:49,505 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation

Would it be possible to support a feature to allow password protected documents to be crawled? When using the crawler to crawl a directory, it could be something like a .password file as suggested in the Discuss thread. For the REST API, it could be an extra parameter in the request, e.g: -F "password=my-password" (could also be Base64 encoded maybe)

I would think Tika supports this, given the documentation for the PasswordProvider class

@lengoyvaerts lengoyvaerts added the feature_request for feature request label Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request for feature request
Projects
None yet
Development

No branches or pull requests

1 participant