Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When re-indexing a large library, Qiqqa is unresponsive for a VERY long time (too long to wait: 1+ hours) #17

Closed
GerHobbelt opened this issue Aug 2, 2019 · 3 comments
Labels
🐛bug Something isn't working ⛷performance Anything that's related to UX: speed of response; I/O speed, etc.
Milestone

Comments

@GerHobbelt
Copy link
Collaborator

20K PDF library. Coming from a v79 commercial install, this library has suffered badly from #16 in the past and a recompiled Qiqqa (with #11 fixed and #13 partly fixed) would now (#14) finally attempt to recreate that Lucene-backed search index, only to end up as 'Not Responding...' and spitting out several MBytes of logfile output carrying a zillion lines like these:

20190802.104554 INFO [Daemon.Maintainable:BackgroundWorkerDaemon.DoMaintenance_Infrequent] Indexing document E6B963888DF9A4CCD5E2CD7647BFE94F692DF1

20190802.104554 INFO [PDFTextExtractor] PDFOCR:297 page(s) to textify and 1254 page(s) to OCR. (1/1551)

GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Aug 2, 2019
…* batches so that Qiqqa is not unreponsive for a loooooooooooooong time when it is re-indexing/upgrading/whatever a *large* library, e.g. 20K+ PDF files. The key here is to make the '**infrequent background task**' produce *some* result quickly (like a working, yet incomplete, Lucene search index DB!) and then *updating*/*augmenting* that result as time goes by. This way, we can recover a search index for larger Qiqqa libraries!
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Aug 2, 2019
…ose to either import a large number of PDF files at once via the Watch Folder feature *or* have just reset the Watch Directory before exiting Qiqqa, you'll otherwise end up with a long running process where many/all files in the Watched Directories are inspected and possibly imported: this is undesirable when the user has decided Qiqqa should terminate (by clicking close-window or Alt-F4 keyboard shortcut).
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Aug 5, 2019
…here is very similar to the code done previously for jimmejardine#17; we just want to add a tiny batch of PDF files from the Watch folder, irrespective of the amount of files waiting there to be added.
@GerHobbelt
Copy link
Collaborator Author

Done as per #33.

See also #20. Do note that this work does not stand alone and is highly related to #18 et al.


Commits:

Revision: d58bd7a
revert debug code that was part of commit SHA-1: 89307ed -- some invalid BibTeX was crashing the Lucene indexer (AddDocumentMetadata_BibTex() would b0rk on a NULL Key)

That problem was fixed in that commit at a higher level (in PDFDocument)

Revision: da3f853
corrected Folder Watch loop + checks for #20: the intent here is very similar to the code done previously for #17; we just want to add a tiny batch of PDF files from the Watch folder, irrespective of the amount of files waiting there to be added.

Revision: 7bd3ee6
more work regarding #10 and #17: when you choose to either import a large number of PDF files at once via the Watch Folder feature or have just reset the Watch Directory before exiting Qiqqa, you'll otherwise end up with a long running process where many/all files in the Watched Directories are inspected and possibly imported: this is undesirable when the user has decided Qiqqa should terminate (by clicking close-window or Alt-F4 keyboard shortcut).

Revision: 8a1d766
Fix #17 by processing PDFs in any Qiqqa library in small batches so that Qiqqa is not unreponsive for a loooooooooooooong time when it is re-indexing/upgrading/whatever a large library, e.g. 20K+ PDF files. The key here is to make the 'infrequent background task' produce some result quickly (like a working, yet incomplete, Lucene search index DB!) and then updating/augmenting that result as time goes by. This way, we can recover a search index for larger Qiqqa libraries!

Revision: b359039
update existing Syncfusion files from v14 to v17, which helps resolve #11

Warning: I got those files by copying a Syncfusion install directory into qiqqa::/libs/ and overwriting existing files. v17 has a few more files, but those seem not to be required/used by Qiqqa, as only overwriting what was already there in the Qiqqa install directory seems to deliver a working Qiqqa tool. :phew:

@GerHobbelt
Copy link
Collaborator Author

Related: #55

@GerHobbelt
Copy link
Collaborator Author

Closing and decluttering the issue list so it stays workable for me: fixed in https://github.com/GerHobbelt/qiqqa-open-source mainline=master branch, pending #15 / any maintainer rights/actions.

@GerHobbelt GerHobbelt added 🐛bug Something isn't working ⛷performance Anything that's related to UX: speed of response; I/O speed, etc. labels Oct 4, 2019
@GerHobbelt GerHobbelt added this to the v82 milestone Oct 4, 2019
@GerHobbelt GerHobbelt changed the title ✅ When re-indexing a large library, Qiqqa is unresponsive for a VERY long time (too long to wait: 1+ hours) When re-indexing a large library, Qiqqa is unresponsive for a VERY long time (too long to wait: 1+ hours) Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working ⛷performance Anything that's related to UX: speed of response; I/O speed, etc.
Projects
None yet
Development

No branches or pull requests

1 participant