Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBD: make Qiqqa cope better with flaky/damaged/b0rked PDF files #13

Closed
GerHobbelt opened this issue Aug 2, 2019 · 2 comments
Closed
Labels
🐛bug Something isn't working 🦸‍♀️enhancement🦸‍♂️ New feature or request 🧑‍🤝‍🧑help wanted🧑‍🤝‍🧑 Extra attention is needed.
Milestone

Comments

@GerHobbelt
Copy link
Collaborator

GerHobbelt commented Aug 2, 2019

Now that I have access to the Qiqqa source code and have been able to rebuild the binary and extend its logging, I find that quite a lot of my troubles in the past years is due to Qiqqa not coping well with all kinds of broken/b0rked PDF files in the Qiqqa libraries:

  1. several PDFs caused Qiqqa to run indefinitely after closing it #10: several PDFs caused Qiqqa to run indefinitely after closing it: every time I had to open the Windows Task Manager and KILL thr Qiqqa process or process tree to make it stop. If I didn't do that, Qiqqa would report it's already running when you restart it, necessitating a reboot. Instead, I've executed the Windows equivalent of kill -9 every time I exit/stop Qiqqa.

  2. search index stops working and re-indexing doesn't recreate the Lucene search db #16: Qiqqa failed on several occasions with my large PDF collection, causing a permanent and total failure in its search feature, i.e. the Lucene database got nuked/b0rked. All subsequent searches in Qiqqa would deliver ZERO results, quickly.

    • Reindexing via the Qiqqa Tools panel would have no effect.

      Tools > Qiqqa Configuration > Troubleshooting > Rebuild Library Search Indices
      
    • Manually deleting all the Lucene DB files in base/Guest/index/ would also be to no avail.

    • Reconstructing the Library by importing the PDF files in tiny batches via the Directory Watch feature of Qiqqa would result in 'semi-random behaviour': it now turns out to be highly dependent on which PDF files got loaded first: as soon as an offending PDF (to be uploaded later) got included in the library, the Lucene-backed search facility would break down and stop to function.

    Note: Pending investigation suspects SyncFusion 14 locks up (hangs) when reading some PDF files #11 at least; at the time of this writing SyncFusion 14 locks up (hangs) when reading some PDF files #11 has been fixed and this was a required first step towards making the Lucene-backed search feature work and (re)generate a working search index once again.

  3. When using the sniffer (Yay! 😄 Superb Feature!) to fetch additional documents (PDFs), sometimes you'll observe a load failure, where

    • the document occurs as pure white multi-page document with no content at all, or
    • the document would render as a single pure-white page document with no content at all, or
    • the PDF download/fetch operation would lock up and you'ld have to kill -9 Qiqqa to stop it. Depending on the alignment of the planets, you'll be able to restart Qiqqa with a functioning or broken 'search' feature then. Waiting on http://website/path.../file.pdf would be shown forever in the status line at the bottom of the main window.
  4. There's no way to dig out these b0rked PDFs from the library and 'select all' the discovered culprits to apply some chosen user activity (delete PDF + library entry, export/dump to diagnostics directory, ...?what you want?...)

@GerHobbelt
Copy link
Collaborator Author

Done as per #33.

Lots of commits related to this issue. This set surely won't cover all as I've had crashes in lots of places during testing a 20K+ library which has collected its own amount of cruft from the Internet and years of Qiqqa fails (Sniffer lockups, download b0rks due to connection failure and what-not, you-name-it 🤡 ):

Revision: dc740d7
fix/tweak FolderWatcher background task: make sure we AT LEAST process ONE(1) tiny batch of PDF files when there are any to process.

Revision: d59d6f0
fix crash in chat code when Qiqqa is shutting down (+ code review to uncover more spots where this might be happening)

20190804.204351 INFO  [Main] Stopping MaintainableManager
Exception thrown: 'System.NullReferenceException' in Qiqqa.exe
20190804.204351 WARN  [9] There was a problem communicating with chat.
System.NullReferenceException: Object reference not set to an instance of an object.
   at Qiqqa.Chat.ChatControl.ProcessDisplayResponse(MemoryStream ms) in W:\lib\tooling\qiqqa\Qiqqa\Chat\ChatControl.xaml.cs:line 221
   at Qiqqa.Chat.ChatControl.PerformRequest(String url) in W:\lib\tooling\qiqqa\Qiqqa\Chat\ChatControl.xaml.cs:line 127
20190804.204351 WARN  [9] Chat: detected Qiqqa shutting down.

Revision: bab0499
code stability: Do not crash/fail when the historical progress file is damaged

Revision: da3f853
corrected Folder Watch loop + checks for #20: the intent here is very similar to the code done previously for #17; we just want to add a tiny batch of PDF files from the Watch folder, irrespective of the amount of files waiting there to be added.

Revision: 7bd3ee6
more work regarding #10 and #17: when you choose to either import a large number of PDF files at once via the Watch Folder feature or have just reset the Watch Directory before exiting Qiqqa, you'll otherwise end up with a long running process where many/all files in the Watched Directories are inspected and possibly imported: this is undesirable when the user has decided Qiqqa should terminate (by clicking close-window or Alt-F4 keyboard shortcut).

Revision: 53f2ca8
code cleanup activity (which happened while going through the code for thread safely locks inspection)

Revision: 5dcda97
#18 work :: code review part 1, looking for thread safety locks being applied correctly and completely: for example, a few places did not follow best practices by using the dissuaded lock(this){...} idiom (https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/lock-statement)

Revision: 8a1d766
Fix #17 by processing PDFs in any Qiqqa library in small batches so that Qiqqa is not unreponsive for a loooooooooooooong time when it is re-indexing/upgrading/whatever a large library, e.g. 20K+ PDF files. The key here is to make the 'infrequent background task' produce some result quickly (like a working, yet incomplete, Lucene search index DB!) and then updating/augmenting that result as time goes by. This way, we can recover a search index for larger Qiqqa libraries!

Revision: 72b8d25
dialing up the debug/info logging to help me find the most annoying bugs, first of them: #10, then #13

Revision: b359039
update existing Syncfusion files from v14 to v17, which helps resolve #11

Warning: I got those files by copying a Syncfusion install directory into qiqqa::/libs/ and overwriting existing files. v17 has a few more files, but those seem not to be required/used by Qiqqa, as only overwriting what was already there in the Qiqqa install directory seems to deliver a working Qiqqa tool. :phew:

@GerHobbelt GerHobbelt changed the title TBD: make Qiqqa cope better with flaky/damaged/b0rked PDF files ✅TBD: make Qiqqa cope better with flaky/damaged/b0rked PDF files Aug 8, 2019
@GerHobbelt
Copy link
Collaborator Author

Closing and decluttering the issue list so it stays workable for me: fixed in https://github.com/GerHobbelt/qiqqa-open-source mainline=master branch, pending #15 / any maintainer rights/actions.

@GerHobbelt GerHobbelt added 🐛bug Something isn't working 🦸‍♀️enhancement🦸‍♂️ New feature or request 🧑‍🤝‍🧑help wanted🧑‍🤝‍🧑 Extra attention is needed. labels Oct 4, 2019
@GerHobbelt GerHobbelt added this to the v82 milestone Oct 4, 2019
@GerHobbelt GerHobbelt changed the title ✅TBD: make Qiqqa cope better with flaky/damaged/b0rked PDF files TBD: make Qiqqa cope better with flaky/damaged/b0rked PDF files Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working 🦸‍♀️enhancement🦸‍♂️ New feature or request 🧑‍🤝‍🧑help wanted🧑‍🤝‍🧑 Extra attention is needed.
Projects
None yet
Development

No branches or pull requests

1 participant