Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'still to go' numbers are HUGE: the lucene indexer seems to be stuck in a land where even at 1 task, the OCR jobs take 100% CPU #129

Open
GerHobbelt opened this issue Nov 2, 2019 · 2 comments
Labels
🕵investigate Needs further analysis to find the root cause. 🤔question Further information is requested or this is a support question
Milestone

Comments

@GerHobbelt
Copy link
Collaborator

GerHobbelt commented Nov 2, 2019

'still to go' numbers are HUGE: the lucene indexer seems to be stuck in a land where even at 1 task, the OCR jobs take 100% CPU. See if we can get lucene to work a little harder for us, maybe a task priority thing? --> IncrementalBuildIndex() is executed only once every minute. That's not gonna fly when a lot of pages are textified.

Cave Canem: is Lucene/indexing the real problem? Profiling Qiqqa has been a bit of nightmare so data is yet inconclusive (DevStudio breakage/extreme delays in processing profiling tracks)

@GerHobbelt GerHobbelt created this issue from a note in TODO list (To do) Nov 2, 2019
@GerHobbelt GerHobbelt added 🕵investigate Needs further analysis to find the root cause. 🤔question Further information is requested or this is a support question labels Nov 2, 2019
@GerHobbelt GerHobbelt added this to the v82 milestone Nov 2, 2019
@GerHobbelt GerHobbelt moved this from To do to In progress in TODO list Nov 2, 2019
@GerHobbelt
Copy link
Collaborator Author

Logging has been augmented in v82pre4 to facilitate the investigation of this issue. Still needs #95 side-effect of robust STDERR+STDOUT streaming to move forward on this bugger.

@GerHobbelt
Copy link
Collaborator Author

GerHobbelt commented Nov 5, 2019

Related: #127. Two problems with the same (suspected) root cause.

@GerHobbelt GerHobbelt moved this from To do to In progress in v82release Nov 7, 2019
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Mar 23, 2020
… SINGLE don't deliver due to, for example, encrypted PDF source. This is a temporary hack to ensure Qiqqa doesn't repeat OCR activities ad nauseam (jimmejardine#129 , jimmejardine#135 , jimmejardine#73 , etc.)

- the previously added extra OCR text files' sanity checks (zero-sized areas of words, etc.) seems to pay off. At least we've observed quite a few OCR files/pages being retriggered for OCR as Qiqqa uncovers these zero-sized word areas while refreshing for Expeditions
- added a few more UI-thread-or-not Assertions.
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕵investigate Needs further analysis to find the root cause. 🤔question Further information is requested or this is a support question
Projects
TODO list
  
In progress
v82release
  
In progress
Development

No branches or pull requests

1 participant