Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster indexing via multi-threading and fifos #92

Merged
merged 12 commits into from
Feb 25, 2021

Conversation

seanmacavaney
Copy link
Collaborator

@seanmacavaney seanmacavaney commented Feb 13, 2021

Not ready for review yet, just documenting what's left to do:

  • Support memory index. I guess it won't use multi-threading because org.terrier.applications.ThreadedBatchIndexing didn't either, but it can fall back on a simpler implementation.
  • Better exception handing in java code
  • update tests to run on both the fifo and non-fifo versions
  • update documentation
  • Defaults okay? -- DETERMINISTIC (rather than FAST) and 80% of CPUs (when available)
  • Is there an easy way to make it DETERMINISTIC fifo and non-fifo versions? Seems like this will be difficult given the fact we're working with an arbitrary-length iterable...

@seanmacavaney
Copy link
Collaborator Author

seanmacavaney commented Feb 13, 2021

ope, and I guess I need to add:

  • fix failing build

@cmacdonald
Copy link
Contributor

  • DETERMINISTIC is good. 80% CPUs I also agree with.
  • Yes, builds are/were failing recently. I dont think it is our fault.
  • changes to terrier-python-helper arent picked up by the Github actions. Perhaps we should mvn install it in push.yml.

@seanmacavaney
Copy link
Collaborator Author

How do you feel about the UNIX and Windows versions producing different results under the DETERMINISTIC setting? The easiest way to fix this would be to force DETERMINISTIC to only use 1 thread, but that eliminates most of the speed benefits (there are still some from using a FIFO for transferring data to the Java thread). In that case, we could just eliminate the "mode" parameter and specify in the documentation that to produce deterministic results, you need to use 1 thread.

An alternative approach would be to provide a way to specify the internal document IDs when indexing. But that looks like this would be a much bigger change.

@cmacdonald
Copy link
Contributor

I havent reviewed the code yet, but have you considered round-robin of documents among the available threads. If merging takes places aligned with the docid boundaries, the end result should be the same as deterministic.

@seanmacavaney
Copy link
Collaborator Author

seanmacavaney commented Feb 15, 2021

Yeah, so DETERMINISTIC allocates documents to threads in a round-robin fashion. So for 10 documents and 3 threads, the allocation is:

Thread 1  D1->0 D4->1 D7->2 D10->3
Thread 2  D2->0 D5->1 D8->2
Thread 3  D3->0 D6->1 D9->2

(FAST simply skips over threads that are not yet ready for data. So on different runs, a different thread may end up with a given document.)

From what I gather from StructureMerger, the resulting docno->docid mapping after merging will be:

D1->0
D4->1
D7->2
D10->3
D2->4
D5->5
D8->6
D3->7
D6->8
D9->9

Whereas for the Windows version, D1->0 D2->1, D3->2, .... Term IDs will also differ, but these do not matter for tie-breaking the way that docids do.

Is there a different merger I should use?

@cmacdonald
Copy link
Contributor

Ok, yes, you are correct - StructureMerger does not interlace, it appends one to the other.

1 thread means deterministic, more is non-deterministic
Improved test to also check multi-threaded indexing
@seanmacavaney
Copy link
Collaborator Author

This should be ready for another review.

@cmacdonald
Copy link
Contributor

I have reviewed this, and it look very good. It requires a new terrier-python-helper release, which I'm reticent to do for now.

@cmacdonald cmacdonald marked this pull request as ready for review February 25, 2021 09:39
@cmacdonald cmacdonald merged commit 0b7f7d3 into terrier-org:master Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants