Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IterDictIndexer can index pre-tokenised documents #328

Merged
merged 25 commits into from
Oct 27, 2022
Merged

Conversation

cmacdonald
Copy link
Contributor

@cmacdonald cmacdonald commented Sep 21, 2022

This PR adds support for pretokenised indices for Terrier, and is now ready for review.

To run the tests, the following command can be used:
To use this, the pretoks branch of Terrier must be installed:

git clone https://github.com/terrier-org/terrier-core.git
cd terrier-core
git checkout --track origin/pretoks
mvn -DskipTests install

Further, the terrier-python-helper in pyterrier must be updated:

cd pyterrier/terrier-python-helper
mvn install

To run the tests, the following invocation is used
TERRIER_VERSION=5.7-SNAPSHOT TERRIER_HELPER_VERSION=0.0.7 pytest -s tests/test_iterdictindex_pretok.py

Usage looks like:

indexer = pt.IterDictIndexer("./index", toks=True)
indexer.index( [
  {'docno' : 'd1', 'toks' : {'a' : 1, 'b' : 2}
])

NB: There is no fields support for this indexer.

NBB: The Github Actions will not pass without the upstream changes in Terrier being merged. The corresponding PR for the Terrier changes is terrier-org/terrier-core#204

@cmacdonald cmacdonald added this to the 0.9 milestone Sep 21, 2022
@cmacdonald
Copy link
Contributor Author

We also need to ensure that /all/ test pass, and this codebase continues to work with Terrier 5.6.

Copy link
Collaborator

@seanmacavaney seanmacavaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I fixed a syntax error, removed an unused argument, and added code that forces all values to ints. It's worth taking a peek at those changes.

@cmacdonald
Copy link
Contributor Author

Good spots. Are we happy with pretokenised=True as being the appropriate kwarg? I guess some documentation also needed, including examples e.g. BERTTokenizer

@seanmacavaney
Copy link
Collaborator

I think I’d still rather just have it detect whether the indexed fields are dicts and act accordingly. But IIRC this proposal was previously rejected.

I’m not sure what argument name would be most memorable. pretokenised seems alright, but could trip folks up with z vs s. I guess toks solved that, but is far less descriptive.

Yeah, having some example tokenisers is a good idea. Perhaps an example using Counter to count the WP’s too?

@cmacdonald
Copy link
Contributor Author

All tests pass, but pretokenised tests are skipped until the new helper has been released to Maven.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants