IterDictIndexer can index pre-tokenised documents #328

cmacdonald · 2022-09-21T18:36:58Z

This PR adds support for pretokenised indices for Terrier, and is now ready for review.

To run the tests, the following command can be used:
To use this, the pretoks branch of Terrier must be installed:

git clone https://github.com/terrier-org/terrier-core.git
cd terrier-core
git checkout --track origin/pretoks
mvn -DskipTests install

Further, the terrier-python-helper in pyterrier must be updated:

cd pyterrier/terrier-python-helper
mvn install

To run the tests, the following invocation is used
TERRIER_VERSION=5.7-SNAPSHOT TERRIER_HELPER_VERSION=0.0.7 pytest -s tests/test_iterdictindex_pretok.py

Usage looks like:

indexer = pt.IterDictIndexer("./index", toks=True)
indexer.index( [
  {'docno' : 'd1', 'toks' : {'a' : 1, 'b' : 2}
])

NB: There is no fields support for this indexer.

NBB: The Github Actions will not pass without the upstream changes in Terrier being merged. The corresponding PR for the Terrier changes is terrier-org/terrier-core#204

cmacdonald · 2022-09-21T18:41:59Z

We also need to ensure that /all/ test pass, and this codebase continues to work with Terrier 5.6.

seanmacavaney

Looks good to me.

I fixed a syntax error, removed an unused argument, and added code that forces all values to ints. It's worth taking a peek at those changes.

cmacdonald · 2022-10-24T10:27:24Z

Good spots. Are we happy with pretokenised=True as being the appropriate kwarg? I guess some documentation also needed, including examples e.g. BERTTokenizer

seanmacavaney · 2022-10-24T10:35:15Z

I think I’d still rather just have it detect whether the indexed fields are dicts and act accordingly. But IIRC this proposal was previously rejected.

I’m not sure what argument name would be most memorable. pretokenised seems alright, but could trip folks up with z vs s. I guess toks solved that, but is far less descriptive.

Yeah, having some example tokenisers is a good idea. Perhaps an example using Counter to count the WP’s too?

…rrier-core#212

cmacdonald · 2022-10-27T14:29:59Z

All tests pass, but pretokenised tests are skipped until the new helper has been released to Maven.

cmacdonald added 10 commits July 27, 2022 13:34

WIP: pretokenised indexing

5394125

dont use deprecated Collection[] API

ad95f85

bump for terrier 5.7 snapshot

a6c64eb

test with new helper

2b8b2c0

WIP: pretokenised indexing

2e7f299

reduce java compiler warnings

2c54999

reset fields config for pretoks

f9b3089

separate pretoks testing into different test file

f28d6c2

rename test, print java stacktraces

e3962a6

this fixes some test failures, which occur due to unused threads

01f0eff

cmacdonald added the upstream label Sep 21, 2022

cmacdonald added this to the 0.9 milestone Sep 21, 2022

cmacdonald requested a review from seanmacavaney September 21, 2022 18:36

seanmacavaney added 3 commits October 23, 2022 21:32

fixed syntax error

e374ed7

removed unused parameter

8f41fb2

allow other types for toks values (as long as they can be int()'d)

33f1f3c

seanmacavaney approved these changes Oct 23, 2022

View reviewed changes

cmacdonald added 10 commits October 26, 2022 11:15

fix typo

c6f539a

disable stemming etc for pretokenised

cfcb2a1

add example for using HuggingFace tokenizers

58ab69a

remove mistaken import

87f8ecd

lazy loading the new class

359db79

skip tests for tr < 5.7

efcef7b

change 5.6 handling

9dd1e51

allow testing for helper version

e832896

better exception handling for failing test

fe5c52e

change import stage

af98fd7

cmacdonald added 2 commits October 27, 2022 13:54

move MemoryIndexer into terrier-core for release Terrier 5.7 - see te…

2545608

…rrier-core#212

typo

6f76cf3

cmacdonald merged commit 9c1d9ff into master Oct 27, 2022

cmacdonald deleted the pretoks branch October 27, 2022 14:33

This was referenced Oct 27, 2022

Craft an Index with previously given Tokens #281

Closed

Support for pre-tokenised queries and documents in Terrier backend #243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterDictIndexer can index pre-tokenised documents #328

IterDictIndexer can index pre-tokenised documents #328

cmacdonald commented Sep 21, 2022 •

edited by seanmacavaney

Loading

cmacdonald commented Sep 21, 2022

seanmacavaney left a comment

cmacdonald commented Oct 24, 2022

seanmacavaney commented Oct 24, 2022

cmacdonald commented Oct 27, 2022

IterDictIndexer can index pre-tokenised documents #328

IterDictIndexer can index pre-tokenised documents #328

Conversation

cmacdonald commented Sep 21, 2022 • edited by seanmacavaney Loading

cmacdonald commented Sep 21, 2022

seanmacavaney left a comment

Choose a reason for hiding this comment

cmacdonald commented Oct 24, 2022

seanmacavaney commented Oct 24, 2022

cmacdonald commented Oct 27, 2022

cmacdonald commented Sep 21, 2022 •

edited by seanmacavaney

Loading