Faster indexing for learned sparse retrieval #2080

thongnt99 · 2023-03-23T14:44:25Z

Related to #1890
On-going work: Using FeatureField to directly index terms and weights

The indexing works and returns the same metrics as the token repeating method, but three tests (for the repeating method) are currently failing. Please let me know how to fix the tests or create new tests.

Indexing:

./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonTermWeightCollection \
-input collections/msmarco-passage/lsr_collection_jsonl \
-index indexes/msmarco-passage/lsr-index-msmarco \
-generator TermWeightDocumentGenerator \
-threads 60 -impact -pretokenized

Retrieval:

./anserini-lsr/target/appassembler/bin/SearchCollection \
-index path_to_index \
-topics path_to_topic \
-topicreader TsvString \
-output path_to_output_file \
-impact -pretokenized -hits 1000 -parallelism 60

lintool · 2023-03-24T18:39:42Z

Hi @thongnt99 very interesting and thanks for the PR!

Can you provide a sense of the performance improvement?

thongnt99 · 2023-03-25T20:23:36Z

Hi @lintool ,

These are some comparison points I collected from our recent reproduction attempt with LSR methods.
The degree of speed up would depend on the magnitude of term weights, but at least twice faster than the term repeating method. We saw, for example, a huge improvement for indexing EPIC since EPIC does not use sparse regularizers during training, therefore produces generally larger weights (the term repeating method has to repeat more).

LSR method	Old	New
QMLP_DMLM	0:10:25	0:04:09
EPIC (top_k=400)	1:23:53	0:04:02
Splade (0.01, 0.08)	0:17:41	0:03:52
uniCOIL	0:05:11	0:02:18

MXueguang · 2023-03-25T20:30:47Z

Hi @lintool ,

These are some comparison points I collected from our recent reproduction attempt with LSR methods. The degree of speed up would depend on the magnitude of term weights, but at least twice faster than the term repeating method. We saw, for example, a huge improvement for indexing EPIC since EPIC does not use sparse regularizers during training, therefore produces generally larger weights (the term repeating method has to repeat more).

LSR method Old New
QMLP_DMLM 0:10:25 0:04:09
EPIC (top_k=400) 1:23:53 0:04:02
Splade (0.01, 0.08) 0:17:41 0:03:52
uniCOIL 0:05:11 0:02:18

@thongnt99 this is cool!

lintool

Initial comments.

src/main/java/io/anserini/collection/JsonTermWeightCollection.java

src/main/java/io/anserini/index/generator/TermWeightDocumentGenerator.java

lintool · 2023-03-25T20:40:03Z

Instead of TermWeightDocument... why not just call it VectorDocument? Vector as Map<String,Float> seems pretty intuitive?

thongnt99 · 2023-03-25T21:03:27Z

Instead of TermWeightDocument... why not just call it VectorDocument? Vector as Map<String,Float> seems pretty intuitive?

Yes, I also think that TermWeightDocument isn't an ideal name. Probably SparseVectorDocument is more suitable than VectorDocument? The formers says that we should store indices/terms and values (similar to SparseMatrix vs DenseMatrix format).

lintool · 2023-03-25T21:17:04Z

I like SparseVectorDocument!

thongnt99 · 2023-03-27T01:00:52Z

@lintool
I changed class names and fixed issues in your previous comments.

./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonSparseVectorCollection \
-input collections/msmarco-passage/lsr_collection_jsonl \
-index indexes/msmarco-passage/lsr-index-msmarco \
-generator SparseVectorDocumentGenerator \
-threads 60 -impact -pretokenized

lintool

How about some tests?

src/main/java/io/anserini/search/SearchCollection.java

src/main/java/io/anserini/search/query/FeatureGenerator.java

thongnt99 · 2023-04-03T03:00:44Z

How about some tests?

@lintool I am gonna add the tests after ECIR.

lsr index with FeatureField

006b934

thongnt99 mentioned this pull request Mar 23, 2023

Linking to Anserini "FakeWords" Issue thongnt99/learned-sparse-retrieval#4

Open

lintool reviewed Mar 25, 2023

View reviewed changes

Thong Nguyen added 3 commits March 27, 2023 01:45

re-naminng classes

eb3f197

renaming class name

fe47e83

update implementation

20fab3c

lintool reviewed Mar 27, 2023

View reviewed changes

src/main/java/io/anserini/search/SearchCollection.java Outdated Show resolved Hide resolved

src/main/java/io/anserini/search/query/FeatureGenerator.java Outdated Show resolved Hide resolved

code formatting

3bc3664

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster indexing for learned sparse retrieval #2080

Faster indexing for learned sparse retrieval #2080

thongnt99 commented Mar 23, 2023 •

edited

Loading

lintool commented Mar 24, 2023

thongnt99 commented Mar 25, 2023 •

edited

Loading

MXueguang commented Mar 25, 2023 •

edited

Loading

lintool left a comment

lintool commented Mar 25, 2023

thongnt99 commented Mar 25, 2023 •

edited

Loading

lintool commented Mar 25, 2023

thongnt99 commented Mar 27, 2023

lintool left a comment

thongnt99 commented Apr 3, 2023

Faster indexing for learned sparse retrieval #2080

Are you sure you want to change the base?

Faster indexing for learned sparse retrieval #2080

Conversation

thongnt99 commented Mar 23, 2023 • edited Loading

lintool commented Mar 24, 2023

thongnt99 commented Mar 25, 2023 • edited Loading

MXueguang commented Mar 25, 2023 • edited Loading

lintool left a comment

Choose a reason for hiding this comment

lintool commented Mar 25, 2023

thongnt99 commented Mar 25, 2023 • edited Loading

lintool commented Mar 25, 2023

thongnt99 commented Mar 27, 2023

lintool left a comment

Choose a reason for hiding this comment

thongnt99 commented Apr 3, 2023

thongnt99 commented Mar 23, 2023 •

edited

Loading

thongnt99 commented Mar 25, 2023 •

edited

Loading

MXueguang commented Mar 25, 2023 •

edited

Loading

thongnt99 commented Mar 25, 2023 •

edited

Loading