JsonVectorCollection weights are not obeyed for long terms #1843

JMMackenzie · 2022-04-14T00:45:27Z

@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking.

The specific bug is because of a series of design choices.

Anserini "clones" a term with a given weight value weight times (pseudo document generation) to offload the actual indexing to Lucene (without tinkering with internals).
Inside Lucene, the default maximum term length is 255 chars (see https://lucene.apache.org/core/8_0_0/core/constant-values.html#org.apache.lucene.analysis.standard.StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH).

So, getting down to the messy bits.

Assume you have a term coming into your vector with 256 characters and a weight of 200.

What happens is that term is split at the 255th character, leaving the final character dangling as its own term. Then, this can mess up the underlying impacts.

A toy example:

{"id": "problem", "contents": "", "vector": {"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX" : 200, "X" : 200}}

This will result in an index with "X" having an impact of 400 (!!!!!) instead of 200.

Clearly this then flows on to downstream indexing/querying tasks.

One solution we found was overriding the default value of 255 in the constructor for the WhitespaceAnalyzer (see https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L768). We set to the max permissible value of 1048576 which solves the problem.

The text was updated successfully, but these errors were encountered:

lintool · 2022-04-14T01:48:51Z

wow, what an obscure bug!

How about we just drop all terms longer than 255 chars? They are unlikely to be meaningful anyway?

JMMackenzie · 2022-04-14T02:22:01Z

Is it possible to log output if so? But yeah, this would at least be better than silently mutating those terms I think...

lintool · 2022-04-14T02:23:36Z

sure!

Issue noted and PR welcome - but this is lowish on our priority list to fix...

JMMackenzie mentioned this issue Jun 1, 2022

Better implementation of JsonVectorCollection than the "fake words" approach #1890

Open

JMMackenzie mentioned this issue Mar 22, 2023

Linking to Anserini "FakeWords" Issue thongnt99/learned-sparse-retrieval#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JsonVectorCollection weights are not obeyed for long terms #1843

JsonVectorCollection weights are not obeyed for long terms #1843

JMMackenzie commented Apr 14, 2022 •

edited

Loading

lintool commented Apr 14, 2022

JMMackenzie commented Apr 14, 2022

lintool commented Apr 14, 2022

JsonVectorCollection weights are not obeyed for long terms #1843

JsonVectorCollection weights are not obeyed for long terms #1843

Comments

JMMackenzie commented Apr 14, 2022 • edited Loading

lintool commented Apr 14, 2022

JMMackenzie commented Apr 14, 2022

lintool commented Apr 14, 2022

JMMackenzie commented Apr 14, 2022 •

edited

Loading