Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsonVectorCollection weights are not obeyed for long terms #1843

Open
JMMackenzie opened this issue Apr 14, 2022 · 3 comments
Open

JsonVectorCollection weights are not obeyed for long terms #1843

JMMackenzie opened this issue Apr 14, 2022 · 3 comments

Comments

@JMMackenzie
Copy link
Contributor

JMMackenzie commented Apr 14, 2022

@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking.

The specific bug is because of a series of design choices.

  1. Anserini "clones" a term with a given weight value weight times (pseudo document generation) to offload the actual indexing to Lucene (without tinkering with internals).

  2. Inside Lucene, the default maximum term length is 255 chars (see https://lucene.apache.org/core/8_0_0/core/constant-values.html#org.apache.lucene.analysis.standard.StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH).

So, getting down to the messy bits.

Assume you have a term coming into your vector with 256 characters and a weight of 200.

What happens is that term is split at the 255th character, leaving the final character dangling as its own term. Then, this can mess up the underlying impacts.

A toy example:

{"id": "problem", "contents": "", "vector": {"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX" : 200, "X" : 200}}

This will result in an index with "X" having an impact of 400 (!!!!!) instead of 200.

Clearly this then flows on to downstream indexing/querying tasks.

One solution we found was overriding the default value of 255 in the constructor for the WhitespaceAnalyzer (see https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L768). We set to the max permissible value of 1048576 which solves the problem.

@lintool
Copy link
Member

lintool commented Apr 14, 2022

wow, what an obscure bug!

How about we just drop all terms longer than 255 chars? They are unlikely to be meaningful anyway?

@JMMackenzie
Copy link
Contributor Author

Is it possible to log output if so? But yeah, this would at least be better than silently mutating those terms I think...

@lintool
Copy link
Member

lintool commented Apr 14, 2022

sure!

Issue noted and PR welcome - but this is lowish on our priority list to fix...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants