Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve inverted index performance #2574

Merged
merged 7 commits into from
Jul 13, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Jul 8, 2024

  • add stopword filter to avoid words that occur everywhere causing very bad performance
  • store tokens, inverted list and docs in HashMap
  • use docs in LargeStringArray cause single doc could be large then the total length could be over i32::MAX
  • more effecient updating for inverted list
    110.3%+ faster
invert(1000000)         time:   [20.542 µs 20.844 µs 21.229 µs]
                        change: [-53.850% -52.380% -51.289%] (p = 0.00 < 0.05)
                        Performance has improved.

- add stopword filter to avoid words that occur everywhere causing very bad performance
- store tokens, inverted list and docs in `HashMap`
- use docs in `LargeStringArray` cause single doc could be large then the total length could be over `i32::MAX`
- more effecient updating for inverted list

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2024

Codecov Report

Attention: Patch coverage is 98.54015% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.95%. Comparing base (7a2f828) to head (c646e3d).
Report is 3 commits behind head on main.

Files Patch % Lines
rust/lance-index/src/scalar/inverted.rs 98.54% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2574      +/-   ##
==========================================
+ Coverage   79.91%   79.95%   +0.04%     
==========================================
  Files         212      212              
  Lines       61639    61658      +19     
  Branches    61639    61658      +19     
==========================================
+ Hits        49256    49298      +42     
+ Misses       9448     9436      -12     
+ Partials     2935     2924      -11     
Flag Coverage Δ
unittests 79.95% <98.54%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let freq = *freq as f32;
let bm25 = bm25.entry(row_id).or_insert(0.0);
*bm25 += self.idf(row_freq.len()) * freq * (K1 + 1.0)
/ (freq + K1 * (1.0 - B + B * self.docs.num_tokens(row_id) as f32 / avgdl));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this formula correct? I would think - B + B is zero, so this looks suspicious. What is this based on? (perhaps you should have a link to this in the method?)

Suggested change
/ (freq + K1 * (1.0 - B + B * self.docs.num_tokens(row_id) as f32 / avgdl));
/ (freq + K1 * (1.0 * self.docs.num_tokens(row_id) as f32 / avgdl));

Copy link
Contributor Author

@BubbleCal BubbleCal Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, just added reference link for this method.
it's correct, it's 1.0 - B + (B * nq as f32 / avgdl) but the operator * is with higher priority so I just ignored the parentheses

Comment on lines 248 to 249
let mut token_id_builder = UInt32Builder::with_capacity(self.tokens.len());
let mut frequency_builder = UInt64Builder::with_capacity(self.tokens.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these aren't nullable, I think the faster approach would be to use a Vec, and then convert from a vec into the appropriate array at the end. This could say some cycles involved in handling the null buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can arrow reuse the vector's data?
here with the array builder I think it doesn't need to copy the data from builder to array again.
but for vector it does need right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays can re-use the vectors data without copying.

Comment on lines 289 to 294
.zip(token_id_col.iter())
.zip(frequency_col.iter())
{
let token = token.unwrap();
let token_id = token_id.unwrap();
let frequency = frequency.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the null checks / unwraps by iterator over the values buffer:

Suggested change
.zip(token_id_col.iter())
.zip(frequency_col.iter())
{
let token = token.unwrap();
let token_id = token_id.unwrap();
let frequency = frequency.unwrap();
.zip(token_id_col.values().iter())
.zip(frequency_col.values().iter())
{
let token = token.unwrap();

Comment on lines 396 to 401
for ((token_id, row_ids), frequencies) in token_col
.iter()
.zip(row_ids_col.iter())
.zip(frequencies_col.iter())
{
let token_id = token_id.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, if it's non-null:

Suggested change
for ((token_id, row_ids), frequencies) in token_col
.iter()
.zip(row_ids_col.iter())
.zip(frequencies_col.iter())
{
let token_id = token_id.unwrap();
for ((token_id, row_ids), frequencies) in token_col
.values()
.iter()
.zip(row_ids_col.iter())
.zip(frequencies_col.iter())
{

Comment on lines 481 to 483
for (row_id, num_tokens) in row_id_col.iter().zip(num_tokens_col.iter()) {
let row_id = row_id.unwrap();
let num_tokens = num_tokens.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (row_id, num_tokens) in row_id_col.iter().zip(num_tokens_col.iter()) {
let row_id = row_id.unwrap();
let num_tokens = num_tokens.unwrap();
for (row_id, num_tokens) in row_id_col.values().iter().zip(num_tokens_col.values().iter()) {

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal requested a review from wjones127 July 9, 2024 04:33
Comment on lines +182 to 187
let stopword_filter =
tantivy::tokenizer::StopWordFilter::new(tantivy::tokenizer::Language::English).unwrap();
let mut tokenizer = tantivy::tokenizer::TextAnalyzer::builder(
tantivy::tokenizer::SimpleTokenizer::default(),
stopword_filter.transform(tantivy::tokenizer::SimpleTokenizer::default()),
)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we'll make this all configurable later, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just make it work now

@BubbleCal BubbleCal merged commit b092f00 into lancedb:main Jul 13, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants