perf: improve inverted index performance #2574

BubbleCal · 2024-07-08T11:10:53Z

add stopword filter to avoid words that occur everywhere causing very bad performance
store tokens, inverted list and docs in HashMap
use docs in LargeStringArray cause single doc could be large then the total length could be over i32::MAX
more effecient updating for inverted list
110.3%+ faster

invert(1000000)         time:   [20.542 µs 20.844 µs 21.229 µs]
                        change: [-53.850% -52.380% -51.289%] (p = 0.00 < 0.05)
                        Performance has improved.

- add stopword filter to avoid words that occur everywhere causing very bad performance - store tokens, inverted list and docs in `HashMap` - use docs in `LargeStringArray` cause single doc could be large then the total length could be over `i32::MAX` - more effecient updating for inverted list Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2024-07-08T11:27:00Z

Codecov Report

Attention: Patch coverage is 98.54015% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.95%. Comparing base (7a2f828) to head (c646e3d).
Report is 3 commits behind head on main.

Files	Patch %	Lines
rust/lance-index/src/scalar/inverted.rs	98.54%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2574      +/-   ##
==========================================
+ Coverage   79.91%   79.95%   +0.04%     
==========================================
  Files         212      212              
  Lines       61639    61658      +19     
  Branches    61639    61658      +19     
==========================================
+ Hits        49256    49298      +42     
+ Misses       9448     9436      -12     
+ Partials     2935     2924      -11

Flag	Coverage Δ
unittests	`79.95% <98.54%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2024-07-08T18:06:46Z

rust/lance-index/src/scalar/inverted.rs

+                    let freq = *freq as f32;
+                    let bm25 = bm25.entry(row_id).or_insert(0.0);
+                    *bm25 += self.idf(row_freq.len()) * freq * (K1 + 1.0)
+                        / (freq + K1 * (1.0 - B + B * self.docs.num_tokens(row_id) as f32 / avgdl));


Is this formula correct? I would think - B + B is zero, so this looks suspicious. What is this based on? (perhaps you should have a link to this in the method?)

Suggested change

/ (freq + K1 * (1.0 - B + B * self.docs.num_tokens(row_id) as f32 / avgdl));

/ (freq + K1 * (1.0 * self.docs.num_tokens(row_id) as f32 / avgdl));

sure, just added reference link for this method.
it's correct, it's 1.0 - B + (B * nq as f32 / avgdl) but the operator * is with higher priority so I just ignored the parentheses

wjones127 · 2024-07-08T18:11:42Z

rust/lance-index/src/scalar/inverted.rs

+        let mut token_id_builder = UInt32Builder::with_capacity(self.tokens.len());
+        let mut frequency_builder = UInt64Builder::with_capacity(self.tokens.len());


Since these aren't nullable, I think the faster approach would be to use a Vec, and then convert from a vec into the appropriate array at the end. This could say some cycles involved in handling the null buffer.

can arrow reuse the vector's data?
here with the array builder I think it doesn't need to copy the data from builder to array again.
but for vector it does need right?

Arrays can re-use the vectors data without copying.

wjones127 · 2024-07-08T18:17:47Z

rust/lance-index/src/scalar/inverted.rs

+                .zip(token_id_col.iter())
+                .zip(frequency_col.iter())
+            {
+                let token = token.unwrap();
+                let token_id = token_id.unwrap();
+                let frequency = frequency.unwrap();


You can avoid the null checks / unwraps by iterator over the values buffer:

Suggested change

.zip(token_id_col.iter())

.zip(frequency_col.iter())

{

let token = token.unwrap();

let token_id = token_id.unwrap();

let frequency = frequency.unwrap();

.zip(token_id_col.values().iter())

.zip(frequency_col.values().iter())

{

let token = token.unwrap();

wjones127 · 2024-07-08T18:21:50Z

rust/lance-index/src/scalar/inverted.rs

+            for ((token_id, row_ids), frequencies) in token_col
+                .iter()
+                .zip(row_ids_col.iter())
+                .zip(frequencies_col.iter())
+            {
+                let token_id = token_id.unwrap();


Same thing here, if it's non-null:

Suggested change

for ((token_id, row_ids), frequencies) in token_col

.iter()

.zip(row_ids_col.iter())

.zip(frequencies_col.iter())

{

let token_id = token_id.unwrap();

for ((token_id, row_ids), frequencies) in token_col

.values()

.iter()

.zip(row_ids_col.iter())

.zip(frequencies_col.iter())

{

wjones127 · 2024-07-08T18:23:05Z

rust/lance-index/src/scalar/inverted.rs

+            for (row_id, num_tokens) in row_id_col.iter().zip(num_tokens_col.iter()) {
+                let row_id = row_id.unwrap();
+                let num_tokens = num_tokens.unwrap();


Suggested change

for (row_id, num_tokens) in row_id_col.iter().zip(num_tokens_col.iter()) {

let row_id = row_id.unwrap();

let num_tokens = num_tokens.unwrap();

for (row_id, num_tokens) in row_id_col.values().iter().zip(num_tokens_col.values().iter()) {

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…-inverted

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2024-07-12T15:55:36Z

rust/lance-index/src/scalar/inverted.rs

+        let stopword_filter =
+            tantivy::tokenizer::StopWordFilter::new(tantivy::tokenizer::Language::English).unwrap();
        let mut tokenizer = tantivy::tokenizer::TextAnalyzer::builder(
-            tantivy::tokenizer::SimpleTokenizer::default(),
+            stopword_filter.transform(tantivy::tokenizer::SimpleTokenizer::default()),
        )
        .build();


I assume we'll make this all configurable later, right?

Yes, just make it work now

github-actions bot added the performance label Jul 8, 2024

BubbleCal added 2 commits July 8, 2024 19:53

fmt

15b40b0

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

smaller dataset

8686d58

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested review from eddyxu and westonpace July 8, 2024 13:13

wjones127 reviewed Jul 8, 2024

View reviewed changes

fix comments

ff61924

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from wjones127 July 9, 2024 04:33

BubbleCal added 3 commits July 9, 2024 16:59

Merge branch 'main' of https://github.com/lancedb/lance into optimize…

61c8528

…-inverted

Merge branch 'main' of https://github.com/lancedb/lance into optimize…

4f98179

…-inverted

fix

c646e3d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 approved these changes Jul 12, 2024

View reviewed changes

BubbleCal merged commit b092f00 into lancedb:main Jul 13, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve inverted index performance #2574

perf: improve inverted index performance #2574

BubbleCal commented Jul 8, 2024 •

edited

Loading

codecov-commenter commented Jul 8, 2024 •

edited

Loading

wjones127 Jul 8, 2024

BubbleCal Jul 9, 2024 •

edited

Loading

wjones127 Jul 8, 2024

BubbleCal Jul 9, 2024

wjones127 Jul 11, 2024

wjones127 Jul 8, 2024

wjones127 Jul 8, 2024

wjones127 Jul 8, 2024

wjones127 Jul 12, 2024

BubbleCal Jul 13, 2024

	/ (freq + K1 * (1.0 - B + B * self.docs.num_tokens(row_id) as f32 / avgdl));
	/ (freq + K1 * (1.0 * self.docs.num_tokens(row_id) as f32 / avgdl));

		let mut token_id_builder = UInt32Builder::with_capacity(self.tokens.len());
		let mut frequency_builder = UInt64Builder::with_capacity(self.tokens.len());

perf: improve inverted index performance #2574

perf: improve inverted index performance #2574

Conversation

BubbleCal commented Jul 8, 2024 • edited Loading

codecov-commenter commented Jul 8, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

BubbleCal Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal commented Jul 8, 2024 •

edited

Loading

codecov-commenter commented Jul 8, 2024 •

edited

Loading

BubbleCal Jul 9, 2024 •

edited

Loading