feat: stable row id support in queries #2452

wjones127 · 2024-06-07T16:45:15Z

Part of #2307

Turns on unit tests to validate we can use ANN and scalar indices with move-stable row ids.
Changed pre-filter to support move-stable row ids
- Major change is that the deletion mask is no longer always a block list. With address-style rod ids they are, but now with stable row ids they will instead be an allow list.

codecov-commenter · 2024-06-07T19:03:22Z

Codecov Report

Attention: Patch coverage is 90.56974% with 48 lines in your changes missing coverage. Please review.

Project coverage is 79.78%. Comparing base (8ccd191) to head (1141cc6).
Report is 1 commits behind head on main.

Files	Patch %	Lines
rust/lance-core/src/utils/mask.rs	83.44%	25 Missing ⚠️
rust/lance/src/dataset/rowids.rs	86.53%	1 Missing and 6 partials ⚠️
rust/lance/src/io/exec/scalar_index.rs	86.95%	5 Missing and 1 partial ⚠️
rust/lance/src/index/prefilter.rs	96.91%	0 Missing and 5 partials ⚠️
rust/lance-index/src/scalar/expression.rs	50.00%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset.rs	0.00%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset/scanner.rs	85.71%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset/take.rs	50.00%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset/transaction.rs	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2452      +/-   ##
==========================================
+ Coverage   79.63%   79.78%   +0.14%     
==========================================
  Files         207      207              
  Lines       59179    59590     +411     
  Branches    59179    59590     +411     
==========================================
+ Hits        47129    47545     +416     
+ Misses       9260     9253       -7     
- Partials     2790     2792       +2

Flag	Coverage Δ
unittests	`79.78% <90.56%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127 · 2024-06-21T03:35:34Z

rust/lance/src/io/exec/knn.rs

These changes aren't related necessarily. I just noticed while debugging that the output schema was a lie.

westonpace

Looks great. I can't say I understand all the internals enough to know what the perf impacts will be to prefiltering but, if there are any, we can address those later I think.

westonpace · 2024-06-24T21:12:30Z

rust/lance-arrow/src/schema.rs

+    fn without_column(&self, column_name: &str) -> Schema {
+        let fields: Vec<FieldRef> = self
+            .fields()
+            .iter()
+            .filter(|f| f.name() != column_name)
+            .cloned()
+            .collect();
+        Self::new_with_metadata(fields, self.metadata.clone())
+    }


Should we document the contract with field ids (this method leaves them unchanged it appears)?

This operates on Arrow schemas, so no implications for field ids.

westonpace · 2024-06-24T21:36:55Z

rust/lance-core/src/utils/mask.rs

+
+    /// Insert a range of values into the set
+    pub fn insert_range<R: RangeBounds<u64>>(&mut self, range: R) -> u64 {
+        let (mut start_high, mut start_low) = match range.start_bound() {


I'm a little confused here. start_high / start_low are u32. Are they fragment IDs? Or row offsets?

As we move to generic u64 row ids, the high and low bits no longer consistently have the semantics of fragment ids and row offsets. Thus I rename them "high" and "low".

westonpace · 2024-06-24T21:39:32Z

rust/lance-core/src/utils/mask.rs

+/// These row ids may either be stable-style (where they can be an incrementing
+/// u64 sequence) or address style, where they are a fragment id and a row offset.
+/// When address style, this supports setting entire fragments as selected,
+/// without needing to enumerate all the ids in the fragment.
+///


Took me a minute to figure this out. Even if these are stable-style we store them in the same nested 32-bit map structure? (though we will have far fewer entries in the outer map)

Yes. To be fair, the roaring crate has a RoaringTreeMap it uses to store u64 values, so it’s not all different from that. The vast majority of tables will just use a single RoaringBitmap here, since most won’t have over 4 billions rows I expect.

westonpace · 2024-06-24T21:40:28Z

rust/lance-core/src/utils/mask.rs

+        // We don't test removing from a full fragment, because that would take
+        // a lot of memory.


I've been making large tests and just marking them ignore. It's helpful during refactoring to manually make sure I didn't botch anything but I don't know how maintainable it is.

westonpace · 2024-06-24T21:41:31Z

rust/lance-table/src/rowids.rs

@@ -269,6 +269,39 @@ impl RowIdSequence {
    }
 }

+impl From<&RowIdSequence> for RowIdTreeMap {


These utilities we are building are very cool.

westonpace · 2024-06-24T21:45:34Z

rust/lance/src/dataset/transaction.rs

@@ -551,7 +551,7 @@ impl Transaction {
        manifest.tag.clone_from(&self.tag);

        if config.auto_set_feature_flags {
-            apply_feature_flags(&mut manifest)?;
+            apply_feature_flags(&mut manifest, config.use_move_stable_row_ids)?;


So what are the rules on this config option? Is it the same as the legacy format rules (only applies to new datasets) or can you adjust the row id style on demand?

Right now, it should only be sett-able on new datasets. And once set it cannot be unset.

westonpace · 2024-06-24T21:48:13Z

rust/lance/src/index/prefilter.rs

+    /// Sometimes this will be a block list of row ids that are deleted, based
+    /// on the deletion files in the fragments. If stable row ids are used and
+    /// there are missing fragments, this may instead be an allow list, since
+    /// we can't easily compute the block list.


Hmm...if it is an allow list I think this may have some performance impact on prefiltering?

This is possible. I will be doing further benchmarking to understand this later.

westonpace · 2024-06-24T21:50:03Z

rust/lance/src/io/exec/scalar_index.rs

@@ -325,14 +334,14 @@ impl DisplayAs for MaterializeIndexExec {
    }
 }

-struct FragIdIter {
-    src: Arc<Vec<Fragment>>,
+struct FragIdIter<'a> {


+1 for overcoming my early fear of lifetime variables 😆

westonpace · 2024-06-24T21:51:50Z

rust/lance/src/io/exec/scalar_index.rs

+            retain_fragments(&mut allow_list, fragments, dataset).await?;
+
+            if let Some(allow_list_iter) = allow_list.row_ids() {
+                Ok(allow_list_iter.map(u64::from).collect::<Vec<_>>())


Why is map(u64::from) needed here?

.row_ids() returns RowAddress while the return type is a u64. Should be a no-op when compiled.

westonpace · 2024-06-24T21:54:14Z

rust/lance/src/io/exec/scalar_index.rs

+async fn row_ids_for_mask(
+    mask: RowIdMask,
+    dataset: &Dataset,
+    fragments: &[Fragment],
+) -> Result<Vec<u64>> {


We are potentially materializing a huge list here but I see now we always were. In the future this path should definitely be avoided when we have only a block list but I think we already have a TODO for that.

That's a good point. I'll definetely be examining the performance here closely soon.

github-actions bot added enhancement New feature or request python labels Jun 7, 2024

wjones127 mentioned this pull request Jun 7, 2024

Epic: Move-Stable Row Ids #2307

Closed

13 tasks

wjones127 force-pushed the feat/stable-row-id-query branch from 7ccdf9b to 009efcd Compare June 7, 2024 18:48

wjones127 added the experimental Features that are experimental label Jun 21, 2024

wjones127 commented Jun 21, 2024

View reviewed changes

wjones127 force-pushed the feat/stable-row-id-query branch from 2850161 to 3616503 Compare June 21, 2024 03:41

wjones127 marked this pull request as ready for review June 21, 2024 18:43

wjones127 requested review from westonpace and chebbyChefNEQ June 21, 2024 18:43

westonpace approved these changes Jun 24, 2024

View reviewed changes

wjones127 added 9 commits June 25, 2024 13:29

feat: stable row id support in queries

f328759

wip: test scalar indices

fcbe68b

materialize row ids in the index

82135b6

fix some bugs

c62fb73

fix all tests

4a052c5

cleanup

b964fa5

better test coverage

8c5131a

pr feedback

950dfc8

fix test

f2a070d

wjones127 force-pushed the feat/stable-row-id-query branch from c813c59 to f2a070d Compare June 25, 2024 20:30

fix test

1141cc6

wjones127 merged commit 63227f4 into lancedb:main Jun 26, 2024
21 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stable row id support in queries #2452

feat: stable row id support in queries #2452

wjones127 commented Jun 7, 2024 •

edited

Loading

codecov-commenter commented Jun 7, 2024 •

edited

Loading

wjones127 Jun 21, 2024

westonpace left a comment

westonpace Jun 24, 2024

wjones127 Jun 25, 2024

westonpace Jun 24, 2024

wjones127 Jun 25, 2024

westonpace Jun 24, 2024

wjones127 Jun 24, 2024

westonpace Jun 24, 2024

westonpace Jun 24, 2024

westonpace Jun 24, 2024

wjones127 Jun 25, 2024

westonpace Jun 24, 2024

wjones127 Jun 25, 2024

westonpace Jun 24, 2024

westonpace Jun 24, 2024

wjones127 Jun 25, 2024

westonpace Jun 24, 2024

wjones127 Jun 25, 2024

		// We don't test removing from a full fragment, because that would take
		// a lot of memory.

feat: stable row id support in queries #2452

feat: stable row id support in queries #2452

Conversation

wjones127 commented Jun 7, 2024 • edited Loading

codecov-commenter commented Jun 7, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 commented Jun 7, 2024 •

edited

Loading

codecov-commenter commented Jun 7, 2024 •

edited

Loading