Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stable row id support in queries #2452

Merged
merged 10 commits into from
Jun 26, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jun 7, 2024

Part of #2307

  • Turns on unit tests to validate we can use ANN and scalar indices with move-stable row ids.
  • Changed pre-filter to support move-stable row ids
    • Major change is that the deletion mask is no longer always a block list. With address-style rod ids they are, but now with stable row ids they will instead be an allow list.

@github-actions github-actions bot added enhancement New feature or request python labels Jun 7, 2024
@wjones127 wjones127 mentioned this pull request Jun 7, 2024
13 tasks
@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2024

Codecov Report

Attention: Patch coverage is 90.56974% with 48 lines in your changes missing coverage. Please review.

Project coverage is 79.78%. Comparing base (8ccd191) to head (1141cc6).
Report is 1 commits behind head on main.

Files Patch % Lines
rust/lance-core/src/utils/mask.rs 83.44% 25 Missing ⚠️
rust/lance/src/dataset/rowids.rs 86.53% 1 Missing and 6 partials ⚠️
rust/lance/src/io/exec/scalar_index.rs 86.95% 5 Missing and 1 partial ⚠️
rust/lance/src/index/prefilter.rs 96.91% 0 Missing and 5 partials ⚠️
rust/lance-index/src/scalar/expression.rs 50.00% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset.rs 0.00% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/scanner.rs 85.71% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/take.rs 50.00% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/transaction.rs 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2452      +/-   ##
==========================================
+ Coverage   79.63%   79.78%   +0.14%     
==========================================
  Files         207      207              
  Lines       59179    59590     +411     
  Branches    59179    59590     +411     
==========================================
+ Hits        47129    47545     +416     
+ Misses       9260     9253       -7     
- Partials     2790     2792       +2     
Flag Coverage Δ
unittests 79.78% <90.56%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 added the experimental Features that are experimental label Jun 21, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't related necessarily. I just noticed while debugging that the output schema was a lie.

@wjones127 wjones127 marked this pull request as ready for review June 21, 2024 18:43
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I can't say I understand all the internals enough to know what the perf impacts will be to prefiltering but, if there are any, we can address those later I think.

Comment on lines +55 to +67
fn without_column(&self, column_name: &str) -> Schema {
let fields: Vec<FieldRef> = self
.fields()
.iter()
.filter(|f| f.name() != column_name)
.cloned()
.collect();
Self::new_with_metadata(fields, self.metadata.clone())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document the contract with field ids (this method leaves them unchanged it appears)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operates on Arrow schemas, so no implications for field ids.


/// Insert a range of values into the set
pub fn insert_range<R: RangeBounds<u64>>(&mut self, range: R) -> u64 {
let (mut start_high, mut start_low) = match range.start_bound() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused here. start_high / start_low are u32. Are they fragment IDs? Or row offsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we move to generic u64 row ids, the high and low bits no longer consistently have the semantics of fragment ids and row offsets. Thus I rename them "high" and "low".

Comment on lines +269 to +273
/// These row ids may either be stable-style (where they can be an incrementing
/// u64 sequence) or address style, where they are a fragment id and a row offset.
/// When address style, this supports setting entire fragments as selected,
/// without needing to enumerate all the ids in the fragment.
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a minute to figure this out. Even if these are stable-style we store them in the same nested 32-bit map structure? (though we will have far fewer entries in the outer map)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. To be fair, the roaring crate has a RoaringTreeMap it uses to store u64 values, so it’s not all different from that. The vast majority of tables will just use a single RoaringBitmap here, since most won’t have over 4 billions rows I expect.

Comment on lines +764 to +766
// We don't test removing from a full fragment, because that would take
// a lot of memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been making large tests and just marking them ignore. It's helpful during refactoring to manually make sure I didn't botch anything but I don't know how maintainable it is.

@@ -269,6 +269,39 @@ impl RowIdSequence {
}
}

impl From<&RowIdSequence> for RowIdTreeMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These utilities we are building are very cool.

@@ -551,7 +551,7 @@ impl Transaction {
manifest.tag.clone_from(&self.tag);

if config.auto_set_feature_flags {
apply_feature_flags(&mut manifest)?;
apply_feature_flags(&mut manifest, config.use_move_stable_row_ids)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what are the rules on this config option? Is it the same as the legacy format rules (only applies to new datasets) or can you adjust the row id style on demand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, it should only be sett-able on new datasets. And once set it cannot be unset.

Comment on lines +169 to +172
/// Sometimes this will be a block list of row ids that are deleted, based
/// on the deletion files in the fragments. If stable row ids are used and
/// there are missing fragments, this may instead be an allow list, since
/// we can't easily compute the block list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...if it is an allow list I think this may have some performance impact on prefiltering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible. I will be doing further benchmarking to understand this later.

@@ -325,14 +334,14 @@ impl DisplayAs for MaterializeIndexExec {
}
}

struct FragIdIter {
src: Arc<Vec<Fragment>>,
struct FragIdIter<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for overcoming my early fear of lifetime variables 😆

retain_fragments(&mut allow_list, fragments, dataset).await?;

if let Some(allow_list_iter) = allow_list.row_ids() {
Ok(allow_list_iter.map(u64::from).collect::<Vec<_>>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is map(u64::from) needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.row_ids() returns RowAddress while the return type is a u64. Should be a no-op when compiled.

Comment on lines +428 to +432
async fn row_ids_for_mask(
mask: RowIdMask,
dataset: &Dataset,
fragments: &[Fragment],
) -> Result<Vec<u64>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are potentially materializing a huge list here but I see now we always were. In the future this path should definitely be avoided when we have only a block list but I think we already have a TODO for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'll definetely be examining the performance here closely soon.

@wjones127 wjones127 merged commit 63227f4 into lancedb:main Jun 26, 2024
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request experimental Features that are experimental python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants