Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: coalesce ids before executing take #2680

Merged
merged 3 commits into from
Aug 13, 2024

Conversation

westonpace
Copy link
Contributor

Late materialization is a great benefit when executing a highly selective filter. However, if a filter is highly selective it means that each input batch will probably only have a few matching rows. The current implementation executes take for each filtered batch. E.g. instead of a single call of take(500, 10000, 300000) we get three calls take(500), take(10000), and take(300000). This means:

  • We can't coalesce
  • More CPU overhead (many calls to take_ranges)
  • Very small output batches (user's batch size is not respected)

On cloud storage I see a 10x plus benefit in scan performance.

We have a benchmark for this (EDA search plot 4) which should assist with preventing regression in the future: https://bencher.dev/console/projects/weston-lancedb/plots

@@ -1584,9 +1585,10 @@ impl Scanner {
projection: &Schema,
batch_readahead: usize,
) -> Result<Arc<dyn ExecutionPlan>> {
let coalesced = Arc::new(CoalesceBatchesExec::new(input, self.get_batch_size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to sort row IDs to offer a better chance we can do sequential reads?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do that internally.

@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2024

Codecov Report

Attention: Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.

Project coverage is 79.34%. Comparing base (30b3df7) to head (08cd611).

Files Patch % Lines
rust/lance-encoding/src/decoder.rs 20.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2680      +/-   ##
==========================================
+ Coverage   79.32%   79.34%   +0.01%     
==========================================
  Files         226      226              
  Lines       66872    66886      +14     
  Branches    66872    66886      +14     
==========================================
+ Hits        53049    53069      +20     
- Misses      10720    10724       +4     
+ Partials     3103     3093      -10     
Flag Coverage Δ
unittests 79.34% <80.95%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@westonpace westonpace merged commit 711bad7 into lancedb:main Aug 13, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants