-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: coalesce ids before executing take #2680
Conversation
@@ -1584,9 +1585,10 @@ impl Scanner { | |||
projection: &Schema, | |||
batch_readahead: usize, | |||
) -> Result<Arc<dyn ExecutionPlan>> { | |||
let coalesced = Arc::new(CoalesceBatchesExec::new(input, self.get_batch_size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to sort row IDs to offer a better chance we can do sequential reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do that internally.
fdd4562
to
e69f0fd
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2680 +/- ##
==========================================
+ Coverage 79.32% 79.34% +0.01%
==========================================
Files 226 226
Lines 66872 66886 +14
Branches 66872 66886 +14
==========================================
+ Hits 53049 53069 +20
- Misses 10720 10724 +4
+ Partials 3103 3093 -10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…the number of times we call take which has a number of performance benefits.
e69f0fd
to
08cd611
Compare
Late materialization is a great benefit when executing a highly selective filter. However, if a filter is highly selective it means that each input batch will probably only have a few matching rows. The current implementation executes take for each filtered batch. E.g. instead of a single call of
take(500, 10000, 300000)
we get three callstake(500)
,take(10000)
, andtake(300000)
. This means:On cloud storage I see a 10x plus benefit in scan performance.
We have a benchmark for this (EDA search plot 4) which should assist with preventing regression in the future: https://bencher.dev/console/projects/weston-lancedb/plots