-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: various fixes to improve shuffling performance at high scales #2710
Conversation
eea8d5b
to
856fcf6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2710 +/- ##
==========================================
+ Coverage 79.13% 79.21% +0.07%
==========================================
Files 227 227
Lines 67398 67573 +175
Branches 67398 67573 +175
==========================================
+ Hits 53338 53529 +191
+ Misses 10956 10947 -9
+ Partials 3104 3097 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Avoid expensive concatenation by building partitions in-place Parallelize shuffling
1d074d2
to
96e6aa2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a NIT comment
Co-authored-by: BubbleCal <bubble-cal@outlook.com>
With this fix shuffling performance at 1M rows goes from ~2.8 seconds to ~0.8 seconds. At 1B rows we can shuffle in ~6 minutes.