-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HllSketch Merge Aggregator optimizations #15162
Conversation
.../java/org/apache/druid/query/aggregation/datasketches/hll/HllSketchHolderObjectStrategy.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/druid/query/aggregation/datasketches/hll/HllSketchHolderObjectStrategy.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/druid/query/aggregation/datasketches/hll/HllSketchHolderObjectStrategy.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/druid/query/aggregation/datasketches/hll/HllSketchHolderObjectStrategy.java
Outdated
Show resolved
Hide resolved
FYI I benchmarked this patch on a dataset with 11 groups and a few million sketches going into those groups (so, each Union would get hundreds-of-thousands of calls). It's about 37% faster.
I also tried the case where the number of groups and number of sketches is similar. In this case the patch is slightly slower. I think it's worth it on balance though. The speedup in the speedup case is a lot bigger than the slowdown in the slowdown case. |
@gianm Thanks for running perf tests on this PR! I've tried running a perf test on a sketch column with ~20M empty sketches and observed somewhat comparable results. Regarding the PR, while testing, I've noticed one issue with MSQ and estimation on an empty sketch, returning null instead of 0 after this change. I'll look into resolving it. |
Resolved the above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Fwiw, the check we are doing to make sure that things are null should be safe, but it's also sad that we are needing to do it. It would probably be nice to file a ticket against the apache datasketches project pointing them to what we are doing to identify that the sketch is effectively empty and ask if they can give us a less invasive (and maybe even faster) way of doing the same thing.
...a/org/apache/druid/query/aggregation/datasketches/hll/HllSketchToEstimatePostAggregator.java
Outdated
Show resolved
Hide resolved
@imply-cheddar Sure, I'll follow up on that |
* Null byte serde for empty sketches * Cache for HllSketchMerge * Check for empty sketches * Address review comments * Revert changes to HllSketchHolder * Handle null sketch holders instead of null sketches * Add unit test for MSQ HllSketch * Add comments * Fix style
The prior code from apache#15162 was reading only the low-order byte of an int representing the size of a coupon set. As a result, it would erroneously believe that a coupon set with a multiple of 256 elements was empty.
The prior code from apache#15162 was reading only the low-order byte of an int representing the size of a coupon set. As a result, it would erroneously believe that a coupon set with a multiple of 256 elements was empty.
* Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch. The prior code from #15162 was reading only the low-order byte of an int representing the size of a coupon set. As a result, it would erroneously believe that a coupon set with a multiple of 256 elements was empty.
…e#15860) * Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch. The prior code from apache#15162 was reading only the low-order byte of an int representing the size of a coupon set. As a result, it would erroneously believe that a coupon set with a multiple of 256 elements was empty.
…e#15860) * Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch. The prior code from apache#15162 was reading only the low-order byte of an int representing the size of a coupon set. As a result, it would erroneously believe that a coupon set with a multiple of 256 elements was empty.
… (#15861) * Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch. The prior code from #15162 was reading only the low-order byte of an int representing the size of a coupon set. As a result, it would erroneously believe that a coupon set with a multiple of 256 elements was empty. Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Description
This PR attempts to improve the performance of HllSketchMergeAggregators.
Improvements
This PR has: