Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HllSketch Merge Aggregator optimizations #15162

Merged
merged 11 commits into from
Nov 3, 2023

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented Oct 15, 2023

Description

This PR attempts to improve the performance of HllSketchMergeAggregators.

Improvements

  • Serialize to null instead of an empty sketch when converting to byte array. This will allow future ingestions to write null, which is faster to read and is slightly cheaper than storing an empty sketch of 8 bytes.
  • Directly check memory to see if the sketch is empty, based on certain flags. This will provide the benefits of the above without reingesting the data.
  • Add cache for hll union objects in HllSketchMergeAggregators

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@gianm
Copy link
Contributor

gianm commented Oct 26, 2023

FYI I benchmarked this patch on a dataset with 11 groups and a few million sketches going into those groups (so, each Union would get hundreds-of-thousands of calls). It's about 37% faster.

Query                   Status    Iter     Min     Avg  Median     Max   OK?
watch_byuser_hll_12_4      200      25     454     463     461     511     Y [patch]
watch_byuser_hll_12_4      200      25     622     636     636     665     Y [master]

I also tried the case where the number of groups and number of sketches is similar. In this case the patch is slightly slower. I think it's worth it on balance though. The speedup in the speedup case is a lot bigger than the slowdown in the slowdown case.

@adarshsanjeev
Copy link
Contributor Author

@gianm Thanks for running perf tests on this PR! I've tried running a perf test on a sketch column with ~20M empty sketches and observed somewhat comparable results.

Regarding the PR, while testing, I've noticed one issue with MSQ and estimation on an empty sketch, returning null instead of 0 after this change. I'll look into resolving it.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 30, 2023
@adarshsanjeev
Copy link
Contributor Author

Resolved the above

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Fwiw, the check we are doing to make sure that things are null should be safe, but it's also sad that we are needing to do it. It would probably be nice to file a ticket against the apache datasketches project pointing them to what we are doing to identify that the sketch is effectively empty and ask if they can give us a less invasive (and maybe even faster) way of doing the same thing.

@asdf2014 asdf2014 merged commit 9576fd3 into apache:master Nov 3, 2023
53 checks passed
@adarshsanjeev
Copy link
Contributor Author

@imply-cheddar Sure, I'll follow up on that

CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
* Null byte serde for empty sketches

* Cache for HllSketchMerge

* Check for empty sketches

* Address review comments

* Revert changes to HllSketchHolder

* Handle null sketch holders instead of null sketches

* Add unit test for MSQ HllSketch

* Add comments

* Fix style
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
gianm added a commit to gianm/druid that referenced this pull request Feb 7, 2024
The prior code from apache#15162 was reading only the low-order byte of an int
representing the size of a coupon set. As a result, it would erroneously
believe that a coupon set with a multiple of 256 elements was empty.
gianm added a commit to gianm/druid that referenced this pull request Feb 7, 2024
The prior code from apache#15162 was reading only the low-order byte of an int
representing the size of a coupon set. As a result, it would erroneously
believe that a coupon set with a multiple of 256 elements was empty.
cryptoe pushed a commit that referenced this pull request Feb 8, 2024
* Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch.

The prior code from #15162 was reading only the low-order byte of an int
representing the size of a coupon set. As a result, it would erroneously
believe that a coupon set with a multiple of 256 elements was empty.
adarshsanjeev pushed a commit to adarshsanjeev/druid that referenced this pull request Feb 8, 2024
…e#15860)

* Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch.

The prior code from apache#15162 was reading only the low-order byte of an int
representing the size of a coupon set. As a result, it would erroneously
believe that a coupon set with a multiple of 256 elements was empty.
LakshSingla pushed a commit to LakshSingla/druid that referenced this pull request Feb 8, 2024
…e#15860)

* Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch.

The prior code from apache#15162 was reading only the low-order byte of an int
representing the size of a coupon set. As a result, it would erroneously
believe that a coupon set with a multiple of 256 elements was empty.
LakshSingla pushed a commit that referenced this pull request Feb 8, 2024
… (#15861)

* Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch.

The prior code from #15162 was reading only the low-order byte of an int
representing the size of a coupon set. As a result, it would erroneously
believe that a coupon set with a multiple of 256 elements was empty.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Extension Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants