Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PercentileKLL aggregation function #10643

Merged
merged 7 commits into from
May 16, 2023

Conversation

cbalci
Copy link
Contributor

@cbalci cbalci commented Apr 19, 2023

Introducing a new approximate percentile calculation function PercentileKLL and its variations (MV & Raw), using Apache Datasketches libraries 'KLL'.

This is part of a proposal to improve Apache Datasketches support in Pinot:
(Google Docs Link) [Proposal] Improved Apache DataSketches Support in Pinot

Some advantages listed and discussed in the linked document:

  • Well defined error bound (comparison to t-Digest)
  • Faster updates, serialization/deserialization
  • Binary compatibility with external systems, hence the ability to use Pinot as sketch store
  • Ability to compute ‘Rank’ and ‘Histogram’ besides ‘Percentile’
  • Feature parity with Druid

Please leave design related comments on the linked document and code related comments in this PR.

Testing

  • Added unit tests to cover basic use cases that call PercentileKLL, PercentileKLLMV, PercentileRawKLL, PercentileRawKLLMV
  • Added tests to cover group by scenarios
  • Manually tested ingesting raw (externally generated) data sketches

feature performance

@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2023

Codecov Report

Merging #10643 (396e2c1) into master (579082c) will decrease coverage by 50.75%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master   #10643       +/-   ##
=============================================
- Coverage     64.40%   13.66%   -50.75%     
+ Complexity     6446      439     -6007     
=============================================
  Files          2098     2103        +5     
  Lines        113315   113521      +206     
  Branches      17204    17255       +51     
=============================================
- Hits          72984    15511    -57473     
- Misses        35082    96740    +61658     
+ Partials       5249     1270     -3979     
Flag Coverage Δ
unittests1 ?
unittests2 13.66% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/pinot/core/common/ObjectSerDeUtils.java 0.00% <0.00%> (-90.68%) ⬇️
...gregation/function/AggregationFunctionFactory.java 0.00% <0.00%> (-83.34%) ⬇️
...ion/function/PercentileKLLAggregationFunction.java 0.00% <0.00%> (ø)
...n/function/PercentileKLLMVAggregationFunction.java 0.00% <0.00%> (ø)
.../function/PercentileRawKLLAggregationFunction.java 0.00% <0.00%> (ø)
...unction/PercentileRawKLLMVAggregationFunction.java 0.00% <0.00%> (ø)
...inot/segment/local/customobject/SerializedKLL.java 0.00% <0.00%> (ø)
...che/pinot/segment/spi/AggregationFunctionType.java 0.00% <0.00%> (-91.59%) ⬇️

... and 1444 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more


_percentile = arguments.get(1).getLiteral().getDoubleValue();
if (numArguments == 3) {
_kValue = arguments.get(2).getLiteral().getIntValue();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar check should be done on the K value. There's could be some memory implications from setting the K to be too high: https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html

Copy link
Contributor Author

@cbalci cbalci Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library already returns a good exception (SketchesArgumentException: K must be >= 8 and <= 65535), so I didn't see much value in doing another check here.

For the memory implication, I think it should be up to the user to decide if the size of the sketch is going to be a problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Did not know about SketchesArgumentException. That should be good enough if bubbled up properly

@cbalci cbalci requested review from chenboat and chiquelo May 1, 2023 21:47
Copy link

@chiquelo chiquelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


_percentile = arguments.get(1).getLiteral().getDoubleValue();
if (numArguments == 3) {
_kValue = arguments.get(2).getLiteral().getIntValue();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Did not know about SketchesArgumentException. That should be good enough if bubbled up properly

* Compares two sketches for a specific percentile value.
*/
public class SerializedKLL implements Comparable<SerializedKLL> {
private final double _quantile;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need _quantile inside this SerializedKLL?
if we want to return the KllDoublesSketch as byte[], the toString() is for that purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. That is used when we want to sort (ORDER BY) sketches.

@chenboat chenboat merged commit 41fbb32 into apache:master May 16, 2023
@navina
Copy link
Contributor

navina commented May 17, 2023

@cbalci / @chenboat : has this feature been documented ? is it available in the OSS docs?

@cbalci
Copy link
Contributor Author

cbalci commented May 17, 2023

@navina, not yet. I'll add it to OSS documentation soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants