Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hyperLogLogPlus aggregation function for distinct count #11346

Merged
merged 3 commits into from
Sep 21, 2023

Conversation

deemoliu
Copy link
Contributor

@deemoliu deemoliu commented Aug 14, 2023

feature: add DistinctCountHLLPlus function.

Context:
currently pinot is using Clearspring implementation of HLL algorithm. https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java
reference: HLL paper: https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

we have some customers observed ElasticSearch dataset to Pinot, Pinot is using HLL algorithm while ES is using HLL++, user reported that HLL++ has higher accuracy than HLL when cardinality of dimension is at 10k-100k.
reference: HLL++ paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf

We tried to solve the above issue by tuning the log2m parameters of HLL (distinctCountHLL) functions, then we observed the CPU usage increased and bring the cluster into unstable state.

This PR tried to bridge the gap between ES and Pinot by introducing HyperLogLogPlusPlus to Pinot. Clearspring already has support for the HLL++. https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java

Approximation Benchmark:

Test Query

SELECT
  DISTINCTCOUNTHLLPLUS(some_id, 12)
FROM
  test_pinot_table
where
  time_ms >= 1694380000000
  time_ms <= 1694382000000
  and filter1 > 0
  AND filter2 = 0
  AND filter3 > 0
  AND filter4 != TRUE

Approximation Benchmarking result

Result

  Query1 Query2 Query3
Distinct Count 144730 2797 204807
HLL (default log2m=8) 158486 3138 191046
HLL log2m=9 155441 2850 209481
HLL log2m=10 148871 2797 204871
HLL++ default p=14 143821 2787 204371
hLL++ p=13 145069 2774 206735
HLL++ p=12 145329 2770 209274
HLL++ p=11 145251 2720 210161
HLL++ p=10 147969 2679 214758
HLL++ p=9 147234 2767 201620
HLL++ p=8 141377 2653 184344

@codecov-commenter
Copy link

codecov-commenter commented Aug 14, 2023

Codecov Report

Merging #11346 (f0cf7f2) into master (2306298) will increase coverage by 0.02%.
Report is 4 commits behind head on master.
The diff coverage is 52.09%.

@@             Coverage Diff              @@
##             master   #11346      +/-   ##
============================================
+ Coverage     63.07%   63.10%   +0.02%     
- Complexity     1110     1121      +11     
============================================
  Files          2326     2342      +16     
  Lines        124918   125686     +768     
  Branches      19145    19306     +161     
============================================
+ Hits          78792    79311     +519     
- Misses        40510    40714     +204     
- Partials       5616     5661      +45     
Flag Coverage Δ
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 63.06% <52.09%> (+0.01%) ⬆️
java-17 62.89% <52.09%> (-0.04%) ⬇️
java-20 62.89% <52.09%> (+12.96%) ⬆️
temurin 63.10% <52.09%> (+0.02%) ⬆️
unittests 63.09% <52.09%> (+0.02%) ⬆️
unittests1 67.20% <52.09%> (-0.09%) ⬇️
unittests2 14.46% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...rg/apache/pinot/core/plan/AggregationPlanNode.java 72.46% <ø> (ø)
...ggregator/DistinctCountHLLPlusValueAggregator.java 0.00% <0.00%> (ø)
...gment/local/aggregator/ValueAggregatorFactory.java 81.25% <0.00%> (-5.42%) ⬇️
...inot/segment/local/utils/HyperLogLogPlusUtils.java 0.00% <0.00%> (ø)
...va/org/apache/pinot/spi/utils/CommonConstants.java 28.00% <ø> (ø)
...he/pinot/segment/local/utils/CustomSerDeUtils.java 37.97% <20.00%> (-2.61%) ⬇️
...ion/DistinctCountHLLPlusMVAggregationFunction.java 38.80% <38.80%> (ø)
...ction/DistinctCountHLLPlusAggregationFunction.java 47.64% <47.64%> (ø)
...perator/query/NonScanBasedAggregationOperator.java 77.87% <57.89%> (-4.04%) ⬇️
...nction/FrequentLongsSketchAggregationFunction.java 60.97% <60.97%> (ø)
... and 9 more

... and 42 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@deemoliu deemoliu force-pushed the wip-hll-fix-test-failure branch 3 times, most recently from c8948ae to 7cbea4a Compare September 8, 2023 00:48
@deemoliu deemoliu changed the title [WIP] Add hyperLogLogPlus aggregation function for distinct count Add hyperLogLogPlus aggregation function for distinct count Sep 11, 2023
@deemoliu
Copy link
Contributor Author

@Jackie-Jiang @chenboat @cbalci can you please review this PR?

@deemoliu
Copy link
Contributor Author

Test Query

SELECT
  DISTINCTCOUNTHLLPLUS(some_id, 12)
FROM
  test_pinot_table
where
  time_ms >= 1694380000000
  time_ms <= 1694382000000
  and filter1 > 0
  AND filter2 = 0
  AND filter3 > 0
  AND filter4 != TRUE

Approximation Benchmarking result

Result

  Query1 Query2 Query3
Distinct Count 144730 2797 204807
HLL (default log2m=8) 158486 3138 191046
HLL log2m=9 155441 2850 209481
HLL log2m=10 148871 2797 204871
HLL++ default p=14 143821 2787 204371
hLL++ p=13 145069 2774 206735
HLL++ p=12 145329 2770 209274
HLL++ p=11 145251 2720 210161
HLL++ p=10 147969 2679 214758
HLL++ p=9 147234 2767 201620
HLL++ p=8 141377 2653 184344

public DistinctCountHLLPlusAggregationFunction(List<ExpressionContext> arguments) {
super(arguments.get(0));
int numExpressions = arguments.size();
// This function expects 1 or 2 arguments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update javadoc, should we expect 2 or 3 instead based on the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank @chenboat for review. this function epxects 1 or 2 or 3 arguments.

@siddharthteotia
Copy link
Contributor

I'd like to spend some time reviewing this today / tomorrow.

@deemoliu
Copy link
Contributor Author

I'd like to spend some time reviewing this today / tomorrow.

@siddharthteotia thanks for reviewing. I will fix this conflict in this hour.

Copy link
Contributor

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Would you mind adding more info to this PR, like what functionalities are added, what's the difference between the raw one and the non-raw one, what the purpose of your benchmarking test is?

}
return hllplus;
} catch (Exception e) {
throw new RuntimeException("Caught exception while merging HyperLogLogs", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Caught exception while merging HyperLogLogsPlus

public int getMaxAggregatedValueByteSize() {
// NOTE: For aggregated metrics, initial aggregated value might have not been generated. Returns the byte size
// based on log2m.
return _maxByteSize > 0 ? _maxByteSize : 100000000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't seem to align with the comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jackjlli for pointing out. let me uncomment the line below and remove this line.

@deemoliu deemoliu force-pushed the wip-hll-fix-test-failure branch 2 times, most recently from 2a995c4 to 9308869 Compare September 14, 2023 22:25
@Jackie-Jiang Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Sep 17, 2023
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

}
return hllplus;
} catch (Exception e) {
throw new RuntimeException("Caught exception while merging HyperLogLogs", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Revise the message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Jackie-Jiang i fixed the error message.

private static HyperLogLogPlus getDistinctCountHLLPlusResult(Dictionary dictionary,
DistinctCountHLLPlusAggregationFunction function) {
if (dictionary.getValueType() == FieldSpec.DataType.BYTES) {
// Treat BYTES value as serialized HyperLogLog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Revise the comment


@Override
public HyperLogLogPlus deserialize(ByteBuffer byteBuffer) {
// NOTE: The passed in byte buffer is always BIG ENDIAN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Remove this comment as it doesn't apply

Comment on lines +216 to +220
try {
return HyperLogLogPlus.Builder.build(bytes);
} catch (IOException e) {
throw new RuntimeException("Caught exception while de-serializing HyperLogLogPlus", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try {
return HyperLogLogPlus.Builder.build(bytes);
} catch (IOException e) {
throw new RuntimeException("Caught exception while de-serializing HyperLogLogPlus", e);
}
return deserialize(bytes);

@deemoliu
Copy link
Contributor Author

thanks @Jackie-Jiang @jackjlli @siddharthteotia and @chenboat for review. I will address comment today.

Copy link
Contributor

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but LGTM. Thanks for adding the aggregation function!

try {
HyperLogLogPlus hyperLogLogPlus = aggregationResultHolder.getResult();
if (hyperLogLogPlus != null) {
for (int i = 0; i < length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these two for loop can be merged. We just need to check if (hyperLogLogPlus == null). If yes, assign the first one to this reference. The following logic should be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for code review! addressed the code comment, please take a look

hyperLogLogPlus.addAll(ObjectSerDeUtils.HYPER_LOG_LOG_PLUS_SER_DE.deserialize(bytesValues[i]));
}
}
for (int i = 0; i < length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: int i = 1;

if (hyperLogLogPlus == null) {
hyperLogLogPlus = ObjectSerDeUtils.HYPER_LOG_LOG_PLUS_SER_DE.deserialize(bytesValues[0]);
aggregationResultHolder.setValue(hyperLogLogPlus);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the else block if the check is not qualified:

} else {
    hyperLogLogPlus.addAll(ObjectSerDeUtils.HYPER_LOG_LOG_PLUS_SER_DE.deserialize(bytesValues[0]));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got your point, updated, thanks!

@chenboat chenboat merged commit 3501b86 into apache:master Sep 21, 2023
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants