Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189

Jackie-Jiang · 2022-02-10T22:32:36Z

Description

Adds DistinctCountSmartHLLAggregationFunction which can automatically convert the Set to HyperLogLog if the set size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries.

By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12.
The log2m and threshold can be configured using the second argument (literal) of the function:

hllLog2m: log2m of the converted HyperLogLog (default 12)
hllConversionThreshold: set size threshold to trigger the conversion, non-positive means never convert (default 100K)

Example query:
SELECT DISTINCTCOUNTSMARTHLL(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable

Release Notes

Adds DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

…e distinct values in Set or HyperLogLog based on cardinality

codecov-commenter · 2022-02-10T23:03:17Z

Codecov Report

Merging #8189 (e7c5165) into master (b12b7bb) will decrease coverage by 57.26%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #8189       +/-   ##
=============================================
- Coverage     71.34%   14.08%   -57.27%     
+ Complexity     4307       81     -4226     
=============================================
  Files          1623     1579       -44     
  Lines         84320    82940     -1380     
  Branches      12642    12569       -73     
=============================================
- Hits          60162    11683    -48479     
- Misses        20028    70388    +50360     
+ Partials       4130      869     -3261

Flag	Coverage Δ
integration1	`?`
integration2	`?`
unittests1	`?`
unittests2	`14.08% <0.00%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ator/query/DictionaryBasedAggregationOperator.java	`0.00% <0.00%> (-87.31%)`	⬇️
...rg/apache/pinot/core/plan/AggregationPlanNode.java	`0.00% <0.00%> (-91.00%)`	⬇️
...gregation/function/AggregationFunctionFactory.java	`0.00% <0.00%> (-83.34%)`	⬇️
...tion/DistinctCountSmartHLLAggregationFunction.java	`0.00% <0.00%> (ø)`
...che/pinot/segment/spi/AggregationFunctionType.java	`0.00% <0.00%> (-90.25%)`	⬇️
...ain/java/org/apache/pinot/core/data/table/Key.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../java/org/apache/pinot/core/data/table/Record.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../java/org/apache/pinot/core/util/GroupByUtils.java	`0.00% <0.00%> (-100.00%)`	⬇️
...ava/org/apache/pinot/spi/config/table/FSTType.java	`0.00% <0.00%> (-100.00%)`	⬇️
... and 1299 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b12b7bb...e7c5165. Read the comment docs.

richardstartin · 2022-02-11T13:40:47Z

pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java

-          && functionType != AggregationFunctionType.MINMAXRANGE
-          && functionType != AggregationFunctionType.DISTINCTCOUNT
-          && functionType != AggregationFunctionType.SEGMENTPARTITIONEDDISTINCTCOUNT) {
+      if (!DICTIONARY_BASED_FUNCTIONS.contains(aggregationFunction.getType())) {


+1, I made the same change on a local branch 😁

richardstartin · 2022-02-11T13:42:06Z

...e/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java

+        IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          intSet.add(dictionary.getIntValue(dictId));
+        }
+        return intSet;
+      case LONG:
+        LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          longSet.add(dictionary.getLongValue(dictId));
+        }
+        return longSet;


Does this have to be a set, I think a roaringbitmap might be better

richardstartin · 2022-02-11T13:42:30Z

...e/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java

+      case FLOAT:
+        FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          floatSet.add(dictionary.getFloatValue(dictId));
+        }
+        return floatSet;
+      case DOUBLE:
+        DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          doubleSet.add(dictionary.getDoubleValue(dictId));
+        }
+        return doubleSet;


Could convert to int/long bits and store in a roaringbitmap

this appears to be beneficial:

RoaringBitmap bitmap = new RoaringBitmap(); FloatOpenHashSet set = new FloatOpenHashSet(); long bitmapBefore = GraphLayout.parseInstance(bitmap).totalSize(); long setBefore = GraphLayout.parseInstance(set).totalSize(); for (int i = 0; i < 1 << 20; i++) { float f = ThreadLocalRandom.current().nextFloat() * ThreadLocalRandom.current().nextLong(); bitmap.add(Float.floatToIntBits(f)); set.add(f); } System.err.println("bitmap: " + ((GraphLayout.parseInstance(bitmap).totalSize() - bitmapBefore) >>> 20) + "MB"); System.err.println(GraphLayout.parseInstance(bitmap).toFootprint()); System.err.println("set: " + ((GraphLayout.parseInstance(set).totalSize() - setBefore) >>> 20) + "MB"); System.err.println(GraphLayout.parseInstance(set).toFootprint());

bitmap: 2MB org.roaringbitmap.RoaringBitmap@36a6bea6d footprint: COUNT AVG SUM DESCRIPTION 3618 706 2555408 [C 1 15008 15008 [Lorg.roaringbitmap.Container; 3617 24 86808 org.roaringbitmap.ArrayContainer 1 24 24 org.roaringbitmap.RoaringArray 1 16 16 org.roaringbitmap.RoaringBitmap 7238 2657264 (total) set: 7MB it.unimi.dsi.fastutil.floats.FloatOpenHashSet@a62c7cdd footprint: COUNT AVG SUM DESCRIPTION 1 8388632 8388632 [F 1 48 48 it.unimi.dsi.fastutil.floats.FloatOpenHashSet 2 8388680 (total)

But this doesn't work well for double

Given that this is an approximate algorithm and floating point numbers are approximations, I don't see a downside in loss of precision by converting double to float and then storing the int bits in a RoaringBitmap

It looks like converting double to float would lead to a relative error of around 0.5% in the typical case, but it doesn't have an upper bound

Jackie-Jiang · 2022-02-11T18:01:37Z

@richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values?

Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current DISTINCT_COUNT in certain environments (configurable)

richardstartin · 2022-02-11T18:59:35Z

@richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values?

Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current DISTINCT_COUNT in certain environments (configurable)

The worst case depends on the size of the set. The absolute worst case is more than 32 bits per value, this would happen if you had 2^16 values with a gap of roughly 2^16 between each value in the set. The worst case for a set more than 2^16 values decreases monotonically.

If we have to maintain absolute accuracy below the threshold, we can't truncate double to float, but hopefully users don't want to distinct count doubles anyway, and it's a meaningless operation given the nature of floating point numbers.

richardstartin

I don't have any comments about the code. Maybe try to use less memory than an IntSet or LongSet when possible unless this complicates the code too much.

Jackie-Jiang · 2022-02-14T23:04:41Z

@richardstartin I'd go with the set based solution for now to have the same behavior with distinct_count for low cardinality case. We can revisit both functions after collecting more info in the future. In the meanwhile, if user knows bitmap based solution performs better, they can use DistinctCountBitmap instead.

…e distinct values in Set or HyperLogLog based on cardinality (apache#8189) Adds `DistinctCountSmartHLLAggregationFunction` which can automatically convert the `Set` to `HyperLogLog` if the set size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries. By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12. The log2m and threshold can be configured using the second argument (literal) of the function: - `hllLog2m`: log2m of the converted HyperLogLog (default 12) - `hllConversionThreshold`: set size threshold to trigger the conversion, non-positive means never convert (default 100K) Example query: `SELECT DISTINCTCOUNTSMARTHLL(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable`

Add DistinctCountSmartHLLAggregationFunction which automatically stor…

e7c5165

…e distinct values in Set or HyperLogLog based on cardinality

Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Feb 10, 2022

Jackie-Jiang requested review from xiangfu0, richardstartin and snleee February 10, 2022 22:32

Jackie-Jiang mentioned this pull request Feb 10, 2022

For DISTINCT_COUNT, automatically convert Set to HyperLogLog when cardinality is too high #8074

Closed

richardstartin reviewed Feb 11, 2022

View reviewed changes

richardstartin approved these changes Feb 11, 2022

View reviewed changes

Jackie-Jiang merged commit 273e516 into apache:master Feb 14, 2022

Jackie-Jiang deleted the distinct_count_smart_hll branch February 14, 2022 23:05

Jackie-Jiang mentioned this pull request Jun 7, 2022

DistinctCount with low selectivity #7887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189

Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189

Jackie-Jiang commented Feb 10, 2022

codecov-commenter commented Feb 10, 2022

richardstartin Feb 11, 2022

richardstartin Feb 11, 2022

richardstartin Feb 11, 2022

richardstartin Feb 11, 2022

richardstartin Feb 11, 2022

richardstartin Feb 11, 2022

richardstartin Feb 11, 2022

Jackie-Jiang commented Feb 11, 2022

richardstartin commented Feb 11, 2022

richardstartin left a comment

Jackie-Jiang commented Feb 14, 2022

Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189

Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189

Conversation

Jackie-Jiang commented Feb 10, 2022

Description

Release Notes

codecov-commenter commented Feb 10, 2022

Codecov Report

richardstartin Feb 11, 2022

Choose a reason for hiding this comment

richardstartin Feb 11, 2022

Choose a reason for hiding this comment

richardstartin Feb 11, 2022

Choose a reason for hiding this comment

richardstartin Feb 11, 2022

Choose a reason for hiding this comment

richardstartin Feb 11, 2022

Choose a reason for hiding this comment

richardstartin Feb 11, 2022

Choose a reason for hiding this comment

richardstartin Feb 11, 2022

Choose a reason for hiding this comment

Jackie-Jiang commented Feb 11, 2022

richardstartin commented Feb 11, 2022

richardstartin left a comment

Choose a reason for hiding this comment

Jackie-Jiang commented Feb 14, 2022