Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189

Merged
merged 1 commit into from
Feb 14, 2022

Conversation

Jackie-Jiang
Copy link
Contributor

Description

Adds DistinctCountSmartHLLAggregationFunction which can automatically convert the Set to HyperLogLog if the set size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries.

By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12.
The log2m and threshold can be configured using the second argument (literal) of the function:

  • hllLog2m: log2m of the converted HyperLogLog (default 12)
  • hllConversionThreshold: set size threshold to trigger the conversion, non-positive means never convert (default 100K)

Example query:
SELECT DISTINCTCOUNTSMARTHLL(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable

Release Notes

Adds DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

…e distinct values in Set or HyperLogLog based on cardinality
@Jackie-Jiang Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Feb 10, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #8189 (e7c5165) into master (b12b7bb) will decrease coverage by 57.26%.
The diff coverage is 0.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #8189       +/-   ##
=============================================
- Coverage     71.34%   14.08%   -57.27%     
+ Complexity     4307       81     -4226     
=============================================
  Files          1623     1579       -44     
  Lines         84320    82940     -1380     
  Branches      12642    12569       -73     
=============================================
- Hits          60162    11683    -48479     
- Misses        20028    70388    +50360     
+ Partials       4130      869     -3261     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 ?
unittests2 14.08% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ator/query/DictionaryBasedAggregationOperator.java 0.00% <0.00%> (-87.31%) ⬇️
...rg/apache/pinot/core/plan/AggregationPlanNode.java 0.00% <0.00%> (-91.00%) ⬇️
...gregation/function/AggregationFunctionFactory.java 0.00% <0.00%> (-83.34%) ⬇️
...tion/DistinctCountSmartHLLAggregationFunction.java 0.00% <0.00%> (ø)
...che/pinot/segment/spi/AggregationFunctionType.java 0.00% <0.00%> (-90.25%) ⬇️
...ain/java/org/apache/pinot/core/data/table/Key.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/pinot/core/data/table/Record.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/pinot/core/util/GroupByUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/table/FSTType.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1299 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b12b7bb...e7c5165. Read the comment docs.

&& functionType != AggregationFunctionType.MINMAXRANGE
&& functionType != AggregationFunctionType.DISTINCTCOUNT
&& functionType != AggregationFunctionType.SEGMENTPARTITIONEDDISTINCTCOUNT) {
if (!DICTIONARY_BASED_FUNCTIONS.contains(aggregationFunction.getType())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I made the same change on a local branch 😁

Comment on lines +130 to +140
IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
for (int dictId = 0; dictId < dictionarySize; dictId++) {
intSet.add(dictionary.getIntValue(dictId));
}
return intSet;
case LONG:
LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
for (int dictId = 0; dictId < dictionarySize; dictId++) {
longSet.add(dictionary.getLongValue(dictId));
}
return longSet;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be a set, I think a roaringbitmap might be better

Comment on lines +141 to +152
case FLOAT:
FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize);
for (int dictId = 0; dictId < dictionarySize; dictId++) {
floatSet.add(dictionary.getFloatValue(dictId));
}
return floatSet;
case DOUBLE:
DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize);
for (int dictId = 0; dictId < dictionarySize; dictId++) {
doubleSet.add(dictionary.getDoubleValue(dictId));
}
return doubleSet;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could convert to int/long bits and store in a roaringbitmap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be beneficial:

      RoaringBitmap bitmap = new RoaringBitmap();
      FloatOpenHashSet set = new FloatOpenHashSet();
      long bitmapBefore = GraphLayout.parseInstance(bitmap).totalSize();
      long setBefore = GraphLayout.parseInstance(set).totalSize();
      for (int i = 0; i < 1 << 20; i++) {
        float f = ThreadLocalRandom.current().nextFloat() * ThreadLocalRandom.current().nextLong();
        bitmap.add(Float.floatToIntBits(f));
        set.add(f);
      }
      System.err.println("bitmap: " + ((GraphLayout.parseInstance(bitmap).totalSize() - bitmapBefore) >>> 20) + "MB");
      System.err.println(GraphLayout.parseInstance(bitmap).toFootprint());
      System.err.println("set: " + ((GraphLayout.parseInstance(set).totalSize() - setBefore) >>> 20) + "MB");
      System.err.println(GraphLayout.parseInstance(set).toFootprint());
bitmap: 2MB
org.roaringbitmap.RoaringBitmap@36a6bea6d footprint:
     COUNT       AVG       SUM   DESCRIPTION
      3618       706   2555408   [C
         1     15008     15008   [Lorg.roaringbitmap.Container;
      3617        24     86808   org.roaringbitmap.ArrayContainer
         1        24        24   org.roaringbitmap.RoaringArray
         1        16        16   org.roaringbitmap.RoaringBitmap
      7238             2657264   (total)


set: 7MB
it.unimi.dsi.fastutil.floats.FloatOpenHashSet@a62c7cdd footprint:
     COUNT       AVG       SUM   DESCRIPTION
         1   8388632   8388632   [F
         1        48        48   it.unimi.dsi.fastutil.floats.FloatOpenHashSet
         2             8388680   (total)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this doesn't work well for double

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is an approximate algorithm and floating point numbers are approximations, I don't see a downside in loss of precision by converting double to float and then storing the int bits in a RoaringBitmap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like converting double to float would lead to a relative error of around 0.5% in the typical case, but it doesn't have an upper bound

@Jackie-Jiang
Copy link
Contributor Author

@richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values?

Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current DISTINCT_COUNT in certain environments (configurable)

@richardstartin
Copy link
Member

@richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values?

Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current DISTINCT_COUNT in certain environments (configurable)

The worst case depends on the size of the set. The absolute worst case is more than 32 bits per value, this would happen if you had 2^16 values with a gap of roughly 2^16 between each value in the set. The worst case for a set more than 2^16 values decreases monotonically.

If we have to maintain absolute accuracy below the threshold, we can't truncate double to float, but hopefully users don't want to distinct count doubles anyway, and it's a meaningless operation given the nature of floating point numbers.

Copy link
Member

@richardstartin richardstartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any comments about the code. Maybe try to use less memory than an IntSet or LongSet when possible unless this complicates the code too much.

@Jackie-Jiang
Copy link
Contributor Author

@richardstartin I'd go with the set based solution for now to have the same behavior with distinct_count for low cardinality case. We can revisit both functions after collecting more info in the future. In the meanwhile, if user knows bitmap based solution performs better, they can use DistinctCountBitmap instead.

@Jackie-Jiang Jackie-Jiang merged commit 273e516 into apache:master Feb 14, 2022
@Jackie-Jiang Jackie-Jiang deleted the distinct_count_smart_hll branch February 14, 2022 23:05
xiangfu0 pushed a commit to xiangfu0/pinot that referenced this pull request Feb 23, 2022
…e distinct values in Set or HyperLogLog based on cardinality (apache#8189)

Adds `DistinctCountSmartHLLAggregationFunction` which can automatically convert the `Set` to `HyperLogLog` if the set size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries.

By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12.
The log2m and threshold can be configured using the second argument (literal) of the function:
- `hllLog2m`: log2m of the converted HyperLogLog (default 12)
- `hllConversionThreshold`: set size threshold to trigger the conversion, non-positive means never convert (default 100K)

Example query:
`SELECT DISTINCTCOUNTSMARTHLL(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants