-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189
Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality #8189
Conversation
…e distinct values in Set or HyperLogLog based on cardinality
Codecov Report
@@ Coverage Diff @@
## master #8189 +/- ##
=============================================
- Coverage 71.34% 14.08% -57.27%
+ Complexity 4307 81 -4226
=============================================
Files 1623 1579 -44
Lines 84320 82940 -1380
Branches 12642 12569 -73
=============================================
- Hits 60162 11683 -48479
- Misses 20028 70388 +50360
+ Partials 4130 869 -3261
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
&& functionType != AggregationFunctionType.MINMAXRANGE | ||
&& functionType != AggregationFunctionType.DISTINCTCOUNT | ||
&& functionType != AggregationFunctionType.SEGMENTPARTITIONEDDISTINCTCOUNT) { | ||
if (!DICTIONARY_BASED_FUNCTIONS.contains(aggregationFunction.getType())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I made the same change on a local branch 😁
IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize); | ||
for (int dictId = 0; dictId < dictionarySize; dictId++) { | ||
intSet.add(dictionary.getIntValue(dictId)); | ||
} | ||
return intSet; | ||
case LONG: | ||
LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize); | ||
for (int dictId = 0; dictId < dictionarySize; dictId++) { | ||
longSet.add(dictionary.getLongValue(dictId)); | ||
} | ||
return longSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be a set, I think a roaringbitmap might be better
case FLOAT: | ||
FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize); | ||
for (int dictId = 0; dictId < dictionarySize; dictId++) { | ||
floatSet.add(dictionary.getFloatValue(dictId)); | ||
} | ||
return floatSet; | ||
case DOUBLE: | ||
DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize); | ||
for (int dictId = 0; dictId < dictionarySize; dictId++) { | ||
doubleSet.add(dictionary.getDoubleValue(dictId)); | ||
} | ||
return doubleSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could convert to int/long bits and store in a roaringbitmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be beneficial:
RoaringBitmap bitmap = new RoaringBitmap();
FloatOpenHashSet set = new FloatOpenHashSet();
long bitmapBefore = GraphLayout.parseInstance(bitmap).totalSize();
long setBefore = GraphLayout.parseInstance(set).totalSize();
for (int i = 0; i < 1 << 20; i++) {
float f = ThreadLocalRandom.current().nextFloat() * ThreadLocalRandom.current().nextLong();
bitmap.add(Float.floatToIntBits(f));
set.add(f);
}
System.err.println("bitmap: " + ((GraphLayout.parseInstance(bitmap).totalSize() - bitmapBefore) >>> 20) + "MB");
System.err.println(GraphLayout.parseInstance(bitmap).toFootprint());
System.err.println("set: " + ((GraphLayout.parseInstance(set).totalSize() - setBefore) >>> 20) + "MB");
System.err.println(GraphLayout.parseInstance(set).toFootprint());
bitmap: 2MB
org.roaringbitmap.RoaringBitmap@36a6bea6d footprint:
COUNT AVG SUM DESCRIPTION
3618 706 2555408 [C
1 15008 15008 [Lorg.roaringbitmap.Container;
3617 24 86808 org.roaringbitmap.ArrayContainer
1 24 24 org.roaringbitmap.RoaringArray
1 16 16 org.roaringbitmap.RoaringBitmap
7238 2657264 (total)
set: 7MB
it.unimi.dsi.fastutil.floats.FloatOpenHashSet@a62c7cdd footprint:
COUNT AVG SUM DESCRIPTION
1 8388632 8388632 [F
1 48 48 it.unimi.dsi.fastutil.floats.FloatOpenHashSet
2 8388680 (total)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this doesn't work well for double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is an approximate algorithm and floating point numbers are approximations, I don't see a downside in loss of precision by converting double
to float
and then storing the int
bits in a RoaringBitmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like converting double
to float
would lead to a relative error of around 0.5% in the typical case, but it doesn't have an upper bound
@richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values? Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current |
The worst case depends on the size of the set. The absolute worst case is more than 32 bits per value, this would happen if you had 2^16 values with a gap of roughly 2^16 between each value in the set. The worst case for a set more than 2^16 values decreases monotonically. If we have to maintain absolute accuracy below the threshold, we can't truncate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any comments about the code. Maybe try to use less memory than an IntSet
or LongSet
when possible unless this complicates the code too much.
@richardstartin I'd go with the set based solution for now to have the same behavior with |
…e distinct values in Set or HyperLogLog based on cardinality (apache#8189) Adds `DistinctCountSmartHLLAggregationFunction` which can automatically convert the `Set` to `HyperLogLog` if the set size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries. By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12. The log2m and threshold can be configured using the second argument (literal) of the function: - `hllLog2m`: log2m of the converted HyperLogLog (default 12) - `hllConversionThreshold`: set size threshold to trigger the conversion, non-positive means never convert (default 100K) Example query: `SELECT DISTINCTCOUNTSMARTHLL(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable`
Description
Adds
DistinctCountSmartHLLAggregationFunction
which can automatically convert theSet
toHyperLogLog
if the set size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries.By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12.
The log2m and threshold can be configured using the second argument (literal) of the function:
hllLog2m
: log2m of the converted HyperLogLog (default 12)hllConversionThreshold
: set size threshold to trigger the conversion, non-positive means never convert (default 100K)Example query:
SELECT DISTINCTCOUNTSMARTHLL(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable
Release Notes
Adds
DistinctCountSmartHLLAggregationFunction
which automatically stores distinct values in Set or HyperLogLog based on cardinality