Add RareTerms aggregation #35718

polyfractal · 2018-11-19T22:05:00Z

This adds a rare_terms aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts.

This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: MAX_LONG (or worse, ordering a terms agg by count ascending, which has unbounded error).

This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a bloom filter. If a future term is found in the bloom filter we assume it was previously removed from the map and is "common".

The map keys are the "rare" terms after collection is done.

Outstanding issues

Unclear how we should default the bloom filter.
Do we expose the bloom filter params to the user, or try to pick a one-size fits all? I think we probably need to expose some settings but right now nothing is configurable
What's the max max_doc_count that we allow? Currently set to 10 but I think that's probably too low. It's mainly another safety mechanism, the max buckets limit will still trigger too. It might not make sense to even have a max here, since it's pretty data-dependent.
No global ordinal support for strings. The PR was already huge so I think this should be done in a followup?
A few misc items questions in //TODO review comments
I don't have any "big" integration tests, just the yaml tests. Should I add a QA test or something that tests this on a few thousand docs?

Closes #20586 (finally!)

@andyb-elastic @not-napoleon tagged you both as reviewers in case your interested but no pressure if not, or too busy :)

Also /cc @clintongormley

This adds a `rare_terms` aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts. This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: LONG_MAX (or worse, ordering a terms agg by count ascending, which has unbounded error). This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a bloom filter. If a future term is found in the bloom filter we assume it was previously removed from the map and is "common". The map keys are the "rare" terms after collection is done.

polyfractal · 2018-11-19T22:06:57Z

server/src/main/java/org/elasticsearch/common/util/BloomFilter.java

+/**
+ * A bloom filter. Inspired by Guava bloom filter implementation though with some optimizations.
+ */
+public class BloomFilter implements Writeable, Releasable {


This class was resurrected from the depths of git. This BloomFilter used to be used by ES elsewhere (doc IDs I think?), but I just realized none of the tests made it through the resurrection.

I'll start looking for those tests, or add my own.

This class had a lot of extra cruft that wasn't needed anymore (string configuration parsing, factories, multiple hashing versions, etc) so I tried to simplify it where possible.

clintongormley · 2018-11-20T10:52:45Z

/cc @tsg

colings86

@polyfractal I left some comments

colings86 · 2018-11-21T08:56:17Z

docs/reference/aggregations/bucket/rare-terms-aggregation.asciidoc

+WARNING: When aggregating on multiple indices the type of the aggregated field may not be the same in all indices.
+Some types are compatible with each other (`integer` and `long` or `float` and `double`) but when the types are a mix
+of decimal and non-decimal number the terms aggregation will promote the non-decimal numbers to decimal numbers.
+This can result in a loss of precision in the bucket values.


I think we should exclude float and double fields from this aggregation since the long-tail is likely to be far too long to practically use this aggregation.

++ Seems reasonable to me. Would cut down some of the complexity of the agg too, which is a nice perk :)

colings86 · 2018-11-21T09:00:02Z

server/src/main/java/org/elasticsearch/common/util/BloomFilter.java

+ */
+public class BloomFilter implements Writeable, Releasable {
+
+    // Some numbers:


Its not really clear what these numbers are, could you add more explanation?

@jpountz do you happen to know, or know who would? This class was taken from the old BloomFilter that I think was used for UUID lookups on segments.

These numbers used to correlate to the string that was passed in the config, and I think they are in the format

<expected insertions> = <false positive probability : <bloom size> , <num hashes>

I don't know for sure but would assume the same format indeed.

👍 thanks. I'll reformat and tidy up the comment so it makes a bit more sense in the current code

colings86 · 2018-11-21T09:01:09Z

server/src/main/java/org/elasticsearch/common/util/BloomFilter.java

+    private final Hashing hashing = Hashing.V1;
+
+    /**
+     * Creates a bloom filter based on the with the expected number


I think there are some words missing here: "Creates a bloom filter based on the ???? with the expected number"

Suggested change

* Creates a bloom filter based on the with the expected number

* Creates a bloom filter based on the expected number

Actually based ont he below constructor maybe there are some extra words?

colings86 · 2018-11-21T09:03:15Z

server/src/main/java/org/elasticsearch/common/util/BloomFilter.java

+        /*
+         * TODO(user): Put a warning in the javadoc about tiny fpp values,
+         * since the resulting size is proportional to -log(p), but there is not
+         * much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680


Suggested change

* much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680

* much of a point after all, e.g. optimalNumOfBits(1000, 0.0000000000000001) = 76680

colings86 · 2018-11-21T09:05:07Z

server/src/main/java/org/elasticsearch/common/util/BloomFilter.java

+            data[i] = in.readLong();
+        }
+        this.numHashFunctions = in.readVInt();
+        this.bits = new BitArray(data);


Nit: can we swap this to the line above so everything reading and building the BitArray is together?

colings86 · 2018-11-21T11:26:21Z

...rc/main/java/org/elasticsearch/search/aggregations/bucket/terms/LongRareTermsAggregator.java

+                    newBucketOrd = newBucketOrds.add(oldKey);
+                } else {
+                    // Make a note when one of the ords has been deleted
+                    hasDeletedEntry = true;


To make sure this GC is working correctly I wonder if it's worth having a counter here and then checking the counter value is the same as the numDeleted that we expect at the end of this for loop? Another option would be to initialise the variable to numDeleted and decrement it here ensuring it reaches 0.

...rc/main/java/org/elasticsearch/search/aggregations/bucket/terms/LongRareTermsAggregator.java

...rc/main/java/org/elasticsearch/search/aggregations/bucket/terms/InternalMappedRareTerms.java

colings86 · 2018-11-21T11:36:03Z

...main/java/org/elasticsearch/search/aggregations/bucket/terms/RareTermsAggregatorFactory.java

+            ExecutionMode execution = ExecutionMode.MAP; //TODO global ords not implemented yet, only supports "map"
+
+            DocValueFormat format = config.format();
+            if ((includeExclude != null) && (includeExclude.isRegexBased()) && format != DocValueFormat.RAW) {


I think the DocValueFormat.RAW check is being used to determine that the field used is a string field. But I see a few issues here (unless I'm misunderstanding what this is doing):

Users can apply custom formats to non-string fields

The valuesSource has already been checked above to be a ValuesSource.Bytes so this can only be a string field here?

I shamefully c/p this from the Terms agg factory :)

Lemme see if we can fix this in the Terms agg itself (in a separate PR) and then I'll pull the change forward into this one.

colings86 · 2018-11-21T11:38:33Z

.../main/java/org/elasticsearch/search/aggregations/bucket/terms/StringRareTermsAggregator.java

+            private long numDeleted = 0;
+
+            @Override
+            public void collect(int docId, long bucket) throws IOException {


Same comments apply as from LongRareTermsAggregator above. Also since this logical is almost the same in three places does it make sense to extract it to something common so we can fix it in one place and apply it to all implementations?

I tried hard to refactor collect() and gcDeletedEntries() into one place... and it's just not possible. There are too many differences between longs and BytesRef. Map get/set, ordinals, hashing, doc values, etc are all different and there aren't any shared types that allow it to be resolve easily :(

jpountz

I skimmed through the patch. The general idea of how this works makes sense to me, here are some questions:

Do we need a shard_size parameter? There could be millions of values that have a doc_count of 1 on each shard? And maybe a size parameter as well in case hundreds of shards are queried? I usually don't like adding parameters but I'm afraid that this aggregation might be hard to use without those?
Maybe we could try to be smarter with the bloom filter and start with a set that contains hashes that we only upgrade to a lossy bloom filter when it starts using more memory than the equivalent bloom filter.
We should somehow register memory that we allocate for the bloom filter and other data structures to the circuit breakers.
Do we need to support sub aggregations? It adds quite some complexity. Also compared to terms aggs a lot of terms might be pruned on the coordinating node because they exist on other shards as well, which might require to increase the shard size which in-turn makes sub aggregations even heavier.
I'm not convinced sharing the hierarchy with terms aggregations helps? It might even make it harder to do changes to the terms aggregation in the future?

jpountz · 2018-11-23T17:14:39Z

server/src/main/java/org/elasticsearch/common/util/BloomFilter.java

+    }
+
+    // Note: We use this instead of java.util.BitSet because we need access to the long[] data field
+    static final class BitArray {


What about Lucene's LongBitSet?

polyfractal · 2018-11-26T15:40:47Z

Thanks for the reviews @colings86 @jpountz. Will try to get to them this week.

Do we need a shard_size parameter? There could be millions of values that have a doc_count of 1 on each shard? And maybe a size parameter as well in case hundreds of shards are queried? I usually don't like adding parameters but I'm afraid that this aggregation might be hard to use without those?

I agree this is an issue... but doesn't adding size/shard_size open the agg back up to the type of sharding errors we're trying to avoid? The shard errors + bloom filter errors may be difficult for a user to understand, leading to nearly as bad of results as a terms agg. We'd also have to add sorting back (at least by _term asc/desc) so that the user could choose which part of the list is truncated.

Perhaps we make it an all-or-nothing agg, and spell out the ramifications in the docs clearly? E.g. track as we add to the map of potentially-rare terms, and if we ever breach the max_buckets threshold we just terminate the aggregation? So if the user wants accurate rare terms (within the bounds of bloom error), they need to ensure they have configured their max_buckets appropriately?

Maybe we could try to be smarter with the bloom filter and start with a set that contains hashes that we only upgrade to a lossy bloom filter when it starts using more memory than the equivalent bloom filter.
We should somehow register memory that we allocate for the bloom filter and other data structures to the circuit breakers.
I'm not convinced sharing the hierarchy with terms aggregations helps? It might even make it harder to do changes to the terms aggregation in the future?

++ Will look into these. I think they make sense, and if we don't mind a bit of extra c/p decoupling from the terms agg would simplify a few things elsewhere.

Do we need to support sub aggregations? It adds quite some complexity. Also compared to terms aggs a lot of terms might be pruned on the coordinating node because they exist on other shards as well, which might require to increase the shard size which in-turn makes sub aggregations even heavier.

I'm not sure... I feel like users may want to run sub-aggs on their rare terms. But not positive. @clintongormley @tsg do you have any thoughts on this?

colings86 · 2018-11-27T13:59:49Z

@elastic/es-analytics-geo

polyfractal · 2019-06-18T20:06:46Z

@colings86 Pushed some updates to the documentation and tidied up some tests/comments. I think the new algo changes are ok to review. 🤞

colings86

Did a really quick pass but I need to more thoroughly go through the CuckooFilters again

docs/reference/aggregations/bucket/rare-terms-aggregation.asciidoc

server/src/main/java/org/elasticsearch/common/util/CuckooFilter.java

iverase

I had a look into CuckooFilter and SetBackedScalingCukooFilter, very cool. I left some comments in the merging logic.

iverase · 2019-06-25T09:58:55Z

server/src/main/java/org/elasticsearch/common/util/SetBackedScalingCuckooFilter.java

+        if (isSetMode && other.isSetMode) {
+            // Both in sets, merge collections then see if we need to convert to cuckoo
+            hashes.addAll(other.hashes);
+            maybeConvert();


In this case, if this filter is just under the threshold and the other one as well we will en up with hashes being almost twice over the threshold. Is that desired?

I wonder if we can compute the final size and decide if we want to convert already and then apply the values to the converted filter.

Hmm, yeah, we can go about twice over the threshold. Tricky to estimate if we should convert to a filter first though. If both sets are duplicates of each other, the total size might not change (or change much). But we won't know that until we've merged them together.

I think it won't matter too much if we go twice over the threshold, since the threshold is set very low relative to the size of the filters. E.g. the current (hard coded) threshold is 10,000 hashes. So 20k longs would be ~ 160kb, compared to the initial filter size of ~1.7mb.

that works for me

server/src/main/java/org/elasticsearch/common/util/SetBackedScalingCuckooFilter.java

polyfractal · 2019-06-26T17:39:48Z

@elasticmachine update branch

iverase

LGTM

polyfractal · 2019-06-28T16:36:50Z

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

polyfractal · 2019-06-28T18:17:31Z

@elasticmachine update branch

Holiday, deferred review to Ignacio :)

This adds a `rare_terms` aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts. This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: LONG_MAX (or worse, ordering a terms agg by count ascending, which has unbounded error). This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a cuckoo filter. If a future term is found in the cuckoo filter we assume it was previously removed from the map and is "common". The map keys are the "rare" terms after collection is done.

Docs for rare_terms were added in elastic#35718, but neglected to link it from the bucket index page

Docs for rare_terms were added in #35718, but neglected to link it from the bucket index page

$polyfractal$

$@polyfractal$ polyfractal added >feature :Analytics/Aggregations Aggregations v7.0.0 v6.6.0 labels Nov 19, 2018

$@polyfractal$ polyfractal requested review from colings86, jpountz, not-napoleon and andyb-elastic November 19, 2018 22:05

$polyfractal$

polyfractal commented Nov 19, 2018

View reviewed changes

colings86 reviewed Nov 21, 2018

View reviewed changes

jpountz reviewed Nov 23, 2018

View reviewed changes

elastic deleted a comment from elasticmachine Nov 27, 2018

alexfrancoeur mentioned this pull request Nov 28, 2018

Add support for RareTerms aggregation elastic/kibana#26340

Closed

jasontedor added v6.7.0 and removed v6.6.0 labels Dec 19, 2018

polyfractal added 6 commits January 22, 2019 16:04

$@polyfractal$

Make bloom filter less bad (cleanup legacy cruft)

2f3a959

$@polyfractal$

review cleanup

ce15588

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rare_terms3

4625714

$@polyfractal$

Add bloom and map to CB

a0c56a2

$@polyfractal$

Refactor bloom to track exact set, add tests

79faa4b

$@polyfractal$

Tweak CB

2819a3d

jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019

danielmitterdorfer added v7.2.0 and removed v6.7.0 labels Feb 7, 2019

$@polyfractal$

Decouple from terms agg

56297a5

polyfractal added 5 commits June 18, 2019 10:38

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rare_terms3

ae8bb22

$@polyfractal$

Compile errors, checkstyle

d5a356e

$@polyfractal$

Add blurb to docs about max_buckets

feb844a

$@polyfractal$

Fix test

56fcdab

$@polyfractal$

Update comments

f7647b3

colings86 reviewed Jun 19, 2019

View reviewed changes

iverase reviewed Jun 25, 2019

View reviewed changes

polyfractal added 3 commits June 25, 2019 10:28

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rare_terms3

af3d5ca

$@polyfractal$

Address review comments, merge conflicts

5feb762

$@polyfractal$

Skip segments when there are no buckets after merging

8115862

elasticmachine and others added 3 commits June 26, 2019 10:39

Merge branch 'master' into rare_terms3

d6c2934

$@polyfractal$

Add version skip to yaml tests

b027d4a

$@polyfractal$

Remove merge conflict .orig file

916f194

iverase approved these changes Jun 28, 2019

View reviewed changes

Merge branch 'master' into rare_terms3

bb283a7

$@polyfractal$ polyfractal merged commit baf155d into elastic:master Jul 1, 2019

This was referenced Jul 1, 2019

"Rare Terms" aggregation #20586

Closed

Link rare_terms docs from buckets index page #43882

Merged

pull bot pushed a commit to sadlil/elasticsearch that referenced this pull request Jul 2, 2019

$@polyfractal$

Link rare_terms docs from index page (elastic#43882)

3e1f73f

Docs for rare_terms were added in elastic#35718, but neglected to link it from the bucket index page

polyfractal added a commit that referenced this pull request Jul 3, 2019

$@polyfractal$

Link rare_terms docs from index page (#43882)

f8fd432

Docs for rare_terms were added in #35718, but neglected to link it from the bucket index page

Mpdreamz mentioned this pull request Aug 7, 2019

[meta] 7.3 Release elastic/elasticsearch-net#4001

Closed

16 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

nik9000 mentioned this pull request Feb 11, 2022

Comprehensive aggregation REST tests #26220

Open

69 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RareTerms aggregation #35718

Add RareTerms aggregation #35718

$@polyfractal$ polyfractal commented Nov 19, 2018

$@polyfractal$ polyfractal Nov 19, 2018

clintongormley commented Nov 20, 2018

colings86 left a comment

colings86 Nov 21, 2018

$@polyfractal$ polyfractal Nov 26, 2018

colings86 Nov 21, 2018

$@polyfractal$ polyfractal Nov 26, 2018

jpountz Nov 26, 2018

$@polyfractal$ polyfractal Nov 26, 2018

colings86 Nov 21, 2018

colings86 Nov 21, 2018

colings86 Nov 21, 2018

colings86 Nov 21, 2018

colings86 Nov 21, 2018

colings86 Nov 21, 2018

$@polyfractal$ polyfractal Nov 26, 2018

colings86 Nov 21, 2018

$@polyfractal$ polyfractal Mar 1, 2019

jpountz left a comment

jpountz Nov 23, 2018

polyfractal commented Nov 26, 2018

colings86 commented Nov 27, 2018

polyfractal commented Jun 18, 2019

colings86 left a comment

iverase left a comment

iverase Jun 25, 2019

$@polyfractal$ polyfractal Jun 25, 2019

iverase Jun 25, 2019

polyfractal commented Jun 26, 2019

iverase left a comment

polyfractal commented Jun 28, 2019

polyfractal commented Jun 28, 2019

	* Creates a bloom filter based on the with the expected number
	* Creates a bloom filter based on the expected number

	* much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680
	* much of a point after all, e.g. optimalNumOfBits(1000, 0.0000000000000001) = 76680

Add RareTerms aggregation #35718

Add RareTerms aggregation #35718

Conversation

polyfractal commented Nov 19, 2018

Outstanding issues

Choose a reason for hiding this comment

clintongormley commented Nov 20, 2018

colings86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Nov 26, 2018

colings86 commented Nov 27, 2018

polyfractal commented Jun 18, 2019

colings86 left a comment

Choose a reason for hiding this comment

iverase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Jun 26, 2019

iverase left a comment

Choose a reason for hiding this comment

polyfractal commented Jun 28, 2019

polyfractal commented Jun 28, 2019

$@polyfractal$ polyfractal commented Nov 19, 2018