Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RareTerms aggregation #35718

Merged
merged 33 commits into from
Jul 1, 2019
Merged

Add RareTerms aggregation #35718

merged 33 commits into from
Jul 1, 2019

Conversation

polyfractal
Copy link
Contributor

This adds a rare_terms aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts.

This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: MAX_LONG (or worse, ordering a terms agg by count ascending, which has unbounded error).

This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a bloom filter. If a future term is found in the bloom filter we assume it was previously removed from the map and is "common".

The map keys are the "rare" terms after collection is done.

Outstanding issues

  • Unclear how we should default the bloom filter.
  • Do we expose the bloom filter params to the user, or try to pick a one-size fits all? I think we probably need to expose some settings but right now nothing is configurable
  • What's the max max_doc_count that we allow? Currently set to 10 but I think that's probably too low. It's mainly another safety mechanism, the max buckets limit will still trigger too. It might not make sense to even have a max here, since it's pretty data-dependent.
  • No global ordinal support for strings. The PR was already huge so I think this should be done in a followup?
  • A few misc items questions in //TODO review comments
  • I don't have any "big" integration tests, just the yaml tests. Should I add a QA test or something that tests this on a few thousand docs?

Closes #20586 (finally!)

@andyb-elastic @not-napoleon tagged you both as reviewers in case your interested but no pressure if not, or too busy :)

Also /cc @clintongormley

This adds a `rare_terms` aggregation.  It is an aggregation designed
to identify the long-tail of keywords, e.g. terms that are "rare" or
have low doc counts.

This aggregation is designed to be more memory efficient than the
alternative, which is setting a terms aggregation to size: LONG_MAX
(or worse, ordering a terms agg by count ascending, which has
unbounded error).

This aggregation works by maintaining a map of terms that have
been seen. A counter associated with each value is incremented
when we see the term again.  If the counter surpasses a predefined
threshold, the term is removed from the map and inserted into a bloom
filter.  If a future term is found in the bloom filter we assume it
was previously removed from the map and is "common".

The map keys are the "rare" terms after collection is done.
/**
* A bloom filter. Inspired by Guava bloom filter implementation though with some optimizations.
*/
public class BloomFilter implements Writeable, Releasable {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was resurrected from the depths of git. This BloomFilter used to be used by ES elsewhere (doc IDs I think?), but I just realized none of the tests made it through the resurrection.

I'll start looking for those tests, or add my own.

This class had a lot of extra cruft that wasn't needed anymore (string configuration parsing, factories, multiple hashing versions, etc) so I tried to simplify it where possible.

@clintongormley
Copy link

/cc @tsg

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polyfractal I left some comments

WARNING: When aggregating on multiple indices the type of the aggregated field may not be the same in all indices.
Some types are compatible with each other (`integer` and `long` or `float` and `double`) but when the types are a mix
of decimal and non-decimal number the terms aggregation will promote the non-decimal numbers to decimal numbers.
This can result in a loss of precision in the bucket values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should exclude float and double fields from this aggregation since the long-tail is likely to be far too long to practically use this aggregation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Seems reasonable to me. Would cut down some of the complexity of the agg too, which is a nice perk :)

*/
public class BloomFilter implements Writeable, Releasable {

// Some numbers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not really clear what these numbers are, could you add more explanation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz do you happen to know, or know who would? This class was taken from the old BloomFilter that I think was used for UUID lookups on segments.

These numbers used to correlate to the string that was passed in the config, and I think they are in the format

<expected insertions> = <false positive probability : <bloom size> , <num hashes>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know for sure but would assume the same format indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks. I'll reformat and tidy up the comment so it makes a bit more sense in the current code

private final Hashing hashing = Hashing.V1;

/**
* Creates a bloom filter based on the with the expected number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some words missing here: "Creates a bloom filter based on the ???? with the expected number"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Creates a bloom filter based on the with the expected number
* Creates a bloom filter based on the expected number

Actually based ont he below constructor maybe there are some extra words?

/*
* TODO(user): Put a warning in the javadoc about tiny fpp values,
* since the resulting size is proportional to -log(p), but there is not
* much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680
* much of a point after all, e.g. optimalNumOfBits(1000, 0.0000000000000001) = 76680

data[i] = in.readLong();
}
this.numHashFunctions = in.readVInt();
this.bits = new BitArray(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we swap this to the line above so everything reading and building the BitArray is together?

newBucketOrd = newBucketOrds.add(oldKey);
} else {
// Make a note when one of the ords has been deleted
hasDeletedEntry = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure this GC is working correctly I wonder if it's worth having a counter here and then checking the counter value is the same as the numDeleted that we expect at the end of this for loop? Another option would be to initialise the variable to numDeleted and decrement it here ensuring it reaches 0.

ExecutionMode execution = ExecutionMode.MAP; //TODO global ords not implemented yet, only supports "map"

DocValueFormat format = config.format();
if ((includeExclude != null) && (includeExclude.isRegexBased()) && format != DocValueFormat.RAW) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the DocValueFormat.RAW check is being used to determine that the field used is a string field. But I see a few issues here (unless I'm misunderstanding what this is doing):

  • Users can apply custom formats to non-string fields
  • The valuesSource has already been checked above to be a ValuesSource.Bytes so this can only be a string field here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shamefully c/p this from the Terms agg factory :)

Lemme see if we can fix this in the Terms agg itself (in a separate PR) and then I'll pull the change forward into this one.

private long numDeleted = 0;

@Override
public void collect(int docId, long bucket) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments apply as from LongRareTermsAggregator above. Also since this logical is almost the same in three places does it make sense to extract it to something common so we can fix it in one place and apply it to all implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried hard to refactor collect() and gcDeletedEntries() into one place... and it's just not possible. There are too many differences between longs and BytesRef. Map get/set, ordinals, hashing, doc values, etc are all different and there aren't any shared types that allow it to be resolve easily :(

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed through the patch. The general idea of how this works makes sense to me, here are some questions:

  • Do we need a shard_size parameter? There could be millions of values that have a doc_count of 1 on each shard? And maybe a size parameter as well in case hundreds of shards are queried? I usually don't like adding parameters but I'm afraid that this aggregation might be hard to use without those?
  • Maybe we could try to be smarter with the bloom filter and start with a set that contains hashes that we only upgrade to a lossy bloom filter when it starts using more memory than the equivalent bloom filter.
  • We should somehow register memory that we allocate for the bloom filter and other data structures to the circuit breakers.
  • Do we need to support sub aggregations? It adds quite some complexity. Also compared to terms aggs a lot of terms might be pruned on the coordinating node because they exist on other shards as well, which might require to increase the shard size which in-turn makes sub aggregations even heavier.
  • I'm not convinced sharing the hierarchy with terms aggregations helps? It might even make it harder to do changes to the terms aggregation in the future?

}

// Note: We use this instead of java.util.BitSet because we need access to the long[] data field
static final class BitArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Lucene's LongBitSet?

@polyfractal
Copy link
Contributor Author

Thanks for the reviews @colings86 @jpountz. Will try to get to them this week.

Do we need a shard_size parameter? There could be millions of values that have a doc_count of 1 on each shard? And maybe a size parameter as well in case hundreds of shards are queried? I usually don't like adding parameters but I'm afraid that this aggregation might be hard to use without those?

I agree this is an issue... but doesn't adding size/shard_size open the agg back up to the type of sharding errors we're trying to avoid? The shard errors + bloom filter errors may be difficult for a user to understand, leading to nearly as bad of results as a terms agg. We'd also have to add sorting back (at least by _term asc/desc) so that the user could choose which part of the list is truncated.

Perhaps we make it an all-or-nothing agg, and spell out the ramifications in the docs clearly? E.g. track as we add to the map of potentially-rare terms, and if we ever breach the max_buckets threshold we just terminate the aggregation? So if the user wants accurate rare terms (within the bounds of bloom error), they need to ensure they have configured their max_buckets appropriately?

Maybe we could try to be smarter with the bloom filter and start with a set that contains hashes that we only upgrade to a lossy bloom filter when it starts using more memory than the equivalent bloom filter.
We should somehow register memory that we allocate for the bloom filter and other data structures to the circuit breakers.
I'm not convinced sharing the hierarchy with terms aggregations helps? It might even make it harder to do changes to the terms aggregation in the future?

++ Will look into these. I think they make sense, and if we don't mind a bit of extra c/p decoupling from the terms agg would simplify a few things elsewhere.

Do we need to support sub aggregations? It adds quite some complexity. Also compared to terms aggs a lot of terms might be pruned on the coordinating node because they exist on other shards as well, which might require to increase the shard size which in-turn makes sub aggregations even heavier.

I'm not sure... I feel like users may want to run sub-aggs on their rare terms. But not positive. @clintongormley @tsg do you have any thoughts on this?

@elastic elastic deleted a comment from elasticmachine Nov 27, 2018
@colings86
Copy link
Contributor

@elastic/es-analytics-geo

@polyfractal
Copy link
Contributor Author

@colings86 Pushed some updates to the documentation and tidied up some tests/comments. I think the new algo changes are ok to review. 🤞

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a really quick pass but I need to more thoroughly go through the CuckooFilters again

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look into CuckooFilter and SetBackedScalingCukooFilter, very cool. I left some comments in the merging logic.

if (isSetMode && other.isSetMode) {
// Both in sets, merge collections then see if we need to convert to cuckoo
hashes.addAll(other.hashes);
maybeConvert();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, if this filter is just under the threshold and the other one as well we will en up with hashes being almost twice over the threshold. Is that desired?

I wonder if we can compute the final size and decide if we want to convert already and then apply the values to the converted filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, we can go about twice over the threshold. Tricky to estimate if we should convert to a filter first though. If both sets are duplicates of each other, the total size might not change (or change much). But we won't know that until we've merged them together.

I think it won't matter too much if we go twice over the threshold, since the threshold is set very low relative to the size of the filters. E.g. the current (hard coded) threshold is 10,000 hashes. So 20k longs would be ~ 160kb, compared to the initial filter size of ~1.7mb.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works for me

@polyfractal
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@polyfractal
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

@polyfractal
Copy link
Contributor Author

@elasticmachine update branch

@polyfractal polyfractal dismissed colings86’s stale review June 28, 2019 20:00

Holiday, deferred review to Ignacio :)

@polyfractal polyfractal merged commit baf155d into elastic:master Jul 1, 2019
polyfractal added a commit that referenced this pull request Jul 1, 2019
This adds a `rare_terms` aggregation.  It is an aggregation designed
to identify the long-tail of keywords, e.g. terms that are "rare" or
have low doc counts.

This aggregation is designed to be more memory efficient than the
alternative, which is setting a terms aggregation to size: LONG_MAX
(or worse, ordering a terms agg by count ascending, which has
unbounded error).

This aggregation works by maintaining a map of terms that have
been seen. A counter associated with each value is incremented
when we see the term again.  If the counter surpasses a predefined
threshold, the term is removed from the map and inserted into a cuckoo
filter.  If a future term is found in the cuckoo filter we assume it
was previously removed from the map and is "common".

The map keys are the "rare" terms after collection is done.
pull bot pushed a commit to sadlil/elasticsearch that referenced this pull request Jul 2, 2019
Docs for rare_terms were added in elastic#35718, but neglected to
link it from the bucket index page
polyfractal added a commit that referenced this pull request Jul 3, 2019
Docs for rare_terms were added in #35718, but neglected to
link it from the bucket index page
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Rare Terms" aggregation
9 participants