Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mixing set-based and regexp-based include and exclude #63325

Conversation

hchargois
Copy link
Contributor

This PR adds the feature discussed in #62246

It allows mixing set-based and regexp-based "include" and "exclude" parameters in Terms (and SignificantTerms, RareTerms) aggregations. Both ways are supported: a set include with a regexp exclude, or a regexp include with a set exclude.

@hchargois
Copy link
Contributor Author

There is one thing I'm a bit unsure of, around the stream serialization, and which version to set. I set the next major version (8.0.0) as default but I guess if this is merged in a 7.x branch this would need changing.

Also, let's say we wanted to backport this feature to our own build of ES v6, do you think it could be done?

@cbuescher cbuescher added the :Analytics/Aggregations Aggregations label Oct 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 6, 2020
@not-napoleon
Copy link
Member

@elasticmachine test this please

@nik9000
Copy link
Member

nik9000 commented Oct 7, 2020

@elasticmachine, ok to test

*/
@Override
public boolean accept(BytesRef value) {
return ((valids == null) || (valids.contains(value))) && ((invalids == null) || (!invalids.contains(value)));
if (valids != null && !valids.contains(value)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (valids != null && !valids.contains(value)) {
if (valids != null && (valids.contains(value) == false)) {

Elastic coding standard prefers this form for readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return false;
}

if (runAutomaton != null && !runAutomaton.run(value.bytes, value.offset, value.length)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (runAutomaton != null && !runAutomaton.run(value.bytes, value.offset, value.length)) {
if (runAutomaton != null && (runAutomaton.run(value.bytes, value.offset, value.length) == false)) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

public IncludeExclude(RegExp include, RegExp exclude, SortedSet<BytesRef> includeValues, SortedSet<BytesRef> excludeValues) {
if (include == null && exclude == null && includeValues == null && excludeValues == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention here is that at most one of (include, includeValues) and at most one of (exclude, excludeValues) will be non-null. In other words, while you can mix set-based includes and regex excludes (or vice versa), you can't have both set-based and regex-based includes. That seems like a requirement of the precedence rules, among other things.

I think we should enforce that rule here. I know the parser doesn't currently allow for specifying both a regex and a set at the same time, but it's ultimately this class's contract that both not be set, and this class should check it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in IncludeExclude any combination of the 4 (include, includeValues, exclude, excludeValues) can work correctly. Meaning that we can have both kinds of includes and/or both kind of excludes and it would "do the right/logical thing", i.e. it would accept terms that are in any of the include(s) but not in any of the exclude(s). I don't think there is precedence between both kinds of includes, nor between both kinds of excludes, just between include(s) and exclude(s).

Also, I don't think restricting to a single include and a single exclude would be more efficient (e.g. allow optimizations in the accept functions).

That's why I didn't forbid having both types of includes or both types of excludes. But I can definitely add a check to forbid it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I am worried about, if you have something like include = "foo.*" and includeValues = ["bar", "quux"], it will incorrectly reject the term "foo" by returning false on line 211, I think. I don't think we need to support that case, but we do need to explicitly reject it, by not allowing both include and includeValues to be set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I was wrong in my previous comment. I think the accept from the ordinals filter can work with both kinds of includes (or excludes), but indeed not the string filter, where I assumed there was only one kind of each.

I've added a check to forbid this

aggregationBuilder = new TermsAggregationBuilder("_name").userValueTypeHint(ValueType.STRING)
.executionHint(executionHint)
.includeExclude(new IncludeExclude("val00.+", null, null,
new String[]{"val001", "val002", "val003", "val004", "val005", "val006", "val007", "val008"}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that you added tests to make sure that the precedence is preserved with the new possible configurations, but I think we should test this on IncludeExcludeTests as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored/added tests for all possible combinations of include/exclude in IncludeExcludeTests.

@not-napoleon
Copy link
Member

I think the BWC failure is related to us cutting a branch around the time this test ran, I didn't see anything that looked related to this change. When you make the requested changes, we'll run the whole suite again and if BWC passes then, that's fine.

@hchargois
Copy link
Contributor Author

I think the BWC failure is related to us cutting a branch around the time this test ran, I didn't see anything that looked related to this change. When you make the requested changes, we'll run the whole suite again and if BWC passes then, that's fine.

All right. Do you want me to merge master into my branch? Or rebase onto it?

@not-napoleon
Copy link
Member

This looks great, thanks. Yeah, if you could merge master into the branch, that should fix the BWC tests. I'm OOO tomorrow, but hopefully we can get this merged Monday or Tuesday.

@hchargois
Copy link
Contributor Author

hchargois commented Oct 19, 2020

I merged master. However, BWC is still failing on my machine, but actually it's even failing for the master branch. It looks like other recent PRs are also failing the same tests though. We'll see how it goes with the CI but I'm not very hopeful.

Update: OK so BWC passed in the CI but another task failed. It seems to me that it's not related to my changes. I tried reproducing the two failing tests on my machine with the command lines given in the log outputs, and they both pass successfully. I'm not sure what to do now

@not-napoleon
Copy link
Member

I agree that test failure doesn't look related. I'll look into it.

@not-napoleon
Copy link
Member

@elasticmachine run elasticsearch-ci/1

@not-napoleon
Copy link
Member

One of those is a known issue: #63719. I'm rerunning the suite to see if a different random seed fixes it. I'll dig in a bit more on the second failure and open an issue if necessary.

@not-napoleon
Copy link
Member

This should be all set. I'll get it merged and backported this week. Thanks for coding it up!

@hchargois
Copy link
Contributor Author

Awesome! Thank you for the review!

@not-napoleon not-napoleon merged commit ff736f0 into elastic:master Oct 21, 2020
not-napoleon pushed a commit to not-napoleon/elasticsearch that referenced this pull request Oct 21, 2020
…63325)

* Allow mixing set-based and regexp-based include and exclude

* Coding style

* Disallow having both set and regexp include (resp. exclude)

* Test correctness of every combination of include/exclude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants