Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multi_term aggregator support #2687

Merged
merged 9 commits into from
Apr 21, 2022

Conversation

penghuo
Copy link
Contributor

@penghuo penghuo commented Mar 31, 2022

Signed-off-by: Peng Huo penghuo@gmail.com

Description

Adding multi_terms aggregator support.

To Reviewers

Limitation

The current implementation focuses on adding new type aggregates. Performance (latency) is not good. This solution is slow, mainly because of simply encoding/decoding a list of values into bucket keys. A performance improvement PR will be released at a later date.

Difference between terms and multi_terms aggregation

  • in aggregation result, boolean field value is represent as false/true instead of "false"/"true"
  • format is configured per terms instead of aggregation.

Demo

GET test_00001/_search
{
  "size": 0, 
  "aggs": {
    "hot": {
      "multi_terms": {
        "terms": [{
          "field": "region" 
        },{
          "field": "host" 
        }],
        "order": {"max-cpu": "desc"}
      },
      "aggs": {
        "max-cpu": { "max": { "field": "cpu" } }
      }      
    }
  }
}

# Results
"aggregations": {
    "hot": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": [
            "dub",
            "h1"
          ],
          "key_as_string": "dub|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 90.0
          }
        },
        {
          "key": [
            "dub",
            "h2"
          ],
          "key_as_string": "dub|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 70.0
          }
        },
        {
          "key": [
            "iad",
            "h2"
          ],
          "key_as_string": "iad|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 50.0
          }
        },
        {
          "key": [
            "iad",
            "h1"
          ],
          "key_as_string": "iad|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 15.0
          }
        }
      ]
    }
  }

Correctness Test

UT

add UT for each new added class.

IT

  • MultiTermsIT
  • 370_multi_terms.yml

Performance Test

  • Slightly performance drop compared to script aggregation.
  • Significant performance drop compared to term aggregation.
  • Keyword fields perform worse than numeric fields.

Test Environment

  • OpenSearch 2.0, single node cluster.
  • Index: logs-201998 (esrally http_logs track)

OpenSearch multi_terms vs script

Goal
  • Getting to know the performance difference between multi_terms aggregation and terms script aggregation in OpenSearch.
queries
  multi_terms painless
numeric "aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "status"}, {"field": "size"} ] } }} "aggs": { "sterm": { "terms": { "script": { "source": "doc['status'].value + '|' + doc['size'].value", "lang": "painless" } } }}
mix (numeric + ip) "aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "clientip"}, {"field": "size"} ] } }} "aggs": { "sterm": { "terms": { "script": { "source": "doc['clientip'].value + '|' + doc['size'].value", "lang": "painless" } } }}
sort by avg(size) "aggs": { "sterm": { "terms": { "field": "clientip", "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }} "aggs": { "sterm": { "terms": { "script": { "source": "doc['clientip'].value + '|' + doc['status'].value", "lang": "painless" }, "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }}

Test Result

Conclusion

  • Slightly performance drop compared to script aggregation.

Latency summary

Field/Agg painless multi_terms
numeric 1021.41 1901.96
mix 4654.4 4843.32
sort 4540.21 5336.51

multi_terms vs painless

vs Painless

OpenSearch multi_terms vs terms

Goal
  • Getting to know the performance difference between multi_terms aggregation and terms aggregation in OpenSearch.
Queries
  multi_terms terms
numeric "aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "status"}, {"field": "size"} ] } }} "aggs": { "sterm": { "terms": { "field": "status" } }}
mix (numeric + ip) "aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "clientip"}, {"field": "size"} ] } }} "aggs": { "sterm": { "terms": { "field": "clientip" } }}
sort by avg(size) "aggs": { "sterm": { "terms": { "field": "clientip", "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }} "aggs": { "sterm": { "terms": { "field": "clientip", "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }}
Test Result

Conclusion

  • Significant performance drop compared to term aggregation.
  • Keyword fields perform worse than numeric fields.

Latency summary

Field/Agg terms(ms) multi_terms(ms) multi_terms/terms
numeric 75.5294 1901.96 26.18
mix 130.958 4843.32 36.98
sort 460.177 5336.51 11.6

terms vs multi_terms
vs Terms

Issues Resolved

#1629

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@penghuo penghuo requested a review from a team as a code owner March 31, 2022 18:48
@nknize nknize added feature New feature or request Search:Aggregations v2.1.0 Issues and PRs related to version 2.1.0 labels Mar 31, 2022
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure cfff6355d63d1e38c8ae5b8fe8fcc719022b46ce
Log 3984

Reports 3984

@tlfeng tlfeng added the backport 2.x Backport to 2.x branch label Mar 31, 2022
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 18bedad58ea629263acbbe56ca6c57fae0237d7d
Log 3991

Reports 3991

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 5720bc2cc46a19cf0a60cd8f2848be1eca4b0760
Log 4007

Reports 4007

@anirudha anirudha requested a review from nknize April 1, 2022 15:07
Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this! It's looking good but it's HUGE!

I think we should slim down the code footprint a bit by deriving from common base classes and avoiding copy/paste.

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure dd72ad4347e3e7cbc2ca736edfb18ba7f7616c5c
Log 4597

Reports 4597

Signed-off-by: Peng Huo <penghuo@gmail.com>
Signed-off-by: Peng Huo <penghuo@gmail.com>
Signed-off-by: Peng Huo <penghuo@gmail.com>
Signed-off-by: Peng Huo <penghuo@gmail.com>
Signed-off-by: Peng Huo <penghuo@gmail.com>
Signed-off-by: Peng Huo <penghuo@gmail.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success a35ef76
Log 4606

Reports 4606

Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this took a while! It was a biggie. I like that we extended the base test case and MultiValuesSourceFieldConfig. I think it's cleaner. This LGTM!

Signed-off-by: Peng Huo <penghuo@gmail.com>
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 8003493
Log 4634

Reports 4634

@reta
Copy link
Collaborator

reta commented Apr 20, 2022

@penghuo LGTM, have a question, may be @nknize could also chime in: since this feature has significant performance impact (at the moment), should we guard it behind the setting (enable / disable) so the users would make the conscious decision of using it? Also, how large were the dataset which the feature were benchmarked against? (I don't see it anywhere).

@nknize
Copy link
Collaborator

nknize commented Apr 20, 2022

Need to run spotlessApply and push...

* What went wrong:
Execution failed for task ':server:spotlessJavaCheck'.
> The following files had format violations:
      src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregationFactory.java
          @@ -60,7 +60,8 @@
           ············true
           ········);

Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @reta question re: performance let's add javadocs in MultiTermsAggregationBuilder clearly describing how to use the agg and the known performance issues.

@nknize
Copy link
Collaborator

nknize commented Apr 20, 2022

since this feature has significant performance impact (at the moment), should we guard it behind the setting (enable / disable)

Since this is a transient computation and nothing is persisted (e.g., not subject to on disk breaking changes) I'm not as concerned about a feature flag. If we plan to make a lot of changes to the public facing API then a feature flag would be more appropriate.

Signed-off-by: Peng Huo <penghuo@gmail.com>
@penghuo
Copy link
Contributor Author

penghuo commented Apr 20, 2022

@penghuo LGTM, have a question, may be @nknize could also chime in: since this feature has significant performance impact (at the moment), should we guard it behind the setting (enable / disable) so the users would make the conscious decision of using it? Also, how large were the dataset which the feature were benchmarked against? (I don't see it anywhere).

Add Test Env section. We use index logs-201998 (esrally http_logs track)

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure d9bfe98
Log 4647

Reports 4647

Signed-off-by: Peng Huo <penghuo@gmail.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success ba8bfec
Log 4653

Reports 4653

Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Thank you @penghuo! LGTM

@nknize
Copy link
Collaborator

nknize commented Apr 21, 2022

I'd like to give @reta one more go at a review then we can merge to main and backport to 2.1.

@nknize nknize requested a review from reta April 21, 2022 03:35
@nknize nknize merged commit 03fbca3 into opensearch-project:main Apr 21, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 21, 2022
Adds a new multi_term aggregation. The current implementation focuses
on adding new type aggregates. Performance (latency) is suboptimal in this
iteration, mainly because of brute force encoding/decoding a list of values
into bucket keys. A performance improvement change will be made as a
follow on.

Signed-off-by: Peng Huo <penghuo@gmail.com>
(cherry picked from commit 03fbca3)
dblock pushed a commit that referenced this pull request Apr 21, 2022
Adds a new multi_term aggregation. The current implementation focuses
on adding new type aggregates. Performance (latency) is suboptimal in this
iteration, mainly because of brute force encoding/decoding a list of values
into bucket keys. A performance improvement change will be made as a
follow on.

Signed-off-by: Peng Huo <penghuo@gmail.com>
(cherry picked from commit 03fbca3)

Co-authored-by: Peng Huo <penghuo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch feature New feature or request Search:Aggregations v2.1.0 Issues and PRs related to version 2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants