Adding multi_term aggregator support #2687

penghuo · 2022-03-31T18:48:24Z

Signed-off-by: Peng Huo penghuo@gmail.com

Description

Adding multi_terms aggregator support.

To Reviewers

Limitation

The current implementation focuses on adding new type aggregates. Performance (latency) is not good. This solution is slow, mainly because of simply encoding/decoding a list of values into bucket keys. A performance improvement PR will be released at a later date.

Difference between terms and multi_terms aggregation

in aggregation result, boolean field value is represent as false/true instead of "false"/"true"
format is configured per terms instead of aggregation.

Demo

GET test_00001/_search
{
  "size": 0, 
  "aggs": {
    "hot": {
      "multi_terms": {
        "terms": [{
          "field": "region" 
        },{
          "field": "host" 
        }],
        "order": {"max-cpu": "desc"}
      },
      "aggs": {
        "max-cpu": { "max": { "field": "cpu" } }
      }      
    }
  }
}

# Results
"aggregations": {
    "hot": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": [
            "dub",
            "h1"
          ],
          "key_as_string": "dub|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 90.0
          }
        },
        {
          "key": [
            "dub",
            "h2"
          ],
          "key_as_string": "dub|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 70.0
          }
        },
        {
          "key": [
            "iad",
            "h2"
          ],
          "key_as_string": "iad|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 50.0
          }
        },
        {
          "key": [
            "iad",
            "h1"
          ],
          "key_as_string": "iad|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 15.0
          }
        }
      ]
    }
  }

Correctness Test

UT

add UT for each new added class.

IT

MultiTermsIT
370_multi_terms.yml

Performance Test

Slightly performance drop compared to script aggregation.
Significant performance drop compared to term aggregation.
Keyword fields perform worse than numeric fields.

Test Environment

OpenSearch 2.0, single node cluster.
Index: logs-201998 (esrally http_logs track)

OpenSearch multi_terms vs script

Goal

Getting to know the performance difference between multi_terms aggregation and terms script aggregation in OpenSearch.

queries

	multi_terms	painless
numeric	"aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "status"}, {"field": "size"} ] } }}	"aggs": { "sterm": { "terms": { "script": { "source": "doc['status'].value + '\|' + doc['size'].value", "lang": "painless" } } }}
mix (numeric + ip)	"aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "clientip"}, {"field": "size"} ] } }}	"aggs": { "sterm": { "terms": { "script": { "source": "doc['clientip'].value + '\|' + doc['size'].value", "lang": "painless" } } }}
sort by avg(size)	"aggs": { "sterm": { "terms": { "field": "clientip", "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }}	"aggs": { "sterm": { "terms": { "script": { "source": "doc['clientip'].value + '\|' + doc['status'].value", "lang": "painless" }, "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }}

Test Result

Conclusion

Slightly performance drop compared to script aggregation.

Latency summary

Field/Agg	painless	multi_terms
numeric	1021.41	1901.96
mix	4654.4	4843.32
sort	4540.21	5336.51

multi_terms vs painless

OpenSearch multi_terms vs terms

Goal

Getting to know the performance difference between multi_terms aggregation and terms aggregation in OpenSearch.

Queries

	multi_terms	terms
numeric	"aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "status"}, {"field": "size"} ] } }}	"aggs": { "sterm": { "terms": { "field": "status" } }}
mix (numeric + ip)	"aggs": { "mterms": { "multi_terms": { "terms": [ {"field": "clientip"}, {"field": "size"} ] } }}	"aggs": { "sterm": { "terms": { "field": "clientip" } }}
sort by avg(size)	"aggs": { "sterm": { "terms": { "field": "clientip", "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }}	"aggs": { "sterm": { "terms": { "field": "clientip", "order": {"avg-size": "desc"} }, "aggs": { "avg-size": {"avg": {"field": "size"}} } }}

Test Result

Conclusion

Significant performance drop compared to term aggregation.
Keyword fields perform worse than numeric fields.

Latency summary

Field/Agg	terms(ms)	multi_terms(ms)	multi_terms/terms
numeric	75.5294	1901.96	26.18
mix	130.958	4843.32	36.98
sort	460.177	5336.51	11.6

terms vs multi_terms

Issues Resolved

#1629

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

opensearch-ci-bot · 2022-03-31T19:12:49Z

❌ Gradle Check failure cfff6355d63d1e38c8ae5b8fe8fcc719022b46ce
Log 3984

Reports 3984

opensearch-ci-bot · 2022-03-31T23:01:08Z

❌ Gradle Check failure 18bedad58ea629263acbbe56ca6c57fae0237d7d
Log 3991

Reports 3991

opensearch-ci-bot · 2022-04-01T03:57:40Z

✅ Gradle Check success 5720bc2cc46a19cf0a60cd8f2848be1eca4b0760
Log 4007

Reports 4007

nknize

Thank you for submitting this! It's looking good but it's HUGE!

I think we should slim down the code footprint a bit by deriving from common base classes and avoiding copy/paste.

server/src/internalClusterTest/java/org/opensearch/search/aggregations/bucket/MultiTermsIT.java

...r/src/main/java/org/opensearch/search/aggregations/support/MultiTermsValuesSourceConfig.java

opensearch-ci-bot · 2022-04-18T21:12:03Z

❌ Gradle Check failure dd72ad4347e3e7cbc2ca736edfb18ba7f7616c5c
Log 4597

Reports 4597

Signed-off-by: Peng Huo <penghuo@gmail.com>

opensearch-ci-bot · 2022-04-19T04:54:50Z

✅ Gradle Check success a35ef76
Log 4606

Reports 4606

nknize

Sorry this took a while! It was a biggie. I like that we extended the base test case and MultiValuesSourceFieldConfig. I think it's cleaner. This LGTM!

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/ParsedTerms.java

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java

.../main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregationFactory.java

server/src/internalClusterTest/java/org/opensearch/search/aggregations/bucket/MultiTermsIT.java

Signed-off-by: Peng Huo <penghuo@gmail.com>

opensearch-ci-bot · 2022-04-20T04:28:35Z

❌ Gradle Check failure 8003493
Log 4634

Reports 4634

reta · 2022-04-20T12:58:23Z

@penghuo LGTM, have a question, may be @nknize could also chime in: since this feature has significant performance impact (at the moment), should we guard it behind the setting (enable / disable) so the users would make the conscious decision of using it? Also, how large were the dataset which the feature were benchmarked against? (I don't see it anywhere).

nknize · 2022-04-20T14:06:37Z

Need to run spotlessApply and push...

* What went wrong:
Execution failed for task ':server:spotlessJavaCheck'.
> The following files had format violations:
      src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregationFactory.java
          @@ -60,7 +60,8 @@
           ············true
           ········);

nknize

Per @reta question re: performance let's add javadocs in MultiTermsAggregationBuilder clearly describing how to use the agg and the known performance issues.

.../main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregationBuilder.java

nknize · 2022-04-20T14:13:37Z

since this feature has significant performance impact (at the moment), should we guard it behind the setting (enable / disable)

Since this is a transient computation and nothing is persisted (e.g., not subject to on disk breaking changes) I'm not as concerned about a feature flag. If we plan to make a lot of changes to the public facing API then a feature flag would be more appropriate.

Signed-off-by: Peng Huo <penghuo@gmail.com>

penghuo · 2022-04-20T19:14:38Z

@penghuo LGTM, have a question, may be @nknize could also chime in: since this feature has significant performance impact (at the moment), should we guard it behind the setting (enable / disable) so the users would make the conscious decision of using it? Also, how large were the dataset which the feature were benchmarked against? (I don't see it anywhere).

Add Test Env section. We use index logs-201998 (esrally http_logs track)

opensearch-ci-bot · 2022-04-20T19:37:44Z

❌ Gradle Check failure d9bfe98
Log 4647

Reports 4647

Signed-off-by: Peng Huo <penghuo@gmail.com>

opensearch-ci-bot · 2022-04-20T21:05:35Z

✅ Gradle Check success ba8bfec
Log 4653

Reports 4653

nknize

This is great. Thank you @penghuo! LGTM

nknize · 2022-04-21T03:35:03Z

I'd like to give @reta one more go at a review then we can merge to main and backport to 2.1.

Adds a new multi_term aggregation. The current implementation focuses on adding new type aggregates. Performance (latency) is suboptimal in this iteration, mainly because of brute force encoding/decoding a list of values into bucket keys. A performance improvement change will be made as a follow on. Signed-off-by: Peng Huo <penghuo@gmail.com> (cherry picked from commit 03fbca3)

Adds a new multi_term aggregation. The current implementation focuses on adding new type aggregates. Performance (latency) is suboptimal in this iteration, mainly because of brute force encoding/decoding a list of values into bucket keys. A performance improvement change will be made as a follow on. Signed-off-by: Peng Huo <penghuo@gmail.com> (cherry picked from commit 03fbca3) Co-authored-by: Peng Huo <penghuo@gmail.com>

penghuo requested a review from a team as a code owner March 31, 2022 18:48

nknize added feature New feature or request Search:Aggregations v2.1.0 Issues and PRs related to version 2.1.0 labels Mar 31, 2022

tlfeng added the backport 2.x Backport to 2.x branch label Mar 31, 2022

anirudha requested a review from nknize April 1, 2022 15:07

nknize requested changes Apr 12, 2022

View reviewed changes

server/src/internalClusterTest/java/org/opensearch/search/aggregations/bucket/MultiTermsIT.java Outdated Show resolved Hide resolved

...r/src/main/java/org/opensearch/search/aggregations/support/MultiTermsValuesSourceConfig.java Outdated Show resolved Hide resolved

penghuo added 6 commits April 18, 2022 16:06

Adding multi_term aggregator support

9ff45fd

Signed-off-by: Peng Huo <penghuo@gmail.com>

fix test failure

71c9a92

Signed-off-by: Peng Huo <penghuo@gmail.com>

fix TermsDocCountErrorIT

1218b5e

Signed-off-by: Peng Huo <penghuo@gmail.com>

add abstract BaseStringTermsTestCase

706b952

Signed-off-by: Peng Huo <penghuo@gmail.com>

add abstract BaseMultiValuesSourceFieldConfig

5b5e8f2

Signed-off-by: Peng Huo <penghuo@gmail.com>

update to version 3.0.0

a35ef76

Signed-off-by: Peng Huo <penghuo@gmail.com>

penghuo force-pushed the multi-terms-agg branch from dd72ad4 to a35ef76 Compare April 19, 2022 04:23

penghuo requested a review from reta as a code owner April 19, 2022 04:23

nknize approved these changes Apr 19, 2022

View reviewed changes

reta reviewed Apr 19, 2022

View reviewed changes

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/ParsedTerms.java Outdated Show resolved Hide resolved

reta reviewed Apr 19, 2022

View reviewed changes

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java Outdated Show resolved Hide resolved

reta reviewed Apr 19, 2022

View reviewed changes

.../main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregationFactory.java Outdated Show resolved Hide resolved

reta reviewed Apr 19, 2022

View reviewed changes

server/src/internalClusterTest/java/org/opensearch/search/aggregations/bucket/MultiTermsIT.java Show resolved Hide resolved

address comments

8003493

Signed-off-by: Peng Huo <penghuo@gmail.com>

nknize requested changes Apr 20, 2022

View reviewed changes

.../main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregationBuilder.java Outdated Show resolved Hide resolved

Add Java Doc to explain usage and performance limitation

d9bfe98

Signed-off-by: Peng Huo <penghuo@gmail.com>

fix failure UT

ba8bfec

Signed-off-by: Peng Huo <penghuo@gmail.com>

nknize approved these changes Apr 21, 2022

View reviewed changes

nknize requested a review from reta April 21, 2022 03:35

reta approved these changes Apr 21, 2022

View reviewed changes

nknize merged commit 03fbca3 into opensearch-project:main Apr 21, 2022

opensearch-trigger-bot bot mentioned this pull request Apr 21, 2022

[Backport 2.x] Adding multi_term aggregator support #3022

Merged

penghuo mentioned this pull request Apr 27, 2022

Add multi terms aggregation feature #1629

Closed

anirudha mentioned this pull request Jul 7, 2022

Add 2.1.0 release notes opensearch-project/opensearch-build#2302

Merged

penghuo mentioned this pull request Jul 20, 2022

TimeSeries optimizations in OpenSearch #3734

Open

ketanv3 mentioned this pull request Jul 15, 2023

[Meta] Improve performance of multi-term aggregations #8710

Closed

ketanv3 mentioned this pull request Jul 24, 2023

Added benchmarks for multi-term aggregation opensearch-project/opensearch-benchmark-workloads#89

Merged

This was referenced Aug 17, 2023

Improve performance of encoding composite keys in multi-term aggregations #9412

Merged

Performance improvements for BytesRefHash #8788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding multi_term aggregator support #2687

Adding multi_term aggregator support #2687

penghuo commented Mar 31, 2022 •

edited

Loading

opensearch-ci-bot commented Mar 31, 2022

opensearch-ci-bot commented Mar 31, 2022

opensearch-ci-bot commented Apr 1, 2022

nknize left a comment

opensearch-ci-bot commented Apr 18, 2022

opensearch-ci-bot commented Apr 19, 2022

nknize left a comment

opensearch-ci-bot commented Apr 20, 2022

reta commented Apr 20, 2022

nknize commented Apr 20, 2022

nknize left a comment

nknize commented Apr 20, 2022

penghuo commented Apr 20, 2022

opensearch-ci-bot commented Apr 20, 2022

opensearch-ci-bot commented Apr 20, 2022

nknize left a comment

nknize commented Apr 21, 2022

Adding multi_term aggregator support #2687

Adding multi_term aggregator support #2687

Conversation

penghuo commented Mar 31, 2022 • edited Loading

Description

To Reviewers

Limitation

Difference between terms and multi_terms aggregation

Demo

Correctness Test

UT

IT

Performance Test

Test Environment

OpenSearch multi_terms vs script

Goal

queries

Test Result

OpenSearch multi_terms vs terms

Goal

Queries

Test Result

Issues Resolved

Check List

opensearch-ci-bot commented Mar 31, 2022

opensearch-ci-bot commented Mar 31, 2022

opensearch-ci-bot commented Apr 1, 2022

nknize left a comment

Choose a reason for hiding this comment

opensearch-ci-bot commented Apr 18, 2022

opensearch-ci-bot commented Apr 19, 2022

nknize left a comment

Choose a reason for hiding this comment

opensearch-ci-bot commented Apr 20, 2022

reta commented Apr 20, 2022

nknize commented Apr 20, 2022

nknize left a comment

Choose a reason for hiding this comment

nknize commented Apr 20, 2022

penghuo commented Apr 20, 2022

opensearch-ci-bot commented Apr 20, 2022

opensearch-ci-bot commented Apr 20, 2022

nknize left a comment

Choose a reason for hiding this comment

nknize commented Apr 21, 2022

penghuo commented Mar 31, 2022 •

edited

Loading