Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize composite aggregation based on index sorting #48399

Merged
merged 26 commits into from
Dec 17, 2019
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
0fb183c
optimize composite aggregation by index sorting
howardhuanghua Oct 14, 2019
484f728
fix long ling issue
howardhuanghua Oct 14, 2019
ea43c88
enhance comment
howardhuanghua Oct 14, 2019
4a82e21
fix a reset issue
howardhuanghua Oct 15, 2019
1c1af51
enhance test case
howardhuanghua Oct 16, 2019
e279371
add reverse order test case
howardhuanghua Oct 22, 2019
d1f3091
remove extra test case
howardhuanghua Oct 22, 2019
618eeb8
Adapt optimization to handle more than one leading source when index …
jimczi Oct 23, 2019
a5e7c25
Merge branch 'master' into composite_sorted_index_optim
jimczi Oct 28, 2019
03141a5
address review
jimczi Oct 28, 2019
64d6f50
Merge branch 'master' into composite_sorted_index_optim
jimczi Oct 31, 2019
646f965
Add documentation for early termination
jimczi Oct 31, 2019
b43b72b
fix comment
jimczi Oct 31, 2019
ee0b122
fix docs
jimczi Oct 31, 2019
1fa763b
separate test to assert on early termination and update docs
jimczi Oct 31, 2019
e497141
unused imports
jimczi Oct 31, 2019
c38e7bc
Merge branch 'master' into composite_sorted_index_optim
jimczi Nov 1, 2019
b52d5c7
fix suppresscodecs
jimczi Nov 1, 2019
701142a
Merge branch 'master' into composite_sorted_index_optim
jimczi Nov 6, 2019
12f3857
set default coded when index sort is provided in tests
jimczi Nov 6, 2019
92e8d30
Merge branch 'master' into composite_sorted_index_optim
jimczi Nov 12, 2019
c0c68cf
address review and TODO
jimczi Nov 12, 2019
3035b07
fix comments
jimczi Nov 12, 2019
5b0c3ab
Merge branch 'master' into composite_sorted_index_optim
jimczi Dec 12, 2019
95820b1
address comments
jimczi Dec 12, 2019
d182c6a
Merge branch 'master' into composite_sorted_index_optim
jimczi Dec 16, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 124 additions & 1 deletion docs/reference/aggregations/bucket/composite-aggregation.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ Example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -134,6 +135,7 @@ Like the `terms` aggregation it is also possible to use a script to create the v
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -168,6 +170,7 @@ Example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -186,6 +189,7 @@ The values are built from a numeric field or a script that return numerical valu
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -218,6 +222,7 @@ is specified by date/time expression:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -247,6 +252,7 @@ the format specified with the format parameter:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -289,6 +295,7 @@ For example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -311,6 +318,7 @@ in the composite buckets.
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -340,6 +348,7 @@ For example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -366,6 +375,7 @@ It is possible to include them in the response by setting `missing_bucket` to
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -391,7 +401,7 @@ first 10 composite buckets created from the values source.
The response contains the values for each composite bucket in an array containing the values extracted
from each value source.

==== After
==== Pagination
jimczi marked this conversation as resolved.
Show resolved Hide resolved

If the number of composite buckets is too high (or unknown) to be returned in a single response
it is possible to split the retrieval in multiple requests.
Expand All @@ -405,6 +415,7 @@ For example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -470,6 +481,7 @@ round of result can be retrieved with:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -487,6 +499,116 @@ GET /_search

<1> Should restrict the aggregation to buckets that sort **after** the provided values.

==== Early termination

For optimal performance the <<index-modules-index-sorting,index sort>> should be set on the index so that it matches
parts or fully the source order in the composite aggregation.
For instance the following index sort:

[source,console]
--------------------------------------------------
PUT twitter
{
"settings" : {
"index" : {
"sort.field" : ["username", "timestamp"], <1>
"sort.order" : ["asc", "desc"] <2>
}
},
"mappings": {
"properties": {
"username": {
"type": "keyword",
"doc_values": true
},
"timestamp": {
"type": "date"
}
}
}
}
--------------------------------------------------

<1> This index is sorted by `username` first then by `timestamp`.
<2> ... in ascending order for the `username` field and in descending order for the `timestamp` field.

.. could be used to optimize these composite aggregations:

[source,console]
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "user_name": { "terms" : { "field": "user_name" } } } <1>
]
}
}
}
}
--------------------------------------------------

<1> `user_name` is a prefix of the index sort and the order matches (`asc`).

[source,console]
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "user_name": { "terms" : { "field": "user_name" } } }, <1>
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } <2>
]
}
}
}
}
--------------------------------------------------

<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
<2> `timestamp` matches also the prefix and the order matches (`desc`).

In order to optimize the early termination it is advised to set `track_total_hits` in the request
to `false`. The number of total hits that match the request can be retrieved on the first request
and it would be costly to compute this number on every page:

[source,console]
--------------------------------------------------
GET /_search
{
"size": 0,
"track_total_hits": false,
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "user_name": { "terms" : { "field": "user_name" } } },
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }
]
}
}
}
}
--------------------------------------------------

Note that the order of the source is important, in the example below switching the `user_name` with the `timestamp`
would deactivate the sort optimization since this configuration wouldn't match the index sort specification.
If the order of sources do not matter for your use case you can follow these simple guidelines:

* Put the fields with the highest cardinality first.
* Make sure that the order of the field matches the order of the index sort.
* Put multi-valued fields last since they cannot be used for early termination.

WARNING: <<index-modules-index-sorting,index sort>> can slowdown indexing, it is very important to test index sorting
with your specific use case and dataset to ensure that it matches your requirement. If it doesn't note that `composite`
aggregations will also try to early terminate on non-sorted indices if the query matches all document (`match_all` query).

==== Sub-aggregations

Like any `multi-bucket` aggregations the `composite` aggregation can hold sub-aggregations.
Expand All @@ -499,6 +621,7 @@ per composite bucket:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,8 @@ protected AggregatorFactory doBuild(QueryShardContext queryShardContext, Aggrega
} else {
afterKey = null;
}
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size, configs, afterKey);
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size,
configs, afterKey);
}


Expand Down
Loading