Consider query when optimizing date rounding (backport of #63403) #63571

nik9000 · 2020-10-12T17:47:38Z

Before this change we inspected the index when optimizing
date_histogram aggregations, precalculating the divisions for the
buckets for the entire range of dates on the index so long as there
aren't a ton of these buckets. This works very well when you query all
of the dates in the index which is quite common - after all, folks
frequently want to query a week of data and have daily indices.

But it doesn't work as well when the index is much larger than the
query. This is quite common when dumping data into ES just to
investigate it but less common in the traditional time series use case.
But even there it still happens, it is just less impactful. Consider
the default query produced by Kibana's Discover app: a range of 15
minutes and a interval of 30 seconds. This optimization saves something
like 3 to 12 nanoseconds per document, so that 15 minutes would have to
have hundreds of millions of documents for it to be impactful.

Anyway, this commit takes the query into account when precalculating the
buckets. Mostly this is good when you have "dirty data". Immagine
loading 80 billion docs in an index to investigate them. Most of them
have dates around 2015 and 2016 but some have dates in 1970 and
others have dates in 2030. These outlier dates are "dirty" "garbage".
Well, without this change a date_histogram across many of these docs
is significantly slowed down because we don't precalculate the range due
to the outliers. That's just rude! So this change takes the query into
account.

The bulk of the code change here is plumbing the query into place. It
turns out that its a ton of plumbing, so instead of just adding a
Query member in hundreds of args replace QueryShardContext with a
new AggregationContext which does two things:

Has the top level Query.
Exposes just the parts of QueryShardContext that we actually need
to run aggregation. This lets us simplify a few tests now and will
let us simplify many, many tests later.

Before this change we inspected the index when optimizing `date_histogram` aggregations, precalculating the divisions for the buckets for the entire range of dates on the index so long as there aren't a ton of these buckets. This works very well when you query all of the dates in the index which is quite common - after all, folks frequently want to query a week of data and have daily indices. But it doesn't work as well when the index is much larger than the query. This is quite common when dumping data into ES just to investigate it but less common in the traditional time series use case. But even there it still happens, it is just less impactful. Consider the default query produced by Kibana's Discover app: a range of 15 minutes and a interval of 30 seconds. This optimization saves something like 3 to 12 nanoseconds per document, so that 15 minutes would have to have hundreds of millions of documents for it to be impactful. Anyway, this commit takes the query into account when precalculating the buckets. Mostly this is good when you have "dirty data". Immagine loading 80 billion docs in an index to investigate them. Most of them have dates around 2015 and 2016 but some have dates in 1970 and others have dates in 2030. These outlier dates are "dirty" "garbage". Well, without this change a `date_histogram` across many of these docs is significantly slowed down because we don't precalculate the range due to the outliers. That's just rude! So this change takes the query into account. The bulk of the code change here is plumbing the query into place. It turns out that its a *ton* of plumbing, so instead of just adding a `Query` member in hundreds of args replace `QueryShardContext` with a new `AggregationContext` which does two things: 1. Has the top level `Query`. 2. Exposes just the parts of `QueryShardContext` that we actually need to run aggregation. This lets us simplify a few tests now and will let us simplify many, many tests later.

nik9000 added backport v7.11.0 labels Oct 12, 2020

nik9000 added 4 commits October 12, 2020 14:52

fixup test

77e06d9

Fixup

50cc1c4

you can do it, Nik!

701d9a1

Merge branch '7.x' into limit_rounding_to_query_7_x

b647676

nik9000 merged commit 5de6f10 into elastic:7.x Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider query when optimizing date rounding (backport of #63403) #63571

Consider query when optimizing date rounding (backport of #63403) #63571

nik9000 commented Oct 12, 2020

Consider query when optimizing date rounding (backport of #63403) #63571

Consider query when optimizing date rounding (backport of #63403) #63571

Conversation

nik9000 commented Oct 12, 2020