Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider query when optimizing date rounding #63403

Merged
merged 11 commits into from
Oct 12, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Oct 7, 2020

Before this change we inspected the index when optimizing
date_histogram aggregations, precalculating the divisions for the
buckets for the entire range of dates on the index so long as there
aren't a ton of these buckets. This works very well when you query all
of the dates in the index which is quite common - after all, folks
frequently want to query a week of data and have daily indices.

But it doesn't work as well when the index is much larger than the
query. This is quite common when dumping data into ES just to
investigate it but less common in the traditional time series use case.
But even there it still happens, it is just less impactful. Consider
the default query produced by Kibana's Discover app: a range of 15
minutes and a interval of 30 seconds. This optimization saves something
like 3 to 12 nanoseconds per document, so that 15 minutes would have to
have hundreds of millions of documents for it to be impactful.

Anyway, this commit takes the query into account when precalculating the
buckets. Mostly this is good when you have "dirty data". Immagine
loading 80 billion docs in an index to investigate them. Most of them
have dates around 2015 and 2016 but some have dates in 1970 and
others have dates in 2030. These outlier dates are "dirty" "garbage".
Well, without this change a date_histogram across many of these docs
is significantly slowed down because we don't precalculate the range due
to the outliers. That's just rude! So this change takes the query into
account.

The bulk of the code change here is plumbing the query into place. It
turns out that its a ton of plumbing, so instead of just adding a
Query member in hundreds of args replace QueryShardContext with a
new AggregationContext which does two things:

  1. Has the top level Query.
  2. Exposes just the parts of QueryShardContext that we actually need
    to run aggregation. This lets us simplify a few tests now and will
    let us simplify many, many tests later.

Before this change we inspected the index when optimizing
`date_histogram` aggregations, precalculating the divisions for the
buckets for the entire range of dates on the index so long as there
aren't a ton of these buckets. This works very well when you query all
of the dates in the index which is quite common - after all, folks
frequently want to query a week of data and have daily indices.

But it doesn't work as well when the index is much larger than the
query. This is quite common when dumping data into ES just to
investigate it but less common in the traditional time series use case.
But even there it still happens, it is just less impactful. Consider
the default query produced by Kibana's Discover app: a range of 15
minutes and a interval of 30 seconds. This optimization saves something
like 3 to 12 nanoseconds per document, so that 15 minutes would have to
have hundreds of millions of documents for it to be impactful.

Anyway, this commit takes the query into account when precalculating the
buckets. Mostly this is good when you have "dirty data". Immagine
loading 80 billion docs in an index to investigate them. Most of them
have dates around 2015 and 2016 but some have dates in 1970 and
others have dates in 2030. These outlier dates are "dirty" "garbage".
Well, without this change a `date_histogram` across many of these docs
is significantly slowed down because we don't precalculate the range due
to the outliers. That's just rude! So this change takes the query into
account.

The bulk of the code change here is plumbing the query into place. It
turns out that its a *ton* of plumbing, so instead of just adding a
`Query` member in hundreds of args replace `QueryShardContext` with a
new `AggregationContext` which does two things:
1. Has the top level `Query`.
2. Exposes just the parts of `QueryShardContext` that we actually need
   to run aggregation. This lets us simplify a few tests now and will
   let us simplify many, many tests later.
@nik9000
Copy link
Member Author

nik9000 commented Oct 7, 2020

Here are some performance results:

| no optimization  |                Min Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| no optimization  |             Median Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| no optimization  |                Max Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| no optimization  |       50th percentile latency | date_histogram_fixed_interval_with_tz | 16659.4 |     ms |
| no optimization  |       90th percentile latency | date_histogram_fixed_interval_with_tz | 18515.2 |     ms |
| no optimization  |      100th percentile latency | date_histogram_fixed_interval_with_tz | 18946.2 |     ms |
| no optimization  |  50th percentile service time | date_histogram_fixed_interval_with_tz | 13087.2 |     ms |
| no optimization  |  90th percentile service time | date_histogram_fixed_interval_with_tz | 13183.9 |     ms |
| no optimization  | 100th percentile service time | date_histogram_fixed_interval_with_tz |   13393 |     ms |
| no optimization  |                    error rate | date_histogram_fixed_interval_with_tz |       0 |      % |
| with this change |                Min Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| with this change |             Median Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| with this change |                Max Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| with this change |       50th percentile latency | date_histogram_fixed_interval_with_tz | 11981.5 |     ms |
| with this change |       90th percentile latency | date_histogram_fixed_interval_with_tz | 12099.6 |     ms |
| with this change |      100th percentile latency | date_histogram_fixed_interval_with_tz | 12221.6 |     ms |
| with this change |  50th percentile service time | date_histogram_fixed_interval_with_tz | 11980.5 |     ms |
| with this change |  90th percentile service time | date_histogram_fixed_interval_with_tz | 12098.6 |     ms |
| with this change | 100th percentile service time | date_histogram_fixed_interval_with_tz | 12220.5 |     ms |
| with this change |                    error rate | date_histogram_fixed_interval_with_tz |       0 |      % |

"no optimization" is without #63245. Without this change and with #63245 we perform worse on dirty data because we take the time to precalculate and give up. Without the optimizations my performance test machine couldn't barely not hit the target interval of 13 operations per second. Even the 50% percentile service time was above 13 seconds. Barely. With this optimization it is barely under twelve. So with dirty data this saves about a second or about 8%. Not bad but we can do better! And we will. Eventually. This unblocks that "doing better" on dirty data.

@nik9000
Copy link
Member Author

nik9000 commented Oct 7, 2020

All the BWC failures look like standard branch cut day fun.

import static org.hamcrest.Matchers.containsString;

public class ParentJoinFieldMapperTests extends ESSingleNodeTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this when I bumped into it working on solving this issue. It's not strictly related, but some changes were indeed required to be compatible with the rest of the changes. The key simplification is that we don't stand up a whole node any more - just set up the mapper parsing infrastructure and some lucene indices.


public AggregationUsageService getUsageService() {
return valuesSourceRegistry.getUsageService();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this at all any more - the caller now gets it form the ValuesSourceRegistry.

try {
AggregatorFactories factories = source.aggregations().build(queryShardContext, null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the start of the actual plumbing.

@@ -70,7 +69,6 @@ static AutoDateHistogramAggregator build(
AggregatorFactories factories,
int targetBuckets,
RoundingInfo[] roundingInfos,
Function<Rounding, Rounding.Prepared> roundingPreparer,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a TODO around moving this to the ctor which I bumped into while I was fixing the calls next to it.

@@ -441,14 +441,14 @@ protected ValuesSourceAggregatorFactory innerBuild(QueryShardContext queryShardC
LongBounds roundedBounds = null;
if (this.extendedBounds != null) {
// parse any string bounds to longs and round
roundedBounds = this.extendedBounds.parseAndValidate(name, "extended_bounds" , queryShardContext, config.format())
roundedBounds = this.extendedBounds.parseAndValidate(name, "extended_bounds" , context::nowInMillis, config.format())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched the parsing so we don't need to pass the whole query shard context in, now it is easier to test too!

* search index (the first test) and the resolution which is
* on the DateFieldType.
*/
if (fieldContext.fieldType() instanceof DateFieldType == false) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that runtime fields can't do it. It makes me think that we're doing something wrong, but I think that is something to solve in a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The instanceof check definitely smells wrong here, but I don't know what the right answer is.

import java.util.Map;
import java.util.function.Function;

import static org.hamcrest.Matchers.equalTo;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

public class AggregatorBaseTests extends ESSingleNodeTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was so short I thought I could clean up lots of these ESSingleNodeTestCases in this PR. It turns out that no, no, I can't.


// TODO: This whole set of tests needs to be rethought.
public class ValuesSourceConfigTests extends MapperServiceTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more entire node!

return source("1", build, null);
}

protected final SourceToParse source(String id, CheckedConsumer<XContentBuilder, IOException> build, @Nullable String routing)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent/child tests wanted this one.

@@ -222,7 +251,111 @@ protected final XContentBuilder fieldMapping(CheckedConsumer<XContentBuilder, IO
});
}

QueryShardContext createQueryShardContext(MapperService mapperService) {
private AggregationContext aggregationContext(MapperService mapperService, IndexSearcher searcher, Query query) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to go with mockito here partially because I wanted to suffer every time I added a new method to AggreationContext.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if you're joking or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not. Suffering makes you think "should I really add this method? this class is already big. maybe there is a cleaner way."

@nik9000 nik9000 marked this pull request as ready for review October 7, 2020 21:55
ScriptedMetricAggContexts.CombineScript.CONTEXT);
Map<String, Object> combineScriptParams = combineScript.getParams();

return new ScriptedMetricAggregatorFactory(name, compiledMapScript, mapScriptParams, compiledInitScript,
initScriptParams, compiledCombineScript, combineScriptParams, reduceScript,
params, queryShardContext.lookup(), queryShardContext, parent, subfactoriesBuilder, metadata);
params, context, parent, subfactoriesBuilder, metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. It annoys me when we pass in both an object and something derived from that object like the old version had.

import java.util.List;
import java.util.Optional;

public abstract class AggregationContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some class level javadoc for this.

* top level {@link Query}.
*/
public static AggregationContext from(QueryShardContext context, Query query) {
return new AggregationContext() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we gaining by making AggregationConetxt abstract and building it via this anonymous closure thing? Seems to me, we could just store a reference to a QueryShardContext in a concrete class and serve these same methods up directly. I think that would be more readable, but maybe there's another consideration I haven't thought of?

* search index (the first test) and the resolution which is
* on the DateFieldType.
*/
if (fieldContext.fieldType() instanceof DateFieldType == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The instanceof check definitely smells wrong here, but I don't know what the right answer is.

@@ -222,7 +251,111 @@ protected final XContentBuilder fieldMapping(CheckedConsumer<XContentBuilder, IO
});
}

QueryShardContext createQueryShardContext(MapperService mapperService) {
private AggregationContext aggregationContext(MapperService mapperService, IndexSearcher searcher, Query query) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if you're joking or not.

@nik9000
Copy link
Member Author

nik9000 commented Oct 12, 2020

@not-napoleon I've pushed patches for all of your notes. I also explained my reasoning around making the class abstract with a "production" implementation. Maybe I'm just in a strange mood, but it makes me feel better about having such a large "holder" sort of class.

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I think clearly documenting what the production path looks like solves my concerns around making AggregationContext abstract. Thank you for addressing the nits too!

@nik9000
Copy link
Member Author

nik9000 commented Oct 12, 2020

@elasticmachine, retest this please

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 12, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Oct 12, 2020
Before this change we inspected the index when optimizing
`date_histogram` aggregations, precalculating the divisions for the
buckets for the entire range of dates on the index so long as there
aren't a ton of these buckets. This works very well when you query all
of the dates in the index which is quite common - after all, folks
frequently want to query a week of data and have daily indices.

But it doesn't work as well when the index is much larger than the
query. This is quite common when dumping data into ES just to
investigate it but less common in the traditional time series use case.
But even there it still happens, it is just less impactful. Consider
the default query produced by Kibana's Discover app: a range of 15
minutes and a interval of 30 seconds. This optimization saves something
like 3 to 12 nanoseconds per document, so that 15 minutes would have to
have hundreds of millions of documents for it to be impactful.

Anyway, this commit takes the query into account when precalculating the
buckets. Mostly this is good when you have "dirty data". Immagine
loading 80 billion docs in an index to investigate them. Most of them
have dates around 2015 and 2016 but some have dates in 1970 and
others have dates in 2030. These outlier dates are "dirty" "garbage".
Well, without this change a `date_histogram` across many of these docs
is significantly slowed down because we don't precalculate the range due
to the outliers. That's just rude! So this change takes the query into
account.

The bulk of the code change here is plumbing the query into place. It
turns out that its a *ton* of plumbing, so instead of just adding a
`Query` member in hundreds of args replace `QueryShardContext` with a
new `AggregationContext` which does two things:
1. Has the top level `Query`.
2. Exposes just the parts of `QueryShardContext` that we actually need
   to run aggregation. This lets us simplify a few tests now and will
   let us simplify many, many tests later.
nik9000 added a commit that referenced this pull request Oct 13, 2020
…3571)

Before this change we inspected the index when optimizing
`date_histogram` aggregations, precalculating the divisions for the
buckets for the entire range of dates on the index so long as there
aren't a ton of these buckets. This works very well when you query all
of the dates in the index which is quite common - after all, folks
frequently want to query a week of data and have daily indices.

But it doesn't work as well when the index is much larger than the
query. This is quite common when dumping data into ES just to
investigate it but less common in the traditional time series use case.
But even there it still happens, it is just less impactful. Consider
the default query produced by Kibana's Discover app: a range of 15
minutes and a interval of 30 seconds. This optimization saves something
like 3 to 12 nanoseconds per document, so that 15 minutes would have to
have hundreds of millions of documents for it to be impactful.

Anyway, this commit takes the query into account when precalculating the
buckets. Mostly this is good when you have "dirty data". Immagine
loading 80 billion docs in an index to investigate them. Most of them
have dates around 2015 and 2016 but some have dates in 1970 and
others have dates in 2030. These outlier dates are "dirty" "garbage".
Well, without this change a `date_histogram` across many of these docs
is significantly slowed down because we don't precalculate the range due
to the outliers. That's just rude! So this change takes the query into
account.

The bulk of the code change here is plumbing the query into place. It
turns out that its a *ton* of plumbing, so instead of just adding a
`Query` member in hundreds of args replace `QueryShardContext` with a
new `AggregationContext` which does two things:
1. Has the top level `Query`.
2. Exposes just the parts of `QueryShardContext` that we actually need
   to run aggregation. This lets us simplify a few tests now and will
   let us simplify many, many tests later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants