Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up interval rounding #63245

Merged
merged 1 commit into from
Oct 6, 2020
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Oct 5, 2020

This speeds up date_histogram by precomputing the rounding points for
date intervals like 10d. The speedup for the rounding itself is
between 18% (UTC many buckets) and 65% (US Eastern Time few buckets).
43% seems like it'd be pretty common:

Benchmark   (count)  (interval)                   (range)           (zone)  Mode  Cnt          Score         Error  Units
before     10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10  130822390.700 ±  177466.657  ns/op
before     10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10  189236837.930 ± 7958933.566  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10   66413746.325 ± 1578834.032  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10   65656941.375 ±  291608.870  ns/op

before     10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  130854975.013 ±  369133.702  ns/op
before     10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  165831615.257 ±  139074.982  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  107832636.671 ± 3502704.198  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  107608802.940 ±  979286.160  ns/op

Speedup for the date_histogram is likely to vary based on how much IO
dominates the collection.

This speeds up date_histogram by precomputing the rounding points for
date intervals like `10d`. The speedup for the rounding itself is
between 18% (UTC many buckets) and 65% (US Eastern Time few buckets).
43% seems like it'd be pretty common:

```
Benchmark   (count)  (interval)                   (range)           (zone)  Mode  Cnt          Score         Error  Units
before     10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10  130822390.700 ±  177466.657  ns/op
before     10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10  189236837.930 ± 7958933.566  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10   66413746.325 ± 1578834.032  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10   65656941.375 ±  291608.870  ns/op

before     10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  130854975.013 ±  369133.702  ns/op
before     10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  165831615.257 ±  139074.982  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  107832636.671 ± 3502704.198  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  107608802.940 ±  979286.160  ns/op
```

Speedup for the date_histogram is likely to vary based on how much IO
dominates the collection.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 5, 2020
@nik9000
Copy link
Member Author

nik9000 commented Oct 5, 2020

I'm running performance tests locally to see what I see on some real data.

@nik9000
Copy link
Member Author

nik9000 commented Oct 5, 2020

The data that I have to run performance tests for this actually doesn't work well because it contains outliers which disable this optimization - there are some docs with dates in 1970 and some on 2050 but the bulk of them are around 2015 or something. That kind of thing defeats this optimization. So we're going to have to rely on the lower level performance test here. I'll have a think about outliers and how we can stop them from defeating the optimization.

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nik9000 nik9000 merged commit 62a74d0 into elastic:master Oct 6, 2020
@nik9000
Copy link
Member Author

nik9000 commented Oct 6, 2020

Thanks @not-napoleon !

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Oct 6, 2020
This speeds up date_histogram by precomputing the rounding points for
date intervals like `10d`. The speedup for the rounding itself is
between 18% (UTC many buckets) and 65% (US Eastern Time few buckets).
43% seems like it'd be pretty common:

```
Benchmark   (count)  (interval)                   (range)           (zone)  Mode  Cnt          Score         Error  Units
before     10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10  130822390.700 ±  177466.657  ns/op
before     10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10  189236837.930 ± 7958933.566  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10   66413746.325 ± 1578834.032  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10   65656941.375 ±  291608.870  ns/op

before     10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  130854975.013 ±  369133.702  ns/op
before     10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  165831615.257 ±  139074.982  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  107832636.671 ± 3502704.198  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  107608802.940 ±  979286.160  ns/op
```

Speedup for the date_histogram is likely to vary based on how much IO
dominates the collection.
nik9000 added a commit that referenced this pull request Oct 7, 2020
This speeds up date_histogram by precomputing the rounding points for
date intervals like `10d`. The speedup for the rounding itself is
between 18% (UTC many buckets) and 65% (US Eastern Time few buckets).
43% seems like it'd be pretty common:

```
Benchmark   (count)  (interval)                   (range)           (zone)  Mode  Cnt          Score         Error  Units
before     10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10  130822390.700 ±  177466.657  ns/op
before     10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10  189236837.930 ± 7958933.566  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31              UTC  avgt   10   66413746.325 ± 1578834.032  ns/op
after      10000000         10d  2000-10-28 to 2000-10-31 America/New_York  avgt   10   65656941.375 ±  291608.870  ns/op

before     10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  130854975.013 ±  369133.702  ns/op
before     10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  165831615.257 ±  139074.982  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31              UTC  avgt   10  107832636.671 ± 3502704.198  ns/op
after      10000000          2h  2000-10-28 to 2000-10-31 America/New_York  avgt   10  107608802.940 ±  979286.160  ns/op
```

Speedup for the date_histogram is likely to vary based on how much IO
dominates the collection.
@nik9000 nik9000 added v7.11.0 and removed v7.10.0 labels Oct 7, 2020
@nik9000
Copy link
Member Author

nik9000 commented Oct 7, 2020

The backport of this didn't make the branch cut for 7.10 so it'll release with 7.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants