Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Uptime] Improve snapshot timespan handling #58078

Closed
wants to merge 9 commits into from

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Feb 20, 2020

Summary

Fixes #58079

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

This patch improves the situation by gathering stats from the index and changing the query strategy based on those stats.

  1. In low-cardinality situations, for instance when < 1000 monitors are either up or down over the past 5m we deliver an exact count of up/down using the monitor iterator to count the number of either up or down monitors then subtract that from the total. Even in slow situations this shouldn't
    take longer than ~1s.
  2. In high-card situations we do follow the efficient snapshot count path, but filter it to the past 30s instead of the past 5m which is more useful.

I've modified the API to report which method was used. I think we'll potentially need to consider adding a note to the UI explaining that things may be more stale in the 30s case, but I'm wondering if that may in fact be overkill.

This whole approach may change when we add support for the 'stale' state.

This PR also improves our doc generators in tests to not refresh so often. Without this improvement we'd back up the Elasticsearch queues with needless refreshes, now we only refresh the index once we're done indexing.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

This patch improves the handling of timespans with snapshot counts. This
feature originally worked, but suffered a regression when we increased
the default timespan in the query context to 5m. This means that without
this patch the counts you get are the maximum total number of monitors
that were down over the past 5m, which is not really that useful.

This patch improves the situation in two ways:

1. In low-cardinality situations, for instance when < 1000 monitors are
either up or down we deliver an *exact* count of up/down using the
monitor iterator to count the number of either up or down monitors then
subtract that from the total. Even in slow situations this shouldn't
take longer than ~1s.
2. In high-card situations we do follow the efficient snapshot count
path, but filter it to the past 30s instead of the past 5m which is more
useful.

I've modified the API to report which method was used. I think we'll
potentially need to consider adding a note to the UI explaining that
things may be more stale in the 30s case, but I'm wondering if that may
in fact be overkill.

There are no tests yet, I'd like for us to review the general approach
before adding those as they will be non-trivial.
@andrewvc andrewvc added bug Fixes for quality problems that affect the customer experience discuss Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability v7.6.0 labels Feb 20, 2020
@andrewvc andrewvc self-assigned this Feb 20, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

@kibanamachine
Copy link
Contributor

💔 Build Failed


Test Failures

Kibana Pipeline / kibana-xpack-agent / X-Pack API Integration Tests.x-pack/test/api_integration/apis/uptime/rest/snapshot·ts.apis uptime uptime REST endpoints with generated data snapshot count when data is present with low cardinality with timespans included will count all statuses correctly

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

[00:00:00]       │
[00:00:00]         └-: apis
[00:00:00]           └-> "before all" hook
[00:05:08]           └-: uptime
[00:05:08]             └-> "before all" hook
[00:05:08]             └-> "before all" hook
[00:05:18]             └-: uptime REST endpoints
[00:05:18]               └-> "before all" hook
[00:05:18]               └-: with generated data
[00:05:18]                 └-> "before all" hook
[00:05:18]                 └-> "before all" hook: load heartbeat data
[00:05:18]                   │ info [uptime/blank] Loading "mappings.json"
[00:05:18]                   │ info [uptime/blank] Loading "data.json"
[00:05:18]                   │ info [o.e.c.m.MetaDataCreateIndexService] [kibana-ci-immutable-debian-tests-xl-1582250529911838265] [heartbeat-8-test] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[00:05:18]                   │ info [uptime/blank] Created index "heartbeat-8-test"
[00:05:18]                   │ debg [uptime/blank] "heartbeat-8-test" settings {"index":{"mapping":{"total_fields":{"limit":"10000"}},"number_of_replicas":"1","number_of_shards":"1","query":{"default_field":["message","tags","agent.ephemeral_id","agent.id","agent.name","agent.type","agent.version","as.organization.name","client.address","client.as.organization.name","client.domain","client.geo.city_name","client.geo.continent_name","client.geo.country_iso_code","client.geo.country_name","client.geo.name","client.geo.region_iso_code","client.geo.region_name","client.mac","client.user.domain","client.user.email","client.user.full_name","client.user.group.id","client.user.group.name","client.user.hash","client.user.id","client.user.name","cloud.account.id","cloud.availability_zone","cloud.instance.id","cloud.instance.name","cloud.machine.type","cloud.provider","cloud.region","container.id","container.image.name","container.image.tag","container.name","container.runtime","destination.address","destination.as.organization.name","destination.domain","destination.geo.city_name","destination.geo.continent_name","destination.geo.country_iso_code","destination.geo.country_name","destination.geo.name","destination.geo.region_iso_code","destination.geo.region_name","destination.mac","destination.user.domain","destination.user.email","destination.user.full_name","destination.user.group.id","destination.user.group.name","destination.user.hash","destination.user.id","destination.user.name","dns.answers.class","dns.answers.data","dns.answers.name","dns.answers.type","dns.header_flags","dns.id","dns.op_code","dns.question.class","dns.question.name","dns.question.registered_domain","dns.question.type","dns.response_code","dns.type","ecs.version","error.code","error.id","error.message","event.action","event.category","event.code","event.dataset","event.hash","event.id","event.kind","event.module","event.original","event.outcome","event.provider","event.timezone","event.type","file.device","file.directory","file.extension","file.gid","file.group","file.hash.md5","file.hash.sha1","file.hash.sha256","file.hash.sha512","file.inode","file.mode","file.name","file.owner","file.path","file.target_path","file.type","file.uid","geo.city_name","geo.continent_name","geo.country_iso_code","geo.country_name","geo.name","geo.region_iso_code","geo.region_name","group.id","group.name","hash.md5","hash.sha1","hash.sha256","hash.sha512","host.architecture","host.geo.city_name","host.geo.continent_name","host.geo.country_iso_code","host.geo.country_name","host.geo.name","host.geo.region_iso_code","host.geo.region_name","host.hostname","host.id","host.mac","host.name","host.os.family","host.os.full","host.os.kernel","host.os.name","host.os.platform","host.os.version","host.type","host.user.domain","host.user.email","host.user.full_name","host.user.group.id","host.user.group.name","host.user.hash","host.user.id","host.user.name","http.request.body.content","http.request.method","http.request.referrer","http.response.body.content","http.version","log.level","log.logger","log.original","network.application","network.community_id","network.direction","network.iana_number","network.name","network.protocol","network.transport","network.type","observer.geo.city_name","observer.geo.continent_name","observer.geo.country_iso_code","observer.geo.country_name","observer.geo.name","observer.geo.region_iso_code","observer.geo.region_name","observer.hostname","observer.mac","observer.os.family","observer.os.full","observer.os.kernel","observer.os.name","observer.os.platform","observer.os.version","observer.serial_number","observer.type","observer.vendor","observer.version","organization.id","organization.name","os.family","os.full","os.kernel","os.name","os.platform","os.version","process.args","process.executable","process.hash.md5","process.hash.sha1","process.hash.sha256","process.hash.sha512","process.name","process.thread.name","process.title","process.working_directory","server.address","server.as.organization.name","server.domain","server.geo.city_name","server.geo.continent_name","server.geo.country_iso_code","server.geo.country_name","server.geo.name","server.geo.region_iso_code","server.geo.region_name","server.mac","server.user.domain","server.user.email","server.user.full_name","server.user.group.id","server.user.group.name","server.user.hash","server.user.id","server.user.name","service.ephemeral_id","service.id","service.name","service.state","service.type","service.version","source.address","source.as.organization.name","source.domain","source.geo.city_name","source.geo.continent_name","source.geo.country_iso_code","source.geo.country_name","source.geo.name","source.geo.region_iso_code","source.geo.region_name","source.mac","source.user.domain","source.user.email","source.user.full_name","source.user.group.id","source.user.group.name","source.user.hash","source.user.id","source.user.name","tracing.trace.id","tracing.transaction.id","url.domain","url.fragment","url.full","url.original","url.password","url.path","url.query","url.scheme","url.username","user.domain","user.email","user.full_name","user.group.id","user.group.name","user.hash","user.id","user.name","user_agent.device.name","user_agent.name","user_agent.original","user_agent.os.family","user_agent.os.full","user_agent.os.kernel","user_agent.os.name","user_agent.os.platform","user_agent.os.version","user_agent.version","agent.hostname","error.type","timeseries.instance","cloud.project.id","cloud.image.id","host.os.build","host.os.codename","kubernetes.pod.name","kubernetes.pod.uid","kubernetes.namespace","kubernetes.node.name","kubernetes.replicaset.name","kubernetes.deployment.name","kubernetes.statefulset.name","kubernetes.container.name","kubernetes.container.image","jolokia.agent.version","jolokia.agent.id","jolokia.server.product","jolokia.server.version","jolokia.server.vendor","jolokia.url","monitor.type","monitor.name","monitor.id","monitor.status","monitor.check_group","http.response.body.hash","fields.*"]},"refresh_interval":"5s"}}
[00:05:18]                 └-: snapshot count
[00:05:18]                   └-> "before all" hook
[00:05:19]                   └-: when data is present
[00:05:19]                     └-> "before all" hook
[00:05:42]                     └-: with low cardinality
[00:05:42]                       └-> "before all" hook
[00:05:42]                       └-: with timespans included
[00:05:42]                         └-> "before all" hook
[00:05:42]                         └-> "before all" hook
[00:05:42]                         └-> will count all statuses correctly
[00:05:42]                           └-> "before each" hook: global before each
[00:05:42]                           └- ✖ fail: "apis uptime uptime REST endpoints with generated data snapshot count when data is present with low cardinality with timespans included will count all statuses correctly"
[00:05:42]                           │

Stack Trace

{ Error: expected { total: 3400,
  up: 2000,
  down: 1400,
  method: 'timeslice' } to sort of equal { up: 10, down: 7, total: 17 }
    at Assertion.assert (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:100:11)
    at Assertion.eql (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:244:8)
    at expectFixtureEql (test/api_integration/apis/uptime/graphql/helpers/expect_fixture_eql.ts:36:27)
    at Context.it (test/api_integration/apis/uptime/rest/snapshot.ts:85:17)
  actual:
   '{\n  "down": 1400\n  "method": "timeslice"\n  "total": 3400\n  "up": 2000\n}',
  expected: '{\n  "down": 7\n  "total": 17\n  "up": 10\n}',
  showDiff: true }

Kibana Pipeline / kibana-xpack-agent / X-Pack API Integration Tests.x-pack/test/api_integration/apis/uptime/rest/snapshot·ts.apis uptime uptime REST endpoints with generated data snapshot count when data is present with low cardinality with timespans included will count all statuses correctly

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

[00:00:00]       │
[00:00:00]         └-: apis
[00:00:00]           └-> "before all" hook
[00:05:46]           └-: uptime
[00:05:46]             └-> "before all" hook
[00:05:46]             └-> "before all" hook
[00:06:00]             └-: uptime REST endpoints
[00:06:00]               └-> "before all" hook
[00:06:00]               └-: with generated data
[00:06:00]                 └-> "before all" hook
[00:06:00]                 └-> "before all" hook: load heartbeat data
[00:06:00]                   │ info [uptime/blank] Loading "mappings.json"
[00:06:00]                   │ info [uptime/blank] Loading "data.json"
[00:06:00]                   │ info [o.e.c.m.MetaDataCreateIndexService] [kibana-ci-immutable-debian-tests-xl-1582250529911838265] [heartbeat-8-test] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[00:06:00]                   │ info [uptime/blank] Created index "heartbeat-8-test"
[00:06:00]                   │ debg [uptime/blank] "heartbeat-8-test" settings {"index":{"mapping":{"total_fields":{"limit":"10000"}},"number_of_replicas":"1","number_of_shards":"1","query":{"default_field":["message","tags","agent.ephemeral_id","agent.id","agent.name","agent.type","agent.version","as.organization.name","client.address","client.as.organization.name","client.domain","client.geo.city_name","client.geo.continent_name","client.geo.country_iso_code","client.geo.country_name","client.geo.name","client.geo.region_iso_code","client.geo.region_name","client.mac","client.user.domain","client.user.email","client.user.full_name","client.user.group.id","client.user.group.name","client.user.hash","client.user.id","client.user.name","cloud.account.id","cloud.availability_zone","cloud.instance.id","cloud.instance.name","cloud.machine.type","cloud.provider","cloud.region","container.id","container.image.name","container.image.tag","container.name","container.runtime","destination.address","destination.as.organization.name","destination.domain","destination.geo.city_name","destination.geo.continent_name","destination.geo.country_iso_code","destination.geo.country_name","destination.geo.name","destination.geo.region_iso_code","destination.geo.region_name","destination.mac","destination.user.domain","destination.user.email","destination.user.full_name","destination.user.group.id","destination.user.group.name","destination.user.hash","destination.user.id","destination.user.name","dns.answers.class","dns.answers.data","dns.answers.name","dns.answers.type","dns.header_flags","dns.id","dns.op_code","dns.question.class","dns.question.name","dns.question.registered_domain","dns.question.type","dns.response_code","dns.type","ecs.version","error.code","error.id","error.message","event.action","event.category","event.code","event.dataset","event.hash","event.id","event.kind","event.module","event.original","event.outcome","event.provider","event.timezone","event.type","file.device","file.directory","file.extension","file.gid","file.group","file.hash.md5","file.hash.sha1","file.hash.sha256","file.hash.sha512","file.inode","file.mode","file.name","file.owner","file.path","file.target_path","file.type","file.uid","geo.city_name","geo.continent_name","geo.country_iso_code","geo.country_name","geo.name","geo.region_iso_code","geo.region_name","group.id","group.name","hash.md5","hash.sha1","hash.sha256","hash.sha512","host.architecture","host.geo.city_name","host.geo.continent_name","host.geo.country_iso_code","host.geo.country_name","host.geo.name","host.geo.region_iso_code","host.geo.region_name","host.hostname","host.id","host.mac","host.name","host.os.family","host.os.full","host.os.kernel","host.os.name","host.os.platform","host.os.version","host.type","host.user.domain","host.user.email","host.user.full_name","host.user.group.id","host.user.group.name","host.user.hash","host.user.id","host.user.name","http.request.body.content","http.request.method","http.request.referrer","http.response.body.content","http.version","log.level","log.logger","log.original","network.application","network.community_id","network.direction","network.iana_number","network.name","network.protocol","network.transport","network.type","observer.geo.city_name","observer.geo.continent_name","observer.geo.country_iso_code","observer.geo.country_name","observer.geo.name","observer.geo.region_iso_code","observer.geo.region_name","observer.hostname","observer.mac","observer.os.family","observer.os.full","observer.os.kernel","observer.os.name","observer.os.platform","observer.os.version","observer.serial_number","observer.type","observer.vendor","observer.version","organization.id","organization.name","os.family","os.full","os.kernel","os.name","os.platform","os.version","process.args","process.executable","process.hash.md5","process.hash.sha1","process.hash.sha256","process.hash.sha512","process.name","process.thread.name","process.title","process.working_directory","server.address","server.as.organization.name","server.domain","server.geo.city_name","server.geo.continent_name","server.geo.country_iso_code","server.geo.country_name","server.geo.name","server.geo.region_iso_code","server.geo.region_name","server.mac","server.user.domain","server.user.email","server.user.full_name","server.user.group.id","server.user.group.name","server.user.hash","server.user.id","server.user.name","service.ephemeral_id","service.id","service.name","service.state","service.type","service.version","source.address","source.as.organization.name","source.domain","source.geo.city_name","source.geo.continent_name","source.geo.country_iso_code","source.geo.country_name","source.geo.name","source.geo.region_iso_code","source.geo.region_name","source.mac","source.user.domain","source.user.email","source.user.full_name","source.user.group.id","source.user.group.name","source.user.hash","source.user.id","source.user.name","tracing.trace.id","tracing.transaction.id","url.domain","url.fragment","url.full","url.original","url.password","url.path","url.query","url.scheme","url.username","user.domain","user.email","user.full_name","user.group.id","user.group.name","user.hash","user.id","user.name","user_agent.device.name","user_agent.name","user_agent.original","user_agent.os.family","user_agent.os.full","user_agent.os.kernel","user_agent.os.name","user_agent.os.platform","user_agent.os.version","user_agent.version","agent.hostname","error.type","timeseries.instance","cloud.project.id","cloud.image.id","host.os.build","host.os.codename","kubernetes.pod.name","kubernetes.pod.uid","kubernetes.namespace","kubernetes.node.name","kubernetes.replicaset.name","kubernetes.deployment.name","kubernetes.statefulset.name","kubernetes.container.name","kubernetes.container.image","jolokia.agent.version","jolokia.agent.id","jolokia.server.product","jolokia.server.version","jolokia.server.vendor","jolokia.url","monitor.type","monitor.name","monitor.id","monitor.status","monitor.check_group","http.response.body.hash","fields.*"]},"refresh_interval":"5s"}}
[00:06:00]                 └-: snapshot count
[00:06:00]                   └-> "before all" hook
[00:06:00]                   └-: when data is present
[00:06:00]                     └-> "before all" hook
[00:06:31]                     └-: with low cardinality
[00:06:31]                       └-> "before all" hook
[00:06:31]                       └-: with timespans included
[00:06:31]                         └-> "before all" hook
[00:06:31]                         └-> "before all" hook
[00:06:31]                         └-> will count all statuses correctly
[00:06:31]                           └-> "before each" hook: global before each
[00:06:31]                           └- ✖ fail: "apis uptime uptime REST endpoints with generated data snapshot count when data is present with low cardinality with timespans included will count all statuses correctly"
[00:06:31]                           │

Stack Trace

{ Error: expected { total: 3400,
  up: 2000,
  down: 1400,
  method: 'timeslice' } to sort of equal { up: 10, down: 7, total: 17 }
    at Assertion.assert (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:100:11)
    at Assertion.eql (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:244:8)
    at expectFixtureEql (test/api_integration/apis/uptime/graphql/helpers/expect_fixture_eql.ts:36:27)
    at Context.it (test/api_integration/apis/uptime/rest/snapshot.ts:85:17)
  actual:
   '{\n  "down": 1400\n  "method": "timeslice"\n  "total": 3400\n  "up": 2000\n}',
  expected: '{\n  "down": 7\n  "total": 17\n  "up": 10\n}',
  showDiff: true }

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@andrewvc
Copy link
Contributor Author

closing in favor of #58247

@andrewvc andrewvc closed this Feb 21, 2020
andrewvc added a commit to andrewvc/kibana that referenced this pull request Feb 21, 2020
When generating test data we refresh excessively, this can fill up the
ES queues and break the tests if we run massive tests. I originally ran
into this with elastic#58078 which I
closed due to finding a better approach.

While none of our current tests have the scale to expose this problem,
we certainly will add tests that do later, so we should keep this
change.
andrewvc added a commit that referenced this pull request Feb 24, 2020
Fixes #58079

This is an improved version of #58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc added a commit to andrewvc/kibana that referenced this pull request Feb 24, 2020
Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc added a commit that referenced this pull request Feb 24, 2020
Fixes #58079

This is an improved version of #58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc added a commit to andrewvc/kibana that referenced this pull request Feb 24, 2020
…elastic#58389)

Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc added a commit that referenced this pull request Feb 25, 2020
When generating test data we refresh excessively, this can fill up the
ES queues and break the tests if we run massive tests. I originally ran
into this with #58078 which I
closed due to finding a better approach.

While none of our current tests have the scale to expose this problem,
we certainly will add tests that do later, so we should keep this
change.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
andrewvc added a commit to andrewvc/kibana that referenced this pull request Feb 25, 2020
…58285)

When generating test data we refresh excessively, this can fill up the
ES queues and break the tests if we run massive tests. I originally ran
into this with elastic#58078 which I
closed due to finding a better approach.

While none of our current tests have the scale to expose this problem,
we certainly will add tests that do later, so we should keep this
change.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
andrewvc added a commit that referenced this pull request Feb 25, 2020
…58468)

When generating test data we refresh excessively, this can fill up the
ES queues and break the tests if we run massive tests. I originally ran
into this with #58078 which I
closed due to finding a better approach.

While none of our current tests have the scale to expose this problem,
we certainly will add tests that do later, so we should keep this
change.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
elasticmachine added a commit to dhurley14/kibana that referenced this pull request Feb 25, 2020
…elastic#58389) (elastic#58415)

Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience discuss release_note:fix Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability v7.6.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants