Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Only load maintenance windows when there are alerts during rule execution and caching loaded maintenance windows #192573

Merged
merged 22 commits into from
Sep 26, 2024

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Sep 11, 2024

Resolves #184324

Summary

This PR moves the loading of maintenance windows further down in rule execution so maintenance windows are only loaded when a rule execution generates alerts. Also caches maintenance windows per space to reduce the number of requests.

To Verify

  1. Add some logging to x-pack/plugins/alerting/server/task_runner/maintenance_windows/maintenance_windows_service.ts to indicate when windows are being fetched and when they're returning from the cache.
  2. Create and run some rules in different spaces with and without alerts to see that the maintenance windows are only loaded when there are alerts and that the windows are returned from the cache when the cache has not expired.

@ymao1 ymao1 force-pushed the alerting/load-maintenance-windows-later branch from 3c4683b to a544490 Compare September 12, 2024 18:57
@ymao1 ymao1 changed the title Moving maintenance window calculations out of process alerts [Response Ops][Alerting] Only load maintenance windows when there are alerts during rule execution. Sep 13, 2024
getRulesClientWithRequest,
kibanaBaseUrl: this.kibanaBaseUrl,
logger,
maintenanceWindowsService: new MaintenanceWindowsService({
cacheInterval: this.config.rulesSettings.cacheInterval,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used just for reducing the cache interval for testing so I reused the setting from the rules settings service.

@ymao1 ymao1 changed the title [Response Ops][Alerting] Only load maintenance windows when there are alerts during rule execution. [Response Ops][Alerting] Only load maintenance windows when there are alerts during rule execution and caching loaded maintenance windows Sep 16, 2024
@ymao1 ymao1 self-assigned this Sep 16, 2024
@ymao1 ymao1 added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v9.0.0 backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) v8.16.0 labels Sep 16, 2024
});

// Only look at maintenance windows for this rule category
const maintenanceWindows = activeMaintenanceWindows.filter(({ categoryIds }) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

active maintenance windows are cached but whether those active maintenance windows apply to this rule type based on category are checked every time this function is called.

const executeEvents = events.filter((event) => event?.event?.action === 'execute');

// the first execute event should not have any maintenance window ids because there were no alerts during the
// first execution
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change from before where maintenance windows were loaded and set in the event log even if there were no alerts during the rule execution, so I believe this is more correct than before.

@ymao1 ymao1 marked this pull request as ready for review September 17, 2024 00:25
@ymao1 ymao1 requested review from a team as code owners September 17, 2024 00:25
@ymao1 ymao1 requested a review from rylnd September 17, 2024 00:25
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 19, 2024

@elasticmachine merge upstream

@botelastic botelastic bot added ci:project-deploy-observability Create an Observability project Team:obs-ux-management Observability Management User Experience Team labels Sep 19, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 19, 2024

@elasticmachine merge upstream

@PhilippeOberti PhilippeOberti removed the request for review from a team September 19, 2024 21:42
@ymao1
Copy link
Contributor Author

ymao1 commented Sep 23, 2024

@elasticmachine merge upstream

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

Noted what is either a bug or me being confused in getActiveMaintenanceWindows(), which seems like it would never return any windows since it's searching on a unique date generated in that method.

}
}

public async loadMaintenanceWindows(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debugged the code to make sure we weren't somehow going down here in the maintance window UX itself, because then we'd have a (potential) problem - the cached one could be stale.

It doesn't.

However, perhaps we should work in "cache" on the method name here, somehow, in case someone uses this and doesn't realize it could be stale. loadCacheableMaintenanceWindows() (yikes!)

Actually, maybe easier to swap names w/ loadMaintenanceWindows() and the private getMaintenaceWindows().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched loadMaintenanceWindows and getMaintenanceWindows in 883d909

@@ -301,45 +310,25 @@ export class AlertsClient<
return this.legacyAlertsClient.checkLimitUsage();
}

public processAlerts(opts: ProcessAlertsOpts) {
this.legacyAlertsClient.processAlerts(opts);
public async processAlerts(opts: ProcessAlertsOpts) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete aside, but if the occaisonal event loop delays we seen in rules are caused by processAlerts(), making this async may help - chunks things up a bit, give other node events get a chance to run

const startDateWithCacheOffset = new Date(startDate.getTime() + cacheIntervalMs);
const startDateWithCacheOffsetISO = startDateWithCacheOffset.toISOString();
eventsKuery = nodeBuilder.or([
nodeBuilder.is('maintenance-window.attributes.events', startDateISO),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused by this. We're generating a new date in this method, and then searching for it explicitly in the SO's? I would think that would literally never work.

I see the change from using date ranges (which makes sense) to the is check was done here: #157112

Can we find out what's going on? Either it's a bug, or needs a comment :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused as well but I tested it and it does work. I think because events is mapped as a date_range it just works? cc @JiaweiWu do you know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooooh! Yup! Thx, it is a date_range and you can do term queries against them. Either forgot that, or TIL

https://www.elastic.co/guide/en/elasticsearch/reference/current/range.html

@@ -308,6 +308,9 @@ export default function createGetAlertSummaryTests({ getService }: FtrProviderCo
true
);

// wait so cache expires
await setTimeoutAsync(10000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we should use a common constant instead of literal. Will be easier to fix later :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added const in 883d909

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 25, 2024

@elasticmachine run docs-build

Copy link
Contributor

@dominiqueclarke dominiqueclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obs-ux-management changes LGTM

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 26, 2024

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

kibana-ci commented Sep 26, 2024

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: 3e43f91
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-192573-3e43f919b8c2

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 merged commit 93414a6 into elastic:main Sep 26, 2024
46 checks passed
@ymao1 ymao1 deleted the alerting/load-maintenance-windows-later branch September 26, 2024 12:59
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.x Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 192573

Questions ?

Please refer to the Backport tool documentation

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 26, 2024

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

ymao1 added a commit to ymao1/kibana that referenced this pull request Sep 26, 2024
… alerts during rule execution and caching loaded maintenance windows (elastic#192573)

Resolves elastic#184324

## Summary

This PR moves the loading of maintenance windows further down in rule
execution so maintenance windows are only loaded when a rule execution
generates alerts. Also caches maintenance windows per space to reduce
the number of requests.

## To Verify

1. Add some logging to
x-pack/plugins/alerting/server/task_runner/maintenance_windows/maintenance_windows_service.ts
to indicate when windows are being fetched and when they're returning
from the cache.
2. Create and run some rules in different spaces with and without alerts
to see that the maintenance windows are only loaded when there are
alerts and that the windows are returned from the cache when the cache
has not expired.

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 93414a6)
ymao1 added a commit that referenced this pull request Sep 26, 2024
…re are alerts during rule execution and caching loaded maintenance windows (#192573) (#194191)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Response Ops][Alerting] Only load maintenance windows when there are
alerts during rule execution and caching loaded maintenance windows
(#192573)](#192573)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ying
Mao","email":"ying.mao@elastic.co"},"sourceCommit":{"committedDate":"2024-09-26T12:59:36Z","message":"[Response
Ops][Alerting] Only load maintenance windows when there are alerts
during rule execution and caching loaded maintenance windows
(#192573)\n\nResolves
#184324
Summary\r\n\r\nThis PR moves the loading of maintenance windows further
down in rule\r\nexecution so maintenance windows are only loaded when a
rule execution\r\ngenerates alerts. Also caches maintenance windows per
space to reduce\r\nthe number of requests.\r\n\r\n## To Verify\r\n\r\n1.
Add some logging
to\r\nx-pack/plugins/alerting/server/task_runner/maintenance_windows/maintenance_windows_service.ts\r\nto
indicate when windows are being fetched and when they're
returning\r\nfrom the cache.\r\n2. Create and run some rules in
different spaces with and without alerts\r\nto see that the maintenance
windows are only loaded when there are\r\nalerts and that the windows
are returned from the cache when the cache\r\nhas not
expired.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>\r\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"93414a672c2767b035110fa2d811cc040af57727","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","v9.0.0","backport:prev-minor","ci:project-deploy-observability","Team:obs-ux-management","v8.16.0"],"number":192573,"url":"#192573
Ops][Alerting] Only load maintenance windows when there are alerts
during rule execution and caching loaded maintenance windows
(#192573)\n\nResolves
#184324
Summary\r\n\r\nThis PR moves the loading of maintenance windows further
down in rule\r\nexecution so maintenance windows are only loaded when a
rule execution\r\ngenerates alerts. Also caches maintenance windows per
space to reduce\r\nthe number of requests.\r\n\r\n## To Verify\r\n\r\n1.
Add some logging
to\r\nx-pack/plugins/alerting/server/task_runner/maintenance_windows/maintenance_windows_service.ts\r\nto
indicate when windows are being fetched and when they're
returning\r\nfrom the cache.\r\n2. Create and run some rules in
different spaces with and without alerts\r\nto see that the maintenance
windows are only loaded when there are\r\nalerts and that the windows
are returned from the cache when the cache\r\nhas not
expired.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>\r\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"93414a672c2767b035110fa2d811cc040af57727"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","labelRegex":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"#192573
Ops][Alerting] Only load maintenance windows when there are alerts
during rule execution and caching loaded maintenance windows
(#192573)\n\nResolves
#184324
Summary\r\n\r\nThis PR moves the loading of maintenance windows further
down in rule\r\nexecution so maintenance windows are only loaded when a
rule execution\r\ngenerates alerts. Also caches maintenance windows per
space to reduce\r\nthe number of requests.\r\n\r\n## To Verify\r\n\r\n1.
Add some logging
to\r\nx-pack/plugins/alerting/server/task_runner/maintenance_windows/maintenance_windows_service.ts\r\nto
indicate when windows are being fetched and when they're
returning\r\nfrom the cache.\r\n2. Create and run some rules in
different spaces with and without alerts\r\nto see that the maintenance
windows are only loaded when there are\r\nalerts and that the windows
are returned from the cache when the cache\r\nhas not
expired.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>\r\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"93414a672c2767b035110fa2d811cc040af57727"}},{"branch":"8.x","label":"v8.16.0","labelRegex":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) ci:project-deploy-observability Create an Observability project Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:obs-ux-management Observability Management User Experience Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache and load maintenance windows only when they are needed
7 participants