chore: add alert metric for Prometheus and add Grafana alert docs [RM-118] #9150

NicholasBlaskey · 2024-04-11T15:55:44Z

Ticket

Description

Add a new Prometheus metric that says if determined is healthy.

Add some docs about how to use the alert on Grafana

Test Plan

Set up a grafana and prometheus and follow the alert docs. Then to test the alert failing rename the cluster_id table

ALTER TABLE cluster_id SET NAME cluster_id2;

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

netlify · 2024-04-11T15:56:00Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`2041c8e`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/661840477aba5a0009be234f

codecov · 2024-04-11T15:56:26Z

Codecov Report

Attention: Patch coverage is 0% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 46.19%. Comparing base (3f7a396) to head (2041c8e).
Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9150      +/-   ##
==========================================
- Coverage   46.22%   46.19%   -0.03%     
==========================================
  Files        1175     1175              
  Lines      145341   145360      +19     
  Branches     2414     2414              
==========================================
- Hits        67180    67148      -32     
- Misses      77952    78003      +51     
  Partials      209      209

Flag	Coverage Δ
backend	`43.68% <0.00%> (-0.10%)`	⬇️
harness	`63.99% <ø> (ø)`
web	`36.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
master/internal/core.go	`4.23% <0.00%> (-0.10%)`	⬇️

... and 7 files with indirect coverage changes

kkunapuli

This is awesome! I especially how the docs explain conditions that cause determined to report "unhealthy".

tara-det-ai · 2024-04-11T16:43:09Z

docs/integrations/prometheus/_index.rst

+``Alert state if no data or all values are null`` to ``Alerting``.
+
+Further information can be found about using Grafana alerts on the `Grafana docs
+<https://grafana.com/docs/grafana/latest/alerting/>`__.


all suggestions including typo fixes

Alerts

The det-master-api-server provides a metric, determined_healthy, that can be used to set up
alerts. This metric will return 1 when Determined can access its major dependencies and 0 when it cannot. On Kubernetes, inability to access the Kubernetes API server will cause this metric to return 0. On Slurm, failure to access the launcher will also cause this metric to return 0. If the database is
down, it is possible Prometheus will be unable to scrape this metric.

To create an alert in Grafana, navigate to the Alert Rules page and use the
Prometheus data source configured earlier. You can use the following query to set up the alert.

1 - determined_healthy{job="det-master-api-server"}

.. image:: /assets/images/grafana-alert-config.png
:width: 704px
:align: center
:alt: Grafana Alert Configuration

Since Prometheus may be unable to scrape Determined under certain circumstances, it is recommended to set
Alert state if no data or all values are null to Alerting.

For more information on using Grafana alerts, visit the Grafana documentation <https://grafana.com/docs/grafana/latest/alerting/>__.

sorry i forgot about the toctree

tara-det-ai · 2024-04-11T16:44:00Z

docs/integrations/prometheus/_index.rst

+********
+
+The ``det-master-api-server`` provides a metric ``deterimed_healthy`` that can be used to set up
+alerts. This metric will return 1 when Determined can access its major dependencies and 0 when it is


chore: add alert metric for Prometheus and add Grafana alert docs

91083c6

cla-bot bot added the cla-signed label Apr 11, 2024

add docs

3f240b7

determined-ci added the documentation Improvements or additions to documentation label Apr 11, 2024

determined-ci requested a review from a team April 11, 2024 16:18

NicholasBlaskey requested a review from kkunapuli April 11, 2024 16:24

NicholasBlaskey marked this pull request as ready for review April 11, 2024 16:24

NicholasBlaskey requested a review from a team as a code owner April 11, 2024 16:24

kkunapuli approved these changes Apr 11, 2024

View reviewed changes

NicholasBlaskey requested a review from tara-det-ai April 11, 2024 16:38

tara-det-ai reviewed Apr 11, 2024

View reviewed changes

Feedback

2041c8e

determined-ci requested a review from a team April 11, 2024 19:55

NicholasBlaskey merged commit 746ba26 into main Apr 11, 2024
75 of 89 checks passed

NicholasBlaskey deleted the grafana_alerts branch April 11, 2024 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add alert metric for Prometheus and add Grafana alert docs [RM-118] #9150

chore: add alert metric for Prometheus and add Grafana alert docs [RM-118] #9150

NicholasBlaskey commented Apr 11, 2024 •

edited by jira bot

Loading

netlify bot commented Apr 11, 2024 •

edited

Loading

codecov bot commented Apr 11, 2024 •

edited

Loading

kkunapuli left a comment

tara-det-ai Apr 11, 2024 •

edited

Loading

tara-det-ai Apr 24, 2024

tara-det-ai Apr 11, 2024

chore: add alert metric for Prometheus and add Grafana alert docs [RM-118] #9150

chore: add alert metric for Prometheus and add Grafana alert docs [RM-118] #9150

Conversation

NicholasBlaskey commented Apr 11, 2024 • edited by jira bot Loading

Ticket

Description

Test Plan

Checklist

netlify bot commented Apr 11, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

codecov bot commented Apr 11, 2024 • edited Loading

Codecov Report

kkunapuli left a comment

Choose a reason for hiding this comment

tara-det-ai Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

tara-det-ai Apr 24, 2024

Choose a reason for hiding this comment

tara-det-ai Apr 11, 2024

Choose a reason for hiding this comment

NicholasBlaskey commented Apr 11, 2024 •

edited by jira bot

Loading

netlify bot commented Apr 11, 2024 •

edited

Loading

codecov bot commented Apr 11, 2024 •

edited

Loading

tara-det-ai Apr 11, 2024 •

edited

Loading