-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: add alert metric for Prometheus and add Grafana alert docs [RM-118] #9150
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9150 +/- ##
==========================================
- Coverage 46.22% 46.19% -0.03%
==========================================
Files 1175 1175
Lines 145341 145360 +19
Branches 2414 2414
==========================================
- Hits 67180 67148 -32
- Misses 77952 78003 +51
Partials 209 209
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I especially how the docs explain conditions that cause determined to report "unhealthy".
``Alert state if no data or all values are null`` to ``Alerting``. | ||
|
||
Further information can be found about using Grafana alerts on the `Grafana docs | ||
<https://grafana.com/docs/grafana/latest/alerting/>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all suggestions including typo fixes
Alerts
The det-master-api-server
provides a metric, determined_healthy
, that can be used to set up
alerts. This metric will return 1
when Determined can access its major dependencies and 0
when it cannot. On Kubernetes, inability to access the Kubernetes API server will cause this metric to return 0
. On Slurm, failure to access the launcher will also cause this metric to return 0
. If the database is
down, it is possible Prometheus will be unable to scrape this metric.
To create an alert in Grafana, navigate to the Alert Rules page and use the
Prometheus data source configured earlier. You can use the following query to set up the alert.
1 - determined_healthy{job="det-master-api-server"}
.. image:: /assets/images/grafana-alert-config.png
:width: 704px
:align: center
:alt: Grafana Alert Configuration
Since Prometheus may be unable to scrape Determined under certain circumstances, it is recommended to set
Alert state if no data or all values are null
to Alerting
.
For more information on using Grafana alerts, visit the Grafana documentation <https://grafana.com/docs/grafana/latest/alerting/>
__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i forgot about the toctree
******** | ||
|
||
The ``det-master-api-server`` provides a metric ``deterimed_healthy`` that can be used to set up | ||
alerts. This metric will return 1 when Determined can access its major dependencies and 0 when it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determined
Ticket
Description
Add a new Prometheus metric that says if determined is healthy.
Add some docs about how to use the alert on Grafana
Test Plan
Set up a grafana and prometheus and follow the alert docs. Then to test the alert failing rename the cluster_id table
Checklist
docs/release-notes/
.See Release Note for details.