Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add alert metric for Prometheus and add Grafana alert docs [RM-118] #9150

Merged
merged 3 commits into from
Apr 11, 2024

Conversation

NicholasBlaskey
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey commented Apr 11, 2024

Ticket

Description

Add a new Prometheus metric that says if determined is healthy.

Add some docs about how to use the alert on Grafana

Test Plan

Set up a grafana and prometheus and follow the alert docs. Then to test the alert failing rename the cluster_id table

ALTER TABLE cluster_id SET NAME cluster_id2;

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label Apr 11, 2024
Copy link

netlify bot commented Apr 11, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 2041c8e
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/661840477aba5a0009be234f

Copy link

codecov bot commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 46.19%. Comparing base (3f7a396) to head (2041c8e).
Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9150      +/-   ##
==========================================
- Coverage   46.22%   46.19%   -0.03%     
==========================================
  Files        1175     1175              
  Lines      145341   145360      +19     
  Branches     2414     2414              
==========================================
- Hits        67180    67148      -32     
- Misses      77952    78003      +51     
  Partials      209      209              
Flag Coverage Δ
backend 43.68% <0.00%> (-0.10%) ⬇️
harness 63.99% <ø> (ø)
web 36.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/core.go 4.23% <0.00%> (-0.10%) ⬇️

... and 7 files with indirect coverage changes

@determined-ci determined-ci added the documentation Improvements or additions to documentation label Apr 11, 2024
@determined-ci determined-ci requested a review from a team April 11, 2024 16:18
@NicholasBlaskey NicholasBlaskey marked this pull request as ready for review April 11, 2024 16:24
@NicholasBlaskey NicholasBlaskey requested a review from a team as a code owner April 11, 2024 16:24
Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I especially how the docs explain conditions that cause determined to report "unhealthy".

``Alert state if no data or all values are null`` to ``Alerting``.

Further information can be found about using Grafana alerts on the `Grafana docs
<https://grafana.com/docs/grafana/latest/alerting/>`__.
Copy link
Member

@tara-det-ai tara-det-ai Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all suggestions including typo fixes


Alerts


The det-master-api-server provides a metric, determined_healthy, that can be used to set up
alerts. This metric will return 1 when Determined can access its major dependencies and 0 when it cannot. On Kubernetes, inability to access the Kubernetes API server will cause this metric to return 0. On Slurm, failure to access the launcher will also cause this metric to return 0. If the database is
down, it is possible Prometheus will be unable to scrape this metric.

To create an alert in Grafana, navigate to the Alert Rules page and use the
Prometheus data source configured earlier. You can use the following query to set up the alert.

1 - determined_healthy{job="det-master-api-server"}

.. image:: /assets/images/grafana-alert-config.png
:width: 704px
:align: center
:alt: Grafana Alert Configuration

Since Prometheus may be unable to scrape Determined under certain circumstances, it is recommended to set
Alert state if no data or all values are null to Alerting.

For more information on using Grafana alerts, visit the Grafana documentation <https://grafana.com/docs/grafana/latest/alerting/>__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i forgot about the toctree

********

The ``det-master-api-server`` provides a metric ``deterimed_healthy`` that can be used to set up
alerts. This metric will return 1 when Determined can access its major dependencies and 0 when it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

determined

@determined-ci determined-ci requested a review from a team April 11, 2024 19:55
@NicholasBlaskey NicholasBlaskey merged commit 746ba26 into main Apr 11, 2024
75 of 89 checks passed
@NicholasBlaskey NicholasBlaskey deleted the grafana_alerts branch April 11, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants