Skip to content

Commit

Permalink
docs: cluster observability documentation and dashboard improvements (#…
Browse files Browse the repository at this point in the history
…9391)

Co-authored-by: Nick Blaskey <nick.blaskey@hpe.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
  • Loading branch information
3 people authored May 21, 2024
1 parent c3b3ae6 commit b84ee1f
Show file tree
Hide file tree
Showing 14 changed files with 4,402 additions and 7 deletions.
Binary file added docs/assets/images/resource-util-dash-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/resource-util-dash-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/resource-util-dash-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/resource-util-dash-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
340 changes: 340 additions & 0 deletions docs/integrations/observability/_index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,340 @@
.. _kubernetes-observability:

##########################
Kubernetes Observability
##########################

This guide provides recommendations and assists in setting up monitoring for a Determined
installation on Kubernetes.

***************
Prerequisites
***************

- Determined must be running within a Kubernetes cluster.
- The Helm value ``observability.enable_prometheus`` must be set to ``true`` (this is the default
setting).
- :ref:`CLI <cli-ug>` must be installed.
- ``Kubectl`` must be installed and configured appropriately.

*********************
Configuration Steps
*********************

Create a Namespace
==================

- Run the following command to create a namespace called ``det-monitoring``:

.. code:: bash
kubectl create ns det-monitoring
Change Directory
================

- Clone the repository and navigate to the ``tools/observability`` directory:

.. code:: bash
git clone https://github.com/determined-ai/determined.git && \
cd determined/tools/observability
Token Refresh
=============

The Determined Prometheus export endpoint is secured by authentication, requiring a refreshable
authentication token. Set up a cron job for token refresh as follows:

#. Create an account.

.. code:: bash
det -u admin user create tokenrefresher
2. Change the account password.

.. code:: bash
det -u admin user change-password tokenrefresher
3. Store the credentials.

.. code:: bash
kubectl -n det-monitoring create secret generic token-refresh-username-pass \
--from-literal="creds=tokenrefresher:testPassword1"
4. Deploy a cron job.

.. attention::

``tokenRefresher.yaml`` may not work work in every Kuberenetes setup. If you have a unique use
case, it might need modification. Hardcoding the Determined master IP can reduce assumptions made
by the script.

.. code:: bash
kubectl -n det-monitoring apply -f tokenRefresher.yaml
5. Verify "secret" creation. After a few minutes, check that the ``det-prom-token`` secret was
created and ``det-token`` is more than 0 bytes.

.. code:: bash
kubectl -n det-monitoring describe secret det-prom-token
Install DCGM Exporter
=====================

Depending on your environment, follow these steps for installing the DCGM exporter:

Steps for General Cloud Environments
------------------------------------

In general, to install DCGM in a cloud-based environment, follow the documentation for that
environment.

If you are not following the steps described here for GKE as a reference, you may need to change the
``additionalScrapeConfigs`` in the ``grafana-prom-values.yaml``.

If you are deploying on-prem, visit `Nvidia docs on installing the DCGM exporter
<https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html#setting-up-dcgm>`__.

Steps for GKE
-------------

#. Create a namespace for the exporter.

.. code:: bash
kubectl create ns gmp-public
2. Apply the exporter from `the GKE docs
<https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/nvidia-dcgm#install-exporter>`__.

.. code:: bash
kubectl apply -n gmp-public -f https://github.com/raw/GoogleCloudPlatform/prometheus-engine/main/examples/nvidia-dcgm/exporter.yaml
3. Create a service for the DCGM exporter.

.. code:: bash
kubectl apply -n gmp-public -f gkeDCGMExporterService.yaml
This differs from the GKE documentation because we deploy a Prometheus installation instead of using
Google Cloud's managed service. While it is still possible to use Google Cloud's managed service,
some features, such as GPU statistics by user, will not be available.

4. Verify the DCGM exporter is functioning by port forwarding the service and checking metrics.

.. code:: bash
kubectl -n gmp-public port-forward service/nvidia-dcgm-exporter 9400
5. In a new console window, verify the service.

.. code:: bash
curl 127.0.0.1:9400/metrics
Install Kube Prometheus Stack
=============================

Follow these instructions for installing a Kube Prometheus Stack. For additional information, you
can visit the `Kube Prometheus stack documentation
<https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack>`__.

#. Add the Helm repo and update.

.. code:: bash
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts && \
helm repo update
2. Install the Kube Prometheus Stack. You'll need to change the password in the following command.

.. code:: bash
helm -n det-monitoring install monitor prometheus-community/kube-prometheus-stack \
--set grafana.adminPassword=testPassword \
--values grafana-prom-values.yaml
Set Up a Monitoring Dashboard
=============================

This section guides you through the process of setting up monitoring dashboards for your Determined
installation on Kubernetes.

- Add an API monitoring dashboard.

.. code:: bash
kubectl -n det-monitoring create configmap det-api-dash --from-file api-dash.json && \
kubectl -n det-monitoring label configmap det-api-dash grafana_dashboard=1
- Add a resource utilization dashboard.

.. code:: bash
kubectl -n det-monitoring create configmap det-resource-utilization-dash --from-file resource-utilization-dash.json && \
kubectl -n det-monitoring label configmap det-resource-utilization-dash grafana_dashboard=1
- Check Prometheus operation by port forwarding.

.. code:: bash
kubectl -n det-monitoring port-forward service/monitor-kube-prometheus-st-prometheus 9090:9090
- Verify metric scraping.

- Go to `127.0.0.1:9090 <http://127.0.0.1:9090>`__ and check that the query has two or more
results with a ``1`` value.

.. code:: bash
up{job=~"det-master-api-server|gpu-metrics"}
- Access Grafana to view the dashboards.

.. code:: bash
kubectl -n det-monitoring port-forward svc/monitor-grafana 9000:80
- Navigate to `127.0.0.1:9000 <http://127.0.0.1:9000>`__. Sign in with the username ``admin`` and
the password you set above. You should see the ``Determined API Server Monitoring`` dashboard.

Dashboard Example
=================

After submitting experiments on the cluster, you should see populated panels in the imported Grafana
dashboard: **Grafana** -> **Dashboards**.

.. figure:: /assets/images/resource-util-dash-1.png
:alt: Resource Utilization Dashboard Headlines

Resource Utilization Dashboard Headlines

.. figure:: /assets/images/resource-util-dash-2.png
:alt: Resource Utilization Dashboard Cluster Overview

Resource Utilization Dashboard Cluster Overview

.. figure:: /assets/images/resource-util-dash-3.png
:alt: Resource Utilization Dashboard GPU Breakdown

Resource Utilization Dashboard GPU Breakdown

.. figure:: /assets/images/resource-util-dash-4.png
:alt: Resource Utilization Dashboard Recent Tasks

Resource Utilization Dashboard Recent Tasks

Each panel in the dashboard is powered by one or more Prometheus queries.

*********
Metrics
*********

Determined does not generate its own metrics; instead, it utilizes existing tools to report
information.

API Performance Metrics
=======================

Determined master reports API performance metrics using `grpc ecosystem
<https://github.com/grpc-ecosystem/go-grpc-prometheus?tab=readme-ov-file#metrics>`__.

Kubernetes and Container Metrics
================================

The kube-prometheus-stack enables ``kube-state-metrics`` and ``cAdvisor`` by default.

- `kube-state-metrics
<https://github.com/kubernetes/kube-state-metrics/tree/main/docs#exposed-metrics>`__ reports the
state of kubernetes objects, including those created by Determined.

- `cAdvisor <https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md>`__ reports
the resource usage and performance of running containers, including metrics such as memory and
CPU usage.

Nvidia DCGM Exporter
====================

Nvidia's Data Center GPU Manager (DCGM) collects data on Nvidia GPUs.

- By default, only the most useful `subset
<https://github.com/GoogleCloudPlatform/prometheus-engine/blob/8dd8a187486cccb5ede3132e5773ae786239dbc2/examples/nvidia-dcgm/exporter.yaml#L139-L169>`__
of metrics are scraped by Prometheus.

- The full list of metrics generated by DCGM exporter can be found `here
<https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv>`__.

Health Status
=============

Determined master reports a metric, ``determined_healthy``, with value of ``1`` when major
dependencies are reachable, and ``0`` otherwise. Visit :ref:_prometheus-grafana-alerts for
information on how to set up alerts.

Viewing Metrics
===============

The Determined Master assigns specific state values to the pods it creates. These pod labels can be
accessed via the ``kube_pod_labels`` metric from "kube-state-metrics". Label names are formatted as
``label_determined_ai_<label_name>``, such as ``label_determined_ai_container_id``.

Kubernetes restricts pod labels to alphanumeric characters, underscores, hyphens, and dots. Any
other characters in Determined resource names will be converted underscores ``(_)`` before being
added as a pod label. Names longer than 63 characters will be truncated.

+-----------------------------+---------------------------------------------------+
| Label Key | Label Value |
+=============================+===================================================+
| determined.ai/container_id | |
+-----------------------------+---------------------------------------------------+
| determined.ai/experiment_id | ``task_type=TRIAL`` only |
+-----------------------------+---------------------------------------------------+
| determined.ai/resource_pool | name of the resource pool, including ``default`` |
+-----------------------------+---------------------------------------------------+
| determined.ai/task_id | |
+-----------------------------+---------------------------------------------------+
| determined.ai/task_type | Determined task type, e.g. ``TRIAL``, |
| | ``NOTEBOOK``, ``TENSORBOARD`` |
+-----------------------------+---------------------------------------------------+
| determined.ai/trial_id | ``task_type=TRIAL`` only |
+-----------------------------+---------------------------------------------------+
| determined.ai/user | Determined username that initiated the request |
+-----------------------------+---------------------------------------------------+
| determined.ai/workspace | name of the workspace, including |
| | ``Uncategorized`` |
+-----------------------------+---------------------------------------------------+

PromQL Example Query
====================

Kubernetes resource metrics and GPU metrics can be broken down by Determined resources by joining
data metrics with the ``kube_pod_labels`` state metric. As an example, the following PromQL query
computes the average GPU Utilization by Determined experiment ID.

.. code:: bash
avg by (label_determined_ai_experiment_id)(
DCGM_FI_DEV_GPU_UTIL * on(pod) group_left(label_determined_ai_experiment_id)
kube_pod_labels{label_determined_ai_experiment_id!=""}
)
Additional Resources
====================

For more details on metric operations:

- Learn about `joining metrics
<https://github.com/kubernetes/kube-state-metrics/tree/main/docs#join-metrics>`__ from
kube-state-metrics.

- Discover how to perform `vector matching
<https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching>`__ in
Prometheus queries.
19 changes: 12 additions & 7 deletions docs/integrations/prometheus/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@
| Determined 0.17.6+ |
+--------------------+

Discover how to enable a Grafana dashboard to monitor Determined hardware and system metrics on a
cloud cluster, such as AWS or Kubernetes. Determined provides a Prometheus endpoint that contains
mappings between internal task, GPU, and container definitions, which are used by Prometheus to
collect relevant metrics on a cluster running Determined. The endpoint is not enabled by default but
can be enabled in the master configuration file.
Discover how to enable a Grafana dashboard to monitor Determined hardware and system metrics.
Determined provides a Prometheus endpoint that contains mappings between internal task, GPU, and
container definitions, which are used by Prometheus to collect relevant metrics on a cluster running
Determined. The endpoint is enabled by default but can be disabled in the master configuration file.

Visit :ref:_kubernetes-observability for instructions on enabling observability in Kubernetes
environments.

***********
Reference
Expand Down Expand Up @@ -60,8 +62,9 @@ other tools can differ, depending on the format and organization of the returned
Configure Determined
**********************

Install and run Determined on a cluster. When launching the master instance, enable the Prometheus
endpoints by adding a flag to the ``master.yaml`` configuration file:
Install and run Determined on a cluster. When launching the master instance, the Prometheus
endpoints are enabled by default in versions 0.32.0 and later. To manually enable the Prometheus
endpoints, add a flag to the ``master.yaml`` configuration file:

.. code:: yaml
Expand Down Expand Up @@ -175,6 +178,8 @@ Each panel in the dashboard is powered by one or more Prometheus queries and tra
metric on the cluster as a percentage of total capacity. Results can be further filtered using
``tags`` and ``resource pool`` and time range in Grafana.

.. _prometheus-grafana-alerts:

********
Alerts
********
Expand Down
10 changes: 10 additions & 0 deletions docs/release-notes/observability.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
:orphan:

**Improvements**

Kubernetes: Add Determined resource information such as `workspace` and `task ID` as pod labels.
This improvement facilitates better resource tracking and management within Kubernetes environments.

Configuration: Introduce a DCGM Helm chart and Prometheus configuration to the `tools/observability`
directory. Additionally, two new dashboards, "API Monitoring" and "Resource Utilization", have been
added to improve observability and operational insight.
1 change: 1 addition & 0 deletions helm/charts/determined/templates/master-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ metadata:
namespace: {{ .Release.Namespace }}
labels:
app: determined-master-{{ .Release.Name }}
determined.ai/master-service: "true"
release: {{ .Release.Name }}
spec:
ports:
Expand Down
Loading

0 comments on commit b84ee1f

Please sign in to comment.