Skip to content

Latest commit

 

History

History
93 lines (64 loc) · 5.59 KB

README.md

File metadata and controls

93 lines (64 loc) · 5.59 KB

Helm charts for AIS Cluster Monitoring

Getting started

  1. Install and configure helm and helmfile (including configuring kubectl context for your cluster).
  2. If using local storage for persistence, set up a storage class on your cluster that can handle dynamic persistent volumes. We use Rancher's local-path-provisioner by default.
  3. Label nodes for scheduling monitoring pods if using affinity: kubectl label node/your-node 'aistore.nvidia.com/role_monitoring=true'.
  4. Create an environment for your deployment based on the values files for either the default (everything) or the external deployment (no grafana/loki).
  5. Update the values for your deployment environment
  6. Export any required environment variables (e.g. if bundling grafana, export GRAFANA_PASSWORD=<password>).
  7. Run helmfile sync or helmfile --environment <your-env> sync.
  8. Access Grafana from an external machine.

With the proper values configured, all tools should automatically sync and provide data in the grafana dashboard.

Most chart values are set in the source charts or in the values.yaml.gotmpl in each chart's directory. To configure a specific deployment, create an environment file and replace default.yaml in the helmfile or create a new environment.

Environment variables

  • Grafana admin user login
    • export GRAFANA_PASSWORD=<password>

Security context

For setting the securityContext, specify details of a non-root user (typically UID > 1000). To identify existing non-root users, use the following command:

awk -F: '$3 >= 1000 {print $1}' /etc/passwd

Alternatively, you can either use an existing non-root user or create a new one. To obtain the UID and Group ID (GID) of a user, execute:

id [username]

Then, update your deployment environment file with the user's UID and GID by setting the runAsUser, runAsGroup, and fsGroup fields, under securityContext.

Alerting

AlertManager supports various receivers, and you can configure them as needed. We include a slack alert in our config file in kube-prom/alertmanager_config, but more can be added. Refer to the Prometheus Alerting Configuration for details on each receiver's config.

AIS Metrics

To monitor AIS, create PodMonitor definitions.

You can find an AIS PodMonitor definition in ais_podmonitors.yaml which will be automatically applied after syncing the kube-prometheus chart.

If using HTTPS for AIS, be sure to update the PodMonitor definition with the appropriate configs for scheme and TLS (an example is provided in the definition).

When applied, the monitors will configure prometheus to scrape metrics from AIStore's proxy and target pods individually every 30 seconds.

Accessing internal services (Prometheus, Grafana)

The web services for Prometheus and Grafana are not directly accessible from outside the cluster. Options include changing the service type to NodePort or using port-forwarding. Use kube-prometheus-stack-prometheus for the Prometheus service and kube-prometheus-stack-grafana for Grafana. Below are instructions for Grafana.

  1. Configure access from the host into the pod by using ONE of the following
    1. Port-forward: kubectl port-forward --namespace monitoring service/kube-prometheus-stack-grafana 3000:80
    2. Patch the service to use NodePort: kubectl patch svc kube-prometheus-stack-grafana -n monitoring -p '{"spec": {"type": "NodePort"}}'
    3. Create a separate NodePort or LoadBalancer service: k8s docs

If needed, use an ssh tunnel to access the k8s host: ssh -L <port>:localhost:<port> <user-name>@<ip-or-host-name> and view localhost:<port>

For Grafana, login with the admin user and the password set with the GRAFANA_PASSWORD environment variable

Prometheus UI

Grafana Dashboard

Included charts:

Promtail

Kube-prometheus

Loki

Grafana