-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: cluster observability documentation and dashboard improvements #9391
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
docs/release-notes/observability.st
Outdated
|
||
Kubernetes: Add Determined resource information such as `workspace` and `task ID` as pod labels. This improvement facilitates better resource tracking and management within Kubernetes environments. | ||
|
||
Configuration: Introduce a DCGM Helm chart and Prometheus configuration to the `tools/observability` directory. Additionally, two new dashboards, "API Monitoring" and "Resource Utilization", have been added to improve observability and operational insight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an "st" file, please change it to an *.rst file
"docs/release-notes/observability.st" should be "docs/release-notes/observability.rst"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Discover how to enable a Grafana dashboard to monitor Determined hardware and system metrics. | ||
Determined provides a Prometheus endpoint that contains mappings between internal task, GPU, and | ||
container definitions, which are used by Prometheus to collect relevant metrics on a cluster running | ||
Determined. The endpoint is not enabled by default but can be enabled in the master configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus should be enabled by default now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Thank you!
I inverted the sentence The endpoint is enabled by default but can be disabled in the master configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a commit with some very minor edits
Co-authored-by: Tara Charter <tara.charter@hpe.com>
f16f30d
to
9ec1196
Compare
Ticket
RM-293
Description
Merging Cluster Observability feature branch to main. New instructions for setting up observability in Kubernetes environments, along with updated dashboards.
Test Plan
None needed.
Checklist
docs/release-notes/
.See Release Note for details.