Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: cluster observability documentation and dashboard improvements #9391

Merged
merged 10 commits into from
May 21, 2024

Conversation

kkunapuli
Copy link
Contributor

@kkunapuli kkunapuli commented May 20, 2024

Ticket

RM-293

Description

Merging Cluster Observability feature branch to main. New instructions for setting up observability in Kubernetes environments, along with updated dashboards.

Test Plan

None needed.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label May 20, 2024
@determined-ci determined-ci added the documentation Improvements or additions to documentation label May 20, 2024
@determined-ci determined-ci requested a review from a team May 20, 2024 15:30
Copy link

netlify bot commented May 20, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 9ec1196
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/664cdddb391b64000809e170
😎 Deploy Preview https://deploy-preview-9391--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kkunapuli kkunapuli changed the title Observability feature branch docs: cluster observability documentation and dashboard improvements May 20, 2024
@kkunapuli kkunapuli marked this pull request as ready for review May 20, 2024 16:18

Kubernetes: Add Determined resource information such as `workspace` and `task ID` as pod labels. This improvement facilitates better resource tracking and management within Kubernetes environments.

Configuration: Introduce a DCGM Helm chart and Prometheus configuration to the `tools/observability` directory. Additionally, two new dashboards, "API Monitoring" and "Resource Utilization", have been added to improve observability and operational insight.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an "st" file, please change it to an *.rst file

"docs/release-notes/observability.st" should be "docs/release-notes/observability.rst"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - thank you!

@determined-ci determined-ci requested a review from a team May 20, 2024 20:10
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Discover how to enable a Grafana dashboard to monitor Determined hardware and system metrics.
Determined provides a Prometheus endpoint that contains mappings between internal task, GPU, and
container definitions, which are used by Prometheus to collect relevant metrics on a cluster running
Determined. The endpoint is not enabled by default but can be enabled in the master configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prometheus should be enabled by default now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Thank you!

I inverted the sentence The endpoint is enabled by default but can be disabled in the master configuration file.

Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a commit with some very minor edits

@determined-ci determined-ci requested a review from a team May 21, 2024 14:37
@kkunapuli kkunapuli force-pushed the observability_feature_branch branch from f16f30d to 9ec1196 Compare May 21, 2024 17:46
@kkunapuli kkunapuli merged commit b84ee1f into main May 21, 2024
79 of 94 checks passed
@kkunapuli kkunapuli deleted the observability_feature_branch branch May 21, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants