Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpu count metrics including limit, request and total counts #214

Merged
merged 3 commits into from
May 29, 2024

Conversation

movence
Copy link

@movence movence commented May 20, 2024

** Revision **

  • Remove the logic to copy resource metric attributes down to data points

Description:
Add NVDIA GPU count metrics including _limit, _request and _total at pod, node and cluster levels. This change will have the leader agent pod to collect GPU count metrics rather than individual agent pods to collect them. This is to collect limit and request metrics for pods that are still in pending status due to lack of available GPU devices. The leader agent will still gather information of pending pods for limit and request, but *_total metric will only include gpu counts data from running pods.

Testing:
Tested with a cluster which has 2 g4dn.12xlarge instances with 4 gpu devices each. There are 4 RUNNING pods (total 8 allocated gpu devices) and 2 PENDING pods requesting 2 gpu devices each.
Screenshot 2024-05-20 at 12 11 21 PM

internal/aws/containerinsight/utils.go Outdated Show resolved Hide resolved
internal/aws/containerinsight/utils.go Outdated Show resolved Hide resolved
@movence movence merged commit 2728c19 into amazon-contributing:aws-cwa-dev May 29, 2024
111 of 122 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants