Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[alert-handler] auto-fix Nvidia GPU low performance issue #5383

Merged
merged 5 commits into from
Mar 31, 2021

Conversation

suiguoxin
Copy link
Member

No description provided.

@coveralls
Copy link

coveralls commented Mar 17, 2021

Coverage Status

Coverage remained the same at 34.02% when pulling 9f24d6e on suiguoxin:gpu-perf into fe18fd9 on microsoft:master.

@suiguoxin suiguoxin force-pushed the gpu-perf branch 5 times, most recently from d7f758a to 82c35c5 Compare March 18, 2021 10:42
@suiguoxin suiguoxin marked this pull request as ready for review March 19, 2021 04:34
@suiguoxin suiguoxin mentioned this pull request Mar 19, 2021
14 tasks
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do some tests in INT bed. Seems the script doesn't work

@suiguoxin suiguoxin merged commit f559e97 into microsoft:master Mar 31, 2021
@suiguoxin suiguoxin deleted the gpu-perf branch March 31, 2021 09:12
@suiguoxin
Copy link
Member Author

Test cases:

  • enable this feature by adding the following customized route & receiver in services-configuration.yaml:
customized-routes:
  routes:
  - receiver: pai-email-admin-and-fix-nvidia-gpu-low-perf
    match:
      alertname: NodeGpuLowPerfState
customized-receivers: # receivers are combination of several actions
- name: "pai-email-admin-and-fix-nvidia-gpu-low-perf"
  actions:
    email-admin:
    fix-nvidia-gpu-low-perf:
  • manully send an alert by POST to https://xxx.openpai.org/alert-manager/api/v1/alerts with the following body:
[
        {
            "labels": {
                "alertname": "NodeGpuLowPerfState",
                "minor_number": "0",
                "node_name": "node6",
                "severity": "warn"
            },
            "generatorURL": "alert/script",
            "fingerprint": "6b8102e96c9e6b2a"
        }
]
  • check k8s job named nvidia-gpu-low-perf-fixer-xxx and related pods created, check logs, the GPU state & clocks should been changed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants