Skip to content
This repository was archived by the owner on Jun 6, 2024. It is now read-only.

[alert-handler] auto-fix Nvidia GPU low performance issue#5383

Merged
suiguoxin merged 5 commits intomicrosoft:masterfrom
suiguoxin:gpu-perf
Mar 31, 2021
Merged

[alert-handler] auto-fix Nvidia GPU low performance issue#5383
suiguoxin merged 5 commits intomicrosoft:masterfrom
suiguoxin:gpu-perf

Conversation

@suiguoxin
Copy link
Contributor

No description provided.

@coveralls
Copy link

coveralls commented Mar 17, 2021

Coverage Status

Coverage remained the same at 34.02% when pulling 9f24d6e on suiguoxin:gpu-perf into fe18fd9 on microsoft:master.

@suiguoxin suiguoxin force-pushed the gpu-perf branch 5 times, most recently from d7f758a to 82c35c5 Compare March 18, 2021 10:42
@suiguoxin suiguoxin marked this pull request as ready for review March 19, 2021 04:34
@suiguoxin suiguoxin requested a review from Binyang2014 March 19, 2021 04:34
@suiguoxin suiguoxin mentioned this pull request Mar 19, 2021
14 tasks
@suiguoxin suiguoxin requested a review from Binyang2014 March 29, 2021 01:41
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do some tests in INT bed. Seems the script doesn't work

@suiguoxin suiguoxin merged commit f559e97 into microsoft:master Mar 31, 2021
@suiguoxin suiguoxin deleted the gpu-perf branch March 31, 2021 09:12
@suiguoxin
Copy link
Contributor Author

Test cases:

  • enable this feature by adding the following customized route & receiver in services-configuration.yaml:
customized-routes:
  routes:
  - receiver: pai-email-admin-and-fix-nvidia-gpu-low-perf
    match:
      alertname: NodeGpuLowPerfState
customized-receivers: # receivers are combination of several actions
- name: "pai-email-admin-and-fix-nvidia-gpu-low-perf"
  actions:
    email-admin:
    fix-nvidia-gpu-low-perf:
  • manully send an alert by POST to https://xxx.openpai.org/alert-manager/api/v1/alerts with the following body:
[
        {
            "labels": {
                "alertname": "NodeGpuLowPerfState",
                "minor_number": "0",
                "node_name": "node6",
                "severity": "warn"
            },
            "generatorURL": "alert/script",
            "fingerprint": "6b8102e96c9e6b2a"
        }
]
  • check k8s job named nvidia-gpu-low-perf-fixer-xxx and related pods created, check logs, the GPU state & clocks should been changed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants