Skip to content
This repository was archived by the owner on Jun 6, 2024. It is now read-only.

Cordon node#4942

Merged
suiguoxin merged 73 commits intomicrosoft:masterfrom
suiguoxin:cordon-node
Oct 15, 2020
Merged

Cordon node#4942
suiguoxin merged 73 commits intomicrosoft:masterfrom
suiguoxin:cordon-node

Conversation

@suiguoxin
Copy link
Contributor

Cordon node when NvidiaSmiEccError:

  • collect node_name in job_exporter
  • set alert in Prometheus
  • add cordon-nodes action in alert-manager, cordon nodes through k8s api patchNode

Refer to Issue #4789

@coveralls
Copy link

coveralls commented Sep 29, 2020

Coverage Status

Coverage decreased (-0.1%) to 34.276% when pulling de9c433 on suiguoxin:cordon-node into cf4e6a8 on microsoft:master.

@fanyangCS
Copy link
Contributor

Maybe you could also implement uncordon as well?

@Binyang2014
Copy link
Contributor

Maybe you could also implement uncordon as well?

I suggest uncordon node manually. I think alert resolve message may be false positive. Such as job-exporter trigger the alert, then the job exporter is crash. Since there is no metric anymore, the alert rules is not satisfied and the alert maybe resolved by Prometheus. But error still here.

So I suggest admin uncordon the node manually when he make sure everything is OK

@suiguoxin
Copy link
Contributor Author

Maybe you could also implement uncordon as well?

I suggest uncordon node manually. I think alert resolve message may be false positive. Such as job-exporter trigger the alert, then the job exporter is crash. Since there is no metric anymore, the alert rules is not satisfied and the alert maybe resolved by Prometheus. But error still here.

So I suggest admin uncordon the node manually when he make sure everything is OK

@fanyangCS @Binyang2014 Added a brief instruction for admin about the alert and how to uncordon nodes here.
https://github.com/suiguoxin/pai/blob/cordon-node/docs/manual/cluster-admin/troubleshooting.md#nvidiasmieccerror

@suiguoxin suiguoxin requested a review from Binyang2014 October 14, 2020 13:10
@suiguoxin suiguoxin merged commit a1b716e into microsoft:master Oct 15, 2020
@suiguoxin suiguoxin deleted the cordon-node branch October 15, 2020 08:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants