Conversation
f663587 to
d951ffe
Compare
|
Maybe you could also implement uncordon as well? |
I suggest uncordon node manually. I think alert resolve message may be false positive. Such as job-exporter trigger the alert, then the job exporter is crash. Since there is no metric anymore, the alert rules is not satisfied and the alert maybe resolved by Prometheus. But error still here. So I suggest admin uncordon the node manually when he make sure everything is OK |
@fanyangCS @Binyang2014 Added a brief instruction for admin about the alert and how to uncordon nodes here. |
Cordon node when NvidiaSmiEccError:
cordon-nodesaction in alert-manager, cordon nodes through k8s apipatchNodeRefer to Issue #4789