Howdy, we have a custom check that retrieves a metric value from Prometheus using curl.
Edit: we are using Slurm as our resource manager.
The check works great, however I need to add code to the check to prevent NHC from changing the state of the node (drained, un-drained) if the curl command fails, examples:
- The Prometheus server is not responding
- The query doesn't return any metric (could happen if node_exporter died on the node)
Is there a way to return from the function where NHC would not make any changes to the node?
return 0 indicates no failure and triggers an un-drain if the node is already drained, so I can't use that
return 1 or any number indicates failure and drains the node.
Thanks,
Mike Hanby
UAB IT Research Computing