What steps did you take and what happened?
When a node becomes unreachable (e.g. the underlying instance is stopped), CAPI's machine drain gets stuck indefinitely. The drain uses the Kubernetes Eviction API which respects PodDisruptionBudgets. When pods on the unreachable node have PDBs with minAvailable: 1 and currentHealthy: 0, the Eviction API returns 429 TooManyRequests and drain retries forever (every 20 seconds).
The existing code at machine_controller.go correctly detects unreachable nodes and sets GracePeriodSeconds=1 and SkipWaitForDeleteTimeoutSeconds=1, with a comment noting that the taint manager should handle PDB bypass via the NoExecute taint. However, when the instance is stopped/terminated, the kubelet is not running, so:
- The taint manager marks pods for deletion, but kubelet never executes the deletion
- Pods remain without a deletionTimestamp (or with one but never actually removed)
- CAPI drain's Eviction API calls are blocked by PDBs indefinitely
- With the default NodeDrainTimeout=0 (unlimited), the machine is stuck in deletion forever
What did you expect to happen?
When a node is unreachable and the infrastructure reports the instance is not ready (stopped/terminated), drain should bypass PDBs by using direct pod deletion instead of the Eviction API. The pods are not actually running, so PDB protection is not meaningful.
Cluster API version
CAPI version: v1.9.x / main (observed on OCP 4.19.12 / ROSA HCP)
Kubernetes version
- Create a cluster with workloads protected by PDBs (minAvailable: 1)
- Stop the underlying instance (e.g. via cloud console)
- Wait for MachineHealthCheck to mark the machine unhealthy and trigger deletion
- Observe drain stuck retrying evictions every 20s, blocked by PDBs
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug
What steps did you take and what happened?
When a node becomes unreachable (e.g. the underlying instance is stopped), CAPI's machine drain gets stuck indefinitely. The drain uses the Kubernetes Eviction API which respects PodDisruptionBudgets. When pods on the unreachable node have PDBs with minAvailable: 1 and currentHealthy: 0, the Eviction API returns 429 TooManyRequests and drain retries forever (every 20 seconds).
The existing code at machine_controller.go correctly detects unreachable nodes and sets GracePeriodSeconds=1 and SkipWaitForDeleteTimeoutSeconds=1, with a comment noting that the taint manager should handle PDB bypass via the NoExecute taint. However, when the instance is stopped/terminated, the kubelet is not running, so:
What did you expect to happen?
When a node is unreachable and the infrastructure reports the instance is not ready (stopped/terminated), drain should bypass PDBs by using direct pod deletion instead of the Eviction API. The pods are not actually running, so PDB protection is not meaningful.
Cluster API version
CAPI version: v1.9.x / main (observed on OCP 4.19.12 / ROSA HCP)
Kubernetes version
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug