Machine drain stuck indefinitely when node is unreachable and PDBs block eviction

### What steps did you take and what happened?

When a node becomes unreachable (e.g. the underlying instance is stopped), CAPI's machine drain gets stuck indefinitely. The drain uses the Kubernetes Eviction API which respects PodDisruptionBudgets. When pods on the unreachable node have PDBs with minAvailable: 1 and currentHealthy: 0, the Eviction API returns 429 TooManyRequests and drain retries forever (every 20 seconds).

The existing code at machine_controller.go correctly detects unreachable nodes and sets GracePeriodSeconds=1 and SkipWaitForDeleteTimeoutSeconds=1, with a comment noting that the taint manager should handle PDB bypass via the NoExecute taint. However, when the instance is stopped/terminated, the kubelet is not running, so:

1. The taint manager marks pods for deletion, but kubelet never executes the deletion
2. Pods remain without a deletionTimestamp (or with one but never actually removed)
3. CAPI drain's Eviction API calls are blocked by PDBs indefinitely
4. With the default NodeDrainTimeout=0 (unlimited), the machine is stuck in deletion forever

### What did you expect to happen?

When a node is unreachable and the infrastructure reports the instance is not ready (stopped/terminated), drain should bypass PDBs by using direct pod deletion instead of the Eviction API. The pods are not actually running, so PDB protection is not meaningful.


### Cluster API version

CAPI version: v1.9.x / main (observed on OCP 4.19.12 / ROSA HCP)

### Kubernetes version

1. Create a cluster with workloads protected by PDBs (minAvailable: 1)
2. Stop the underlying instance (e.g. via cloud console)
3. Wait for MachineHealthCheck to mark the machine unhealthy and trigger deletion
4. Observe drain stuck retrying evictions every 20s, blocked by PDBs

### Anything else you would like to add?

_No response_

### Label(s) to be applied

/kind bug



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine drain stuck indefinitely when node is unreachable and PDBs block eviction #13508

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Machine drain stuck indefinitely when node is unreachable and PDBs block eviction #13508

Description

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions