Fix service discovery bug in `kubernetes-extensions` by capistrant · Pull Request #19139 · apache/druid

capistrant · 2026-03-11T22:14:31Z

Description

Bug Report

The k8s service discovery is not removing discovered nodes whose pods still exist with service announcement labels, but the underlying services are actually unhealthy.

For example, if a broker container is killed but the pod that manages it remains in the namespace with announcement labels, all druid services will maintain this service in their discovered services cache. This leads to queries being routed to a broker that cannot possibly execute the request. If this pod remains in an announced but unhealthy state for any meaningful period of time, the cluster functionality can be severely compromised.

Desired behavior in the above example would be that the broker is removed from discovered services caches, at least until the underlying container for the pod is restarted and the pod is healthy again.

Fix Details

My proposed fix starts using a pods readiness flag in the discovery logic. If a pod is not ready, the underlying services will not be added to service discovery caches they are not in and will be removed from any caches that they were in. These services can be added back once they have a MODIFIED or ADDED event in addition to being ready again.

Fix Risks

The biggest risk I see is that this new reliance on readiness probe introduces an expectation that this probe is accurate and stable. I try to call out in documentation that this needs to be considered when defining the readiness probe for a pod as a way to mitigate unexpected changes for users. This could be included in a release note as well to tip off any users of the extension.

Release note

TBD

Key changed/added classes in this PR

DefaultK8sApiClient
BaseNodeRoleWatcher
WatchResult

This PR has:

…very

capistrant · 2026-03-11T22:32:37Z

marking this as draft while I evaluate a competing approach that uses pod phase instead of readiness

capistrant added 8 commits March 11, 2026 16:16

Potential fix

dbe34a8

self review iteration

e539452

minor fixes

cb71279

clarify why remapping modified to added is safe

fcce511

Add more info about how readiness probes come into play for k8s disco…

474c33f

…very

more comment and log cleanup

23af381

Fix doc spelling

1340130

refactor to make reasoning about new behavior more easy

23f557e

capistrant added the Bug label Mar 11, 2026

github-actions bot added the Area - Documentation label Mar 11, 2026

capistrant added the Kubernetes label Mar 11, 2026

capistrant marked this pull request as draft March 11, 2026 22:32

capistrant added 4 commits March 13, 2026 13:32

Merge branch 'master' into k8s-discovery-bug

27cefc1

Merge branch 'master' into k8s-discovery-bug

f8f057b

change example in docs to new readiness probe

106fe38

Fix tests file

5ca2069

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix service discovery bug in `kubernetes-extensions`#19139

Fix service discovery bug in `kubernetes-extensions`#19139
capistrant wants to merge 12 commits intoapache:masterfrom
capistrant:k8s-discovery-bug

capistrant commented Mar 11, 2026

Uh oh!

capistrant commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

capistrant commented Mar 11, 2026

Description

Bug Report

Fix Details

Fix Risks

Release note

Key changed/added classes in this PR

Uh oh!

capistrant commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant