Handle post-failover pod replacement gracefully by ysqyang · Pull Request #86 · valkey-io/valkey-operator

ysqyang · 2026-02-16T22:31:18Z

Summary

When Valkey promotes a replica to primary during automatic failover and Kubernetes recreates the old node-index=0 pod, the reconciler previously tried to assign a new slot range to the replacement pod. This failed in a loop with "no slots range to assign" because all 16384 slots were already owned by the promoted replica, leaving the cluster stuck in Degraded/NodeAddFailed.

Fix: before assigning slots, assignSlotsToPendingPrimaries now checks the live Valkey topology (shardExistsInTopology) for any existing member in the same shard. If one exists (post-failover or mid-failover), the replacement pod joins as a replica instead. replicateToShardPrimary also gains a fallback (findShardPrimary) that scans all shard pods for the actual primary, since after failover node-index=0 is no longer the Valkey primary.

What changed

utils.go — two new helpers:
- shardHasLivePrimary: checks if another pod in the same shard-index group is a slot-bearing primary in the live topology
- findShardPrimary: scans all pods in a shard to find the actual Valkey primary regardless of node-index label
valkeycluster_controller.go:
- addValkeyNode step 2 now detects post-failover replacements and falls through to the replica path
- replicateToShardPrimary tries node-index=0 first (fast path), then falls back to findShardPrimary

Test plan

1. Deploy a 3-shard, 1-replica cluster and wait for Ready

kubectl apply -f config/samples/v1alpha1_valkeycluster.yaml

2. Kill shard-0 primary

kubectl delete pod valkeycluster-sample-0-0

3. Verify valkeycluster-sample-0-1 is now the primary and valkeycluster-sample-0-0 is its replica

kubectl exec valkeycluster-sample-0-1 -- valkey-cli cluster nodes

jdheyburn · 2026-02-18T13:41:10Z

On the whole the code looks sound, but I will let the better Cluster experts comment first

bjosv · 2026-02-18T13:42:56Z

+// primary, regardless of its node-index label. This handles the post-failover
+// case where node-index=1 (or higher) was promoted by Valkey.
+// Returns ("", "") if no primary is found.
+func findShardPrimary(state *valkey.ClusterState, shardIndex int, selfAddress string, pods *corev1.PodList) (nodeID, ip string) {


Is the argument selfAddress really needed? It's an optimization to save a few cpu cycles right?
The api gets cleaner if we remove it.
Same comment for shardExistsInTopology

Good catch. I just realized that both functions are called with a pending node's address, which can't possibly appear in state.Shards (it wasn't part of any shard when the state was captured).

bjosv · 2026-02-18T15:24:36Z

+
+		// This test was temporarily disabled in PR #54 because the operator
+		// could not recover from a primary deletion (issue #43). The failover
+		// fix (shardHasLivePrimary + findShardPrimary) now handles this: when


The comment needs to be updated here, shardHasLivePrimary does not exist in this PR.

..also update the PR description with the latest changes, shardHasLivePrimary is mentioned there too.

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

bjosv

LGTM, I have run the new testcase multiple times (20+) without failure.
We'll see if we can catch it again.

bjosv · 2026-02-19T18:27:00Z

We should remove the pictures when merging to avoid messy git history. Also remove what changed since its mentions shardHasLivePrimary.
I'll merge if you are ok with this PR @sandeepkunusoth ?

ysqyang force-pushed the failover-fix branch from 4987fca to 9ee9cf9 Compare February 17, 2026 00:27

bjosv reviewed Feb 17, 2026

View reviewed changes

Comment thread test/e2e/valkeycluster_test.go

ysqyang force-pushed the failover-fix branch 2 times, most recently from a913d7c to 8029164 Compare February 17, 2026 18:12

ysqyang marked this pull request as ready for review February 17, 2026 18:12

ysqyang force-pushed the failover-fix branch from 8029164 to 0acf5eb Compare February 17, 2026 19:42

bjosv reviewed Feb 18, 2026

View reviewed changes

sandeepkunusoth reviewed Feb 18, 2026

View reviewed changes

Comment thread test/e2e/valkeycluster_test.go Outdated

handle post-failover pod replacement gracefully

2a9ef62

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

ysqyang force-pushed the failover-fix branch 2 times, most recently from 40c1ea4 to f1431d1 Compare February 19, 2026 06:44

address comments

9ff764f

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

ysqyang force-pushed the failover-fix branch from f1431d1 to 9ff764f Compare February 19, 2026 17:40

bjosv approved these changes Feb 19, 2026

View reviewed changes

bjosv merged commit d777c83 into valkey-io:main Feb 19, 2026
4 checks passed

ysqyang deleted the failover-fix branch February 19, 2026 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle post-failover pod replacement gracefully#86

Handle post-failover pod replacement gracefully#86
bjosv merged 2 commits into
valkey-io:mainfrom
ysqyang:failover-fix

ysqyang commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

jdheyburn commented Feb 18, 2026

Uh oh!

bjosv Feb 18, 2026

Uh oh!

ysqyang Feb 18, 2026

Uh oh!

bjosv Feb 18, 2026

Uh oh!

bjosv Feb 18, 2026

Uh oh!

ysqyang Feb 19, 2026

Uh oh!

Uh oh!

bjosv left a comment

Uh oh!

bjosv commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ysqyang commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Test plan

Uh oh!

Uh oh!

jdheyburn commented Feb 18, 2026

Uh oh!

bjosv Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

ysqyang Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

ysqyang Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjosv left a comment

Choose a reason for hiding this comment

Uh oh!

bjosv commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ysqyang commented Feb 16, 2026 •

edited

Loading