Skip to content

Handle post-failover pod replacement gracefully#86

Merged
bjosv merged 2 commits into
valkey-io:mainfrom
ysqyang:failover-fix
Feb 19, 2026
Merged

Handle post-failover pod replacement gracefully#86
bjosv merged 2 commits into
valkey-io:mainfrom
ysqyang:failover-fix

Conversation

@ysqyang
Copy link
Copy Markdown
Contributor

@ysqyang ysqyang commented Feb 16, 2026

Summary

When Valkey promotes a replica to primary during automatic failover and Kubernetes recreates the old node-index=0 pod, the reconciler previously tried to assign a new slot range to the replacement pod. This failed in a loop with "no slots range to assign" because all 16384 slots were already owned by the promoted replica, leaving the cluster stuck in Degraded/NodeAddFailed.

Fix: before assigning slots, assignSlotsToPendingPrimaries now checks the live Valkey topology (shardExistsInTopology) for any existing member in the same shard. If one exists (post-failover or mid-failover), the replacement pod joins as a replica instead. replicateToShardPrimary also gains a fallback (findShardPrimary) that scans all shard pods for the actual primary, since after failover node-index=0 is no longer the Valkey primary.

What changed

  • utils.go — two new helpers:
    • shardHasLivePrimary: checks if another pod in the same shard-index group is a slot-bearing primary in the live topology
    • findShardPrimary: scans all pods in a shard to find the actual Valkey primary regardless of node-index label
  • valkeycluster_controller.go:
    • addValkeyNode step 2 now detects post-failover replacements and falls through to the replica path
    • replicateToShardPrimary tries node-index=0 first (fast path), then falls back to findShardPrimary

Test plan

1. Deploy a 3-shard, 1-replica cluster and wait for Ready

kubectl apply -f config/samples/v1alpha1_valkeycluster.yaml
Screenshot 2026-02-16 at 2 54 38 PM Screenshot 2026-02-16 at 2 57 59 PM

2. Kill shard-0 primary

kubectl delete pod valkeycluster-sample-0-0
Screenshot 2026-02-16 at 3 04 55 PM

3. Verify valkeycluster-sample-0-1 is now the primary and valkeycluster-sample-0-0 is its replica

kubectl exec valkeycluster-sample-0-1 -- valkey-cli cluster nodes
Screenshot 2026-02-16 at 3 05 47 PM

Comment thread test/e2e/valkeycluster_test.go
@ysqyang ysqyang force-pushed the failover-fix branch 2 times, most recently from a913d7c to 8029164 Compare February 17, 2026 18:12
@ysqyang ysqyang marked this pull request as ready for review February 17, 2026 18:12
@jdheyburn
Copy link
Copy Markdown
Collaborator

On the whole the code looks sound, but I will let the better Cluster experts comment first

Comment thread internal/controller/utils.go Outdated
// primary, regardless of its node-index label. This handles the post-failover
// case where node-index=1 (or higher) was promoted by Valkey.
// Returns ("", "") if no primary is found.
func findShardPrimary(state *valkey.ClusterState, shardIndex int, selfAddress string, pods *corev1.PodList) (nodeID, ip string) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the argument selfAddress really needed? It's an optimization to save a few cpu cycles right?
The api gets cleaner if we remove it.
Same comment for shardExistsInTopology

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I just realized that both functions are called with a pending node's address, which can't possibly appear in state.Shards (it wasn't part of any shard when the state was captured).

Comment thread test/e2e/valkeycluster_test.go Outdated

// This test was temporarily disabled in PR #54 because the operator
// could not recover from a primary deletion (issue #43). The failover
// fix (shardHasLivePrimary + findShardPrimary) now handles this: when
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment needs to be updated here, shardHasLivePrimary does not exist in this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..also update the PR description with the latest changes, shardHasLivePrimary is mentioned there too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread test/e2e/valkeycluster_test.go Outdated
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
@ysqyang ysqyang force-pushed the failover-fix branch 2 times, most recently from 40c1ea4 to f1431d1 Compare February 19, 2026 06:44
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
Copy link
Copy Markdown
Collaborator

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have run the new testcase multiple times (20+) without failure.
We'll see if we can catch it again.

@bjosv
Copy link
Copy Markdown
Collaborator

bjosv commented Feb 19, 2026

We should remove the pictures when merging to avoid messy git history. Also remove what changed since its mentions shardHasLivePrimary.
I'll merge if you are ok with this PR @sandeepkunusoth ?

@bjosv bjosv merged commit d777c83 into valkey-io:main Feb 19, 2026
4 checks passed
@ysqyang ysqyang deleted the failover-fix branch February 19, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants