Handle post-failover pod replacement gracefully#86
Conversation
4987fca to
9ee9cf9
Compare
a913d7c to
8029164
Compare
8029164 to
0acf5eb
Compare
|
On the whole the code looks sound, but I will let the better Cluster experts comment first |
| // primary, regardless of its node-index label. This handles the post-failover | ||
| // case where node-index=1 (or higher) was promoted by Valkey. | ||
| // Returns ("", "") if no primary is found. | ||
| func findShardPrimary(state *valkey.ClusterState, shardIndex int, selfAddress string, pods *corev1.PodList) (nodeID, ip string) { |
There was a problem hiding this comment.
Is the argument selfAddress really needed? It's an optimization to save a few cpu cycles right?
The api gets cleaner if we remove it.
Same comment for shardExistsInTopology
There was a problem hiding this comment.
Good catch. I just realized that both functions are called with a pending node's address, which can't possibly appear in state.Shards (it wasn't part of any shard when the state was captured).
|
|
||
| // This test was temporarily disabled in PR #54 because the operator | ||
| // could not recover from a primary deletion (issue #43). The failover | ||
| // fix (shardHasLivePrimary + findShardPrimary) now handles this: when |
There was a problem hiding this comment.
The comment needs to be updated here, shardHasLivePrimary does not exist in this PR.
There was a problem hiding this comment.
..also update the PR description with the latest changes, shardHasLivePrimary is mentioned there too.
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
40c1ea4 to
f1431d1
Compare
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
f1431d1 to
9ff764f
Compare
bjosv
left a comment
There was a problem hiding this comment.
LGTM, I have run the new testcase multiple times (20+) without failure.
We'll see if we can catch it again.
|
We should remove the pictures when merging to avoid messy git history. Also remove what changed since its mentions shardHasLivePrimary. |
Summary
When Valkey promotes a replica to primary during automatic failover and Kubernetes recreates the old node-index=0 pod, the reconciler previously tried to assign a new slot range to the replacement pod. This failed in a loop with
"no slots range to assign"because all 16384 slots were already owned by the promoted replica, leaving the cluster stuck inDegraded/NodeAddFailed.Fix: before assigning slots,
assignSlotsToPendingPrimariesnow checks the live Valkey topology (shardExistsInTopology) for any existing member in the same shard. If one exists (post-failover or mid-failover), the replacement pod joins as a replica instead.replicateToShardPrimaryalso gains a fallback (findShardPrimary) that scans all shard pods for the actual primary, since after failover node-index=0 is no longer the Valkey primary.What changed
utils.go— two new helpers:shardHasLivePrimary: checks if another pod in the same shard-index group is a slot-bearing primary in the live topologyfindShardPrimary: scans all pods in a shard to find the actual Valkey primary regardless of node-index labelvalkeycluster_controller.go:addValkeyNodestep 2 now detects post-failover replacements and falls through to the replica pathreplicateToShardPrimarytries node-index=0 first (fast path), then falls back tofindShardPrimaryTest plan
1. Deploy a 3-shard, 1-replica cluster and wait for
Ready2. Kill shard-0 primary
3. Verify
valkeycluster-sample-0-1is now the primary andvalkeycluster-sample-0-0is its replicakubectl exec valkeycluster-sample-0-1 -- valkey-cli cluster nodes