fix: allow MEET against isolated shard primaries during scale-up by daanvinken · Pull Request #147 · valkey-io/valkey-operator

daanvinken · 2026-04-22T12:05:51Z

Description:

Scaling a single-node cluster (1 shard, 0 replicas) to add a replica gets stuck in an infinite reconciliation loop. findMeetTarget() skips shard primaries with cluster_known_nodes <= 1, so the new replica is never CLUSTER MEET'd to the primary. CLUSTER REPLICATE fails with "Unknown node" on every reconciliaton.

The same bug also prevents adding a new shard (1 shard -> 2 shards), since the new shard's primary can't MEET the existing isolated primary either.

A shard primary that owns slots is always a valid MEET target regardless of cluster_known_nodes. This removes the IsIsolated() guard for shard primaries.

Discovered while investigating feedback on #135.

Testing:
Reproduced both cases on a local kind cluster: create a 1-shard-0-replica cluster, wait for Ready, then scale up.

Case 1: add replica (replicas: 0 -> replicas: 1)

Before fix - cluster stuck at Reconciling, replica never MEET'd:

$ kubectl get valkeycluster
NAME         STATE         REASON
test-scale   Reconciling   Reconciling

$ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role
NAME             READY   ROLE
test-scale-0-0   true    primary
test-scale-0-1   true    primary

$ kubectl exec statefulset/valkey-test-scale-0-1 -c server -- valkey-cli CLUSTER INFO | grep cluster_known_nodes
cluster_known_nodes:1

Operator logs repeating every 2s:

DEBUG  replica does not yet know primary (gossip pending); will retry  replica=10.244.0.9 primaryId=3441aa9a...
DEBUG  skipping replica; primary not ready yet
DEBUG  missing replicas, requeue..

After fix - MEET succeeds, cluster reaches Ready:

$ kubectl get valkeycluster
NAME         STATE   REASON
test-scale   Ready   ClusterHealthy

$ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role
NAME             READY   ROLE
test-scale-0-0   true    primary
test-scale-0-1   true    replica

$ kubectl exec statefulset/valkey-test-scale-0-0 -c server -- valkey-cli CLUSTER NODES
3441aa9a... 10.244.0.8:6379@16379 myself,master - 0 0 0 connected 0-16383
bd334739... 10.244.0.9:6379@16379 slave 3441aa9a... 0 1776858716401 0 connected

Operator logs:

DEBUG  meet node  node=10.244.0.9 target=10.244.0.8
DEBUG  events  Introduced 1 isolated node(s) to the cluster

Case 2: add shard (shards: 1 -> shards: 2)

After fix - new shard joins and slots are rebalanced:

$ kubectl get valkeycluster
NAME          STATE   REASON
test-scale2   Ready   ClusterHealthy

$ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role
NAME              READY   ROLE
test-scale2-0-0   true    primary
test-scale2-1-0   true    primary

$ kubectl exec statefulset/valkey-test-scale2-0-0 -c server -- valkey-cli CLUSTER NODES
a6e1a811... 10.244.0.13:6379@16379 master - 0 1776860233046 1 connected 0-8191
e9dae06c... 10.244.0.12:6379@16379 myself,master - 0 0 0 connected 8192-16383

daanvinken · 2026-04-22T12:08:35Z

+// isolated (fresh bootstrap, first reconcile).
 func findMeetTarget(state *valkey.ClusterState, isolated []*valkey.NodeState) *valkey.NodeState {
 	for _, shard := range state.Shards {
-		if p := shard.GetPrimaryNode(); p != nil && !p.IsIsolated() {


Admittedly it feels a bit weird removing this check, but a shard primary (in state.Shards) always has slots assigned, meaning it went through the full slot assignment flow and is a legitimate cluster member.

The only scenario where such a primary has cluster_known_nodes <= 1 is a single-node cluster being scaled up. Network partitions don't cause this because cluster_known_nodes counts remembered peers even if they're in a failed state, and fresh nodes can't get slots while isolated thanks to the guard in assignSlotsToPendingPrimaries() here.

jdheyburn · 2026-04-23T15:48:15Z

Thanks for raising this! I did a test locally, looks like there is a regression from the previous PR, where 1 shard with 0 replica doesn't produce a healthy cluster:

# redis-cli cluster info
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:0
cluster_nodes_fail:0
cluster_voting_nodes_pfail:0
cluster_voting_nodes_fail:0
cluster_known_nodes:1
cluster_size:0
cluster_current_epoch:0
cluster_my_epoch:0
cluster_stats_messages_sent:0
cluster_stats_messages_received:0
total_cluster_links_buffer_limit_exceeded:0
#

Operator logs:

2026-04-23T15:45:01Z    DEBUG    reconcile...    {"controller": "valkeycluster", "controllerGroup": "valkey.io", "controllerKind": "ValkeyCluster", "ValkeyCluster": {"name":"cluster-sample","namespace":"v │
│ alkey-operator-system"}, "namespace": "valkey-operator-system", "name": "cluster-sample", "reconcileID": "e6691409-3b43-49e9-9f4f-a5c5cf47d889"}                                                             │
│ 2026-04-23T15:45:01Z    INFO    getting system users secret: cluster-sample    {"controller": "valkeycluster", "controllerGroup": "valkey.io", "controllerKind": "ValkeyCluster", "ValkeyCluster": {"name":" │
│ cluster-sample","namespace":"valkey-operator-system"}, "namespace": "valkey-operator-system", "name": "cluster-sample", "reconcileID": "e6691409-3b43-49e9-9f4f-a5c5cf47d889"}                               │
│ 2026-04-23T15:45:01Z    DEBUG    internal ACLs unchanged    {"controller": "valkeycluster", "controllerGroup": "valkey.io", "controllerKind": "ValkeyCluster", "ValkeyCluster": {"name":"cluster-sample","na │
│ mespace":"valkey-operator-system"}, "namespace": "valkey-operator-system", "name": "cluster-sample", "reconcileID": "e6691409-3b43-49e9-9f4f-a5c5cf47d889"}                                                  │
│ 2026-04-23T15:45:01Z    DEBUG    slots are not assigned, requeue..    {"controller": "valkeycluster", "controllerGroup": "valkey.io", "controllerKind": "ValkeyCluster", "ValkeyCluster": {"name":"cluster-s │
│ ample","namespace":"valkey-operator-system"}, "namespace": "valkey-operator-system", "name": "cluster-sample", "reconcileID": "e6691409-3b43-49e9-9f4f-a5c5cf47d889", "unassignedSlots": [{"Start":0,"End":1 │
│ 6383}]}                                                                                                                                                                                                      │

When I added a replica or shard to it, it seemed to have fixed it.

Do you have the same on your side?

findMeetTarget() skipped shard primaries with cluster_known_nodes <= 1 (IsIsolated). When scaling a single-node cluster (1 shard, 0 replicas) to add a replica, the existing primary has cluster_known_nodes:1 and was skipped. findMeetTarget fell through to isolated[0] (the new replica), which the caller then trimmed from the iteration list, causing MEET to iterate an empty slice. The replica was never introduced to the primary, and CLUSTER REPLICATE failed with "Unknown node" on every reconcile. A shard primary that owns slots is always a valid MEET target regardless of cluster_known_nodes. Remove the IsIsolated guard for shard primaries. Signed-off-by: Daan Vinken <daanvinken@tythus.com>

daanvinken · 2026-04-23T18:44:36Z

The 1s/0r creation issue you're seeing is #135, which has been merged. I've rebased this PR onto the latest main which includes it. The initial creation should work now, and this PR adds the scale-up fix on top.

I tested it locally on this branch, if you are still able to reproduce please share the exact config 🙏

jdheyburn · 2026-04-24T12:51:50Z

Ah yeah sorry, I assumed that #135 was included in this branch. I did a round of testing and it works now - thanks for looking into this!

## Description Adds e2e coverage for scaling a 1-shard-0-replica cluster. This was missing and led to the bugs found in #135 / #147. Two tests: - Scale from 1 shard / 0 replicas to 1 shard / 1 replica (add replica) - Scale from 1 shard / 0 replicas to 2 shards / 0 replicas (add shard) Both verify the cluster reaches Ready with correct `cluster_known_nodes` and `cluster_size`. ## Testing Tests compile. Pattern follows the existing `rebalances slots on scale out` e2e test. Validated locally by pointing context to kind cluster. Signed-off-by: Daan Vinken <daanvinken@tythus.com>

…key-io#147) **Description:** Scaling a single-node cluster (1 shard, 0 replicas) to add a replica gets stuck in an infinite reconciliation loop. `findMeetTarget()` skips shard primaries with `cluster_known_nodes <= 1`, so the new replica is never `CLUSTER MEET`'d to the primary. `CLUSTER REPLICATE` fails with "Unknown node" on every reconciliaton. The same bug also prevents adding a new shard (1 shard -> 2 shards), since the new shard's primary can't MEET the existing isolated primary either. A shard primary that owns slots is always a valid `MEET` target regardless of `cluster_known_nodes`. This removes the `IsIsolated()` guard for shard primaries. Discovered while investigating feedback on valkey-io#135. **Testing:** Reproduced both cases on a local kind cluster: create a 1-shard-0-replica cluster, wait for Ready, then scale up. **Case 1: add replica** (`replicas: 0` -> `replicas: 1`) Before fix - cluster stuck at Reconciling, replica never `MEET`'d: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale Reconciling Reconciling $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale-0-0 true primary test-scale-0-1 true primary $ kubectl exec statefulset/valkey-test-scale-0-1 -c server -- valkey-cli CLUSTER INFO | grep cluster_known_nodes cluster_known_nodes:1 ``` Operator logs repeating every 2s: ``` DEBUG replica does not yet know primary (gossip pending); will retry replica=10.244.0.9 primaryId=3441aa9a... DEBUG skipping replica; primary not ready yet DEBUG missing replicas, requeue.. ``` After fix - MEET succeeds, cluster reaches Ready: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale Ready ClusterHealthy $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale-0-0 true primary test-scale-0-1 true replica $ kubectl exec statefulset/valkey-test-scale-0-0 -c server -- valkey-cli CLUSTER NODES 3441aa9a... 10.244.0.8:6379@16379 myself,master - 0 0 0 connected 0-16383 bd334739... 10.244.0.9:6379@16379 slave 3441aa9a... 0 1776858716401 0 connected ``` Operator logs: ``` DEBUG meet node node=10.244.0.9 target=10.244.0.8 DEBUG events Introduced 1 isolated node(s) to the cluster ``` **Case 2: add shard** (`shards: 1` -> `shards: 2`) After fix - new shard joins and slots are rebalanced: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale2 Ready ClusterHealthy $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale2-0-0 true primary test-scale2-1-0 true primary $ kubectl exec statefulset/valkey-test-scale2-0-0 -c server -- valkey-cli CLUSTER NODES a6e1a811... 10.244.0.13:6379@16379 master - 0 1776860233046 1 connected 0-8191 e9dae06c... 10.244.0.12:6379@16379 myself,master - 0 0 0 connected 8192-16383 ``` Signed-off-by: Daan Vinken <daanvinken@tythus.com>

## Description Adds e2e coverage for scaling a 1-shard-0-replica cluster. This was missing and led to the bugs found in valkey-io#135 / valkey-io#147. Two tests: - Scale from 1 shard / 0 replicas to 1 shard / 1 replica (add replica) - Scale from 1 shard / 0 replicas to 2 shards / 0 replicas (add shard) Both verify the cluster reaches Ready with correct `cluster_known_nodes` and `cluster_size`. ## Testing Tests compile. Pattern follows the existing `rebalances slots on scale out` e2e test. Validated locally by pointing context to kind cluster. Signed-off-by: Daan Vinken <daanvinken@tythus.com>

daanvinken commented Apr 22, 2026

View reviewed changes

This was referenced Apr 22, 2026

test: add e2e tests for single-node cluster scale-up #148

Merged

fix: allow slot assignment for single-node clusters (#131) #135

Merged

daanvinken force-pushed the fix/meet-target-isolated-primary branch from 9f33994 to cd304d7 Compare April 23, 2026 17:44

jdheyburn approved these changes Apr 24, 2026

View reviewed changes

jdheyburn merged commit 5e1621b into valkey-io:main Apr 24, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: allow MEET against isolated shard primaries during scale-up#147

fix: allow MEET against isolated shard primaries during scale-up#147
jdheyburn merged 1 commit into
valkey-io:mainfrom
daanvinken:fix/meet-target-isolated-primary

daanvinken commented Apr 22, 2026 •

edited

Loading

Uh oh!

daanvinken Apr 22, 2026

Uh oh!

jdheyburn commented Apr 23, 2026

Uh oh!

daanvinken commented Apr 23, 2026

Uh oh!

jdheyburn commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daanvinken commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daanvinken Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn commented Apr 23, 2026

Uh oh!

daanvinken commented Apr 23, 2026

Uh oh!

jdheyburn commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daanvinken commented Apr 22, 2026 •

edited

Loading