fix: allow slot assignment for single-node clusters (#131)#135
Conversation
|
Just a thought, would users want to have 1 shard (primary) with 1 replica, and expand out from there? What would we need to change in the code to support this case too, alongside 1 shard with 0 replica? |
|
In that case there would be 2 nodes so that already works 👍 |
A ValkeyCluster with 1 shard and 0 replicas gets stuck in an infinite reconciliation loop because assignSlotsToPendingPrimaries() skips isolated nodes, and a single-node cluster is permanently isolated with no peer to MEET. Skip the isolation guard when the expected cluster is a single node. Signed-off-by: Daan Vinken <daanvinken@tythus.com>
ea405f7 to
5c8e308
Compare
|
@daanvinken I did some testing locally, and I went from 1 shard 0 replicas to 1 shard 1 replica, and the replica was not able to join. Also I went from 1 shard 0 replicas to 2 shards 0 replicas, and the second shard was not able to join. If these use cases are not yet supported, that's fine - maybe we can raise Issues for these to be able to support them in future? |
|
Thanks for testing! Seemingly these are pre-existing scale-up issues. This PR only changes the initial slot assignment for a 1-shard-0-replica cluster. Both cases seem to get stuck in the gossip-waiting loops (before reaching any code this PR touches):
These would likely reproduce when scaling any cluster configuration, not just single-node origins. I will open issue(s) for both after some investigation 👍 |
|
I think we should add scaling E2E tests as well, I'll have a look. |
|
@daanvinken Are those E2E tests to be included in this PR? |
**Description:** Scaling a single-node cluster (1 shard, 0 replicas) to add a replica gets stuck in an infinite reconciliation loop. `findMeetTarget()` skips shard primaries with `cluster_known_nodes <= 1`, so the new replica is never `CLUSTER MEET`'d to the primary. `CLUSTER REPLICATE` fails with "Unknown node" on every reconciliaton. The same bug also prevents adding a new shard (1 shard -> 2 shards), since the new shard's primary can't MEET the existing isolated primary either. A shard primary that owns slots is always a valid `MEET` target regardless of `cluster_known_nodes`. This removes the `IsIsolated()` guard for shard primaries. Discovered while investigating feedback on #135. **Testing:** Reproduced both cases on a local kind cluster: create a 1-shard-0-replica cluster, wait for Ready, then scale up. **Case 1: add replica** (`replicas: 0` -> `replicas: 1`) Before fix - cluster stuck at Reconciling, replica never `MEET`'d: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale Reconciling Reconciling $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale-0-0 true primary test-scale-0-1 true primary $ kubectl exec statefulset/valkey-test-scale-0-1 -c server -- valkey-cli CLUSTER INFO | grep cluster_known_nodes cluster_known_nodes:1 ``` Operator logs repeating every 2s: ``` DEBUG replica does not yet know primary (gossip pending); will retry replica=10.244.0.9 primaryId=3441aa9a... DEBUG skipping replica; primary not ready yet DEBUG missing replicas, requeue.. ``` After fix - MEET succeeds, cluster reaches Ready: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale Ready ClusterHealthy $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale-0-0 true primary test-scale-0-1 true replica $ kubectl exec statefulset/valkey-test-scale-0-0 -c server -- valkey-cli CLUSTER NODES 3441aa9a... 10.244.0.8:6379@16379 myself,master - 0 0 0 connected 0-16383 bd334739... 10.244.0.9:6379@16379 slave 3441aa9a... 0 1776858716401 0 connected ``` Operator logs: ``` DEBUG meet node node=10.244.0.9 target=10.244.0.8 DEBUG events Introduced 1 isolated node(s) to the cluster ``` **Case 2: add shard** (`shards: 1` -> `shards: 2`) After fix - new shard joins and slots are rebalanced: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale2 Ready ClusterHealthy $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale2-0-0 true primary test-scale2-1-0 true primary $ kubectl exec statefulset/valkey-test-scale2-0-0 -c server -- valkey-cli CLUSTER NODES a6e1a811... 10.244.0.13:6379@16379 master - 0 1776860233046 1 connected 0-8191 e9dae06c... 10.244.0.12:6379@16379 myself,master - 0 0 0 connected 8192-16383 ``` Signed-off-by: Daan Vinken <daanvinken@tythus.com>
## Description Adds e2e coverage for scaling a 1-shard-0-replica cluster. This was missing and led to the bugs found in #135 / #147. Two tests: - Scale from 1 shard / 0 replicas to 1 shard / 1 replica (add replica) - Scale from 1 shard / 0 replicas to 2 shards / 0 replicas (add shard) Both verify the cluster reaches Ready with correct `cluster_known_nodes` and `cluster_size`. ## Testing Tests compile. Pattern follows the existing `rebalances slots on scale out` e2e test. Validated locally by pointing context to kind cluster. Signed-off-by: Daan Vinken <daanvinken@tythus.com>
…key-io#147) **Description:** Scaling a single-node cluster (1 shard, 0 replicas) to add a replica gets stuck in an infinite reconciliation loop. `findMeetTarget()` skips shard primaries with `cluster_known_nodes <= 1`, so the new replica is never `CLUSTER MEET`'d to the primary. `CLUSTER REPLICATE` fails with "Unknown node" on every reconciliaton. The same bug also prevents adding a new shard (1 shard -> 2 shards), since the new shard's primary can't MEET the existing isolated primary either. A shard primary that owns slots is always a valid `MEET` target regardless of `cluster_known_nodes`. This removes the `IsIsolated()` guard for shard primaries. Discovered while investigating feedback on valkey-io#135. **Testing:** Reproduced both cases on a local kind cluster: create a 1-shard-0-replica cluster, wait for Ready, then scale up. **Case 1: add replica** (`replicas: 0` -> `replicas: 1`) Before fix - cluster stuck at Reconciling, replica never `MEET`'d: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale Reconciling Reconciling $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale-0-0 true primary test-scale-0-1 true primary $ kubectl exec statefulset/valkey-test-scale-0-1 -c server -- valkey-cli CLUSTER INFO | grep cluster_known_nodes cluster_known_nodes:1 ``` Operator logs repeating every 2s: ``` DEBUG replica does not yet know primary (gossip pending); will retry replica=10.244.0.9 primaryId=3441aa9a... DEBUG skipping replica; primary not ready yet DEBUG missing replicas, requeue.. ``` After fix - MEET succeeds, cluster reaches Ready: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale Ready ClusterHealthy $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale-0-0 true primary test-scale-0-1 true replica $ kubectl exec statefulset/valkey-test-scale-0-0 -c server -- valkey-cli CLUSTER NODES 3441aa9a... 10.244.0.8:6379@16379 myself,master - 0 0 0 connected 0-16383 bd334739... 10.244.0.9:6379@16379 slave 3441aa9a... 0 1776858716401 0 connected ``` Operator logs: ``` DEBUG meet node node=10.244.0.9 target=10.244.0.8 DEBUG events Introduced 1 isolated node(s) to the cluster ``` **Case 2: add shard** (`shards: 1` -> `shards: 2`) After fix - new shard joins and slots are rebalanced: ``` $ kubectl get valkeycluster NAME STATE REASON test-scale2 Ready ClusterHealthy $ kubectl get valkeynodes -o custom-columns=NAME:.metadata.name,READY:.status.ready,ROLE:.status.role NAME READY ROLE test-scale2-0-0 true primary test-scale2-1-0 true primary $ kubectl exec statefulset/valkey-test-scale2-0-0 -c server -- valkey-cli CLUSTER NODES a6e1a811... 10.244.0.13:6379@16379 master - 0 1776860233046 1 connected 0-8191 e9dae06c... 10.244.0.12:6379@16379 myself,master - 0 0 0 connected 8192-16383 ``` Signed-off-by: Daan Vinken <daanvinken@tythus.com>
## Description Adds e2e coverage for scaling a 1-shard-0-replica cluster. This was missing and led to the bugs found in valkey-io#135 / valkey-io#147. Two tests: - Scale from 1 shard / 0 replicas to 1 shard / 1 replica (add replica) - Scale from 1 shard / 0 replicas to 2 shards / 0 replicas (add shard) Both verify the cluster reaches Ready with correct `cluster_known_nodes` and `cluster_size`. ## Testing Tests compile. Pattern follows the existing `rebalances slots on scale out` e2e test. Validated locally by pointing context to kind cluster. Signed-off-by: Daan Vinken <daanvinken@tythus.com>
This PR closes #131
Summary
A
ValkeyClusterwith 1 shard and 0 replicas gets stuck in an infinite reconciliation loop becauseassignSlotsToPendingPrimaries()skips isolated nodes, and a single-node cluster is permanently isolated with no peer to MEET.This causes an infinite reconciliation loop
It seems there's a deadlock between
meetIsolatedNodes()andassignSlotsToPendingPrimaries()When a single node starts up with
cluster_known_nodes = 1, it's considered "isolated".meetIsolatedNodes()picks a meet target viafindMeetTarget(). Since there are no non-isolated nodes, it falls back toisolated[0]aka the onlynode. Then because
meetTarget == isolated[0], the node is removed from the list. The loop iterates over an empty slice, where no MEET is issued at any point in time.assignSlotsToPendingPrimaries()skips isolated nodes (cluster_known_nodes <= 1). Since no MEET happened,cluster_known_nodesis still 1, thenode is skipped and no slots are assigned.
The reconciler sees unassigned slots and requeues forever.
IIUC the isolation guard exists to prevent a pod in a multi-node cluster from getting slots before it's been introduced to its peers. But in a single-node cluster (1 shard, 0 replicas), isolation is the permanent correct state, there's no other node to MEET.
Implementation
I considered fixing this in
meetIsolatedNodes()itself, but that's a dead end.Even if we made the single node attempt to MEET itself,CLUSTER MEETwith your own address is a no-op so that wouldn't help.The simpler fix is in
assignSlotsToPendingPrimaries(). If the whole cluster is one node, isolation is the correct permanent state and there's nothing to protect against.Testing