Skip to content

Fix slot-range parsing during migration and add rebalance planning logic#92

Merged
bjosv merged 2 commits into
valkey-io:mainfrom
ysqyang:cluster-state-classification
Feb 27, 2026
Merged

Fix slot-range parsing during migration and add rebalance planning logic#92
bjosv merged 2 commits into
valkey-io:mainfrom
ysqyang:cluster-state-classification

Conversation

@ysqyang
Copy link
Copy Markdown
Contributor

@ysqyang ysqyang commented Feb 24, 2026

Summary

  • Fix parseSlotsRanges to skip [slot->-nodeid] / [slot-<-nodeid] entries that Valkey appends to CLUSTER NODES output during active slot migrations, which previously caused parse errors.
  • Add BuildRebalanceMove — a deterministic, incremental slot rebalance planner that computes a single slot migration from an overloaded primary to an underloaded (or empty) one. This is the planning layer for scale-out support; execution is wired in a follow-up PR.

Test plan

  • Unit test for parseSlotsRanges: normal ranges, migrating/importing entries, empty input, migration-only input
  • Unit tests for PlanRebalanceMove: scale-out (2→3 shards), balanced cluster (no-op), mismatched shard count, extra empty shards, zero max-slots, slot extraction from ranges

@ysqyang ysqyang force-pushed the cluster-state-classification branch from 34325d0 to 78cb392 Compare February 24, 2026 22:13
@ysqyang ysqyang changed the title Refine node classification in GetClusterState for scale-out Add slot-parsing fix and rebalance planning logic Feb 24, 2026
@ysqyang ysqyang force-pushed the cluster-state-classification branch 4 times, most recently from ede2f24 to e728115 Compare February 25, 2026 07:40
@ysqyang ysqyang changed the title Add slot-parsing fix and rebalance planning logic Fix slot-range parsing during migration and add rebalance planning logic Feb 25, 2026
@ysqyang ysqyang marked this pull request as ready for review February 25, 2026 19:06
@ysqyang ysqyang force-pushed the cluster-state-classification branch 4 times, most recently from 170677e to ea75281 Compare February 26, 2026 06:14
Copy link
Copy Markdown
Collaborator

@jdheyburn jdheyburn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though I know nothing about rebalancing. I feel having clearer variable names would make it easier for a non-Cluster person like me to understand it more, which is where my comments are around.

If there are opportunities to extract a block of code into a function behind a descriptive name, that would be great too!

Comment thread internal/valkey/rebalance.go Outdated
Comment thread internal/valkey/rebalance.go Outdated
Comment thread internal/valkey/rebalance.go Outdated
Comment thread internal/valkey/rebalance.go Outdated
Comment thread internal/valkey/rebalance.go Outdated
Comment thread internal/valkey/rebalance.go Outdated
Comment thread internal/valkey/rebalance.go Outdated
Comment thread internal/controller/valkeycluster_controller.go Outdated
Comment thread internal/controller/valkeycluster_controller.go Outdated
Comment thread internal/valkey/cluster_rebalance.go
@ysqyang ysqyang force-pushed the cluster-state-classification branch 4 times, most recently from 6621f41 to 8cb1309 Compare February 26, 2026 18:50
Copy link
Copy Markdown
Collaborator

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some comments.
Feels like the "Tolerate ±1-slot rounding differences" is a bit tricky.

Comment thread internal/valkey/cluster_rebalance.go
Comment thread internal/valkey/cluster_rebalance.go Outdated

// BuildRebalanceMove computes a single, deterministic slot move to improve balance.
// It returns nil when the cluster is already balanced or not ready for rebalancing.
func BuildRebalanceMove(shards []*ShardState, expectedShards int, maxSlots int) (*SlotMove, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We build a SlotMove in this function, is the name of this function misguiding?
Maybe its a PlanRebalanceMove?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the shards []*ShardState, is that from the current ClusterState, or is it planned to be a subset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the shards []*ShardState, is that from the current ClusterState, or is it planned to be a subset?

It's actually a superset of current ClusterState, which will become apparent in a follow-up PR.

type SlotMove struct {
Src *NodeState
Dst *NodeState
Slots []int
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR; but maybe we should have a defined type for a Slot in clusterstate.go (or alias).
That might avoid any confusion if an int means a slot, a number of slots, a length or size.

Comment thread internal/valkey/cluster_rebalance.go Outdated
Comment thread internal/valkey/cluster_rebalance.go Outdated
Comment thread internal/valkey/cluster_rebalance.go Outdated
@ysqyang ysqyang force-pushed the cluster-state-classification branch 2 times, most recently from 724ec1f to 665422a Compare February 27, 2026 16:01
Copy link
Copy Markdown
Collaborator

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are improvement possibilities, but I think we can address them in coming PRs.

@ysqyang ysqyang force-pushed the cluster-state-classification branch 3 times, most recently from 760cab6 to 8766ab2 Compare February 27, 2026 18:50
Two independent but related changes for scale-out support:

1. Skip migrating/importing entries (e.g. "[5461->-abc123]") in
   parseSlotsRanges to prevent parse errors when CLUSTER NODES output
   contains in-progress slot migrations. Adds unit tests.

2. Introduce PlanRebalanceMove which computes a single, deterministic
   slot move to incrementally rebalance a cluster after scale-out.
   Calculates per-primary slot targets (16384 / shards), identifies the
   most-loaded source and least-loaded destination, and returns a
   bounded SlotMove. Returns one move per call so each reconcile loop
   stays fast and restartable. Includes unit tests.

Signed-off-by: yang.qiu <yang.qiu@reddit.com>
@ysqyang ysqyang force-pushed the cluster-state-classification branch from 8766ab2 to e094421 Compare February 27, 2026 18:57
… logic

- Rename rebalance.go → cluster_rebalance.go (and test file)
- Rename BuildRebalanceMove → PlanRebalanceMove
- Improve variable names for clarity (primarySlots fields, loop vars)
- Move ±1 tolerance check into slot allocation loop to avoid masking real imbalances
- Remove unused error return from takeSlotsFromRanges
- Add remainder distribution comment in assignSlotsToPendingPrimaries

Signed-off-by: yang.qiu <yang.qiu@reddit.com>
@ysqyang ysqyang force-pushed the cluster-state-classification branch from e094421 to 4d3314b Compare February 27, 2026 19:13
@bjosv bjosv merged commit 39fa3f5 into valkey-io:main Feb 27, 2026
4 checks passed
sandeepkunusoth pushed a commit to sandeepkunusoth/valkey-k8s-operator that referenced this pull request Feb 28, 2026
…gic (valkey-io#92)

- Fix `parseSlotsRanges` to skip `[slot->-nodeid]` / `[slot-<-nodeid]`
  entries that Valkey appends to `CLUSTER NODES` output during active slot
  migrations, which previously caused parse errors.
- Add `BuildRebalanceMove` — a deterministic, incremental slot rebalance
  planner that computes a single slot migration from an overloaded primary
  to an underloaded (or empty) one. This is the planning layer for
  scale-out support; execution is wired in a follow-up PR.

Signed-off-by: yang.qiu <yang.qiu@reddit.com>
Co-authored-by: yang.qiu <yang.qiu@reddit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants