feat: Sequentially update workloads one-by-one by jdheyburn · Pull Request #116 · valkey-io/valkey-operator

jdheyburn · 2026-03-17T14:48:14Z

ValkeyCluster controller now reconciles spec changes onto ValkeyNode CRs
one at a time (replicas before primaries, descending node-index order),
gating each step on the previous node's workload being fully rolled out
ValkeyNode controller stamps status.observedGeneration on every reconcile
so ValkeyCluster can detect unprocessed spec changes; tracks rollout
completion via isWorkloadRolledOut (StatefulSet revision equality /
Deployment updatedReplicas gate) using APIReader to bypass cache
Fix upsertService and upsertConfigMap to use controllerutil.CreateOrUpdate,
preventing update failures on second reconcile after operator restart
Extract conditionsChanged helper to deduplicate status comparison logic
Add integration tests for isWorkloadRolledOut, buildClusterValkeyNode
propagation, condition ObservedGeneration tracking, and rolling-update
sequencing in the ValkeyCluster controller
Add e2e tests: StatefulSet resourceVersion stability, ObservedGeneration
tracking, rolling update readiness gate, workloadType immutability, and
ValkeyCluster rolling update end-to-end
Updated status-conditions documentation

Note: node-index is used to determine if it is a replica or not. A follow up PR will enhance this to use a live replica instead.

I think also the pods are being rolled too fast, so we might want to revisit the readiness checks, or introduce some additional cluster health checks when reconciling pods to ensure we're not adding fuel to the fire.
Update: I've got a follow up PR after this one that will stabilise this

- ValkeyCluster controller now reconciles spec changes onto ValkeyNode CRs one at a time (replicas before primaries, descending node-index order), gating each step on the previous node's workload being fully rolled out - ValkeyNode controller stamps status.observedGeneration on every reconcile so ValkeyCluster can detect unprocessed spec changes; tracks rollout completion via isWorkloadRolledOut (StatefulSet revision equality / Deployment updatedReplicas gate) using APIReader to bypass cache - Fix upsertService and upsertConfigMap to use controllerutil.CreateOrUpdate, preventing update failures on second reconcile after operator restart - Extract conditionsChanged helper to deduplicate status comparison logic - Add integration tests for isWorkloadRolledOut, buildClusterValkeyNode propagation, condition ObservedGeneration tracking, and rolling-update sequencing in the ValkeyCluster controller - Add e2e tests: StatefulSet resourceVersion stability, ObservedGeneration tracking, rolling update readiness gate, workloadType immutability, and ValkeyCluster rolling update end-to-end Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn · 2026-03-18T09:54:05Z

+		// Iterate nodeIndex in reverse order (replicas before primary)
+		for nodeIndex := nodesPerShard - 1; nodeIndex >= 0; nodeIndex-- {


A follow up PR will include logic to select a real replica, instead of via the nodeIndex

Added an issue for this:

[Enhancement] Do not assume node role from node index during sequential rolls #123

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

bjosv

Looks good!

This PR enhances the readiness probe checks such that nodes must be in a ready state before progressing with the sequential roll that was introduced in valkey-io#116. It does this by introducing checks for the node liveness via a Running status field. Once the node is live then the controller is enhanced to attempt to get the node to rejoin the cluster, regardless if a volume is set or not. Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn force-pushed the jdheyburn/feat/update-existing-workloads branch from 42149fb to c044062 Compare March 17, 2026 20:25

chore: extend e2e timeout

8acf166

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn commented Mar 18, 2026

View reviewed changes

jdheyburn marked this pull request as ready for review March 18, 2026 09:54

bjosv reviewed Mar 20, 2026

View reviewed changes

jdheyburn added 2 commits March 20, 2026 11:09

Review

28f9383

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Update status condition docs

0944722

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn requested a review from bjosv March 20, 2026 11:57

bjosv approved these changes Mar 20, 2026

View reviewed changes

jdheyburn merged commit c92a1ef into valkey-io:main Mar 20, 2026
7 checks passed

jdheyburn mentioned this pull request Mar 30, 2026

feat: Safer cluster node rolls #120

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Sequentially update workloads one-by-one#116

feat: Sequentially update workloads one-by-one#116
jdheyburn merged 4 commits into
valkey-io:mainfrom
jdheyburn:jdheyburn/feat/update-existing-workloads

jdheyburn commented Mar 17, 2026 •

edited

Loading

Uh oh!

jdheyburn Mar 18, 2026

Uh oh!

jdheyburn Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjosv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Iterate nodeIndex in reverse order (replicas before primary)
		for nodeIndex := nodesPerShard - 1; nodeIndex >= 0; nodeIndex-- {

Conversation

jdheyburn commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdheyburn Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjosv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jdheyburn commented Mar 17, 2026 •

edited

Loading