Skip to content

feat: Sequentially update workloads one-by-one#116

Merged
jdheyburn merged 4 commits into
valkey-io:mainfrom
jdheyburn:jdheyburn/feat/update-existing-workloads
Mar 20, 2026
Merged

feat: Sequentially update workloads one-by-one#116
jdheyburn merged 4 commits into
valkey-io:mainfrom
jdheyburn:jdheyburn/feat/update-existing-workloads

Conversation

@jdheyburn
Copy link
Copy Markdown
Collaborator

@jdheyburn jdheyburn commented Mar 17, 2026

  • ValkeyCluster controller now reconciles spec changes onto ValkeyNode CRs
    one at a time (replicas before primaries, descending node-index order),
    gating each step on the previous node's workload being fully rolled out
  • ValkeyNode controller stamps status.observedGeneration on every reconcile
    so ValkeyCluster can detect unprocessed spec changes; tracks rollout
    completion via isWorkloadRolledOut (StatefulSet revision equality /
    Deployment updatedReplicas gate) using APIReader to bypass cache
  • Fix upsertService and upsertConfigMap to use controllerutil.CreateOrUpdate,
    preventing update failures on second reconcile after operator restart
  • Extract conditionsChanged helper to deduplicate status comparison logic
  • Add integration tests for isWorkloadRolledOut, buildClusterValkeyNode
    propagation, condition ObservedGeneration tracking, and rolling-update
    sequencing in the ValkeyCluster controller
  • Add e2e tests: StatefulSet resourceVersion stability, ObservedGeneration
    tracking, rolling update readiness gate, workloadType immutability, and
    ValkeyCluster rolling update end-to-end
  • Updated status-conditions documentation

Note: node-index is used to determine if it is a replica or not. A follow up PR will enhance this to use a live replica instead.

I think also the pods are being rolled too fast, so we might want to revisit the readiness checks, or introduce some additional cluster health checks when reconciling pods to ensure we're not adding fuel to the fire.
Update: I've got a follow up PR after this one that will stabilise this

- ValkeyCluster controller now reconciles spec changes onto ValkeyNode CRs
  one at a time (replicas before primaries, descending node-index order),
  gating each step on the previous node's workload being fully rolled out
- ValkeyNode controller stamps status.observedGeneration on every reconcile
  so ValkeyCluster can detect unprocessed spec changes; tracks rollout
  completion via isWorkloadRolledOut (StatefulSet revision equality /
  Deployment updatedReplicas gate) using APIReader to bypass cache
- Fix upsertService and upsertConfigMap to use controllerutil.CreateOrUpdate,
  preventing update failures on second reconcile after operator restart
- Extract conditionsChanged helper to deduplicate status comparison logic
- Add integration tests for isWorkloadRolledOut, buildClusterValkeyNode
  propagation, condition ObservedGeneration tracking, and rolling-update
  sequencing in the ValkeyCluster controller
- Add e2e tests: StatefulSet resourceVersion stability, ObservedGeneration
  tracking, rolling update readiness gate, workloadType immutability, and
  ValkeyCluster rolling update end-to-end

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn jdheyburn force-pushed the jdheyburn/feat/update-existing-workloads branch from 42149fb to c044062 Compare March 17, 2026 20:25
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Comment on lines +424 to +425
// Iterate nodeIndex in reverse order (replicas before primary)
for nodeIndex := nodesPerShard - 1; nodeIndex >= 0; nodeIndex-- {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow up PR will include logic to select a real replica, instead of via the nodeIndex

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdheyburn jdheyburn marked this pull request as ready for review March 18, 2026 09:54
Comment thread internal/controller/valkeycluster_controller.go Outdated
Comment thread internal/controller/valkeycluster_controller.go
Comment thread internal/controller/valkeycluster_controller.go
Comment thread internal/controller/valkeycluster_controller.go
Comment thread internal/controller/valkeycluster_controller.go
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn jdheyburn requested a review from bjosv March 20, 2026 11:57
Copy link
Copy Markdown
Collaborator

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jdheyburn jdheyburn merged commit c92a1ef into valkey-io:main Mar 20, 2026
7 checks passed
jdheyburn added a commit to jdheyburn/valkey-operator that referenced this pull request Mar 30, 2026
This PR enhances the readiness probe checks such that
nodes must be in a ready state before progressing with
the sequential roll that was introduced in valkey-io#116.

It does this by introducing checks for the node liveness
via a Running status field. Once the node is live then
the controller is enhanced to attempt to get the node to
rejoin the cluster, regardless if a volume is set or not.

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants