test(envtest): Phase 4 — stuck rollout baseline (AwaitSpecUpdate stall) by bdchatham · Pull Request #328 · sei-protocol/sei-k8s-controller

bdchatham · 2026-05-21T15:12:08Z

Summary

Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence (sidecar wedged, StatefulSet status not advancing, pod replacement gated by slow scheduler, etc.). Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field.

Net: 1 file, +220 / 0.

Stuck-vs-Paused contrast matrix (the contract for the Paused feature)

Field	Stuck	Paused (target shape)
`Status.Phase`	Upgrading	Paused (or Ready with Cond)
`Status.Rollout`	`!= nil`	nil
`Status.Plan`	`!= nil`, Active	nil
`ConditionRolloutInProgress`	True	False or absent
`ConditionPlanInProgress`	True	False or absent
`ConditionPaused`	absent	True
`Status.ReadyReplicas`	`== replicas`	`== replicas`
`ConditionNodesReady`	True	True
`Status.Rollout.StartedAt`	stable (no rest)	n/a
`RolloutStarted` event	fired	never (no rollout begun)
`RolloutComplete` event	never fires	n/a
Child `Spec.Image`	new image	old image (no propagation)

The on-call's tell for stuck: Phase=Upgrading + RolloutInProgress=True + ReadyReplicas=replicas — existing pods are healthy, but the controller hasn't observed the rollout completing.

Approach

Pauses the StatusFaker (#325's primitive) to wedge ObserveImage and AwaitSpecUpdate behind it
Captures a stuckSnapshot struct of all asserted fields
Consistently(snapshotFn, "10s", "500ms").Should(SatisfyAll(HaveField(...)...)) — per-field diagnostic on flake
Resume faker → Eventually(...).WithTimeout(60s) for clean recovery (default 30s is thin under p99 CI load after a 10s stall + 5-hop reconcile drain)

Coral cross-review applied (k8s-specialist + sre-engineer, pre-PR)

Incorporated:

Struct-snapshot + SatisfyAll(HaveField(...)) for per-field diagnostic on Consistently flakes (vs. opaque "expected true, got false" from a bool predicate)
Status.Phase == Upgrading — kubectl get snd column feat: bump seictl to use S3 transfer manager for parallel snapshot downloads #3, the literal thing an on-call types first (SRE finding)
Status.ReadyReplicas == replicas + ConditionNodesReady=True — the diagnostic contradiction pattern operators look for (SRE finding)
Status.Rollout.StartedAt captured initially, asserted stable inside Consistently — load-bearing for any future "rollout stuck >1h" alert (SRE finding)
ConditionPlanInProgress — operators read conditions before embedded Status.Plan
Typed CurrentImage assertion (Equal(v1), not the weak NotTo(Equal(v2))) — a regression writing garbage would have passed before
Recovery uses Eventually(...).WithTimeout(60s) — recovery has to drain a 10s faker backlog plus 5+ reconcile hops; p99 under loaded CI can spike past the default 30s
Contrast matrix doc-comment anchoring the Paused feature contract

Deferred to follow-ups (call out, don't block):

Controller-still-reconciling liveness probe — harness pattern, separate test class. Currently nothing distinguishes "stuck on AwaitSpecUpdate as designed" from "controller panicked and gave up." Worth its own test.
Sibling failure-class tests — image-pull (no faker involvement; pods CrashLoopBackOff), PVC bind failure (different plan task), sidecar crash mid-task (different stub shape). Each has a distinct stuck-shape; landing them as separate tests preserves runbook-to-test mapping.
PromQL/alert-rule rotation — grep finds zero alert rules or PromQL keyed on SeiNodeDeployment under /Users/brandon/platform/clusters/{prod,harbor,dev}/monitoring/. A stuck prod rollout pages nobody today; discovery is "someone notices." Filing /issue against observability-platform-engineer with the query shape ((time() - max by (namespace,name)(sei_snd_rollout_started_seconds)) > 3600 and on(namespace,name) sei_snd_condition{type="RolloutInProgress",status="true"} == 1) and against opentelemetry-expert for whether the controller exposes those gauges yet.
Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate once sibling tests land.

Verifications

make test-integration green; 3 consecutive runs stable at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests). Stuck test individual ~17s (10s Consistently + recovery).
make verify-generated clean.
go build ./... clean.

🤖 Generated with Claude Code

Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence. Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field. What's asserted (stuck shape, holds for 10s via SatisfyAll + HaveField under Consistently): - Status.Rollout != nil, RolloutInProgress=True - Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True - Status.Phase = Upgrading - Status.ReadyReplicas = replicas (existing pods stay healthy) - ConditionNodesReady = True - Status.Rollout.StartedAt stable across reconciles (alert-rule load-bearing field; a regression that re-stamps it would hide stalls from age-based alerting) - No RolloutComplete event fires during the stall - Children Spec.Image advanced to v2 (UpdateNodeSpecs ran) - Children Status.CurrentImage stays at v1 (ObserveImage stalled) Recovery (Eventually with 60s budget): resuming the StatefulSet status faker lets the rollout reach terminal state on its own, proving the stuck state is the stall, not a deeper controller failure. Harness: uses the existing StatusFaker.Pause/Resume primitive introduced in #325. No production code touched. Coral cross-review applied (k8s-specialist + sre-engineer): Incorporated: - Struct-snapshot pattern + SatisfyAll(HaveField(...)) for per-field diagnostic messages on Consistently flakes - Status.Phase == Upgrading (kubectl get snd column #3 — the literal thing an on-call types first) - ReadyReplicas + ConditionNodesReady simultaneously True (the diagnostic contradiction pattern operators look for) - StartedAt stability inside Consistently (load-bearing for any future "rollout stuck >1h" alert) - ConditionPlanInProgress (operators read conditions before reading embedded Status.Plan) - Typed CurrentImage assertion (Equal(v1), not the weak NotTo(Equal(v2)) — a regression writing garbage would have passed before) - Recovery uses Eventually with 60s budget, not the default 30s pollTimeout — recovery has to drain a 10s faker backlog plus 5+ reconcile hops, and p99 under loaded CI can spike past 30s Deferred to follow-ups (out of scope for this PR): - Controller-still-reconciling liveness probe (harness pattern, separate test class) - Sibling failure-class tests for image-pull, PVC bind, sidecar crash (each has a distinct envtest-stub shape) - PromQL/alert-rule rotation in /Users/brandon/platform/clusters/ monitoring stacks — file /issue against observability-platform- engineer; no SND-keyed PromQL exists today, so a stuck prod rollout pages nobody until someone manually checks - Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate alongside the sibling failure-class tests Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests); stuck test individual ~17s. verify-generated clean.

cursor · 2026-05-21T15:12:16Z

PR Summary

Medium Risk
Test-only change, but it adds a new envtest scenario that intentionally stalls reconciliation for 10s and relies on timing/polling, which could introduce CI runtime or flake risk.

Overview
Adds a new envtest (inplace_stuck_test.go) that pins the “stuck in-place rollout” status contract when child convergence can’t be observed (by pausing the status faker), asserting the rollout remains in progress with stable StartedAt, Phase=Upgrading, and ready replicas while children show Spec.Image advanced but Status.CurrentImage unchanged.

The test then resumes the faker and verifies recovery, ensuring the plan/rollout clear and ConditionRolloutInProgress transitions to False with reason RolloutComplete within an extended timeout.

^{Reviewed by Cursor Bugbot for commit 9b65cbd. Bugbot is set up for automated code reviews on this repo. Configure here.}

bdchatham merged commit f1ddd42 into main May 21, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(envtest): Phase 4 — stuck rollout baseline (AwaitSpecUpdate stall)#328

test(envtest): Phase 4 — stuck rollout baseline (AwaitSpecUpdate stall)#328
bdchatham merged 1 commit into
mainfrom
chore/envtest-phase-4-stuck

bdchatham commented May 21, 2026

Uh oh!

cursor Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented May 21, 2026

Summary

Stuck-vs-Paused contrast matrix (the contract for the Paused feature)

Approach

Coral cross-review applied (k8s-specialist + sre-engineer, pre-PR)

Verifications

Uh oh!

cursor Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented May 21, 2026 •

edited

Loading