test(envtest): Phase 4 — stuck rollout baseline (AwaitSpecUpdate stall)#328
Conversation
Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence. Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field. What's asserted (stuck shape, holds for 10s via SatisfyAll + HaveField under Consistently): - Status.Rollout != nil, RolloutInProgress=True - Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True - Status.Phase = Upgrading - Status.ReadyReplicas = replicas (existing pods stay healthy) - ConditionNodesReady = True - Status.Rollout.StartedAt stable across reconciles (alert-rule load-bearing field; a regression that re-stamps it would hide stalls from age-based alerting) - No RolloutComplete event fires during the stall - Children Spec.Image advanced to v2 (UpdateNodeSpecs ran) - Children Status.CurrentImage stays at v1 (ObserveImage stalled) Recovery (Eventually with 60s budget): resuming the StatefulSet status faker lets the rollout reach terminal state on its own, proving the stuck state is the stall, not a deeper controller failure. Harness: uses the existing StatusFaker.Pause/Resume primitive introduced in #325. No production code touched. Coral cross-review applied (k8s-specialist + sre-engineer): Incorporated: - Struct-snapshot pattern + SatisfyAll(HaveField(...)) for per-field diagnostic messages on Consistently flakes - Status.Phase == Upgrading (kubectl get snd column #3 — the literal thing an on-call types first) - ReadyReplicas + ConditionNodesReady simultaneously True (the diagnostic contradiction pattern operators look for) - StartedAt stability inside Consistently (load-bearing for any future "rollout stuck >1h" alert) - ConditionPlanInProgress (operators read conditions before reading embedded Status.Plan) - Typed CurrentImage assertion (Equal(v1), not the weak NotTo(Equal(v2)) — a regression writing garbage would have passed before) - Recovery uses Eventually with 60s budget, not the default 30s pollTimeout — recovery has to drain a 10s faker backlog plus 5+ reconcile hops, and p99 under loaded CI can spike past 30s Deferred to follow-ups (out of scope for this PR): - Controller-still-reconciling liveness probe (harness pattern, separate test class) - Sibling failure-class tests for image-pull, PVC bind, sidecar crash (each has a distinct envtest-stub shape) - PromQL/alert-rule rotation in /Users/brandon/platform/clusters/ monitoring stacks — file /issue against observability-platform- engineer; no SND-keyed PromQL exists today, so a stuck prod rollout pages nobody until someone manually checks - Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate alongside the sibling failure-class tests Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests); stuck test individual ~17s. verify-generated clean.
PR SummaryMedium Risk Overview The test then resumes the faker and verifies recovery, ensuring the plan/rollout clear and Reviewed by Cursor Bugbot for commit 9b65cbd. Bugbot is set up for automated code reviews on this repo. Configure here. |
Summary
Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence (sidecar wedged, StatefulSet status not advancing, pod replacement gated by slow scheduler, etc.). Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field.
Net: 1 file, +220 / 0.
Stuck-vs-Paused contrast matrix (the contract for the Paused feature)
Status.PhaseStatus.Rollout!= nilStatus.Plan!= nil, ActiveConditionRolloutInProgressConditionPlanInProgressConditionPausedStatus.ReadyReplicas== replicas== replicasConditionNodesReadyStatus.Rollout.StartedAtRolloutStartedeventRolloutCompleteeventSpec.ImageThe on-call's tell for stuck:
Phase=Upgrading + RolloutInProgress=True + ReadyReplicas=replicas— existing pods are healthy, but the controller hasn't observed the rollout completing.Approach
#325's primitive) to wedge ObserveImage and AwaitSpecUpdate behind itstuckSnapshotstruct of all asserted fieldsConsistently(snapshotFn, "10s", "500ms").Should(SatisfyAll(HaveField(...)...))— per-field diagnostic on flakeEventually(...).WithTimeout(60s)for clean recovery (default 30s is thin under p99 CI load after a 10s stall + 5-hop reconcile drain)Coral cross-review applied (k8s-specialist + sre-engineer, pre-PR)
Incorporated:
SatisfyAll(HaveField(...))for per-field diagnostic on Consistently flakes (vs. opaque "expected true, got false" from a bool predicate)Status.Phase == Upgrading—kubectl get sndcolumn feat: bump seictl to use S3 transfer manager for parallel snapshot downloads #3, the literal thing an on-call types first (SRE finding)Status.ReadyReplicas == replicas+ConditionNodesReady=True— the diagnostic contradiction pattern operators look for (SRE finding)Status.Rollout.StartedAtcaptured initially, asserted stable inside Consistently — load-bearing for any future "rollout stuck >1h" alert (SRE finding)ConditionPlanInProgress— operators read conditions before embeddedStatus.PlanCurrentImageassertion (Equal(v1), not the weakNotTo(Equal(v2))) — a regression writing garbage would have passed beforeEventually(...).WithTimeout(60s)— recovery has to drain a 10s faker backlog plus 5+ reconcile hops; p99 under loaded CI can spike past the default 30sDeferred to follow-ups (call out, don't block):
/Users/brandon/platform/clusters/{prod,harbor,dev}/monitoring/. A stuck prod rollout pages nobody today; discovery is "someone notices." Filing/issueagainst observability-platform-engineer with the query shape ((time() - max by (namespace,name)(sei_snd_rollout_started_seconds)) > 3600 and on(namespace,name) sei_snd_condition{type="RolloutInProgress",status="true"} == 1) and against opentelemetry-expert for whether the controller exposes those gauges yet.TestInPlaceRollout_StuckOnAwaitSpecUpdateonce sibling tests land.Verifications
make test-integrationgreen; 3 consecutive runs stable at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests). Stuck test individual ~17s (10s Consistently + recovery).make verify-generatedclean.go build ./...clean.🤖 Generated with Claude Code