Skip to content

test(envtest): Phase 4 — stuck rollout baseline (AwaitSpecUpdate stall)#328

Merged
bdchatham merged 1 commit into
mainfrom
chore/envtest-phase-4-stuck
May 21, 2026
Merged

test(envtest): Phase 4 — stuck rollout baseline (AwaitSpecUpdate stall)#328
bdchatham merged 1 commit into
mainfrom
chore/envtest-phase-4-stuck

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence (sidecar wedged, StatefulSet status not advancing, pod replacement gated by slow scheduler, etc.). Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field.

Net: 1 file, +220 / 0.

Stuck-vs-Paused contrast matrix (the contract for the Paused feature)

Field Stuck Paused (target shape)
Status.Phase Upgrading Paused (or Ready with Cond)
Status.Rollout != nil nil
Status.Plan != nil, Active nil
ConditionRolloutInProgress True False or absent
ConditionPlanInProgress True False or absent
ConditionPaused absent True
Status.ReadyReplicas == replicas == replicas
ConditionNodesReady True True
Status.Rollout.StartedAt stable (no rest) n/a
RolloutStarted event fired never (no rollout begun)
RolloutComplete event never fires n/a
Child Spec.Image new image old image (no propagation)

The on-call's tell for stuck: Phase=Upgrading + RolloutInProgress=True + ReadyReplicas=replicas — existing pods are healthy, but the controller hasn't observed the rollout completing.

Approach

  • Pauses the StatusFaker (#325's primitive) to wedge ObserveImage and AwaitSpecUpdate behind it
  • Captures a stuckSnapshot struct of all asserted fields
  • Consistently(snapshotFn, "10s", "500ms").Should(SatisfyAll(HaveField(...)...)) — per-field diagnostic on flake
  • Resume faker → Eventually(...).WithTimeout(60s) for clean recovery (default 30s is thin under p99 CI load after a 10s stall + 5-hop reconcile drain)

Coral cross-review applied (k8s-specialist + sre-engineer, pre-PR)

Incorporated:

  • Struct-snapshot + SatisfyAll(HaveField(...)) for per-field diagnostic on Consistently flakes (vs. opaque "expected true, got false" from a bool predicate)
  • Status.Phase == Upgradingkubectl get snd column feat: bump seictl to use S3 transfer manager for parallel snapshot downloads #3, the literal thing an on-call types first (SRE finding)
  • Status.ReadyReplicas == replicas + ConditionNodesReady=True — the diagnostic contradiction pattern operators look for (SRE finding)
  • Status.Rollout.StartedAt captured initially, asserted stable inside Consistently — load-bearing for any future "rollout stuck >1h" alert (SRE finding)
  • ConditionPlanInProgress — operators read conditions before embedded Status.Plan
  • Typed CurrentImage assertion (Equal(v1), not the weak NotTo(Equal(v2))) — a regression writing garbage would have passed before
  • Recovery uses Eventually(...).WithTimeout(60s) — recovery has to drain a 10s faker backlog plus 5+ reconcile hops; p99 under loaded CI can spike past the default 30s
  • Contrast matrix doc-comment anchoring the Paused feature contract

Deferred to follow-ups (call out, don't block):

  • Controller-still-reconciling liveness probe — harness pattern, separate test class. Currently nothing distinguishes "stuck on AwaitSpecUpdate as designed" from "controller panicked and gave up." Worth its own test.
  • Sibling failure-class tests — image-pull (no faker involvement; pods CrashLoopBackOff), PVC bind failure (different plan task), sidecar crash mid-task (different stub shape). Each has a distinct stuck-shape; landing them as separate tests preserves runbook-to-test mapping.
  • PromQL/alert-rule rotation — grep finds zero alert rules or PromQL keyed on SeiNodeDeployment under /Users/brandon/platform/clusters/{prod,harbor,dev}/monitoring/. A stuck prod rollout pages nobody today; discovery is "someone notices." Filing /issue against observability-platform-engineer with the query shape ((time() - max by (namespace,name)(sei_snd_rollout_started_seconds)) > 3600 and on(namespace,name) sei_snd_condition{type="RolloutInProgress",status="true"} == 1) and against opentelemetry-expert for whether the controller exposes those gauges yet.
  • Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate once sibling tests land.

Verifications

  • make test-integration green; 3 consecutive runs stable at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests). Stuck test individual ~17s (10s Consistently + recovery).
  • make verify-generated clean.
  • go build ./... clean.

🤖 Generated with Claude Code

Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate
cannot observe child convergence. Establishes the contrast baseline
the forthcoming Paused feature must remain observably distinguishable
from on every status field.

What's asserted (stuck shape, holds for 10s via SatisfyAll +
HaveField under Consistently):

- Status.Rollout != nil, RolloutInProgress=True
- Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True
- Status.Phase = Upgrading
- Status.ReadyReplicas = replicas (existing pods stay healthy)
- ConditionNodesReady = True
- Status.Rollout.StartedAt stable across reconciles (alert-rule
  load-bearing field; a regression that re-stamps it would hide
  stalls from age-based alerting)
- No RolloutComplete event fires during the stall
- Children Spec.Image advanced to v2 (UpdateNodeSpecs ran)
- Children Status.CurrentImage stays at v1 (ObserveImage stalled)

Recovery (Eventually with 60s budget): resuming the StatefulSet
status faker lets the rollout reach terminal state on its own,
proving the stuck state is the stall, not a deeper controller
failure.

Harness: uses the existing StatusFaker.Pause/Resume primitive
introduced in #325. No production code touched.

Coral cross-review applied (k8s-specialist + sre-engineer):

Incorporated:
- Struct-snapshot pattern + SatisfyAll(HaveField(...)) for
  per-field diagnostic messages on Consistently flakes
- Status.Phase == Upgrading (kubectl get snd column #3 — the
  literal thing an on-call types first)
- ReadyReplicas + ConditionNodesReady simultaneously True
  (the diagnostic contradiction pattern operators look for)
- StartedAt stability inside Consistently (load-bearing for any
  future "rollout stuck >1h" alert)
- ConditionPlanInProgress (operators read conditions before
  reading embedded Status.Plan)
- Typed CurrentImage assertion (Equal(v1), not the weak
  NotTo(Equal(v2)) — a regression writing garbage would have
  passed before)
- Recovery uses Eventually with 60s budget, not the default 30s
  pollTimeout — recovery has to drain a 10s faker backlog plus
  5+ reconcile hops, and p99 under loaded CI can spike past 30s

Deferred to follow-ups (out of scope for this PR):
- Controller-still-reconciling liveness probe (harness pattern,
  separate test class)
- Sibling failure-class tests for image-pull, PVC bind, sidecar
  crash (each has a distinct envtest-stub shape)
- PromQL/alert-rule rotation in /Users/brandon/platform/clusters/
  monitoring stacks — file /issue against observability-platform-
  engineer; no SND-keyed PromQL exists today, so a stuck prod
  rollout pages nobody until someone manually checks
- Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate
  alongside the sibling failure-class tests

Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall
(all 5 envtest tests); stuck test individual ~17s. verify-generated
clean.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

PR Summary

Medium Risk
Test-only change, but it adds a new envtest scenario that intentionally stalls reconciliation for 10s and relies on timing/polling, which could introduce CI runtime or flake risk.

Overview
Adds a new envtest (inplace_stuck_test.go) that pins the “stuck in-place rollout” status contract when child convergence can’t be observed (by pausing the status faker), asserting the rollout remains in progress with stable StartedAt, Phase=Upgrading, and ready replicas while children show Spec.Image advanced but Status.CurrentImage unchanged.

The test then resumes the faker and verifies recovery, ensuring the plan/rollout clear and ConditionRolloutInProgress transitions to False with reason RolloutComplete within an extended timeout.

Reviewed by Cursor Bugbot for commit 9b65cbd. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit f1ddd42 into main May 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant