Skip to content

test(envtest): Phase 2 — mid-flight supersession assertion#325

Merged
bdchatham merged 1 commit into
mainfrom
chore/envtest-phase-2-supersession
May 21, 2026
Merged

test(envtest): Phase 2 — mid-flight supersession assertion#325
bdchatham merged 1 commit into
mainfrom
chore/envtest-phase-2-supersession

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Second envtest test on the Phase 1 harness. Exercises detectDeploymentNeeded's supersession branch (internal/controller/nodedeployment/nodes.go) — when an operator pushes v2 and then realizes they want v3 before v2 finishes rolling out, the in-flight plan must be replaced with one targeting v3, and a RolloutSuperseded event must be recorded.

Net: 4 files, +247 / -27.

Harness change

sts_status_faker.go refactored from a package-level StartStatefulSetStatusFaker(ctx, kc) context.CancelFunc to a *StatusFaker struct returned by StartStatefulSetStatusFaker. The struct exposes Pause(), Resume(), Stop(). The 50ms-tick goroutine reads a paused flag each iteration and skips writes while paused. No production code touched.

suite_test.go retains the faker handle in a package-level testFaker var. helpers_test.go gains listEventsForSND(t, snd, reason) for filtering core/v1 Events by InvolvedObject.UID + optional Reason.

New test: TestInPlaceRollout_Supersession

Step Assertion
1. Create 2-replica InPlace SND at v1 Initial deployment settles
2. Pause the faker (timing setup)
3. Patch v1 → v2 Status.Rollout populated with v2 hash, RolloutInProgress=True. Plan stalls (ObserveImage gates on STS status that never advances with faker paused).
4. Patch v2 → v3 (while v2 plan in flight) Status.Rollout.TargetHash retargets to v3 hash. RolloutSuperseded event recorded referencing the v2 hash.
5. Resume the faker (timing release)
6. Wait for terminal state Plan nil, Rollout nil, RolloutInProgress=False/RolloutComplete. Every child at v3.

What this catches as a regression

  • Removing the supersession branch (plan parks on v2 forever)
  • Changing the event reason or omitting the hash from the message
  • Breaking the fall-through where supersession nils the plan but doesn't write a fresh Rollout
  • Order-of-operation bugs between Plan/Rollout/condition writes

Verifications

  • make test-integration green, both tests pass. 3 consecutive runs stable: 17.2s / 17.4s / 18.1s wall.
  • make verify-generated clean.
  • go build ./... clean.

Phase 3+ candidates on the bench

  • Optimistic-lock conformance (MergeFromWithOptimisticLock{} contract from CLAUDE.md:30-46)
  • Failure-path: stuck rollout / stall-detection signals
  • Finalizer ordering test
  • Genesis ceremony plan execution

🤖 Generated with Claude Code

Adds a deterministic envtest for detectDeploymentNeeded's supersession
branch (nodes.go): when spec.template.spec.image changes a second time
before the first rollout finishes, the in-flight plan must be replaced
with one targeting the latest hash, and a RolloutSuperseded event must
be recorded against the SND.

Approach
- Refactor sts_status_faker.go to return a *StatusFaker struct with
  Pause() / Resume() methods. The 50ms-tick goroutine reads a paused
  flag each tick and skips writes when paused. No production code
  touched; pure test-harness change.
- Expose the faker handle as a package-level var in suite_test.go so
  individual tests can drive timing.
- Add listEventsForSND helper to helpers_test.go for filtering events
  by InvolvedObject UID + Reason.

New test: TestInPlaceRollout_Supersession (~150 LOC)
- Creates a 2-replica InPlace SND at v1
- Pauses the faker, patches to v2
- Asserts Status.Rollout populated with v2's TargetHash (rollout
  registered) but the plan stalls (can't complete with faker paused —
  ObserveImage gates on StatefulSet status that never advances)
- Patches to v3 while v2 is in flight
- Asserts Status.Rollout retargeted to v3's hash AND a
  RolloutSuperseded event was recorded referencing the v2 hash
- Resumes the faker
- Asserts v3 reaches terminal state cleanly: plan nil, rollout nil,
  RolloutInProgress=False/RolloutComplete, every child at v3

Verifications
- make test-integration: green, both tests pass. Three consecutive runs
  stable at 17.2s / 17.4s / 18.1s wall (Phase 1: 7.55s, Phase 2: 5.55s).
- make verify-generated: clean.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

@bdchatham bdchatham merged commit b93c9bf into main May 21, 2026
4 checks passed
bdchatham added a commit that referenced this pull request May 21, 2026
…l) (#328)

Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate
cannot observe child convergence. Establishes the contrast baseline
the forthcoming Paused feature must remain observably distinguishable
from on every status field.

What's asserted (stuck shape, holds for 10s via SatisfyAll +
HaveField under Consistently):

- Status.Rollout != nil, RolloutInProgress=True
- Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True
- Status.Phase = Upgrading
- Status.ReadyReplicas = replicas (existing pods stay healthy)
- ConditionNodesReady = True
- Status.Rollout.StartedAt stable across reconciles (alert-rule
  load-bearing field; a regression that re-stamps it would hide
  stalls from age-based alerting)
- No RolloutComplete event fires during the stall
- Children Spec.Image advanced to v2 (UpdateNodeSpecs ran)
- Children Status.CurrentImage stays at v1 (ObserveImage stalled)

Recovery (Eventually with 60s budget): resuming the StatefulSet
status faker lets the rollout reach terminal state on its own,
proving the stuck state is the stall, not a deeper controller
failure.

Harness: uses the existing StatusFaker.Pause/Resume primitive
introduced in #325. No production code touched.

Coral cross-review applied (k8s-specialist + sre-engineer):

Incorporated:
- Struct-snapshot pattern + SatisfyAll(HaveField(...)) for
  per-field diagnostic messages on Consistently flakes
- Status.Phase == Upgrading (kubectl get snd column #3 — the
  literal thing an on-call types first)
- ReadyReplicas + ConditionNodesReady simultaneously True
  (the diagnostic contradiction pattern operators look for)
- StartedAt stability inside Consistently (load-bearing for any
  future "rollout stuck >1h" alert)
- ConditionPlanInProgress (operators read conditions before
  reading embedded Status.Plan)
- Typed CurrentImage assertion (Equal(v1), not the weak
  NotTo(Equal(v2)) — a regression writing garbage would have
  passed before)
- Recovery uses Eventually with 60s budget, not the default 30s
  pollTimeout — recovery has to drain a 10s faker backlog plus
  5+ reconcile hops, and p99 under loaded CI can spike past 30s

Deferred to follow-ups (out of scope for this PR):
- Controller-still-reconciling liveness probe (harness pattern,
  separate test class)
- Sibling failure-class tests for image-pull, PVC bind, sidecar
  crash (each has a distinct envtest-stub shape)
- PromQL/alert-rule rotation in /Users/brandon/platform/clusters/
  monitoring stacks — file /issue against observability-platform-
  engineer; no SND-keyed PromQL exists today, so a stuck prod
  rollout pages nobody until someone manually checks
- Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate
  alongside the sibling failure-class tests

Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall
(all 5 envtest tests); stuck test individual ~17s. verify-generated
clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant