test(envtest): Phase 2 — mid-flight supersession assertion#325
Merged
Conversation
Adds a deterministic envtest for detectDeploymentNeeded's supersession branch (nodes.go): when spec.template.spec.image changes a second time before the first rollout finishes, the in-flight plan must be replaced with one targeting the latest hash, and a RolloutSuperseded event must be recorded against the SND. Approach - Refactor sts_status_faker.go to return a *StatusFaker struct with Pause() / Resume() methods. The 50ms-tick goroutine reads a paused flag each tick and skips writes when paused. No production code touched; pure test-harness change. - Expose the faker handle as a package-level var in suite_test.go so individual tests can drive timing. - Add listEventsForSND helper to helpers_test.go for filtering events by InvolvedObject UID + Reason. New test: TestInPlaceRollout_Supersession (~150 LOC) - Creates a 2-replica InPlace SND at v1 - Pauses the faker, patches to v2 - Asserts Status.Rollout populated with v2's TargetHash (rollout registered) but the plan stalls (can't complete with faker paused — ObserveImage gates on StatefulSet status that never advances) - Patches to v3 while v2 is in flight - Asserts Status.Rollout retargeted to v3's hash AND a RolloutSuperseded event was recorded referencing the v2 hash - Resumes the faker - Asserts v3 reaches terminal state cleanly: plan nil, rollout nil, RolloutInProgress=False/RolloutComplete, every child at v3 Verifications - make test-integration: green, both tests pass. Three consecutive runs stable at 17.2s / 17.4s / 18.1s wall (Phase 1: 7.55s, Phase 2: 5.55s). - make verify-generated: clean.
|
You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace. To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard. |
bdchatham
added a commit
that referenced
this pull request
May 21, 2026
…l) (#328) Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence. Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field. What's asserted (stuck shape, holds for 10s via SatisfyAll + HaveField under Consistently): - Status.Rollout != nil, RolloutInProgress=True - Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True - Status.Phase = Upgrading - Status.ReadyReplicas = replicas (existing pods stay healthy) - ConditionNodesReady = True - Status.Rollout.StartedAt stable across reconciles (alert-rule load-bearing field; a regression that re-stamps it would hide stalls from age-based alerting) - No RolloutComplete event fires during the stall - Children Spec.Image advanced to v2 (UpdateNodeSpecs ran) - Children Status.CurrentImage stays at v1 (ObserveImage stalled) Recovery (Eventually with 60s budget): resuming the StatefulSet status faker lets the rollout reach terminal state on its own, proving the stuck state is the stall, not a deeper controller failure. Harness: uses the existing StatusFaker.Pause/Resume primitive introduced in #325. No production code touched. Coral cross-review applied (k8s-specialist + sre-engineer): Incorporated: - Struct-snapshot pattern + SatisfyAll(HaveField(...)) for per-field diagnostic messages on Consistently flakes - Status.Phase == Upgrading (kubectl get snd column #3 — the literal thing an on-call types first) - ReadyReplicas + ConditionNodesReady simultaneously True (the diagnostic contradiction pattern operators look for) - StartedAt stability inside Consistently (load-bearing for any future "rollout stuck >1h" alert) - ConditionPlanInProgress (operators read conditions before reading embedded Status.Plan) - Typed CurrentImage assertion (Equal(v1), not the weak NotTo(Equal(v2)) — a regression writing garbage would have passed before) - Recovery uses Eventually with 60s budget, not the default 30s pollTimeout — recovery has to drain a 10s faker backlog plus 5+ reconcile hops, and p99 under loaded CI can spike past 30s Deferred to follow-ups (out of scope for this PR): - Controller-still-reconciling liveness probe (harness pattern, separate test class) - Sibling failure-class tests for image-pull, PVC bind, sidecar crash (each has a distinct envtest-stub shape) - PromQL/alert-rule rotation in /Users/brandon/platform/clusters/ monitoring stacks — file /issue against observability-platform- engineer; no SND-keyed PromQL exists today, so a stuck prod rollout pages nobody until someone manually checks - Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate alongside the sibling failure-class tests Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests); stuck test individual ~17s. verify-generated clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Second envtest test on the Phase 1 harness. Exercises
detectDeploymentNeeded's supersession branch (internal/controller/nodedeployment/nodes.go) — when an operator pushes v2 and then realizes they want v3 before v2 finishes rolling out, the in-flight plan must be replaced with one targeting v3, and aRolloutSupersededevent must be recorded.Net: 4 files, +247 / -27.
Harness change
sts_status_faker.gorefactored from a package-levelStartStatefulSetStatusFaker(ctx, kc) context.CancelFuncto a*StatusFakerstruct returned byStartStatefulSetStatusFaker. The struct exposesPause(),Resume(),Stop(). The 50ms-tick goroutine reads apausedflag each iteration and skips writes while paused. No production code touched.suite_test.goretains the faker handle in a package-leveltestFakervar.helpers_test.gogainslistEventsForSND(t, snd, reason)for filtering core/v1 Events byInvolvedObject.UID+ optionalReason.New test:
TestInPlaceRollout_SupersessionStatus.Rolloutpopulated with v2 hash,RolloutInProgress=True. Plan stalls (ObserveImage gates on STS status that never advances with faker paused).Status.Rollout.TargetHashretargets to v3 hash.RolloutSupersededevent recorded referencing the v2 hash.RolloutInProgress=False/RolloutComplete. Every child at v3.What this catches as a regression
Verifications
make test-integrationgreen, both tests pass. 3 consecutive runs stable: 17.2s / 17.4s / 18.1s wall.make verify-generatedclean.go build ./...clean.Phase 3+ candidates on the bench
MergeFromWithOptimisticLock{}contract fromCLAUDE.md:30-46)🤖 Generated with Claude Code