fix(controllers): enforce optimistic concurrency on status patches (closes #147)#149
Conversation
Audits the codebase for status patches that could silently overwrite
fresher in-flight writes. Fixes 2 sites + documents the invariant in
CLAUDE.md.
Audit findings (10 status-patch-relevant call sites in internal/):
OK — explicit MergeFromWithOptimisticLock:
- internal/controller/node/controller.go:97-98 + :125
- internal/controller/node/controller.go:201-203 (handleNodeDeletion)
- internal/controller/nodedeployment/controller.go:81 + status.go:50
(statusBase reused via updateStatus)
NEEDS FIX (now patched):
- internal/controller/nodedeployment/controller.go:149 (handleDeletion
setting Phase=Terminating) — was MergeFrom; now MergeFromWithOptions
+ MergeFromWithOptimisticLock{}
- internal/task/deployment_switch.go:48 (rollout incumbent revision
write) — same fix
Out of scope (not status patches; different invariants):
- SSA on owned children (Services, StatefulSets, Routes) via
client.Apply with field owner — field-manager isolation handles
cross-controller writes
- Finalizer patches (controller.go:139,180) — spec metadata
- Spec patches on owned children (genesis_peers.go) — different concern
Documents the invariant in CLAUDE.md under Code Standards. Code-review
checklist: every r.Status().Patch call site must use a base built with
MergeFromWithOptimisticLock{}.
Test infra deferral: Standing up envtest scaffolding to exercise
double-reconcile contention is non-trivial in this repo (no existing
envtest harness; all current tests are pure-Go). Per ValidationRun
LLD discussion, deferring to a follow-up issue alongside the admission-
test envtest follow-up — both should backfill at once if/when the
project wants integration-style admission/contention coverage.
Closes #147
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review (kubernetes-specialist)Suggestions only — approve with two NITsThe audit is accurate, the two fixes are mechanically correct, and the CLAUDE.md framing is technically sound (verified against controller-runtime v0.23.1 FindingsAPPROVED — audit completeness. Re-ran the grep. 10 hits in APPROVED — fix correctness. Both NIT — NIT — APPROVED — out-of-scope classifications. SSA, finalizers, spec-on-children correctly bucketed. APPROVED — test deferral. Same shape as #145. Acceptable. The follow-up issue should land both this contention test and the admission test from #145 envtest at once — separate PRs sharing one harness setup. SHOULD — documentation surface. Counter-suggestions
Files inspected
|
… patch Coral review of #147 flagged that the new CLAUDE.md guidance ("status patches must use MergeFromWithOptimisticLock") could lead a future contributor to "reflexively fix" genesis_peers.go:89, which uses plain MergeFrom intentionally — that call site is a SPEC patch on a peer SeiNode, not a status patch, and is idempotent under conflict (the task converges spec.Peers toward the assembled peer list and is safe to re-run). Adds an explanatory comment so the audit table's classification is visible at the call site itself, not just in the PR body. Refs #147 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed Coral's NIT #1 in fixup commit Coral's other findings remain as separate follow-ups (networking.go:434 |
Refs #147 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Audits the codebase for status patches that could silently overwrite fresher in-flight writes. Fixes 2 sites and documents the invariant in
CLAUDE.md.The ValidationRun LLD (PR #143) names plan-creation idempotency under fast double-reconcile as an implementation invariant: status patches must use optimistic concurrency (resourceVersion-checked) so two near-simultaneous reconciles can't both observe
status.plan == nil, both build a plan, and have the second silently overwrite the first.Audit findings — 10 status-patch-relevant call sites in
internal/OK — explicit
client.MergeFromWithOptimisticLock{}(no change)internal/controller/node/controller.go:97-98+:125internal/controller/node/controller.go:201-203handleNodeDeletionPhase=Terminatinginternal/controller/nodedeployment/controller.go:81+status.go:50updateStatusNEEDS FIX —
MergeFromwithout optimistic-lock option (now patched)internal/controller/nodedeployment/controller.go:149(handleDeletion settingPhase=Terminating)client.MergeFrom(group.DeepCopy())client.MergeFromWithOptions(group.DeepCopy(), client.MergeFromWithOptimisticLock{})internal/task/deployment_switch.go:48(rollout incumbent revision write)client.MergeFrom(group.DeepCopy())client.MergeFromWithOptions(group.DeepCopy(), client.MergeFromWithOptimisticLock{})Out of scope — not status patches
client.Applywith field owner)nodedeployment/{internal_service,networking,monitoring}.go,task/apply_{service,statefulset}.gonodedeployment/controller.go:139,180task/genesis_peers.go:89(Spec.Peers on SeiNode)Documentation
Adds a new "Status patches" subsection under
CLAUDE.md§ Code Standards documenting the invariant, the use/don't-use patterns, and a code-review checklist item. CLAUDE.md is the de-facto contributor doc for this repo (noCONTRIBUTING.mdor.github/PULL_REQUEST_TEMPLATE.mdexists).Test infra deferral
Standing up envtest scaffolding to exercise double-reconcile contention is non-trivial in this repo — no existing envtest harness, all current tests are pure-Go. Per the test-infra discussion that surfaced during the #145 review (which deferred admission-test envtest the same way), this PR defers integration-style contention testing to a separate follow-up issue. The audit + fix + documentation all land here; the integration test harness is a separate scope-bounded follow-up that should backfill admission tests + contention tests at once if the project wants integration-style coverage.
Will file the follow-up issue once this PR is reviewed; references the same envtest follow-up #145 deferred.
Test plan
make lint— 0 issuesmake test— all packages passStatus().Patch/Status().Updatecall sites ininternal/(grep + read)Files changed
CLAUDE.mdinternal/controller/nodedeployment/controller.goMergeFrom→MergeFromWithOptions(..., MergeFromWithOptimisticLock{})internal/task/deployment_switch.goReferences
🤖 Generated with Claude Code