Skip to content

fix(controllers): enforce optimistic concurrency on status patches (closes #147)#149

Merged
bdchatham merged 3 commits into
mainfrom
fix/status-patch-optimistic-concurrency-audit
Apr 30, 2026
Merged

fix(controllers): enforce optimistic concurrency on status patches (closes #147)#149
bdchatham merged 3 commits into
mainfrom
fix/status-patch-optimistic-concurrency-audit

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented Apr 30, 2026

Summary

Audits the codebase for status patches that could silently overwrite fresher in-flight writes. Fixes 2 sites and documents the invariant in CLAUDE.md.

The ValidationRun LLD (PR #143) names plan-creation idempotency under fast double-reconcile as an implementation invariant: status patches must use optimistic concurrency (resourceVersion-checked) so two near-simultaneous reconciles can't both observe status.plan == nil, both build a plan, and have the second silently overwrite the first.

Audit findings — 10 status-patch-relevant call sites in internal/

OK — explicit client.MergeFromWithOptimisticLock{} (no change)

Site Description
internal/controller/node/controller.go:97-98 + :125 Main reconcile statusBase + flush
internal/controller/node/controller.go:201-203 handleNodeDeletion Phase=Terminating
internal/controller/nodedeployment/controller.go:81 + status.go:50 Main reconcile statusBase + flush via updateStatus

NEEDS FIX — MergeFrom without optimistic-lock option (now patched)

Site Was Now
internal/controller/nodedeployment/controller.go:149 (handleDeletion setting Phase=Terminating) client.MergeFrom(group.DeepCopy()) client.MergeFromWithOptions(group.DeepCopy(), client.MergeFromWithOptimisticLock{})
internal/task/deployment_switch.go:48 (rollout incumbent revision write) client.MergeFrom(group.DeepCopy()) client.MergeFromWithOptions(group.DeepCopy(), client.MergeFromWithOptimisticLock{})

Out of scope — not status patches

Pattern Sites Why out of scope
SSA on owned children (client.Apply with field owner) nodedeployment/{internal_service,networking,monitoring}.go, task/apply_{service,statefulset}.go Field-manager isolation handles cross-controller writes; different invariant
Finalizer patches nodedeployment/controller.go:139,180 Spec metadata patches
Spec patches on owned children task/genesis_peers.go:89 (Spec.Peers on SeiNode) Different invariant; planner-driven cross-resource spec mutation

Documentation

Adds a new "Status patches" subsection under CLAUDE.md § Code Standards documenting the invariant, the use/don't-use patterns, and a code-review checklist item. CLAUDE.md is the de-facto contributor doc for this repo (no CONTRIBUTING.md or .github/PULL_REQUEST_TEMPLATE.md exists).

Test infra deferral

Standing up envtest scaffolding to exercise double-reconcile contention is non-trivial in this repo — no existing envtest harness, all current tests are pure-Go. Per the test-infra discussion that surfaced during the #145 review (which deferred admission-test envtest the same way), this PR defers integration-style contention testing to a separate follow-up issue. The audit + fix + documentation all land here; the integration test harness is a separate scope-bounded follow-up that should backfill admission tests + contention tests at once if the project wants integration-style coverage.

Will file the follow-up issue once this PR is reviewed; references the same envtest follow-up #145 deferred.

Test plan

  • make lint — 0 issues
  • make test — all packages pass
  • Manual audit of all Status().Patch / Status().Update call sites in internal/ (grep + read)
  • Confirmed both fixed sites produce identical Status mutations as before; only the patch base option changes
  • (Deferred to follow-up) envtest harness exercising double-reconcile contention against ValidationRun + SeiNode + SND

Files changed

File LoC Purpose
CLAUDE.md +19 Documents the invariant + code-review checklist item
internal/controller/nodedeployment/controller.go ±1 MergeFromMergeFromWithOptions(..., MergeFromWithOptimisticLock{})
internal/task/deployment_switch.go ±1 Same fix for rollout revision write

References

🤖 Generated with Claude Code

Audits the codebase for status patches that could silently overwrite
fresher in-flight writes. Fixes 2 sites + documents the invariant in
CLAUDE.md.

Audit findings (10 status-patch-relevant call sites in internal/):

OK — explicit MergeFromWithOptimisticLock:
  - internal/controller/node/controller.go:97-98 + :125
  - internal/controller/node/controller.go:201-203 (handleNodeDeletion)
  - internal/controller/nodedeployment/controller.go:81 + status.go:50
    (statusBase reused via updateStatus)

NEEDS FIX (now patched):
  - internal/controller/nodedeployment/controller.go:149 (handleDeletion
    setting Phase=Terminating) — was MergeFrom; now MergeFromWithOptions
    + MergeFromWithOptimisticLock{}
  - internal/task/deployment_switch.go:48 (rollout incumbent revision
    write) — same fix

Out of scope (not status patches; different invariants):
  - SSA on owned children (Services, StatefulSets, Routes) via
    client.Apply with field owner — field-manager isolation handles
    cross-controller writes
  - Finalizer patches (controller.go:139,180) — spec metadata
  - Spec patches on owned children (genesis_peers.go) — different concern

Documents the invariant in CLAUDE.md under Code Standards. Code-review
checklist: every r.Status().Patch call site must use a base built with
MergeFromWithOptimisticLock{}.

Test infra deferral: Standing up envtest scaffolding to exercise
double-reconcile contention is non-trivial in this repo (no existing
envtest harness; all current tests are pure-Go). Per ValidationRun
LLD discussion, deferring to a follow-up issue alongside the admission-
test envtest follow-up — both should backfill at once if/when the
project wants integration-style admission/contention coverage.

Closes #147

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Collaborator Author

Review (kubernetes-specialist)

Suggestions only — approve with two NITs

The audit is accurate, the two fixes are mechanically correct, and the CLAUDE.md framing is technically sound (verified against controller-runtime v0.23.1 pkg/client/patch.go: default MergeFrom produces a JSON merge patch with no metadata.resourceVersion in the body, and the kube API server's Patch verb does not enforce optimistic concurrency unless that field is present. The PR's "stale writes succeed silently" claim is correct.).

Findings

APPROVED — audit completeness. Re-ran the grep. 10 hits in internal/ against Status().Patch|Status().Update|client.MergeFrom. Every site is accounted for in the PR table. No Status().Update callers exist (the four Update hits in nodes.go:214, node/controller.go:191,214, deployment_update.go:58 are spec/finalizer updates — correctly out of scope). Phase-flushing through updateStatus (nodedeployment/status.go:50) reuses the optimistic-lock base built at controller.go:81, so it inherits the protection — good.

APPROVED — fix correctness. Both MergeFromWithOptions(group.DeepCopy(), MergeFromWithOptimisticLock{}) calls compose correctly. In controller.go:149 and deployment_switch.go:48 the DeepCopy() happens before any in-memory group.Status mutation, so the snapshotted resourceVersion matches what was last fetched. No corruption window. Pattern matches the existing OK sites at node/controller.go:98,201 and nodedeployment/controller.go:81.

NIT — genesis_peers.go:89 plain MergeFrom is defensible but worth a comment. Spec patches on owned children write a single field (Spec.Peers) idempotently; the planner re-runs are convergent (compute peers, write peers, no read-modify-write of the existing slice). Out of scope for this PR is the right call. Suggest a one-line comment at the call site explaining why optimistic lock is intentionally omitted, so the next contributor doesn't trip the new CLAUDE.md checklist and "fix" it without thinking.

NIT — networking.go:434 (removeOwnerRef) read-modify-writes OwnerReferences with plain MergeFrom. This is a read-modify-write on a shared list — if two reconciles race during deletion-policy=Retain orphaning, the second observes pre-filter refs, and the patch could re-remove an already-removed UID (idempotent — fine) or, in adversarial ordering, miss a concurrent finalizer addition. Not in scope per the PR's classification (it's not a status patch), but worth a follow-up issue: orphan-path metadata patches on owned children probably want optimistic locking too.

APPROVED — out-of-scope classifications. SSA, finalizers, spec-on-children correctly bucketed.

APPROVED — test deferral. Same shape as #145. Acceptable. The follow-up issue should land both this contention test and the admission test from #145 envtest at once — separate PRs sharing one harness setup.

SHOULD — documentation surface. CLAUDE.md is fine for now (the PR notes correctly that no CONTRIBUTING.md exists), but the invariant is human-reviewer-relevant, not just agent-relevant. When the repo gains a CONTRIBUTING.md or a docs/contributing/ (likely as the contributor base widens for ValidationRun work), mirror this section there. Not blocking.

Counter-suggestions

  1. Add a one-line comment at genesis_peers.go:89 documenting the intentional omission — same drive-by polish as Admission validation: reject SeiNodeDeployment with genesis on full-node-role deployments #145.
  2. File the contention-test follow-up issue before this merges, link from PR body, so the deferral has a tracked landing place.
  3. Consider one follow-up tightening: removeOwnerRef should probably use optimistic lock on the metadata patch.

Files inspected

  • /Users/brandon/sei-k8s-controller/internal/controller/nodedeployment/controller.go
  • /Users/brandon/sei-k8s-controller/internal/controller/nodedeployment/networking.go
  • /Users/brandon/sei-k8s-controller/internal/controller/nodedeployment/status.go (referenced)
  • /Users/brandon/sei-k8s-controller/internal/controller/node/controller.go
  • /Users/brandon/sei-k8s-controller/internal/task/deployment_switch.go
  • /Users/brandon/sei-k8s-controller/internal/task/genesis_peers.go
  • /Users/brandon/sei-k8s-controller/internal/task/deployment_update.go
  • /Users/brandon/sei-k8s-controller/CLAUDE.md (diff)
  • /Users/brandon/sei-k8s-controller/docs/design/validation-run-lld.md (Implementation invariants)
  • controller-runtime v0.23.1 pkg/client/patch.go (verified merge-patch semantics)

… patch

Coral review of #147 flagged that the new CLAUDE.md guidance ("status
patches must use MergeFromWithOptimisticLock") could lead a future
contributor to "reflexively fix" genesis_peers.go:89, which uses plain
MergeFrom intentionally — that call site is a SPEC patch on a peer
SeiNode, not a status patch, and is idempotent under conflict (the task
converges spec.Peers toward the assembled peer list and is safe to
re-run).

Adds an explanatory comment so the audit table's classification is
visible at the call site itself, not just in the PR body.

Refs #147

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Collaborator Author

Addressed Coral's NIT #1 in fixup commit 669769e — added an explanatory comment at genesis_peers.go:89 noting the intentional plain MergeFrom (spec patch, not status; idempotent under conflict). The audit table's classification is now visible at the call site so future contributors don't reflexively "fix" it under the new CLAUDE.md guidance.

Coral's other findings remain as separate follow-ups (networking.go:434 removeOwnerRef race-class adjacency for a separate issue; CONTRIBUTING.md mirroring deferred until that file lands).

Refs #147

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit a80d53d into main Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify all controller status patches use optimistic concurrency, not blind SSA

1 participant