Skip to content

Verify all controller status patches use optimistic concurrency, not blind SSA #147

@bdchatham

Description

@bdchatham

Problem

Plan-creation idempotency under fast double-reconcile depends on the executor's single-patch model + status-conflict retry. If status patches are blind SSA (no resourceVersion check), two near-simultaneous reconciles can both observe status.plan == nil, both build a plan, and the second's Patch silently overwrites the first — losing the first plan's ID and any in-flight state.

The ValidationRun LLD (PR #143) names this as an implementation invariant for the new ValidationOrchestrationReconciler and ValidationLoadGenerationReconciler controllers. Existing controllers (SeiNodeReconciler, SeiNodeDeploymentReconciler) appear to use optimistic concurrency, but this hasn't been audited as a code-review checklist item — it's a property the codebase relies on without enforcement.

Impact

Primary use case

Any controller in this repo that builds plans via ResolvePlan-style logic. Today: SeiNode, SND. Tomorrow: ValidationRun (orchestration + load generation).

Cost of not addressing

A single contributor mistakenly using client.Status().Update(ctx, obj) (or Patch without resourceVersion) instead of client.Status().Patch(ctx, obj, client.MergeFrom(original)) introduces a race that's hard to surface in tests and easy to miss in code review. The race is rare but real, and corrupts plan state when it triggers.

Relevant experts

  • kubernetes-specialist — controller-runtime status-patch idioms, optimistic concurrency

Proposed approach

Three-part fix:

1. Audit existing controllers. Verify all status patches in internal/controller/{node,nodedeployment,...} use optimistic concurrency. Specifically:

  • client.Status().Patch(ctx, obj, client.MergeFrom(original)) — OK
  • client.Status().Update(ctx, obj) — verify resourceVersion on obj matches a recent fetch
  • Blind SSA without resourceVersion — fix

2. Add a code-review checklist item. Document in CONTRIBUTING.md (or repo equivalent) that all status patches must be optimistic-concurrency checked.

3. Add an integration test. Exercise double-reconcile contention on a single resource — fire two Reconcile calls in close succession, verify only one plan creation succeeds and the other observes-and-skips.

Acceptance criteria

  • Audit completed for SeiNodeReconciler, SeiNodeDeploymentReconciler, and any other planner consumer
  • CONTRIBUTING.md (or equivalent) documents the optimistic-concurrency invariant
  • Integration test landing in internal/controller/.../*_test.go that exercises double-reconcile contention; passes with optimistic-concurrency, fails with blind SSA
  • Findings from the audit fixed if any blind-SSA patches surface

Out of scope

  • ValidationRun controllers (covered by PR docs: LLD for ValidationRun CRD (for #139) #143's implementation; this issue is the cross-cutting hygiene)
  • Optimistic concurrency on spec patches (admission webhook + controller-runtime defaults already handle)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions