Skip to content

🤖 feat: harden CoderControlPlane controller (leader election, readiness, CRD defaults, tests)#41

Merged
ThomasK33 merged 6 commits into
mainfrom
coder-control-plane-j3y3
Feb 11, 2026
Merged

🤖 feat: harden CoderControlPlane controller (leader election, readiness, CRD defaults, tests)#41
ThomasK33 merged 6 commits into
mainfrom
coder-control-plane-j3y3

Conversation

@ThomasK33
Copy link
Copy Markdown
Member

Summary

Close operational and test gaps in the CoderControlPlane controller, making it safer for HA deployments, more observable, and harder to regress.

Background

The CoderControlPlane controller audit identified several best-practice gaps:

  • No leader election: unsafe when scaled to >1 replica — multiple controllers would race on the same resources.
  • Readiness check is a no-op ping: Kubernetes could route traffic before caches are synced.
  • Defaults hidden in reconciler logic: users can't see effective defaults via kubectl get -o yaml.
  • Limited test coverage: only basic create + nil-check tests; no status, update, ownership, or phase-transition tests.

Implementation

1. Leader election (controllerapp.go)

  • Enabled LeaderElection: true with stable ID coder-k8s-controller.coder.com.
  • Set LeaderElectionReleaseOnCancel: true for faster handoff on shutdown.
  • Added RBAC markers for coordination.k8s.io/leases and events.

2. Cache-sync readiness check (controllerapp.go)

  • Replaced /readyz from healthz.Ping to a cache-sync check using mgr.GetCache().WaitForCacheSync with a 2-second timeout.
  • /healthz remains healthz.Ping (liveness should be cheap).

3. CRD default markers (api/v1alpha1/)

  • Added +kubebuilder:default markers on:
    • CoderControlPlaneSpec.Image"ghcr.io/coder/coder:latest"
    • CoderControlPlaneSpec.Replicas1
    • ServiceSpec.Type"ClusterIP"
    • ServiceSpec.Port80
  • Reconciler-level defaults preserved for backward compatibility with pre-marker CRDs.

4. Regenerated manifests

  • CRDs updated with default values (config/crd/bases/).
  • RBAC role updated with lease + event permissions (config/rbac/role.yaml).

5. Expanded envtest coverage (5 new tests)

  • TestReconcile_StatusPersistence: asserts ObservedGeneration, URL, ReadyReplicas, Phase after reconcile.
  • TestReconcile_OwnerReferences: verifies Deployment/Service have controller owner refs.
  • TestReconcile_SpecUpdatePropagates: updates spec fields, reconciles, asserts child resources updated.
  • TestReconcile_PhaseTransitionToReady: manually sets deployment ready replicas, asserts phase transition to Ready.
  • TestReconcile_DefaultsApplied: reconciles minimal spec, asserts defaults in child resources.

Validation

  • make verify-vendor
  • make build
  • make lint
  • make test ✅ (all packages pass, including 5 new envtest tests)

Risks

  • Leader election: existing single-replica deployments are unaffected. Multi-replica setups now work correctly instead of silently racing. RBAC for leases is required — the regenerated role includes it.
  • Readiness check: the 2-second timeout is conservative; pods will show NotReady until caches sync, which is the desired behavior.
  • CRD defaults: additive change — existing resources with explicit values are unaffected. New resources or omitted fields get visible defaults.

📋 Implementation Plan

Plan: Close CoderControlPlane controller best-practice gaps

Context / Why

The CoderControlPlane controller works, but the audit identified operational and test gaps that make it harder to run safely in production (HA), harder to debug, and easier to regress:

  • Controller manager does not use leader election (unsafe if scaled to >1 replica).
  • /readyz and /healthz are both healthz.Ping (readiness doesn’t reflect cache/API readiness).
  • CoderControlPlane tests cover creation and defensive nil checks, but miss spec updates, status assertions, and key error paths.
  • Defaults are applied imperatively in reconcilers; there’s no admission-time/CRD defaulting.

Goal: harden the controller app + reconciler and expand tests so the behavior is safer and the contract is verified.

Evidence (what was inspected)

  • internal/app/controllerapp/controllerapp.go
    • Manager options only set Scheme + HealthProbeBindAddress (no LeaderElection).
    • Both health + ready checks use healthz.Ping.
  • internal/controller/codercontrolplane_controller.go
    • Defaults for image/replicas/service type+port are applied inside reconcile.
    • Status update sets ObservedGeneration, ReadyReplicas, URL, Phase.
  • internal/controller/codercontrolplane_controller_test.go
    • Tests: NotFound, create children, nil client/scheme.
    • Missing: spec update, status persistence, owner refs, error paths.
  • internal/controller/suite_test.go
    • Uses controller-runtime envtest with CRDs loaded from config/crd/bases.
  • Reference pattern: internal/controller/workspaceproxy_controller_test.go contains richer assertions (including status).

Implementation details

1) Enable leader election for the controller manager

Why: safe HA operation; prevents multiple controllers racing on the same resources.

Edits

  • internal/app/controllerapp/controllerapp.go
    • Enable leader election in ctrl.Options.
    • Set a stable LeaderElectionID.
    • Prefer LeaderElectionReleaseOnCancel: true for faster handoff on shutdown.

Suggested shape:

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,
    HealthProbeBindAddress: HealthProbeBindAddress,

    LeaderElection: true,
    LeaderElectionID: "coder-k8s-controller.coder.com",
    LeaderElectionReleaseOnCancel: true,
})

RBAC (required if leader election is enabled)

  • Add kubebuilder RBAC marker somewhere included by controller-gen (e.g., internal/app/controllerapp/controllerapp.go):
// +kubebuilder:rbac:groups=coordination.k8s.io,resources=leases,verbs=get;list;watch;create;update;patch;delete
  • Regenerate manifests (make manifests) so config/rbac/role.yaml includes leases permissions.
  • Verify any install manifests (deploy/rbac.yaml, config/e2e/*) bind the controller SA to the regenerated role (or update them if they’re intended to be used).

2) Make readiness meaningful (cache sync instead of Ping)

Why: Kubernetes should only route traffic / treat the controller as ready once caches are started and synced.

Edits

  • internal/app/controllerapp/controllerapp.go
    • Keep liveness as healthz.Ping.
    • Replace ready check with a cache-sync check (wrap mgr.GetCache().WaitForCacheSync). Ensure it does not hang indefinitely by using a short timeout.

Suggested shape:

if err := mgr.AddReadyzCheck("cache-sync", healthz.Checker(func(req *http.Request) error {
    ctx, cancel := context.WithTimeout(req.Context(), 2*time.Second)
    defer cancel()

    if ok := mgr.GetCache().WaitForCacheSync(ctx); !ok {
        return fmt.Errorf("cache not synced")
    }
    return nil
})); err != nil {
    return fmt.Errorf("unable to set up ready check: %w", err)
}

3) Move defaults out of “hidden” controller logic (prefer CRD defaults)

Why: users and automation should be able to see the effective defaults via kubectl get -o yaml and avoid surprise behavior.

Preferred approach (low overhead): CRD defaults via kubebuilder markers

  • Add +kubebuilder:default markers to API types.
    • api/v1alpha1/codercontrolplane_types.go
      • CoderControlPlaneSpec.Image default ghcr.io/coder/coder:latest
      • CoderControlPlaneSpec.Replicas default 1
    • api/v1alpha1/types_shared.go
      • ServiceSpec.Type default ClusterIP
      • ServiceSpec.Port default 80

Example:

// +kubebuilder:default="ghcr.io/coder/coder:latest"
Image string `json:"image,omitempty"`

// +kubebuilder:default=1
Replicas *int32 `json:"replicas,omitempty"`
  • Run make manifests to update CRDs under config/crd/bases/.
  • Keep reconciler-level defaults for backward compatibility (old CRDs), but add a TODO to remove once defaults are guaranteed by CRDs.
Alternative: mutating webhook defaulting

If CRD defaults for nested structs (like spec.service) are insufficient or too awkward, a mutating webhook can set defaults reliably. This is higher operational overhead (webhook server, certs, deployment changes) and is likely not worth it unless defaults become complex.

4) Expand envtest coverage for CoderControlPlane

Why: prevent regressions in update behavior and status reporting.

Edits

  • internal/controller/codercontrolplane_controller_test.go

Add tests mirroring the thoroughness of workspaceproxy_controller_test.go:

  1. Status is persisted after reconcile

    • After reconcile, Get the CoderControlPlane and assert:
      • Status.ObservedGeneration == Generation
      • Status.URL == fmt.Sprintf("http://%s.%s.svc.cluster.local:%d", ...)
      • Status.ReadyReplicas == 0 (in envtest)
      • Status.Phase == Pending
  2. Owner references are set (in lieu of GC tests)

    • Assert Deployment.OwnerReferences and Service.OwnerReferences include the CP with controller=true.
  3. Spec updates propagate

    • Update CP spec fields (replicas, image, service port/annotations), call reconcile, assert Deployment/Service updated.
  4. Status transitions to Ready when deployment is “ready”

    • Manually set deployment.Status.ReadyReplicas = 1 via k8sClient.Status().Update.
    • Reconcile again and assert CP Status.Phase == Ready and ReadyReplicas == 1.
  5. Defaulting behavior is covered

    • Create a CP with minimal/empty spec, reconcile, and assert defaults took effect (either by checking the stored spec fields if using CRD defaults, and/or checking reconciled child resources).

5) Add targeted error-path tests (unit-style)

Why: ensure reconcile wraps and returns actionable errors (especially around status writes).

Approach:

  • Add a thin client.Client wrapper used only in tests to force failures on:
    • Status().Update (to cover update control plane status: ... path)
    • optionally Create for Deployments/Services (to cover reconcile child errors)

This avoids needing a full fake API server while still verifying error-handling behavior.

6) Validation / regeneration checklist

When implementing:

  • make manifests (CRD + RBAC generation)
  • make test
  • make lint
  • (optional) make build

Expected outcome

  • Controller manager supports HA safely (leader election + required RBAC).
  • Readiness indicates real operational readiness (cache synced).
  • Defaults are visible and consistent (CRD defaults preferred).
  • CoderControlPlane tests cover create/update/status/ownership/error paths, reducing regressions.

Generated with mux • Model: anthropic:claude-opus-4-6 • Thinking: xhigh • Cost: $0.57

@ThomasK33
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e455d98ec7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/app/controllerapp/controllerapp.go
Detect pod namespace from the in-cluster service account file, falling
back to kube-system for out-of-cluster development runs. Without an
explicit namespace, controller-runtime fails at startup when not running
inside a pod.
@ThomasK33
Copy link
Copy Markdown
Member Author

@codex review

Fixed the LeaderElectionNamespace issue. Added detectLeaderElectionNamespace() that reads the in-cluster namespace file (/var/run/secrets/kubernetes.io/serviceaccount/namespace), falling back to kube-system for out-of-cluster development runs.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9063e50cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread api/v1alpha1/types_shared.go
Comment thread internal/app/controllerapp/controllerapp.go
- Add +kubebuilder:default={} on Service fields so nested defaults
  (type, port) apply even when spec.service is omitted entirely.
- Add POD_NAMESPACE env var as first-priority leader election namespace
  source, enabling explicit override for out-of-cluster dev runs.
@ThomasK33
Copy link
Copy Markdown
Member Author

@codex review

Addressed both P2 comments:

  1. Added +kubebuilder:default={} on Service ServiceSpec fields (both CoderControlPlane and WorkspaceProxy) so nested type/port defaults materialize even when spec.service is omitted.
  2. Added POD_NAMESPACE env var as first-priority source for leader election namespace, allowing explicit override for out-of-cluster dev. Resolution order: POD_NAMESPACE env → in-cluster file → kube-system fallback.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ThomasK33 ThomasK33 added this pull request to the merge queue Feb 11, 2026
@ThomasK33
Copy link
Copy Markdown
Member Author

Merged via the queue into main with commit 145c99e Feb 11, 2026
8 checks passed
@ThomasK33 ThomasK33 deleted the coder-control-plane-j3y3 branch February 11, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant