🤖 feat: unify controller, aggregated API server, and MCP into --app=all default mode#53
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60b597c00e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Pushed a fix for the flaky |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 160a0e26fb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Addressed both review comments:
Also fixed a flaky |
|
@codex review The events RBAC comment has already been addressed in commit |
|
@codex review Fixed E2E smoke test — restored |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 110b428af7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Updated all documentation references to use the unified |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f5c16d313
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Plumbed |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Adds a new
--app=allmode (now the default) that runs the controller, aggregated API server, and MCP HTTP server in a single process with a shared controller-runtime cache/client. This eliminates the need for three separate Deployments and enables single-pod deployment.Background
Previously,
coder-k8srequired deploying three separate pods:--app=controller— the operator managingCoderControlPlane,WorkspaceProxy, andCoderProvisionerCRDs--app=aggregated-apiserver— the aggregated API server forCoderWorkspace/CoderTemplate(required manual--coder-url,--coder-session-token,--coder-namespaceflags)--app=mcp-http— the MCP server for AI tool integrationThis created operational overhead and prevented the MCP server from sharing the controller's cache-backed informers, leading to redundant API reads.
Implementation
Architecture: Shared cache composition root
Created
internal/app/allapp/allapp.goas a single composition root that:coder.com/v1alpha1+aggregation.coder.com/v1alpha1)controllerapp.SetupControllers(mgr), leader-election gatedclient.ClientDynamic control-plane discovery
The aggregated API server no longer requires manual
--coder-url/--coder-session-tokenflags inallmode. Instead, a newControlPlaneClientProvider(internal/aggregated/coder/controlplane_provider.go) dynamically discovers Coder instances:CoderControlPlaneobjects in the request namespace.)Refactored components for reuse
controllerapp: ExtractedNewManager(),SetupControllers(),SetupProbes()for reuse byallappmcpapp: AddedRunHTTPWithClients()to accept injected clients instead of constructing its ownsharedscheme: New package providing a unified scheme used by all componentsapiserverapp: AddedOptions.ClientProvideroverride soallappcan inject the dynamic providerUnified deployment manifests
Replaced three separate Deployments with a single
deploy/deployment.yaml:--apparg needed (defaults toall)8081(health),6443(aggregated API),8090(MCP)Backward compatibility
Existing explicit
--app=controller,--app=aggregated-apiserver, and--app=mcp-httpmodes remain fully functional for users who prefer split deployments.Validation
make verify-vendor✅make build✅make test✅ (all existing + new tests)make lint✅Risks
BadRequestwhen multiple eligibleCoderControlPlaneinstances exist in the same namespace. Multi-instance support (with control-plane-prefixed naming) is planned as a follow-up.📋 Implementation Plan
Plan: unify controller + aggregated API server + MCP into one
--app=all(shared cache)Context / Why
Today
coder-k8sruns as one of three separate app modes selected by--app:controller→ controller-runtime operator forcoder.com/v1alpha1(CoderControlPlane, etc.)aggregated-apiserver→ aggregated API server foraggregation.coder.com/v1alpha1(CoderWorkspace,CoderTemplate)mcp-http→ MCP server (HTTP transport) that interacts with both sets of APIsGoal:
--app=alland make it the default.allmode, run all three components in a single process and support deploying them as a single pod.Non-goals (for this change):
Evidence (what we verified)
From repo inspection:
app_dispatch.gocurrently requires--appand supports:controller,aggregated-apiserver,mcp-http.internal/app/controllerapp/controllerapp.gocreates a controller-runtimeManagerand uses the manager’s cache-backed client.api/v1alpha1/codercontrolplane_types.go+internal/controller/codercontrolplane_controller.goimplement operator access:spec.operatorAccess.disabledgates whether the operator/admin token is managedstatus.operatorTokenSecretRefpoints at the Secret/key containing the admin tokenstatus.operatorAccessReadyindicates token bootstrap successstatus.urlis the in-cluster coderd base URL used for service discoveryinternal/app/apiserverapp/apiserverapp.goservescoderworkspaces/codertemplatesvia codersdk-backed storage (internal/aggregated/storage/*).internal/aggregated/coder.ClientProvider.coder.NewStaticClientProviderconfigured via--coder-url,--coder-session-token, and--coder-namespace.internal/app/mcpapp/server.gobuilds its own non-cached controller-runtime client (client.New) + client-go clientset;mcpapp.NewServeralready supports injecting aclient.Client+clientset.internal/app/mcpapp/tools.gouses:CoderControlPlane,CoderWorkspace,CoderTemplate, Pods, Services, Deploymentscontroller-runtimecache supports starting new informers on-demand by default (ReaderFailOnMissingInformerdefaults tofalse).controller-runtimeManager.Add(...)behavior: any runnable that doesn’t implementLeaderElectionRunnableis treated as leader-election gated. To run MCP/APIServer on all replicas, they must implementNeedLeaderElection() bool { return false }.deploy/currently has three Deployments and separate ServiceAccounts/RBAC; a single pod requires one ServiceAccount with the union of permissions.Key files consulted:
app_dispatch.go,main_test.gointernal/app/controllerapp/controllerapp.gointernal/app/apiserverapp/apiserverapp.gointernal/app/mcpapp/http.go,internal/app/mcpapp/server.go,internal/app/mcpapp/tools.godeploy/*.yaml(controller/apiserver/mcp)vendor/sigs.k8s.io/controller-runtime/pkg/cache/*,vendor/.../pkg/manager/*Implementation details
1) Introduce an
all-mode “composition root” packageCreate a new package:
internal/app/allapp/allapp.goIt will be the single place where we wire together:
High-level shape:
Why a dedicated package?
main/dispatch simple.allmode really shares a single cache and config.Why not just run the existing 3 Run() functions concurrently?
Calling
controllerapp.Run(ctx)+mcpapp.RunHTTP(ctx)concurrently would not share controller-runtime cache/client instances, becausecontrollerapp.Runconstructs aManagerinternally andmcpapp.RunHTTPconstructs its own client/clientset.To share cache, we need a composition root that constructs the manager once and passes its cache-backed client into MCP.
2) Create a shared scheme builder used by controller + MCP
We need the manager’s scheme to include:
clientgoscheme.AddToSchemecoderv1alpha1.AddToSchemeaggregationv1alpha1.AddToScheme(MCP reads/patches these types)Options (pick one; recommended is A):
A) Create
internal/app/sharedscheme/sharedscheme.go:Then:
controllerapp.NewScheme()becomes a thin wrapper callingsharedscheme.New()(or remove/replace).mcpapp.newScheme()becomes a thin wrapper callingsharedscheme.New().allappusessharedscheme.New().B) Expand
controllerapp.NewScheme()to also registeraggregationv1alpha1and deletemcpapp.newScheme().(A avoids circular dependencies and keeps the scheme definition single-sourced.)
3) Refactor controller mode to allow reuse of the manager (minimal surgical extraction)
Currently
controllerapp.Runboth builds the manager and starts it.We want
allappto build the manager once and start it once, but reuse controller setup.Edits in
internal/app/controllerapp/controllerapp.go:Run(ctx)behavior by composing the above:Also:
detectLeaderElectionNamespace()is currently unexported; keep it in controllerapp and use it fromNewManager.4) Make MCP HTTP server runnable accept injected (shared) clients
We want MCP to use the manager’s cache-backed client:
k8sClient := mgr.GetClient()and a clientset built from the same config:
clientset := kubernetes.NewForConfig(mgr.GetConfig())Edits in
internal/app/mcpapp/http.go:RunHTTP(ctx)for standalone mode by delegating:This allows
allappto reuse the same cache-backed client without duplicating MCP’s HTTP lifecycle code.5) Aggregated API server in
allmode: dynamic discovery + operator-token authCurrent state (post-rebase):
internal/aggregated/storage/*).internal/aggregated/coder.ClientProvider(today: a namespace-pinned static provider built from--coder-url,--coder-session-token,--coder-namespaceinapiserverapp.buildClientProvider).Goal for
allmode:--coder-*wiring.CoderControlPlane(s), fetch the operator admin token from the Secret created by the controller, and use it for codersdk calls.spec.operatorAccess.disabled: true, that control plane must be treated as invisible: do not read its Secret, do not call its coderd, and do not return its resources.5.1) Implement a control-plane-backed ClientProvider
Add
internal/aggregated/coder/controlplane_provider.go(name flexible) that uses Kubernetes to build codersdk clients on demand.Recommended interface evolution (minimal but supports LIST across instances):
Provider construction inputs (from
allapp):client.ReaderforCoderControlPlanediscovery (cachedmgr.GetClient()is fine)client.Readerfor Secret reads (prefermgr.GetAPIReader()to avoid caching secret tokens in informers)--coder-request-timeout, default 30s)Eligibility filter (MUST run before any Secret read or coderd call):
cp.Spec.OperatorAccess.Disabled == true⇒ skipcp.Status.OperatorAccessReady != true⇒ skipcp.Status.OperatorTokenSecretRef == nil⇒ skipstrings.TrimSpace(cp.Status.URL) == ""⇒ skipToken fetch:
cp.Namespacecp.Status.OperatorTokenSecretRef.Namecp.Status.OperatorTokenSecretRef.Key(default tocoderv1alpha1.DefaultTokenSecretKeywhen empty)SDK construction:
cp.Status.URLas aurl.URLcoder.NewSDKClient(coder.Config{CoderURL: parsed, SessionToken: token, RequestTimeout: timeout})Error handling rules:
5.2) Disambiguate multiple control planes per namespace (instance selection)
To support “multiple deployments in the same namespace” without collisions, encode the
CoderControlPlanename into aggregated object names.Recommended naming scheme:
<control-plane>.<org>.<user>.<workspace><control-plane>.<org>.<template>Implementation:
internal/aggregated/coder/names.go:api/aggregation/v1alpha1/types.gocomments to match the new naming scheme.aggregation.coder.com/control-plane-name=<cp.Name>aggregation.coder.com/control-plane-namespace=<cp.Namespace>Note: control plane names containing '.'
The aggregated naming scheme uses '.' as a separator. Kubernetes object names may contain '.', so for the initial implementation we will treat any
CoderControlPlanewhose name contains '.' as ineligible for aggregation (skip it) and surface a clear warning/error explaining why it’s being skipped.5.3) Update converters + storage to use the dynamic provider
Update converters:
internal/aggregated/convert/workspace.goandtemplate.go:controlPlaneName stringUpdate storages:
internal/aggregated/storage/workspace.goandtemplate.go:provider.ClientsForNamespace(ctx, namespace)and merge results across all eligible control planesprovider.ClientsForNamespace(ctx, "")and merge across all namespacesnameClientForControlPlanehelper, or by storing the chosen control-plane inctxbefore callingClientForNamespace)Critically: when operator access is disabled, objects from that control plane must not appear in LIST results and must behave like NotFound/BadRequest for other verbs.
5.4) Wire dynamic provider only in
allmodeUpdate
internal/app/apiserverapp/apiserverapp.go:Optionswith a provider override, e.g.ClientProvider coder.ClientProvider.RunWithOptions, if override is non-nil, use it; otherwise keep the existing static-provider behavior for--app=aggregated-apiserver.In
internal/app/allapp/allapp.go:mgr.GetClient()+mgr.GetAPIReader().apiserverapp.RunWithOptions.6) Run aggregated apiserver + MCP as non-leader manager runnables
In
internal/app/allapp/allapp.go:scheme := sharedscheme.New()cfg := ctrl.GetConfigOrDie()mgr, err := controllerapp.NewManager(cfg, scheme)controllerapp.SetupControllers(mgr)controllerapp.SetupProbes(mgr)Because we want it to run on every pod replica (not gated by leader election), wrap it in a type implementing
LeaderElectionRunnable:Then (wait for cache sync and run with the dynamic control-plane provider):
This yields a single top-level blocking call and ensures the MCP server uses the exact same cache and config as the operator.
Note on cache size / informers
Using
mgr.GetClient()in MCP means reads/lists will go through the shared cache and may start informers for object types MCP touches (Pods, Services, Deployments, aggregated API types, etc.). This is desirable for “share caches” but may increase memory usage vs direct REST reads.If this becomes an issue, a follow-up optimization is:
Cache.ReaderFailOnMissingInformer = truemgr.GetAPIReader()for types we explicitly do not want to cache (e.g., Pods)That refinement can be postponed until profiling shows it’s needed.
7) Update app dispatch: add
--app=alland make it defaultEdits in
app_dispatch.go:internal/app/allappimport.supportedAppModesand help text to include:all, controller, aggregated-apiserver, mcp-http.--appdefault"all".case "all": return allapp.Run(setupSignalHandler()).case "": ... requiredbranch (--appis no longer required).--coder-url,--coder-session-token,--coder-namespace,--coder-request-timeout) and their validation logic for--app=aggregated-apiserver.allmode these should not be required; they can be ignored (or later repurposed as global codersdk tuning knobs).8) Deploy manifests: single Deployment/Pod, preserve Services
Add a new unified Deployment manifest and migrate existing Services/selectors.
Recommended approach:
deploy/deployment.yaml(or renamedeploy/controller-deployment.yaml→deploy/deployment.yaml).deploy/apiserver-deployment.yamlanddeploy/mcp-deployment.yamlfrom the “happy path” examples (either delete them or move them underdeploy/legacy/), to avoid users accidentally deploying 3 pods that each run all components.Unified Deployment details:
coder-k8s)ghcr.io/coder/coder-k8s:latestargs:entirely (no--app), relying on the new default--app=all8081(controller probes),6443(aggregated apiserver),8090(MCP)Update
deploy/apiserver-service.yamlselector to match the combined deployment label.Update
deploy/mcp-service.yamlselector to match the combined deployment label.Update
deploy/rbac.yaml:coder-k8s).system:auth-delegator+extension-apiserver-authentication-reader)This yields a single pod with the union of permissions.
:8081.controllerapp.SetupProbes(in all mode only, or generally) with additional readiness checks that confirm:localhost:6443or HTTP GEThttps://localhost:6443/readyzif available)localhost:8090or expose an atomic ready flag)(Implementation choice depends on whether we want to introduce a new composite endpoint vs extending the controller’s existing
/readyz.)coder-k8sto rely on the defaultallmode.config/e2e/deployment.yaml: removeargs: ["--app=controller"]so the container starts in defaultallmode.6443and8090so the manifest matches the unified runtime.9) Tests
Update and add tests to prevent regressions.
main_test.goTestRunRejectsEmptyModewithTestRunDefaultsToAllMode.TestRunDispatchesAllMode.TestRunRejectsUnknownMode.Create
internal/app/allapp/allapp_test.gowith dependency injection to avoid requiring a real Kubernetes cluster:allapp, define package-level vars for constructors used insideRun, e.g.:newManager(wrapscontrollerapp.NewManager)newClientset(wrapskubernetes.NewForConfig)runAggregatedAPIServer(wrapsapiserverapp.RunWithOptions)runMCPHTTPWithClients(wrapsmcpapp.RunHTTPWithClients)In tests, stub these to:
GetClient()pointerGetCache().WaitForCacheSyncreturning truek8sClientpassed intorunMCPHTTPWithClientsmgr.GetClient()This directly enforces the “shared cache client” requirement.
internal/aggregated/coder/*_test.go:spec.operatorAccess.disabled=truewithout reading Secrets or calling coderdstatus.operatorAccessReady=falseinternal/aggregated/storage/*_test.go:10) Validation (when implementing)
Run:
make testmake buildmake lintmake verify-vendorIf deploy manifests are considered part of release artifacts, also validate:
go run github.com/rhysd/actionlint/cmd/actionlint@v1.7.10(only if workflow edits happen)Rollout / migration notes
--app=controller/--app=aggregated-apiserver/--app=mcp-httpremain valid.--appand will run all components.Suggested execution strategy (single agent vs. multiple agents)
Recommendation: implement with a single agent/engineer if possible.
Rationale:
allappwiring + RBAC/manifests) and benefits from tight iteration with end-to-end validation (make test/build/lint).controllerapp/controllerapp.go,app_dispatch.go, deployment YAMLs), creating merge churn.If you do want to parallelize, the safest split is along file ownership boundaries, with one “integration agent” responsible for final wiring + validation:
Agent A (runtime wiring / shared cache)
internal/app/allapp/*internal/app/sharedscheme/*internal/app/controllerapp/*refactor (NewManager,SetupControllers,SetupProbes)Agent B (MCP refactor)
internal/app/mcpapp/http.go(RunHTTPWithClients), keep standalone mode workingAgent C (aggregated API server dynamic provider + multi-instance support)
internal/app/apiserverapp/*(addOptions.ClientProvideroverride; keep static--coder-*mode working)internal/aggregated/coder/*(control-plane-backed provider, naming helpers)internal/aggregated/storage/*+internal/aggregated/convert/*(control-plane-prefixed names; LIST merge across instances; enforce disabled skip)api/aggregation/v1alpha1/types.go+ generated docs if naming comments changeAgent D (dispatch + unit tests + docs)
app_dispatch.go,main_test.go--app=...(if desired in this change)Agent E (manifests/RBAC)
deploy/*.yaml,config/e2e/deployment.yamlIntegration agent (final pass)
allGenerated with
mux• Model:anthropic:claude-opus-4-6• Thinking:xhigh• Cost:$1.51