Skip to content

Disaggregated serving on the unopinionated engine shape#142

Merged
negz merged 1 commit into
mainfrom
dennis/disagg-unopinionated
Jun 15, 2026
Merged

Disaggregated serving on the unopinionated engine shape#142
negz merged 1 commit into
mainfrom
dennis/disagg-unopinionated

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Towards #34.

PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape.

A deployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). This follows the merged design — the engine carries its own phase rather than serving naming engines, so a phase can't dangle. The scheduler copies serving and the phases onto the ModelReplica. The replica backend, when disaggregated, finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), fronts the decode engine with the pd-sidecar (the sidecar takes port 8000 and forwards to the engine on its own --port, with --secure-proxy=false for the HTTP gateway path), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool behind a hardcoded endpoint picker.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the substrate #137 stripped: the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway delegating InferencePool resolution to the AI Gateway's ext-proc server. So compose-serving-stack installs the AI Gateway (CRDs and controller) and the GAIE InferencePool CRD, and restores Envoy Gateway's extensionManager. Envoy Gateway moves from v1.3.0 to v1.8.1, the released version the AI Gateway is tested against and the first whose chart bundles the ListenerSet CRD the newer gateway requires; v1.3.0 rejected InferencePool backendRefs outright (InvalidKind).

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The routing layer is one module (backends/disagg.py) applied as a post-process over #137's per-engine workloads, so the unified path is untouched.

Validated end to end on a GKE cluster (L4s): the deployment schedules both engines through DRA; both vLLM engines serve and bring up NixlConnector (NIXL is available, kv_role=kv_both, from the user's flags); the endpoint picker reconciles the InferencePool over both endpoints; and a request through the gateway to the InferencePool backendRef (ResolvedRefs=True under v1.8.1) returns a completion from the disaggregated pair.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

@dennis-upbound dennis-upbound force-pushed the dennis/disagg-unopinionated branch from 6726e6c to bb2d9d6 Compare June 14, 2026 15:29
@dennis-upbound dennis-upbound changed the base branch from topological to main June 14, 2026 15:29
@dennis-upbound dennis-upbound force-pushed the dennis/disagg-unopinionated branch 3 times, most recently from 2a70c23 to ba01f3a Compare June 14, 2026 17:33
@dennis-upbound dennis-upbound changed the title WIP: Disaggregated serving on the unopinionated engine shape Disaggregated serving on the unopinionated engine shape Jun 14, 2026
@dennis-upbound dennis-upbound force-pushed the dennis/disagg-unopinionated branch from ba01f3a to 3f9ba7a Compare June 14, 2026 22:47
@dennis-upbound dennis-upbound marked this pull request as ready for review June 15, 2026 14:38
Comment thread apis/modeldeployments/definition.yaml Outdated
Comment thread apis/modeldeployments/definition.yaml Outdated
Comment thread functions/compose-model-deployment/tests/test_fn.py Outdated
@dennis-upbound dennis-upbound force-pushed the dennis/disagg-unopinionated branch 2 times, most recently from 275d569 to 7fd99ac Compare June 15, 2026 18:20

@negz negz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall @dennis-upbound. Only blocker is the comment about the new apply method removing the base routing config. I'd prefer to start with no routing config then pick one to add.

Comment thread apis/modeldeployments/definition.yaml Outdated
Comment on lines +40 to +41
# GIE group the EPP binary registers (inference.networking.x-k8s.io/v1alpha1); an
# unregistered group crash-loops the picker.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment feels like the model narrating an edge case it hit. Not sure it's useful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Dropped the "crash-loops the picker" narration; the comment now just states the apiVersion is the GIE group the EPP registers.

Comment thread functions/compose-model-replica/function/backends/disagg.py Outdated
Comment thread functions/compose-model-replica/function/routing.py
Comment on lines +287 to +289
out = dict(composed)
out.pop(base.SERVICE_KEY, None)
out.pop(base.ROUTE_KEY, None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that this removes the base's routing.

Could we handle the base/Unified routing concern the same way we handle it here - i.e. with an apply (apply_routing?) that adds it on. Then we wouldn't need to filter it here: we'd instead build the appropriate base backend then decorate it with the appropriate routing stack.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The backends build engines only now; a routing layer (function/routing.py) decorates them with the surface serving.mode picks — apply adds a Service for Unified, or the InferencePool + endpoint picker for PrefillDecode. Nothing is stripped anymore.

PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: PrefillDecode and marks each
engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of
each phase, and phase set only under PrefillDecode). The scheduler copies serving
and the phases onto the ModelReplica.

The workload backends compose engines only; a routing layer then decorates them
with the serving surface serving.mode selects. Unified adds a Service and
HTTPRoute. PrefillDecode finds the prefill and decode engines by phase,
role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode
(the engine listens on 8001, the sidecar takes 8000, --secure-proxy=false so the
HTTP gateway path works), and fronts both with a GAIE InferencePool and a
hardcoded endpoint picker that sequences prefill then decode.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to
run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy
Gateway configured to delegate InferencePool resolution to the AI Gateway's
ext-proc server; #137 had stripped these. The serving stack now installs the AI
Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures
Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released
version the AI Gateway is tested against, whose chart bundles the ListenerSet
CRD the newer gateway needs.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references, the EndpointPickerConfig
apiVersion, and the Envoy Gateway version are the ones validated on a live GKE
cluster, end to end through the gateway.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound force-pushed the dennis/disagg-unopinionated branch from 0d2a6cd to 891d51e Compare June 15, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants