Disaggregated serving on the unopinionated engine shape by dennis-upbound · Pull Request #142 · modelplaneai/modelplane

dennis-upbound · 2026-06-13T21:06:10Z

Towards #34.

PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape.

A deployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). This follows the merged design — the engine carries its own phase rather than serving naming engines, so a phase can't dangle. The scheduler copies serving and the phases onto the ModelReplica. The replica backend, when disaggregated, finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), fronts the decode engine with the pd-sidecar (the sidecar takes port 8000 and forwards to the engine on its own --port, with --secure-proxy=false for the HTTP gateway path), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool behind a hardcoded endpoint picker.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the substrate #137 stripped: the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway delegating InferencePool resolution to the AI Gateway's ext-proc server. So compose-serving-stack installs the AI Gateway (CRDs and controller) and the GAIE InferencePool CRD, and restores Envoy Gateway's extensionManager. Envoy Gateway moves from v1.3.0 to v1.8.1, the released version the AI Gateway is tested against and the first whose chart bundles the ListenerSet CRD the newer gateway requires; v1.3.0 rejected InferencePool backendRefs outright (InvalidKind).

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The routing layer is one module (backends/disagg.py) applied as a post-process over #137's per-engine workloads, so the unified path is untouched.

Validated end to end on a GKE cluster (L4s): the deployment schedules both engines through DRA; both vLLM engines serve and bring up NixlConnector (NIXL is available, kv_role=kv_both, from the user's flags); the endpoint picker reconciles the InferencePool over both endpoints; and a request through the gateway to the InferencePool backendRef (ResolvedRefs=True under v1.8.1) returns a completion from the disaggregated pair.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
Added or updated tests covering any composition function changes.
Signed off every commit with git commit -s.

negz

Looks good overall @dennis-upbound. Only blocker is the comment about the new apply method removing the base routing config. I'd prefer to start with no routing config then pick one to add.

negz · 2026-06-15T19:07:33Z

+# GIE group the EPP binary registers (inference.networking.x-k8s.io/v1alpha1); an
+# unregistered group crash-loops the picker.


This comment feels like the model narrating an edge case it hit. Not sure it's useful.

Done. Dropped the "crash-loops the picker" narration; the comment now just states the apiVersion is the GIE group the EPP registers.

negz · 2026-06-15T19:18:05Z

+    out = dict(composed)
+    out.pop(base.SERVICE_KEY, None)
+    out.pop(base.ROUTE_KEY, None)


I don't love that this removes the base's routing.

Could we handle the base/Unified routing concern the same way we handle it here - i.e. with an apply (apply_routing?) that adds it on. Then we wouldn't need to filter it here: we'd instead build the appropriate base backend then decorate it with the appropriate routing stack.

Done. The backends build engines only now; a routing layer (function/routing.py) decorates them with the surface serving.mode picks — apply adds a Service for Unified, or the InferencePool + endpoint picker for PrefillDecode. Nothing is stripped anymore.

PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). The scheduler copies serving and the phases onto the ModelReplica. The workload backends compose engines only; a routing layer then decorates them with the serving surface serving.mode selects. Unified adds a Service and HTTPRoute. PrefillDecode finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine listens on 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), and fronts both with a GAIE InferencePool and a hardcoded endpoint picker that sequences prefill then decode. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

dennis-upbound force-pushed the dennis/disagg-unopinionated branch from 6726e6c to bb2d9d6 Compare June 14, 2026 15:29

dennis-upbound changed the base branch from topological to main June 14, 2026 15:29

dennis-upbound force-pushed the dennis/disagg-unopinionated branch 3 times, most recently from 2a70c23 to ba01f3a Compare June 14, 2026 17:33

dennis-upbound changed the title ~~WIP: Disaggregated serving on the unopinionated engine shape~~ Disaggregated serving on the unopinionated engine shape Jun 14, 2026

dennis-upbound force-pushed the dennis/disagg-unopinionated branch from ba01f3a to 3f9ba7a Compare June 14, 2026 22:47

dennis-upbound marked this pull request as ready for review June 15, 2026 14:38

negz reviewed Jun 15, 2026

View reviewed changes

Comment thread apis/modeldeployments/definition.yaml Outdated

negz reviewed Jun 15, 2026

View reviewed changes

Comment thread apis/modeldeployments/definition.yaml Outdated

negz reviewed Jun 15, 2026

View reviewed changes

Comment thread functions/compose-model-deployment/tests/test_fn.py Outdated

dennis-upbound force-pushed the dennis/disagg-unopinionated branch 2 times, most recently from 275d569 to 7fd99ac Compare June 15, 2026 18:20

negz reviewed Jun 15, 2026

View reviewed changes

negz mentioned this pull request Jun 15, 2026

Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing #124

Closed

4 tasks

dennis-upbound force-pushed the dennis/disagg-unopinionated branch from 0d2a6cd to 891d51e Compare June 15, 2026 20:25

negz approved these changes Jun 15, 2026

View reviewed changes

negz merged commit 89f575c into main Jun 15, 2026
3 checks passed

negz deleted the dennis/disagg-unopinionated branch June 16, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disaggregated serving on the unopinionated engine shape#142

Disaggregated serving on the unopinionated engine shape#142
negz merged 1 commit into
mainfrom
dennis/disagg-unopinionated

dennis-upbound commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

negz left a comment

Uh oh!

Uh oh!

negz Jun 15, 2026

Uh oh!

dennis-upbound Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

negz Jun 15, 2026

Uh oh!

dennis-upbound Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# GIE group the EPP binary registers (inference.networking.x-k8s.io/v1alpha1); an
		# unregistered group crash-loops the picker.

Uh oh!

Conversation

dennis-upbound commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

negz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

negz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-upbound Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

negz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-upbound Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dennis-upbound commented Jun 13, 2026 •

edited

Loading