Disaggregated serving on the unopinionated engine shape#142
Conversation
6726e6c to
bb2d9d6
Compare
2a70c23 to
ba01f3a
Compare
ba01f3a to
3f9ba7a
Compare
275d569 to
7fd99ac
Compare
negz
left a comment
There was a problem hiding this comment.
Looks good overall @dennis-upbound. Only blocker is the comment about the new apply method removing the base routing config. I'd prefer to start with no routing config then pick one to add.
| # GIE group the EPP binary registers (inference.networking.x-k8s.io/v1alpha1); an | ||
| # unregistered group crash-loops the picker. |
There was a problem hiding this comment.
This comment feels like the model narrating an edge case it hit. Not sure it's useful.
There was a problem hiding this comment.
Done. Dropped the "crash-loops the picker" narration; the comment now just states the apiVersion is the GIE group the EPP registers.
| out = dict(composed) | ||
| out.pop(base.SERVICE_KEY, None) | ||
| out.pop(base.ROUTE_KEY, None) |
There was a problem hiding this comment.
I don't love that this removes the base's routing.
Could we handle the base/Unified routing concern the same way we handle it here - i.e. with an apply (apply_routing?) that adds it on. Then we wouldn't need to filter it here: we'd instead build the appropriate base backend then decorate it with the appropriate routing stack.
There was a problem hiding this comment.
Done. The backends build engines only now; a routing layer (function/routing.py) decorates them with the surface serving.mode picks — apply adds a Service for Unified, or the InferencePool + endpoint picker for PrefillDecode. Nothing is stripped anymore.
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). The scheduler copies serving and the phases onto the ModelReplica. The workload backends compose engines only; a routing layer then decorates them with the serving surface serving.mode selects. Unified adds a Service and HTTPRoute. PrefillDecode finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine listens on 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), and fronts both with a GAIE InferencePool and a hardcoded endpoint picker that sequences prefill then decode. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
0d2a6cd to
891d51e
Compare
Towards #34.
PR #137 reshaped
spec.workersinto named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape.A deployment opts in with
serving.mode: PrefillDecodeand marks each engine'sphaseasPrefillorDecode(CEL-checked: exactly two engines, one of each phase, andphaseset only underPrefillDecode). This follows the merged design — the engine carries its own phase rather thanservingnaming engines, so a phase can't dangle. The scheduler copiesservingand the phases onto the ModelReplica. The replica backend, when disaggregated, finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), fronts the decode engine with the pd-sidecar (the sidecar takes port 8000 and forwards to the engine on its own--port, with--secure-proxy=falsefor the HTTP gateway path), replaces the unified Service with anInferencePoolover both engines, and points the HTTPRoute at the pool behind a hardcoded endpoint picker.Routing an HTTPRoute to an
InferencePoolbackendRef needs the serving cluster to run the substrate #137 stripped: the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway delegatingInferencePoolresolution to the AI Gateway's ext-proc server. So compose-serving-stack installs the AI Gateway (CRDs and controller) and the GAIE InferencePool CRD, and restores Envoy Gateway'sextensionManager. Envoy Gateway moves from v1.3.0 to v1.8.1, the released version the AI Gateway is tested against and the first whose chart bundles theListenerSetCRD the newer gateway requires; v1.3.0 rejectedInferencePoolbackendRefs outright (InvalidKind).Engine flags, including the NixlConnector
--kv-transfer-config, stay the user's per #137; this injects none. The routing layer is one module (backends/disagg.py) applied as a post-process over #137's per-engine workloads, so the unified path is untouched.Validated end to end on a GKE cluster (L4s): the deployment schedules both engines through DRA; both vLLM engines serve and bring up NixlConnector (
NIXL is available,kv_role=kv_both, from the user's flags); the endpoint picker reconciles the InferencePool over both endpoints; and a request through the gateway to theInferencePoolbackendRef (ResolvedRefs=Trueunder v1.8.1) returns a completion from the disaggregated pair.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.