Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing#124
Closed
dennis-upbound wants to merge 11 commits into
Closed
Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing#124dennis-upbound wants to merge 11 commits into
dennis-upbound wants to merge 11 commits into
Conversation
LLM inference has two phases with opposite hardware profiles — prefill is compute-bound and sets TTFT; decode is memory-bandwidth-bound and sets ITL. Run on one pod set they contend, and neither can be tuned independently. This adds Phase 1 of prefill/decode disaggregation (design/disaggregation.md): an optional `prefill` block on ModelDeployment (self-contained workers, topology, template, nodeSelector) plus a routing.template, mirrored onto ModelReplica. The fleet scheduler treats a disaggregated replica as the existing decode placement plus one optional prefill placement, co-located on one InferenceCluster, choosing the (decode_pool, prefill_pool) pair jointly against a single capacity ledger rather than greedily per role; it charges both roles' pools and re-places a replica if either pool drifts. The llm-d backend emits a decode pod set (kv_consumer) and a prefill pod set (kv_producer), each pinned to its role's pool with its own ResourceClaimTemplate, distinguished by a modelplane.ai/pd-role label so the decode Service never selects prefill, both mounting the model cache and carrying VLLM_NIXL_SIDE_CHANNEL_HOST via the downward API so NixlConnector can establish the KV transfer. The unified (non-disaggregated) path is unchanged — it is the zero-prefill case of the same code. Request-level prefill->decode sequencing (the GAIE InferencePool + EPP routing layer) is deferred to Phase 2: a feasibility spike (design/disaggregation-routing-spike.md) found core Envoy Gateway can't serve an InferencePool and the llm-d EPP fronts a decode-only pool with a sidecar handoff, so Phase 2 will move ServingStack to Envoy AI Gateway after confirming the EPP mechanism against upstream source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Before writing disaggregated routing code, three questions from the Phase 1 spike were unresolved or low confidence. This spike confirms all three from primary source. Header: x-prefiller-host-port (defined in pkg/common/routing/common.go). The EPP's disagg-profile-handler injects it; the pd-sidecar on the decode pod reads it and forwards prefill to the named host:port. The standalone routing-sidecar repo is deprecated; code lives under cmd/pd-sidecar and pkg/sidecar in the inference-scheduler repo. Pod discovery: a single InferencePool selects all pods (prefill + decode) by a shared label. The EPP partitions the set at scheduling time using prefill-filter and decode-filter plugins, both keyed on llm-d.ai/role. Prefill pods need llm-d.ai/role: prefill; decode pods need decode. The label-selector-filter plugin can read modelplane.ai/pd-role instead if we want to avoid adding llm-d-native labels. Envoy AI Gateway: v0.7.0 shipped June 4, 2026 (v1.0 GA targets June 30). It supports InferencePool (inference.networking.k8s.io/v1) as an HTTPRoute backendRef. It runs on top of gateway-helm via extensionManager hooks, not as a replacement GatewayClass; controllerName stays gateway.envoyproxy.io/gatewayclass-controller. ServingStack needs three charts: gateway-helm (with AI Gateway extension values), ai-gateway-crds-helm, and ai-gateway-helm, plus the GAIE v1.0.1 manifests. Towards #34. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The design settled on no guessing for a disaggregated deployment: the prefill:decode ratio and the routing EPP are stated, not defaulted. Drop the schema default on workers.count and prefill.workers.count (symmetrically, so the prefill workers schema stays byte-identical to decode and codegen still deduplicates them) so the both-counts CEL rule actually rejects an omitted count, and add a rule requiring routing when prefill is set. A unified deployment is unaffected: the rules are guarded on has(prefill), and the composition function materializes an omitted count as 1 onto the ModelReplica, which keeps the replica self-describing rather than relying on a re-default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The GAIE EPP selects pods using llm-d-native labels that Modelplane was not emitting. For disaggregated replicas, stamp both the decode and prefill LeaderWorkerSet pod templates (leader and worker) with: - app: <replica-name> shared InferencePool selector - llm-d.ai/inference-serving: "true" shared EPP selector - llm-d.ai/role: "decode"/"prefill" per-role EPP filter The app label carries the decode replica name on both roles so a single InferencePool spec.selector matches both pod sets. Unified (non- disaggregated) replicas are untouched. The existing modelplane.ai/pd-role labels are preserved. New constants LABEL_LLMD_ROLE and LABEL_LLMD_SERVING are added to base.py; the disagg-only decode_extra dict and the prefill_labels dict in llmd.py are extended to carry all three labels. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
adf3e32 to
2fc3e35
Compare
Records what compose-model-replica must emit for a disaggregated replica (the ten EPP/InferencePool Objects, since Modelplane runs no per-model Helm install), the EPP container args/env/config, the gateway-helm values overrides the InferencePool backendRef needs, and the pinned pd-sidecar and endpoint-picker images. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
2fc3e35 to
811cea4
Compare
On a disaggregated replica the pd-sidecar (llm-d-inference-scheduler-disagg-sidecar:v0.8.0) intercepts port 8000 on the decode leader pod, forwards the prompt to the prefill instance for KV transfer, then proxies to local vLLM which has moved to port 8001. Service targetPort stays 8000 (the sidecar's listen port) so no routing change is needed. Unified and prefill pods are unchanged. - base.py: add PD_SIDECAR_IMAGE, _DECODE_ENGINE_PORT=8001, pd_sidecar_container() - llmd.py: extract _build_commands() helper; thread engine_serving_port (8001 when disagg, 8000 otherwise) through the container closure's ports+readinessProbe; inject --port=8001 into turnkey vLLM args; append pd_sidecar_container() to the decode leader pod only when disagg Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
For disaggregated replicas the HTTPRoute previously targeted the decode
Service directly, bypassing the EPP. Now:
- emit an `InferencePool` (inference.networking.k8s.io/v1) at key
`inference-pool`; its selector matches both prefill and decode pods via
the shared app:<name> + llm-d.ai/inference-serving:"true" labels, and
its endpointPickerRef names the EPP Service (<name>-epp:9002) with
failureMode:FailOpen so a transient EPP outage never black-holes traffic
- flip the disagg HTTPRoute's backendRefs from `{name:<name>,port:80}` to
`{group:inference.networking.k8s.io,kind:InferencePool,name:<name>-pool}`
so the GAIE EPP intercepts every request for KV-/prefix-aware routing
- unified replicas keep `HTTPRoute -> Service` untouched; the new logic is
gated on the existing `disagg` flag so there is no unified-path change
The decode Service is retained: the InferencePool/EPP and the pods
themselves still need it.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
failureMode is a field of endpointPickerRef in the GAIE v1 InferencePool schema, not of the pool spec; emitting it at spec level would be rejected by the apiserver or silently drop to FailClose, black-holing decode traffic when the EPP is briefly unavailable. Move it inside endpointPickerRef and assert the placement. Also refresh the module docstring, which still claimed the backend only ever emits HTTPRoute->Service — it now fronts disaggregated pods with an InferencePool + EPP. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
811cea4 to
8f61a37
Compare
Collaborator
|
Superseded by #142 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #34.
LLM inference has two phases with opposite hardware profiles: prefill is compute-bound and sets TTFT, decode is memory-bandwidth-bound and sets ITL. Run on one pod set they contend, and neither can be tuned independently. Modelplane had no way to split them.
This serves the two phases as separate, co-located pod sets with the KV cache transferred over NIXL, and sequences each request prefill→decode through a GAIE
InferencePooland an endpoint-picker. It implements the design from #116.A deployment opts in by declaring a
prefillblock alongsideworkers(now the decode role), plus aroutingblock that supplies the endpoint-picker. For a disaggregated deployment both worker counts androutingmust be explicit, enforced by CEL; a deployment with noprefillblock is unified serving and routes exactly as before.The scheduler (
compose-model-deployment) treats a disaggregated replica as the existing decode placement plus one optional prefill placement, co-located on one InferenceCluster. It chooses the(decode_pool, prefill_pool)pair jointly against a single capacity ledger rather than greedily per role, charges both pools, and re-places the replica if either pool drifts.The llm-d backend (
compose-model-replica) emits a decode pod set (kv_consumer) and a prefill pod set (kv_producer), each role-labeled withllm-d.ai/roleand pinned to its pool with its ownResourceClaimTemplate, both carryingVLLM_NIXL_SIDE_CHANNEL_HOSTvia the downward API so NixlConnector can establish the transfer. It injects the pd-sidecar on decode pods (vLLM moves to 8001, the sidecar takes 8000), emits theInferencePooland the EPP stack (Deployment, Service, ConfigMap, RBAC) built fromrouting.template, and points the disaggregatedHTTPRouteat the pool instead of the Service.ServingStack installs Envoy AI Gateway (v0.7.0) and the GAIE CRDs, because core Envoy Gateway can't serve an
InferencePool; the Phase 1 spike (design/disaggregation-routing-spike.md) established this and the follow-up spike confirmed the mechanism against upstream source.The unified path is the zero-prefill case of the same code, unchanged throughout.
Validated on real GKE L4 GPUs, which caught four bugs unit tests couldn't — the expected manifests encoded the same wrong values — all fixed here:
ghcr.io/llm-d/llm-d-inference-scheduler:v0.8.0andghcr.io/llm-d/llm-d-routing-sidecar:v0.8.0; the EPP image now pulls and runs.EndpointPickerConfigusedapiVersion: llm-d.ai/v1alpha1, which the EPP binary doesn't register (crash-loop). Corrected toinference.networking.x-k8s.io/v1alpha1; the EPP now parses its config, runs, and reconciles theInferencePool(tracking both the decode and prefill endpoints).--kv-transfer-config, so no prefill→decode KV handoff could occur. Both roles now pass--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'(NixlConnector doesn't distinguish kv_role; the routing sidecar drives direction). Confirmed live: both engines come up with the connector configured and logNIXL is available.--secure-proxydefaults true), so the HTTP readiness probe and the HTTP gateway path were rejected and the decode pod never became Ready. Now passes--secure-proxy=false(the Modelplane serving path is HTTP throughout).With these, both vLLM roles run on L4 with NixlConnector active and the model serves real completions — e.g. a chat request returned
Hello! How can I assist you today?. The EPP is healthy and tracks both endpoints.Not yet confirmed: a single request driven end to end through the gateway (
HTTPRoute → InferencePool → EPP → decode, sidecar pulling KV from prefill). On the test cluster theInferencePoolbackendRef didn't program, but the substrate had been disturbed during debugging, so this is inconclusive rather than a confirmed limitation and needs a clean re-run. Envoy AI Gateway is pinned at v0.7.0 pending its v1.0.0 GA (~end of June).Follow-ups, out of scope and pre-existing: (1) a product-provisioned GKE cluster brings GPU pools up on the device-plugin path while workloads claim GPUs only via the
gpu.nvidia.comDRA driver, so GKE-side DRA provisioning (driver-root, device-plugin) needs validating — affects unified serving too; (2) the LWS bootstrap runsray startunconditionally, but stockvllm/vllm-openaiships no ray and single-node replicas don't need it.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.