Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing by dennis-upbound · Pull Request #124 · modelplaneai/modelplane

dennis-upbound · 2026-06-11T00:57:10Z

Fixes #34.

LLM inference has two phases with opposite hardware profiles: prefill is compute-bound and sets TTFT, decode is memory-bandwidth-bound and sets ITL. Run on one pod set they contend, and neither can be tuned independently. Modelplane had no way to split them.

This serves the two phases as separate, co-located pod sets with the KV cache transferred over NIXL, and sequences each request prefill→decode through a GAIE InferencePool and an endpoint-picker. It implements the design from #116.

A deployment opts in by declaring a prefill block alongside workers (now the decode role), plus a routing block that supplies the endpoint-picker. For a disaggregated deployment both worker counts and routing must be explicit, enforced by CEL; a deployment with no prefill block is unified serving and routes exactly as before.

apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
spec:
  model: { ... }
  workers:                 # decode role
    count: 2
    topology: { tensor: 1 }
    template: { spec: { containers: [{ name: engine, image: ... }] } }
  prefill:                 # new: a second, independently-sized role
    workers:
      count: 1
      topology: { tensor: 1 }
      template: { spec: { containers: [{ name: engine, image: ... }] } }
    nodeSelector: { ... }
  routing:                 # new: the endpoint-picker
    template:
      spec:
        containers:
        - name: epp        # image/args optional; defaults to the pinned llm-d EPP

The scheduler (compose-model-deployment) treats a disaggregated replica as the existing decode placement plus one optional prefill placement, co-located on one InferenceCluster. It chooses the (decode_pool, prefill_pool) pair jointly against a single capacity ledger rather than greedily per role, charges both pools, and re-places the replica if either pool drifts.

The llm-d backend (compose-model-replica) emits a decode pod set (kv_consumer) and a prefill pod set (kv_producer), each role-labeled with llm-d.ai/role and pinned to its pool with its own ResourceClaimTemplate, both carrying VLLM_NIXL_SIDE_CHANNEL_HOST via the downward API so NixlConnector can establish the transfer. It injects the pd-sidecar on decode pods (vLLM moves to 8001, the sidecar takes 8000), emits the InferencePool and the EPP stack (Deployment, Service, ConfigMap, RBAC) built from routing.template, and points the disaggregated HTTPRoute at the pool instead of the Service.

ServingStack installs Envoy AI Gateway (v0.7.0) and the GAIE CRDs, because core Envoy Gateway can't serve an InferencePool; the Phase 1 spike (design/disaggregation-routing-spike.md) established this and the follow-up spike confirmed the mechanism against upstream source.

The unified path is the zero-prefill case of the same code, unchanged throughout.

Validated on real GKE L4 GPUs, which caught four bugs unit tests couldn't — the expected manifests encoded the same wrong values — all fixed here:

The pinned EPP and pd-sidecar image refs didn't exist (403 from ghcr.io). Corrected to the published public images ghcr.io/llm-d/llm-d-inference-scheduler:v0.8.0 and ghcr.io/llm-d/llm-d-routing-sidecar:v0.8.0; the EPP image now pulls and runs.
The EndpointPickerConfig used apiVersion: llm-d.ai/v1alpha1, which the EPP binary doesn't register (crash-loop). Corrected to inference.networking.x-k8s.io/v1alpha1; the EPP now parses its config, runs, and reconciles the InferencePool (tracking both the decode and prefill endpoints).
The disaggregated engines never enabled vLLM's NixlConnector — the side-channel env and sidecar were emitted but no --kv-transfer-config, so no prefill→decode KV handoff could occur. Both roles now pass --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' (NixlConnector doesn't distinguish kv_role; the routing sidecar drives direction). Confirmed live: both engines come up with the connector configured and log NIXL is available.
The pd-sidecar served HTTPS by default (--secure-proxy defaults true), so the HTTP readiness probe and the HTTP gateway path were rejected and the decode pod never became Ready. Now passes --secure-proxy=false (the Modelplane serving path is HTTP throughout).

With these, both vLLM roles run on L4 with NixlConnector active and the model serves real completions — e.g. a chat request returned Hello! How can I assist you today?. The EPP is healthy and tracks both endpoints.

Not yet confirmed: a single request driven end to end through the gateway (HTTPRoute → InferencePool → EPP → decode, sidecar pulling KV from prefill). On the test cluster the InferencePool backendRef didn't program, but the substrate had been disturbed during debugging, so this is inconclusive rather than a confirmed limitation and needs a clean re-run. Envoy AI Gateway is pinned at v0.7.0 pending its v1.0.0 GA (~end of June).

Follow-ups, out of scope and pre-existing: (1) a product-provisioned GKE cluster brings GPU pools up on the device-plugin path while workloads claim GPUs only via the gpu.nvidia.com DRA driver, so GKE-side DRA provisioning (driver-root, device-plugin) needs validating — affects unified serving too; (2) the LWS bootstrap runs ray start unconditionally, but stock vllm/vllm-openai ships no ray and single-node replicas don't need it.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
Added or updated tests covering any composition function changes.
Signed off every commit with git commit -s.

LLM inference has two phases with opposite hardware profiles — prefill is compute-bound and sets TTFT; decode is memory-bandwidth-bound and sets ITL. Run on one pod set they contend, and neither can be tuned independently. This adds Phase 1 of prefill/decode disaggregation (design/disaggregation.md): an optional `prefill` block on ModelDeployment (self-contained workers, topology, template, nodeSelector) plus a routing.template, mirrored onto ModelReplica. The fleet scheduler treats a disaggregated replica as the existing decode placement plus one optional prefill placement, co-located on one InferenceCluster, choosing the (decode_pool, prefill_pool) pair jointly against a single capacity ledger rather than greedily per role; it charges both roles' pools and re-places a replica if either pool drifts. The llm-d backend emits a decode pod set (kv_consumer) and a prefill pod set (kv_producer), each pinned to its role's pool with its own ResourceClaimTemplate, distinguished by a modelplane.ai/pd-role label so the decode Service never selects prefill, both mounting the model cache and carrying VLLM_NIXL_SIDE_CHANNEL_HOST via the downward API so NixlConnector can establish the KV transfer. The unified (non-disaggregated) path is unchanged — it is the zero-prefill case of the same code. Request-level prefill->decode sequencing (the GAIE InferencePool + EPP routing layer) is deferred to Phase 2: a feasibility spike (design/disaggregation-routing-spike.md) found core Envoy Gateway can't serve an InferencePool and the llm-d EPP fronts a decode-only pool with a sidecar handoff, so Phase 2 will move ServingStack to Envoy AI Gateway after confirming the EPP mechanism against upstream source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Before writing disaggregated routing code, three questions from the Phase 1 spike were unresolved or low confidence. This spike confirms all three from primary source. Header: x-prefiller-host-port (defined in pkg/common/routing/common.go). The EPP's disagg-profile-handler injects it; the pd-sidecar on the decode pod reads it and forwards prefill to the named host:port. The standalone routing-sidecar repo is deprecated; code lives under cmd/pd-sidecar and pkg/sidecar in the inference-scheduler repo. Pod discovery: a single InferencePool selects all pods (prefill + decode) by a shared label. The EPP partitions the set at scheduling time using prefill-filter and decode-filter plugins, both keyed on llm-d.ai/role. Prefill pods need llm-d.ai/role: prefill; decode pods need decode. The label-selector-filter plugin can read modelplane.ai/pd-role instead if we want to avoid adding llm-d-native labels. Envoy AI Gateway: v0.7.0 shipped June 4, 2026 (v1.0 GA targets June 30). It supports InferencePool (inference.networking.k8s.io/v1) as an HTTPRoute backendRef. It runs on top of gateway-helm via extensionManager hooks, not as a replacement GatewayClass; controllerName stays gateway.envoyproxy.io/gatewayclass-controller. ServingStack needs three charts: gateway-helm (with AI Gateway extension values), ai-gateway-crds-helm, and ai-gateway-helm, plus the GAIE v1.0.1 manifests. Towards #34. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The design settled on no guessing for a disaggregated deployment: the prefill:decode ratio and the routing EPP are stated, not defaulted. Drop the schema default on workers.count and prefill.workers.count (symmetrically, so the prefill workers schema stays byte-identical to decode and codegen still deduplicates them) so the both-counts CEL rule actually rejects an omitted count, and add a rule requiring routing when prefill is set. A unified deployment is unaffected: the rules are guarded on has(prefill), and the composition function materializes an omitted count as 1 onto the ModelReplica, which keeps the replica self-describing rather than relying on a re-default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The GAIE EPP selects pods using llm-d-native labels that Modelplane was not emitting. For disaggregated replicas, stamp both the decode and prefill LeaderWorkerSet pod templates (leader and worker) with: - app: <replica-name> shared InferencePool selector - llm-d.ai/inference-serving: "true" shared EPP selector - llm-d.ai/role: "decode"/"prefill" per-role EPP filter The app label carries the decode replica name on both roles so a single InferencePool spec.selector matches both pod sets. Unified (non- disaggregated) replicas are untouched. The existing modelplane.ai/pd-role labels are preserved. New constants LABEL_LLMD_ROLE and LABEL_LLMD_SERVING are added to base.py; the disagg-only decode_extra dict and the prefill_labels dict in llmd.py are extended to carry all three labels. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Records what compose-model-replica must emit for a disaggregated replica (the ten EPP/InferencePool Objects, since Modelplane runs no per-model Helm install), the EPP container args/env/config, the gateway-helm values overrides the InferencePool backendRef needs, and the pinned pd-sidecar and endpoint-picker images. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

On a disaggregated replica the pd-sidecar (llm-d-inference-scheduler-disagg-sidecar:v0.8.0) intercepts port 8000 on the decode leader pod, forwards the prompt to the prefill instance for KV transfer, then proxies to local vLLM which has moved to port 8001. Service targetPort stays 8000 (the sidecar's listen port) so no routing change is needed. Unified and prefill pods are unchanged. - base.py: add PD_SIDECAR_IMAGE, _DECODE_ENGINE_PORT=8001, pd_sidecar_container() - llmd.py: extract _build_commands() helper; thread engine_serving_port (8001 when disagg, 8000 otherwise) through the container closure's ports+readinessProbe; inject --port=8001 into turnkey vLLM args; append pd_sidecar_container() to the decode leader pod only when disagg Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

For disaggregated replicas the HTTPRoute previously targeted the decode Service directly, bypassing the EPP. Now: - emit an `InferencePool` (inference.networking.k8s.io/v1) at key `inference-pool`; its selector matches both prefill and decode pods via the shared app:<name> + llm-d.ai/inference-serving:"true" labels, and its endpointPickerRef names the EPP Service (<name>-epp:9002) with failureMode:FailOpen so a transient EPP outage never black-holes traffic - flip the disagg HTTPRoute's backendRefs from `{name:<name>,port:80}` to `{group:inference.networking.k8s.io,kind:InferencePool,name:<name>-pool}` so the GAIE EPP intercepts every request for KV-/prefix-aware routing - unified replicas keep `HTTPRoute -> Service` untouched; the new logic is gated on the existing `disagg` flag so there is no unified-path change The decode Service is retained: the InferencePool/EPP and the pods themselves still need it. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Signed-off-by: Dennis Ramdass <dennis@upbound.io>

failureMode is a field of endpointPickerRef in the GAIE v1 InferencePool schema, not of the pool spec; emitting it at spec level would be rejected by the apiserver or silently drop to FailClose, black-holing decode traffic when the EPP is briefly unavailable. Move it inside endpointPickerRef and assert the placement. Also refresh the module docstring, which still claimed the backend only ever emits HTTPRoute->Service — it now fronts disaggregated pods with an InferencePool + EPP. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

negz · 2026-06-15T20:12:14Z

Superseded by #142

dennis-upbound and others added 4 commits June 11, 2026 06:39

dennis-upbound force-pushed the dennis/disagg-impl branch from adf3e32 to 2fc3e35 Compare June 11, 2026 13:41

dennis-upbound force-pushed the dennis/disagg-impl branch from 2fc3e35 to 811cea4 Compare June 11, 2026 15:34

dennis-upbound and others added 6 commits June 11, 2026 09:34

Emit the llm-d endpoint picker for disaggregated replicas

981a757

Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Install the Envoy AI Gateway stack and GAIE CRDs on serving clusters

68eb2f6

Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Satisfy the formatter

df11eeb

Signed-off-by: Dennis Ramdass <dennis@upbound.io>

dennis-upbound force-pushed the dennis/disagg-impl branch from 811cea4 to 8f61a37 Compare June 11, 2026 16:35

dennis-upbound marked this pull request as ready for review June 11, 2026 16:35

negz closed this Jun 15, 2026

dennis-upbound deleted the dennis/disagg-impl branch June 19, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing#124

Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing#124
dennis-upbound wants to merge 11 commits into
mainfrom
dennis/disagg-impl

dennis-upbound commented Jun 11, 2026 •

edited

Loading

Uh oh!

negz commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dennis-upbound commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negz commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dennis-upbound commented Jun 11, 2026 •

edited

Loading