Skip to content

Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing#124

Closed
dennis-upbound wants to merge 11 commits into
mainfrom
dennis/disagg-impl
Closed

Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing#124
dennis-upbound wants to merge 11 commits into
mainfrom
dennis/disagg-impl

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Fixes #34.

LLM inference has two phases with opposite hardware profiles: prefill is compute-bound and sets TTFT, decode is memory-bandwidth-bound and sets ITL. Run on one pod set they contend, and neither can be tuned independently. Modelplane had no way to split them.

This serves the two phases as separate, co-located pod sets with the KV cache transferred over NIXL, and sequences each request prefill→decode through a GAIE InferencePool and an endpoint-picker. It implements the design from #116.

A deployment opts in by declaring a prefill block alongside workers (now the decode role), plus a routing block that supplies the endpoint-picker. For a disaggregated deployment both worker counts and routing must be explicit, enforced by CEL; a deployment with no prefill block is unified serving and routes exactly as before.

apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
spec:
  model: { ... }
  workers:                 # decode role
    count: 2
    topology: { tensor: 1 }
    template: { spec: { containers: [{ name: engine, image: ... }] } }
  prefill:                 # new: a second, independently-sized role
    workers:
      count: 1
      topology: { tensor: 1 }
      template: { spec: { containers: [{ name: engine, image: ... }] } }
    nodeSelector: { ... }
  routing:                 # new: the endpoint-picker
    template:
      spec:
        containers:
        - name: epp        # image/args optional; defaults to the pinned llm-d EPP

The scheduler (compose-model-deployment) treats a disaggregated replica as the existing decode placement plus one optional prefill placement, co-located on one InferenceCluster. It chooses the (decode_pool, prefill_pool) pair jointly against a single capacity ledger rather than greedily per role, charges both pools, and re-places the replica if either pool drifts.

The llm-d backend (compose-model-replica) emits a decode pod set (kv_consumer) and a prefill pod set (kv_producer), each role-labeled with llm-d.ai/role and pinned to its pool with its own ResourceClaimTemplate, both carrying VLLM_NIXL_SIDE_CHANNEL_HOST via the downward API so NixlConnector can establish the transfer. It injects the pd-sidecar on decode pods (vLLM moves to 8001, the sidecar takes 8000), emits the InferencePool and the EPP stack (Deployment, Service, ConfigMap, RBAC) built from routing.template, and points the disaggregated HTTPRoute at the pool instead of the Service.

ServingStack installs Envoy AI Gateway (v0.7.0) and the GAIE CRDs, because core Envoy Gateway can't serve an InferencePool; the Phase 1 spike (design/disaggregation-routing-spike.md) established this and the follow-up spike confirmed the mechanism against upstream source.

The unified path is the zero-prefill case of the same code, unchanged throughout.

Validated on real GKE L4 GPUs, which caught four bugs unit tests couldn't — the expected manifests encoded the same wrong values — all fixed here:

  • The pinned EPP and pd-sidecar image refs didn't exist (403 from ghcr.io). Corrected to the published public images ghcr.io/llm-d/llm-d-inference-scheduler:v0.8.0 and ghcr.io/llm-d/llm-d-routing-sidecar:v0.8.0; the EPP image now pulls and runs.
  • The EndpointPickerConfig used apiVersion: llm-d.ai/v1alpha1, which the EPP binary doesn't register (crash-loop). Corrected to inference.networking.x-k8s.io/v1alpha1; the EPP now parses its config, runs, and reconciles the InferencePool (tracking both the decode and prefill endpoints).
  • The disaggregated engines never enabled vLLM's NixlConnector — the side-channel env and sidecar were emitted but no --kv-transfer-config, so no prefill→decode KV handoff could occur. Both roles now pass --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' (NixlConnector doesn't distinguish kv_role; the routing sidecar drives direction). Confirmed live: both engines come up with the connector configured and log NIXL is available.
  • The pd-sidecar served HTTPS by default (--secure-proxy defaults true), so the HTTP readiness probe and the HTTP gateway path were rejected and the decode pod never became Ready. Now passes --secure-proxy=false (the Modelplane serving path is HTTP throughout).

With these, both vLLM roles run on L4 with NixlConnector active and the model serves real completions — e.g. a chat request returned Hello! How can I assist you today?. The EPP is healthy and tracks both endpoints.

Not yet confirmed: a single request driven end to end through the gateway (HTTPRoute → InferencePool → EPP → decode, sidecar pulling KV from prefill). On the test cluster the InferencePool backendRef didn't program, but the substrate had been disturbed during debugging, so this is inconclusive rather than a confirmed limitation and needs a clean re-run. Envoy AI Gateway is pinned at v0.7.0 pending its v1.0.0 GA (~end of June).

Follow-ups, out of scope and pre-existing: (1) a product-provisioned GKE cluster brings GPU pools up on the device-plugin path while workloads claim GPUs only via the gpu.nvidia.com DRA driver, so GKE-side DRA provisioning (driver-root, device-plugin) needs validating — affects unified serving too; (2) the LWS bootstrap runs ray start unconditionally, but stock vllm/vllm-openai ships no ray and single-node replicas don't need it.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

dennis-upbound and others added 4 commits June 11, 2026 06:39
LLM inference has two phases with opposite hardware profiles — prefill is
compute-bound and sets TTFT; decode is memory-bandwidth-bound and sets ITL.
Run on one pod set they contend, and neither can be tuned independently.

This adds Phase 1 of prefill/decode disaggregation (design/disaggregation.md):
an optional `prefill` block on ModelDeployment (self-contained workers,
topology, template, nodeSelector) plus a routing.template, mirrored onto
ModelReplica. The fleet scheduler treats a disaggregated replica as the
existing decode placement plus one optional prefill placement, co-located on
one InferenceCluster, choosing the (decode_pool, prefill_pool) pair jointly
against a single capacity ledger rather than greedily per role; it charges both
roles' pools and re-places a replica if either pool drifts. The llm-d backend
emits a decode pod set (kv_consumer) and a prefill pod set (kv_producer), each
pinned to its role's pool with its own ResourceClaimTemplate, distinguished by
a modelplane.ai/pd-role label so the decode Service never selects prefill, both
mounting the model cache and carrying VLLM_NIXL_SIDE_CHANNEL_HOST via the
downward API so NixlConnector can establish the KV transfer.

The unified (non-disaggregated) path is unchanged — it is the zero-prefill case
of the same code. Request-level prefill->decode sequencing (the GAIE
InferencePool + EPP routing layer) is deferred to Phase 2: a feasibility spike
(design/disaggregation-routing-spike.md) found core Envoy Gateway can't serve
an InferencePool and the llm-d EPP fronts a decode-only pool with a sidecar
handoff, so Phase 2 will move ServingStack to Envoy AI Gateway after confirming
the EPP mechanism against upstream source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Before writing disaggregated routing code, three questions from the
Phase 1 spike were unresolved or low confidence. This spike confirms all
three from primary source.

Header: x-prefiller-host-port (defined in pkg/common/routing/common.go).
The EPP's disagg-profile-handler injects it; the pd-sidecar on the decode
pod reads it and forwards prefill to the named host:port. The standalone
routing-sidecar repo is deprecated; code lives under cmd/pd-sidecar and
pkg/sidecar in the inference-scheduler repo.

Pod discovery: a single InferencePool selects all pods (prefill + decode)
by a shared label. The EPP partitions the set at scheduling time using
prefill-filter and decode-filter plugins, both keyed on llm-d.ai/role.
Prefill pods need llm-d.ai/role: prefill; decode pods need decode. The
label-selector-filter plugin can read modelplane.ai/pd-role instead if
we want to avoid adding llm-d-native labels.

Envoy AI Gateway: v0.7.0 shipped June 4, 2026 (v1.0 GA targets June 30).
It supports InferencePool (inference.networking.k8s.io/v1) as an HTTPRoute
backendRef. It runs on top of gateway-helm via extensionManager hooks, not
as a replacement GatewayClass; controllerName stays
gateway.envoyproxy.io/gatewayclass-controller. ServingStack needs three
charts: gateway-helm (with AI Gateway extension values), ai-gateway-crds-helm,
and ai-gateway-helm, plus the GAIE v1.0.1 manifests.

Towards #34.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The design settled on no guessing for a disaggregated deployment: the
prefill:decode ratio and the routing EPP are stated, not defaulted. Drop the
schema default on workers.count and prefill.workers.count (symmetrically, so the
prefill workers schema stays byte-identical to decode and codegen still
deduplicates them) so the both-counts CEL rule actually rejects an omitted
count, and add a rule requiring routing when prefill is set. A unified
deployment is unaffected: the rules are guarded on has(prefill), and the
composition function materializes an omitted count as 1 onto the ModelReplica,
which keeps the replica self-describing rather than relying on a re-default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The GAIE EPP selects pods using llm-d-native labels that Modelplane was
not emitting. For disaggregated replicas, stamp both the decode and
prefill LeaderWorkerSet pod templates (leader and worker) with:

- app: <replica-name>        shared InferencePool selector
- llm-d.ai/inference-serving: "true"   shared EPP selector
- llm-d.ai/role: "decode"/"prefill"    per-role EPP filter

The app label carries the decode replica name on both roles so a single
InferencePool spec.selector matches both pod sets. Unified (non-
disaggregated) replicas are untouched. The existing
modelplane.ai/pd-role labels are preserved.

New constants LABEL_LLMD_ROLE and LABEL_LLMD_SERVING are added to
base.py; the disagg-only decode_extra dict and the prefill_labels dict
in llmd.py are extended to carry all three labels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Records what compose-model-replica must emit for a disaggregated replica (the
ten EPP/InferencePool Objects, since Modelplane runs no per-model Helm install),
the EPP container args/env/config, the gateway-helm values overrides the
InferencePool backendRef needs, and the pinned pd-sidecar and endpoint-picker
images.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound and others added 6 commits June 11, 2026 09:34
On a disaggregated replica the pd-sidecar (llm-d-inference-scheduler-disagg-sidecar:v0.8.0)
intercepts port 8000 on the decode leader pod, forwards the prompt to the prefill
instance for KV transfer, then proxies to local vLLM which has moved to port 8001.
Service targetPort stays 8000 (the sidecar's listen port) so no routing change is
needed.  Unified and prefill pods are unchanged.

- base.py: add PD_SIDECAR_IMAGE, _DECODE_ENGINE_PORT=8001, pd_sidecar_container()
- llmd.py: extract _build_commands() helper; thread engine_serving_port (8001 when
  disagg, 8000 otherwise) through the container closure's ports+readinessProbe;
  inject --port=8001 into turnkey vLLM args; append pd_sidecar_container() to the
  decode leader pod only when disagg

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
For disaggregated replicas the HTTPRoute previously targeted the decode
Service directly, bypassing the EPP. Now:

- emit an `InferencePool` (inference.networking.k8s.io/v1) at key
  `inference-pool`; its selector matches both prefill and decode pods via
  the shared app:<name> + llm-d.ai/inference-serving:"true" labels, and
  its endpointPickerRef names the EPP Service (<name>-epp:9002) with
  failureMode:FailOpen so a transient EPP outage never black-holes traffic
- flip the disagg HTTPRoute's backendRefs from `{name:<name>,port:80}` to
  `{group:inference.networking.k8s.io,kind:InferencePool,name:<name>-pool}`
  so the GAIE EPP intercepts every request for KV-/prefix-aware routing
- unified replicas keep `HTTPRoute -> Service` untouched; the new logic is
  gated on the existing `disagg` flag so there is no unified-path change

The decode Service is retained: the InferencePool/EPP and the pods
themselves still need it.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
failureMode is a field of endpointPickerRef in the GAIE v1 InferencePool
schema, not of the pool spec; emitting it at spec level would be rejected by the
apiserver or silently drop to FailClose, black-holing decode traffic when the
EPP is briefly unavailable. Move it inside endpointPickerRef and assert the
placement. Also refresh the module docstring, which still claimed the backend
only ever emits HTTPRoute->Service — it now fronts disaggregated pods with an
InferencePool + EPP.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound marked this pull request as ready for review June 11, 2026 16:35
@negz

negz commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Superseded by #142

@negz negz closed this Jun 15, 2026
@dennis-upbound dennis-upbound deleted the dennis/disagg-impl branch June 19, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support prefill/decode disaggregation

2 participants