Make PrefillDecode actually disaggregate by dennis-upbound · Pull Request #175 · modelplaneai/modelplane

dennis-upbound · 2026-06-17T04:57:30Z

Problem

Across a full benchmark on a live GKE cluster, the vLLM prefill engine handled
0 requests, the decode engine served everything, remote_engine_id was
null, and no KV cache transferred. PrefillDecode was silently running
decode-only.

Root cause — three stacked defaults, all wrong on our EPP image

EPP image: llm-d-inference-scheduler:v0.8.0 (embeds gateway-api-inference-extension v1.5.0). Fixing any one alone leaves it broken:

prefix-based-pd-decider had no params → nonCachedTokens = 0, which the
decider treats as disabled (always decode-only).
The decider reads a PrefixCacheMatchInfo attribute that nothing produced.
GIE v1.5.0 split production out of prefix-cache-scorer into a separate
plugin and made prepare-data default-on — so the prepareDataPlugins
feature gate the v0.8.0 docs still tell you to set is unregistered and
crashloops the EPP (feature gate 'prepareDataPlugins' is unknown or unregistered). The producer is now an explicit plugin:
approx-prefix-cache-producer.
That producer defaults to autoTune: true, which leaves its block size 0
and never populates the attribute. Must pin autoTune: false +
blockSizeTokens.

(The v0.8.0 sample configs and docs/disaggregation.md are stale — they still
list the removed gate. This is effectively a packaging bug upstream; we work
around it with the correct v1.5.0 data-path config.)

Fix

Two pieces, both required for a request to actually disaggregate:

Arm the EPP decider. Add approx-prefix-cache-producer (autoTune: false,
blockSizeTokens derived from the engine's --block-size/--page-size), set
nonCachedTokens: 16, drop the feature gate, and wire the scorers into both
profiles.
Inject the NIXL KV-transfer plumbing the schema can't express. Cross-pod
NIXL needs VLLM_NIXL_SIDE_CHANNEL_HOST set to the pod IP (a fieldRef env)
and a Memory-backed /dev/shm (a volume) — neither expressible in the
ModelDeployment engine template (env.valueFrom allows only secret/configMap
refs; no volumes) — the schema gap is tracked in ModelDeployment engine template can't express volumes #180. _inject_nixl_plumbing
adds them to both phase engines,
the same way the pd-sidecar is injected. Without the side-channel host the
prefill advertises an unreachable address and the handshake fails (500/503,
zero transfers). Also documents the engine-image prerequisite: the image must
ship the NIXL runtime — recent vanilla vllm/vllm-openai tags do.

Verified live:

prompts	prefill engine `request_prefill_time`	behavior
5× long unique	+5	✅ disaggregates (KV flows prefill→decode over NIXL)
3× tiny (`"hi"`)	+0 (flat)	✅ selectively skips prefill, decode-only
3× long unique	+3	✅ disaggregates

So disaggregation engages and is correctly selective (short/cache-hot prompts
skip the prefill hop and its KV-transfer cost).

Hard offload evidence (re-validated 2026-06-17, vanilla `vllm/vllm-openai:v0.19.1`)

Clean run: armed EPP + injected NIXL plumbing, 1 prefill (kv_producer) + 1 decode
(kv_consumer), five ~600-token prompts. vLLM's own NIXL counters:

engine	`prompt_tokens`	`generation_tokens`	`nixl_xfer` count	`nixl_bytes_transferred`
prefill (producer)	6220	5	—	—
decode (consumer)	(decode-side)	160	5	367 MB (3.67e8)

The prefill processes all prompt tokens and generates ~nothing (5 = handoff
tokens); the decode records exactly 5 NIXL transfers totaling 367 MB of KV
cache pulled from the prefill, then does the generation. That is the offload,
measured end-to-end — no custom kv-connector image, just vanilla vLLM that ships
NIXL.

Validated the failure modes too: the unarmed EPP config serves everything
decode-only (prefill prompt_tokens stays 0); missing VLLM_NIXL_SIDE_CHANNEL_HOST
returns 500/503 with zero transfers — i.e. both halves of this PR are load-bearing.

Alternatives considered

always-disagg-pd-decider — also works on this image and is simpler, but
disaggregates unconditionally (pays KV-transfer overhead even on tiny prompts).
prefix-based is the better default; keep always-disagg in mind as a fallback if
the producer wiring ever regresses.
Downgrade to inference-scheduler v0.7.1 (GIE v1.4.0) — there the original
featureGates: [prepareDataPlugins] config is correct, but it gives up v0.8.0.
Rejected; the v1.5.0 config above is config-only and keeps us current.

Future considerations

We may want to expose the EPP config as a specTemplate for full customization later on

Test plan

compose-model-replica unit tests pass (43 + 8 subtests) — incl. test_epp_config_arms_the_pd_decider, test_kv_block_size, test_injects_nixl_plumbing
ruff check / ruff format clean
Verified live: selective disaggregation works (long → prefill engages, short → skips)
Verified live: KV offload measured — 5 transfers / 367 MB prefill→decode over NIXL on vanilla vllm/vllm-openai:v0.19.1
nix flake check (CI, incl. docs Vale)

serving.mode: PrefillDecode composes an EndpointPicker whose disaggregation is gated by a decider plugin. The composed config never actually disaggregated: across a full benchmark on a live GKE cluster the vLLM prefill engine handled zero requests, the decode engine served everything, and no KV cache transferred. Three defaults all conspired against it on the EPP image we run (llm-d-inference-scheduler v0.8.0, embedding gateway-api-inference-extension v1.5.0), and fixing only one is not enough: 1. prefix-based-pd-decider was declared with no parameters, so nonCachedTokens took its int zero value, which the decider treats as "disabled" — every request decode-only. 2. The decider reads a PrefixCacheMatchInfo attribute that prefix-cache-scorer no longer produces. GIE v1.5.0 split production into a separate plugin and made prepare-data default-on, so the prepareDataPlugins feature gate the v0.8.0 docs still tell you to set is unregistered and crashloops the EPP. The producer is now an explicit plugin, approx-prefix-cache-producer. 3. That producer defaults to autoTune: true, which leaves its block size 0 and never populates the attribute. Add approx-prefix-cache-producer pinned to autoTune: false, set nonCachedTokens: 16, drop the feature gate, and wire the scorers into both profiles, matching the data path the v1.5.0 binary actually uses. Verified live: long prompts now disaggregate (prefill engine's request_prefill_time counter increments and KV flows prefill->decode over NIXL) while short prompts correctly skip the prefill hop and serve decode-only. Add a regression test pinning the three load-bearing settings. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The EPP's approx-prefix-cache-producer must chunk prefixes at the same KV block size the engine uses, or prefix-cache routing silently degrades (no error, just worse decisions). The config hardcoded blockSizeTokens: 16, which only works because it matches vLLM's default --block-size; a user who sets --block-size 32 (engine flags are the user's, per #137) would quietly get bad routing. Derive it best-effort from the decode engine's flags — vLLM's --block-size and SGLang's --page-size — falling back to 16 when absent or unparseable, and render it into the EPP config. Marked a HACK: peeking at user-owned engine args is the pragmatic v0.1 unblock; the durable fix is a typed/overridable knob on the serving block (#179). Signed-off-by: Dennis Ramdass <dennis@upbound.io>

PrefillDecode silently fails when the engine image lacks the NIXL runtime: vLLM's NixlConnector (and SGLang's PD transfer) import the `nixl` package, which the base vllm/vllm-openai image doesn't include, so disaggregated engines crashloop with "NIXL is not available". Engine images are the user's (#137), so Modelplane can't bundle it — but nothing told the user it was required. Document the prerequisite where it's relevant: the _disaggregated composition docstring, the user-facing ModelDeployment doc, and the unopinionated-deployments design. The fix is to use a kv-connector-enabled image — build vLLM with INSTALL_KV_CONNECTORS=true (nixl + lmcache + mooncake) or a pre-built one such as lmcache/vllm-openai. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

PrefillDecode engines need two things for cross-pod KV transfer that the ModelDeployment schema can't express: a Memory-backed /dev/shm (the container default 64Mi is too small for NIXL's shared-memory buffers) and VLLM_NIXL_SIDE_CHANNEL_HOST set to the pod IP (via fieldRef) so peer engines can reach this one's NIXL metadata channel. The engine template only allows valueFrom.secretKeyRef/configMapKeyRef (no fieldRef) and no volumes, so a user literally cannot supply them — and without them the decode engine can't fetch the prefill's KV and requests fail with a 500 and no error in the engine logs. Inject both onto every disaggregated engine, the same way the pd-sidecar is injected — infra-level and always-correct for PrefillDecode, no user input. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Disaggregated engines import the NIXL KV-transfer runtime through their connector (vLLM's NixlConnector, SGLang's transfer path). An image without NIXL crashes at startup with "NIXL is not available", which is easy to hit and hard to diagnose. Recent vanilla vllm/vllm-openai images ship NIXL, so the guidance is simply to pin a current tag. Note this prerequisite in the ModelDeployment guide and the design doc, and teach the docs vocabulary the NIXL/NixlConnector terms so Vale stops flagging them. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

_kv_block_size and _decode_port each open-coded the same "--flag value" / "--flag=value" engine-arg scan. Factor it into one _flag_value helper so both read the user's flags the same way. Also bring the _disaggregated NIXL-prerequisite docstring in line with the docs: recent vanilla vllm/vllm-openai images ship the NIXL runtime, so the guidance is to pin a current tag rather than build a kv-connector image. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Modelplane's pitch is that adopting an advanced inference-serving technique should be a deployment-level change, not a quarter-long project — but we have no end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't see how they'd evaluate something like prefill/decode disaggregation on their own workload and roll it out to production safely. This adds a self-contained scenario guide and runnable kit under demo/prefill-decode/. The guide (README.md) opens with the latency problem (TTFT/ITL under load), explains why these techniques are powerful but operationally heavy, then walks the real adoption workflow with prefill/decode disaggregation as the worked example on one Modelplane cluster: a unified baseline, a PrefillDecode variant stood up on spare capacity, proving it disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of representative traffic, and promoting it behind one shared-label ModelService by shifting replica capacity (canary, promote, cut over, roll back are all just replicas, since the service load-balances evenly across healthy endpoints). Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D deployments, the shared ModelService), a single run.sh (deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The disaggregation recipe matches what #175 validated live — vanilla vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode --port=8001, matched --block-size — with Modelplane composing the EPP, InferencePool, routing sidecar, and NIXL plumbing. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

dennis-upbound force-pushed the dennis/fix-epp-disagg-config branch from 9807d2f to cadc2ee Compare June 17, 2026 05:18

dennis-upbound changed the title ~~Arm the prefill/decode decider in the composed EPP config~~ Use always-disagg-pd-decider so PrefillDecode actually disaggregates Jun 17, 2026

dennis-upbound force-pushed the dennis/fix-epp-disagg-config branch from cadc2ee to f56ccd4 Compare June 17, 2026 13:56

dennis-upbound changed the title ~~Use always-disagg-pd-decider so PrefillDecode actually disaggregates~~ Make PrefillDecode disaggregate via selective prefix-based PD Jun 17, 2026

dennis-upbound marked this pull request as ready for review June 17, 2026 14:03

haarchri mentioned this pull request Jun 17, 2026

Make the EndpointPicker (EPP) config user-configurable for PrefillDecode serving #179

Open

haarchri reviewed Jun 17, 2026

View reviewed changes

Comment thread functions/compose-model-replica/function/routing.py Outdated

dennis-upbound added 5 commits June 17, 2026 10:41

dennis-upbound mentioned this pull request Jun 17, 2026

ModelDeployment engine template can't express volumes #180

Closed

negz approved these changes Jun 17, 2026

View reviewed changes

dennis-upbound changed the title ~~Make PrefillDecode disaggregate via selective prefix-based PD~~ Make PrefillDecode actually disaggregate Jun 17, 2026

dennis-upbound merged commit e0cf861 into main Jun 17, 2026
5 checks passed

dennis-upbound mentioned this pull request Jun 17, 2026

WIP: Add a scenario guide for deploying advanced serving techniques #182

Closed

4 tasks

negz deleted the dennis/fix-epp-disagg-config branch June 18, 2026 03:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make PrefillDecode actually disaggregate#175

Make PrefillDecode actually disaggregate#175
dennis-upbound merged 6 commits into
mainfrom
dennis/fix-epp-disagg-config

dennis-upbound commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dennis-upbound commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause — three stacked defaults, all wrong on our EPP image

Fix

Hard offload evidence (re-validated 2026-06-17, vanilla vllm/vllm-openai:v0.19.1)

Alternatives considered

Future considerations

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dennis-upbound commented Jun 17, 2026 •

edited

Loading

Hard offload evidence (re-validated 2026-06-17, vanilla `vllm/vllm-openai:v0.19.1`)