Make PrefillDecode actually disaggregate#175
Merged
Merged
Conversation
9807d2f to
cadc2ee
Compare
serving.mode: PrefillDecode composes an EndpointPicker whose disaggregation is
gated by a decider plugin. The composed config never actually disaggregated:
across a full benchmark on a live GKE cluster the vLLM prefill engine handled
zero requests, the decode engine served everything, and no KV cache transferred.
Three defaults all conspired against it on the EPP image we run
(llm-d-inference-scheduler v0.8.0, embedding gateway-api-inference-extension
v1.5.0), and fixing only one is not enough:
1. prefix-based-pd-decider was declared with no parameters, so nonCachedTokens
took its int zero value, which the decider treats as "disabled" — every
request decode-only.
2. The decider reads a PrefixCacheMatchInfo attribute that prefix-cache-scorer
no longer produces. GIE v1.5.0 split production into a separate plugin and
made prepare-data default-on, so the prepareDataPlugins feature gate the
v0.8.0 docs still tell you to set is unregistered and crashloops the EPP.
The producer is now an explicit plugin, approx-prefix-cache-producer.
3. That producer defaults to autoTune: true, which leaves its block size 0 and
never populates the attribute.
Add approx-prefix-cache-producer pinned to autoTune: false, set
nonCachedTokens: 16, drop the feature gate, and wire the scorers into both
profiles, matching the data path the v1.5.0 binary actually uses. Verified live:
long prompts now disaggregate (prefill engine's request_prefill_time counter
increments and KV flows prefill->decode over NIXL) while short prompts correctly
skip the prefill hop and serve decode-only. Add a regression test pinning the
three load-bearing settings.
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
cadc2ee to
f56ccd4
Compare
haarchri
reviewed
Jun 17, 2026
The EPP's approx-prefix-cache-producer must chunk prefixes at the same KV block size the engine uses, or prefix-cache routing silently degrades (no error, just worse decisions). The config hardcoded blockSizeTokens: 16, which only works because it matches vLLM's default --block-size; a user who sets --block-size 32 (engine flags are the user's, per #137) would quietly get bad routing. Derive it best-effort from the decode engine's flags — vLLM's --block-size and SGLang's --page-size — falling back to 16 when absent or unparseable, and render it into the EPP config. Marked a HACK: peeking at user-owned engine args is the pragmatic v0.1 unblock; the durable fix is a typed/overridable knob on the serving block (#179). Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PrefillDecode silently fails when the engine image lacks the NIXL runtime: vLLM's NixlConnector (and SGLang's PD transfer) import the `nixl` package, which the base vllm/vllm-openai image doesn't include, so disaggregated engines crashloop with "NIXL is not available". Engine images are the user's (#137), so Modelplane can't bundle it — but nothing told the user it was required. Document the prerequisite where it's relevant: the _disaggregated composition docstring, the user-facing ModelDeployment doc, and the unopinionated-deployments design. The fix is to use a kv-connector-enabled image — build vLLM with INSTALL_KV_CONNECTORS=true (nixl + lmcache + mooncake) or a pre-built one such as lmcache/vllm-openai. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PrefillDecode engines need two things for cross-pod KV transfer that the ModelDeployment schema can't express: a Memory-backed /dev/shm (the container default 64Mi is too small for NIXL's shared-memory buffers) and VLLM_NIXL_SIDE_CHANNEL_HOST set to the pod IP (via fieldRef) so peer engines can reach this one's NIXL metadata channel. The engine template only allows valueFrom.secretKeyRef/configMapKeyRef (no fieldRef) and no volumes, so a user literally cannot supply them — and without them the decode engine can't fetch the prefill's KV and requests fail with a 500 and no error in the engine logs. Inject both onto every disaggregated engine, the same way the pd-sidecar is injected — infra-level and always-correct for PrefillDecode, no user input. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Disaggregated engines import the NIXL KV-transfer runtime through their connector (vLLM's NixlConnector, SGLang's transfer path). An image without NIXL crashes at startup with "NIXL is not available", which is easy to hit and hard to diagnose. Recent vanilla vllm/vllm-openai images ship NIXL, so the guidance is simply to pin a current tag. Note this prerequisite in the ModelDeployment guide and the design doc, and teach the docs vocabulary the NIXL/NixlConnector terms so Vale stops flagging them. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
_kv_block_size and _decode_port each open-coded the same "--flag value" / "--flag=value" engine-arg scan. Factor it into one _flag_value helper so both read the user's flags the same way. Also bring the _disaggregated NIXL-prerequisite docstring in line with the docs: recent vanilla vllm/vllm-openai images ship the NIXL runtime, so the guidance is to pin a current tag rather than build a kv-connector image. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
negz
approved these changes
Jun 17, 2026
4 tasks
dennis-upbound
added a commit
that referenced
this pull request
Jun 17, 2026
Modelplane's pitch is that adopting an advanced inference-serving technique should be a deployment-level change, not a quarter-long project — but we have no end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't see how they'd evaluate something like prefill/decode disaggregation on their own workload and roll it out to production safely. This adds a self-contained scenario guide and runnable kit under demo/prefill-decode/. The guide (README.md) opens with the latency problem (TTFT/ITL under load), explains why these techniques are powerful but operationally heavy, then walks the real adoption workflow with prefill/decode disaggregation as the worked example on one Modelplane cluster: a unified baseline, a PrefillDecode variant stood up on spare capacity, proving it disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of representative traffic, and promoting it behind one shared-label ModelService by shifting replica capacity (canary, promote, cut over, roll back are all just replicas, since the service load-balances evenly across healthy endpoints). Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D deployments, the shared ModelService), a single run.sh (deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The disaggregation recipe matches what #175 validated live — vanilla vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode --port=8001, matched --block-size — with Modelplane composing the EPP, InferencePool, routing sidecar, and NIXL plumbing. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound
added a commit
that referenced
this pull request
Jun 18, 2026
Modelplane's pitch is that adopting an advanced inference-serving technique should be a deployment-level change, not a quarter-long project — but we have no end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't see how they'd evaluate something like prefill/decode disaggregation on their own workload and roll it out to production safely. This adds a self-contained scenario guide and runnable kit under demo/prefill-decode/. The guide (README.md) opens with the latency problem (TTFT/ITL under load), explains why these techniques are powerful but operationally heavy, then walks the real adoption workflow with prefill/decode disaggregation as the worked example on one Modelplane cluster: a unified baseline, a PrefillDecode variant stood up on spare capacity, proving it disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of representative traffic, and promoting it behind one shared-label ModelService by shifting replica capacity (canary, promote, cut over, roll back are all just replicas, since the service load-balances evenly across healthy endpoints). Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D deployments, the shared ModelService), a single run.sh (deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The disaggregation recipe matches what #175 validated live — vanilla vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode --port=8001, matched --block-size — with Modelplane composing the EPP, InferencePool, routing sidecar, and NIXL plumbing. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound
added a commit
that referenced
this pull request
Jun 18, 2026
Modelplane's pitch is that adopting an advanced inference-serving technique should be a deployment-level change, not a quarter-long project — but we have no end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't see how they'd evaluate something like prefill/decode disaggregation on their own workload and roll it out to production safely. This adds a self-contained scenario guide and runnable kit under demo/prefill-decode/. The guide (README.md) opens with the latency problem (TTFT/ITL under load), explains why these techniques are powerful but operationally heavy, then walks the real adoption workflow with prefill/decode disaggregation as the worked example on one Modelplane cluster: a unified baseline, a PrefillDecode variant stood up on spare capacity, proving it disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of representative traffic, and promoting it behind one shared-label ModelService by shifting replica capacity (canary, promote, cut over, roll back are all just replicas, since the service load-balances evenly across healthy endpoints). Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D deployments, the shared ModelService), a single run.sh (deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The disaggregation recipe matches what #175 validated live — vanilla vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode --port=8001, matched --block-size — with Modelplane composing the EPP, InferencePool, routing sidecar, and NIXL plumbing. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Across a full benchmark on a live GKE cluster, the vLLM prefill engine handled
0 requests, the decode engine served everything,
remote_engine_idwasnull, and no KV cache transferred.PrefillDecodewas silently runningdecode-only.
Root cause — three stacked defaults, all wrong on our EPP image
EPP image:
llm-d-inference-scheduler:v0.8.0(embedsgateway-api-inference-extension v1.5.0). Fixing any one alone leaves it broken:prefix-based-pd-deciderhad no params →nonCachedTokens= 0, which thedecider treats as disabled (always decode-only).
PrefixCacheMatchInfoattribute that nothing produced.GIE v1.5.0 split production out of
prefix-cache-scorerinto a separateplugin and made prepare-data default-on — so the
prepareDataPluginsfeature gate the v0.8.0 docs still tell you to set is unregistered and
crashloops the EPP (
feature gate 'prepareDataPlugins' is unknown or unregistered). The producer is now an explicit plugin:approx-prefix-cache-producer.autoTune: true, which leaves its block size 0and never populates the attribute. Must pin
autoTune: false+blockSizeTokens.(The v0.8.0 sample configs and
docs/disaggregation.mdare stale — they stilllist the removed gate. This is effectively a packaging bug upstream; we work
around it with the correct v1.5.0 data-path config.)
Fix
Two pieces, both required for a request to actually disaggregate:
approx-prefix-cache-producer(autoTune: false,blockSizeTokensderived from the engine's--block-size/--page-size), setnonCachedTokens: 16, drop the feature gate, and wire the scorers into bothprofiles.
NIXL needs
VLLM_NIXL_SIDE_CHANNEL_HOSTset to the pod IP (afieldRefenv)and a Memory-backed
/dev/shm(a volume) — neither expressible in theModelDeploymentengine template (env.valueFromallows only secret/configMaprefs; no
volumes) — the schema gap is tracked in ModelDeployment engine template can't express volumes #180._inject_nixl_plumbingadds them to both phase engines,
the same way the pd-sidecar is injected. Without the side-channel host the
prefill advertises an unreachable address and the handshake fails (500/503,
zero transfers). Also documents the engine-image prerequisite: the image must
ship the NIXL runtime — recent vanilla
vllm/vllm-openaitags do.Verified live:
request_prefill_time"hi")So disaggregation engages and is correctly selective (short/cache-hot prompts
skip the prefill hop and its KV-transfer cost).
Hard offload evidence (re-validated 2026-06-17, vanilla
vllm/vllm-openai:v0.19.1)Clean run: armed EPP + injected NIXL plumbing, 1 prefill (
kv_producer) + 1 decode(
kv_consumer), five ~600-token prompts. vLLM's own NIXL counters:prompt_tokensgeneration_tokensnixl_xfercountnixl_bytes_transferredThe prefill processes all prompt tokens and generates ~nothing (5 = handoff
tokens); the decode records exactly 5 NIXL transfers totaling 367 MB of KV
cache pulled from the prefill, then does the generation. That is the offload,
measured end-to-end — no custom kv-connector image, just vanilla vLLM that ships
NIXL.
Validated the failure modes too: the unarmed EPP config serves everything
decode-only (prefill
prompt_tokensstays 0); missingVLLM_NIXL_SIDE_CHANNEL_HOSTreturns 500/503 with zero transfers — i.e. both halves of this PR are load-bearing.
Alternatives considered
always-disagg-pd-decider— also works on this image and is simpler, butdisaggregates unconditionally (pays KV-transfer overhead even on tiny prompts).
prefix-based is the better default; keep always-disagg in mind as a fallback if
the producer wiring ever regresses.
featureGates: [prepareDataPlugins]config is correct, but it gives up v0.8.0.Rejected; the v1.5.0 config above is config-only and keeps us current.
Future considerations
Test plan
compose-model-replicaunit tests pass (43 + 8 subtests) — incl.test_epp_config_arms_the_pd_decider,test_kv_block_size,test_injects_nixl_plumbingruff check/ruff formatcleanvllm/vllm-openai:v0.19.1nix flake check(CI, incl. docs Vale)