Skip to content

Make PrefillDecode actually disaggregate#175

Merged
dennis-upbound merged 6 commits into
mainfrom
dennis/fix-epp-disagg-config
Jun 17, 2026
Merged

Make PrefillDecode actually disaggregate#175
dennis-upbound merged 6 commits into
mainfrom
dennis/fix-epp-disagg-config

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Problem

Across a full benchmark on a live GKE cluster, the vLLM prefill engine handled
0 requests, the decode engine served everything, remote_engine_id was
null, and no KV cache transferred. PrefillDecode was silently running
decode-only.

Root cause — three stacked defaults, all wrong on our EPP image

EPP image: llm-d-inference-scheduler:v0.8.0 (embeds gateway-api-inference-extension v1.5.0). Fixing any one alone leaves it broken:

  1. prefix-based-pd-decider had no paramsnonCachedTokens = 0, which the
    decider treats as disabled (always decode-only).
  2. The decider reads a PrefixCacheMatchInfo attribute that nothing produced.
    GIE v1.5.0 split production out of prefix-cache-scorer into a separate
    plugin and made prepare-data default-on — so the prepareDataPlugins
    feature gate the v0.8.0 docs still tell you to set is unregistered and
    crashloops the EPP
    (feature gate 'prepareDataPlugins' is unknown or unregistered). The producer is now an explicit plugin:
    approx-prefix-cache-producer.
  3. That producer defaults to autoTune: true, which leaves its block size 0
    and never populates the attribute. Must pin autoTune: false +
    blockSizeTokens.

(The v0.8.0 sample configs and docs/disaggregation.md are stale — they still
list the removed gate. This is effectively a packaging bug upstream; we work
around it with the correct v1.5.0 data-path config.)

Fix

Two pieces, both required for a request to actually disaggregate:

  1. Arm the EPP decider. Add approx-prefix-cache-producer (autoTune: false,
    blockSizeTokens derived from the engine's --block-size/--page-size), set
    nonCachedTokens: 16, drop the feature gate, and wire the scorers into both
    profiles.
  2. Inject the NIXL KV-transfer plumbing the schema can't express. Cross-pod
    NIXL needs VLLM_NIXL_SIDE_CHANNEL_HOST set to the pod IP (a fieldRef env)
    and a Memory-backed /dev/shm (a volume) — neither expressible in the
    ModelDeployment engine template (env.valueFrom allows only secret/configMap
    refs; no volumes) — the schema gap is tracked in ModelDeployment engine template can't express volumes #180. _inject_nixl_plumbing
    adds them to both phase engines,
    the same way the pd-sidecar is injected. Without the side-channel host the
    prefill advertises an unreachable address and the handshake fails (500/503,
    zero transfers). Also documents the engine-image prerequisite: the image must
    ship the NIXL runtime — recent vanilla vllm/vllm-openai tags do.

Verified live:

prompts prefill engine request_prefill_time behavior
5× long unique +5 ✅ disaggregates (KV flows prefill→decode over NIXL)
3× tiny ("hi") +0 (flat) ✅ selectively skips prefill, decode-only
3× long unique +3 ✅ disaggregates

So disaggregation engages and is correctly selective (short/cache-hot prompts
skip the prefill hop and its KV-transfer cost).

Hard offload evidence (re-validated 2026-06-17, vanilla vllm/vllm-openai:v0.19.1)

Clean run: armed EPP + injected NIXL plumbing, 1 prefill (kv_producer) + 1 decode
(kv_consumer), five ~600-token prompts. vLLM's own NIXL counters:

engine prompt_tokens generation_tokens nixl_xfer count nixl_bytes_transferred
prefill (producer) 6220 5
decode (consumer) (decode-side) 160 5 367 MB (3.67e8)

The prefill processes all prompt tokens and generates ~nothing (5 = handoff
tokens); the decode records exactly 5 NIXL transfers totaling 367 MB of KV
cache
pulled from the prefill, then does the generation. That is the offload,
measured end-to-end — no custom kv-connector image, just vanilla vLLM that ships
NIXL.

Validated the failure modes too: the unarmed EPP config serves everything
decode-only (prefill prompt_tokens stays 0); missing VLLM_NIXL_SIDE_CHANNEL_HOST
returns 500/503 with zero transfers — i.e. both halves of this PR are load-bearing.

Alternatives considered

  • always-disagg-pd-decider — also works on this image and is simpler, but
    disaggregates unconditionally (pays KV-transfer overhead even on tiny prompts).
    prefix-based is the better default; keep always-disagg in mind as a fallback if
    the producer wiring ever regresses.
  • Downgrade to inference-scheduler v0.7.1 (GIE v1.4.0) — there the original
    featureGates: [prepareDataPlugins] config is correct, but it gives up v0.8.0.
    Rejected; the v1.5.0 config above is config-only and keeps us current.

Future considerations

  • We may want to expose the EPP config as a specTemplate for full customization later on

Test plan

  • compose-model-replica unit tests pass (43 + 8 subtests) — incl. test_epp_config_arms_the_pd_decider, test_kv_block_size, test_injects_nixl_plumbing
  • ruff check / ruff format clean
  • Verified live: selective disaggregation works (long → prefill engages, short → skips)
  • Verified live: KV offload measured — 5 transfers / 367 MB prefill→decode over NIXL on vanilla vllm/vllm-openai:v0.19.1
  • nix flake check (CI, incl. docs Vale)

@dennis-upbound dennis-upbound force-pushed the dennis/fix-epp-disagg-config branch from 9807d2f to cadc2ee Compare June 17, 2026 05:18
@dennis-upbound dennis-upbound changed the title Arm the prefill/decode decider in the composed EPP config Use always-disagg-pd-decider so PrefillDecode actually disaggregates Jun 17, 2026
serving.mode: PrefillDecode composes an EndpointPicker whose disaggregation is
gated by a decider plugin. The composed config never actually disaggregated:
across a full benchmark on a live GKE cluster the vLLM prefill engine handled
zero requests, the decode engine served everything, and no KV cache transferred.

Three defaults all conspired against it on the EPP image we run
(llm-d-inference-scheduler v0.8.0, embedding gateway-api-inference-extension
v1.5.0), and fixing only one is not enough:

  1. prefix-based-pd-decider was declared with no parameters, so nonCachedTokens
     took its int zero value, which the decider treats as "disabled" — every
     request decode-only.
  2. The decider reads a PrefixCacheMatchInfo attribute that prefix-cache-scorer
     no longer produces. GIE v1.5.0 split production into a separate plugin and
     made prepare-data default-on, so the prepareDataPlugins feature gate the
     v0.8.0 docs still tell you to set is unregistered and crashloops the EPP.
     The producer is now an explicit plugin, approx-prefix-cache-producer.
  3. That producer defaults to autoTune: true, which leaves its block size 0 and
     never populates the attribute.

Add approx-prefix-cache-producer pinned to autoTune: false, set
nonCachedTokens: 16, drop the feature gate, and wire the scorers into both
profiles, matching the data path the v1.5.0 binary actually uses. Verified live:
long prompts now disaggregate (prefill engine's request_prefill_time counter
increments and KV flows prefill->decode over NIXL) while short prompts correctly
skip the prefill hop and serve decode-only. Add a regression test pinning the
three load-bearing settings.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound force-pushed the dennis/fix-epp-disagg-config branch from cadc2ee to f56ccd4 Compare June 17, 2026 13:56
@dennis-upbound dennis-upbound changed the title Use always-disagg-pd-decider so PrefillDecode actually disaggregates Make PrefillDecode disaggregate via selective prefix-based PD Jun 17, 2026
@dennis-upbound dennis-upbound marked this pull request as ready for review June 17, 2026 14:03
Comment thread functions/compose-model-replica/function/routing.py Outdated
The EPP's approx-prefix-cache-producer must chunk prefixes at the same KV block
size the engine uses, or prefix-cache routing silently degrades (no error, just
worse decisions). The config hardcoded blockSizeTokens: 16, which only works
because it matches vLLM's default --block-size; a user who sets --block-size 32
(engine flags are the user's, per #137) would quietly get bad routing.

Derive it best-effort from the decode engine's flags — vLLM's --block-size and
SGLang's --page-size — falling back to 16 when absent or unparseable, and render
it into the EPP config. Marked a HACK: peeking at user-owned engine args is the
pragmatic v0.1 unblock; the durable fix is a typed/overridable knob on the
serving block (#179).

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PrefillDecode silently fails when the engine image lacks the NIXL runtime:
vLLM's NixlConnector (and SGLang's PD transfer) import the `nixl` package, which
the base vllm/vllm-openai image doesn't include, so disaggregated engines
crashloop with "NIXL is not available". Engine images are the user's (#137), so
Modelplane can't bundle it — but nothing told the user it was required.

Document the prerequisite where it's relevant: the _disaggregated composition
docstring, the user-facing ModelDeployment doc, and the unopinionated-deployments
design. The fix is to use a kv-connector-enabled image — build vLLM with
INSTALL_KV_CONNECTORS=true (nixl + lmcache + mooncake) or a pre-built one such as
lmcache/vllm-openai.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PrefillDecode engines need two things for cross-pod KV transfer that the
ModelDeployment schema can't express: a Memory-backed /dev/shm (the container
default 64Mi is too small for NIXL's shared-memory buffers) and
VLLM_NIXL_SIDE_CHANNEL_HOST set to the pod IP (via fieldRef) so peer engines can
reach this one's NIXL metadata channel. The engine template only allows
valueFrom.secretKeyRef/configMapKeyRef (no fieldRef) and no volumes, so a user
literally cannot supply them — and without them the decode engine can't fetch
the prefill's KV and requests fail with a 500 and no error in the engine logs.

Inject both onto every disaggregated engine, the same way the pd-sidecar is
injected — infra-level and always-correct for PrefillDecode, no user input.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Disaggregated engines import the NIXL KV-transfer runtime through their
connector (vLLM's NixlConnector, SGLang's transfer path). An image without
NIXL crashes at startup with "NIXL is not available", which is easy to hit
and hard to diagnose. Recent vanilla vllm/vllm-openai images ship NIXL, so
the guidance is simply to pin a current tag.

Note this prerequisite in the ModelDeployment guide and the design doc, and
teach the docs vocabulary the NIXL/NixlConnector terms so Vale stops flagging
them.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
_kv_block_size and _decode_port each open-coded the same "--flag value" /
"--flag=value" engine-arg scan. Factor it into one _flag_value helper so both
read the user's flags the same way.

Also bring the _disaggregated NIXL-prerequisite docstring in line with the
docs: recent vanilla vllm/vllm-openai images ship the NIXL runtime, so the
guidance is to pin a current tag rather than build a kv-connector image.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound changed the title Make PrefillDecode disaggregate via selective prefix-based PD Make PrefillDecode actually disaggregate Jun 17, 2026
@dennis-upbound dennis-upbound merged commit e0cf861 into main Jun 17, 2026
5 checks passed
dennis-upbound added a commit that referenced this pull request Jun 17, 2026
Modelplane's pitch is that adopting an advanced inference-serving technique
should be a deployment-level change, not a quarter-long project — but we have
no end-to-end, customer-facing walkthrough that shows it. A prospective ML team
can't see how they'd evaluate something like prefill/decode disaggregation on
their own workload and roll it out to production safely.

This adds a self-contained scenario guide and runnable kit under
demo/prefill-decode/. The guide (README.md) opens with the latency problem
(TTFT/ITL under load), explains why these techniques are powerful but
operationally heavy, then walks the real adoption workflow with prefill/decode
disaggregation as the worked example on one Modelplane cluster: a unified
baseline, a PrefillDecode variant stood up on spare capacity, proving it
disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of
representative traffic, and promoting it behind one shared-label ModelService by
shifting replica capacity (canary, promote, cut over, roll back are all just
replicas, since the service load-balances evenly across healthy endpoints).

Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D
deployments, the shared ModelService), a single run.sh
(deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The
disaggregation recipe matches what #175 validated live — vanilla
vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode
--port=8001, matched --block-size — with Modelplane composing the EPP,
InferencePool, routing sidecar, and NIXL plumbing.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@negz negz deleted the dennis/fix-epp-disagg-config branch June 18, 2026 03:22
dennis-upbound added a commit that referenced this pull request Jun 18, 2026
Modelplane's pitch is that adopting an advanced inference-serving technique
should be a deployment-level change, not a quarter-long project — but we have
no end-to-end, customer-facing walkthrough that shows it. A prospective ML team
can't see how they'd evaluate something like prefill/decode disaggregation on
their own workload and roll it out to production safely.

This adds a self-contained scenario guide and runnable kit under
demo/prefill-decode/. The guide (README.md) opens with the latency problem
(TTFT/ITL under load), explains why these techniques are powerful but
operationally heavy, then walks the real adoption workflow with prefill/decode
disaggregation as the worked example on one Modelplane cluster: a unified
baseline, a PrefillDecode variant stood up on spare capacity, proving it
disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of
representative traffic, and promoting it behind one shared-label ModelService by
shifting replica capacity (canary, promote, cut over, roll back are all just
replicas, since the service load-balances evenly across healthy endpoints).

Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D
deployments, the shared ModelService), a single run.sh
(deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The
disaggregation recipe matches what #175 validated live — vanilla
vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode
--port=8001, matched --block-size — with Modelplane composing the EPP,
InferencePool, routing sidecar, and NIXL plumbing.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 18, 2026
Modelplane's pitch is that adopting an advanced inference-serving technique
should be a deployment-level change, not a quarter-long project — but we have
no end-to-end, customer-facing walkthrough that shows it. A prospective ML team
can't see how they'd evaluate something like prefill/decode disaggregation on
their own workload and roll it out to production safely.

This adds a self-contained scenario guide and runnable kit under
demo/prefill-decode/. The guide (README.md) opens with the latency problem
(TTFT/ITL under load), explains why these techniques are powerful but
operationally heavy, then walks the real adoption workflow with prefill/decode
disaggregation as the worked example on one Modelplane cluster: a unified
baseline, a PrefillDecode variant stood up on spare capacity, proving it
disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of
representative traffic, and promoting it behind one shared-label ModelService by
shifting replica capacity (canary, promote, cut over, roll back are all just
replicas, since the service load-balances evenly across healthy endpoints).

Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D
deployments, the shared ModelService), a single run.sh
(deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The
disaggregation recipe matches what #175 validated live — vanilla
vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode
--port=8001, matched --block-size — with Modelplane composing the EPP,
InferencePool, routing sidecar, and NIXL plumbing.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants