Skip to content

WIP: Add a scenario guide for deploying advanced serving techniques#182

Closed
dennis-upbound wants to merge 2 commits into
mainfrom
dennis/prefill-decode-demo
Closed

WIP: Add a scenario guide for deploying advanced serving techniques#182
dennis-upbound wants to merge 2 commits into
mainfrom
dennis/prefill-decode-demo

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

Modelplane's pitch is that adopting an advanced inference-serving technique should
be a deployment-level change, not a quarter-long project — but we have no
end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't
see how they'd evaluate something like prefill/decode disaggregation on their own
workload and roll it out to production safely.

This adds a self-contained scenario guide and runnable kit under
demo/prefill-decode/. The guide (README.md"Deploying advanced serving
techniques"
) opens with the latency problem (TTFT/ITL under load), explains why
these techniques are powerful but operationally heavy to ship, then walks the real
adoption workflow with prefill/decode disaggregation as the worked example, on one
Modelplane cluster:

  • a unified baseline ModelDeployment → a PrefillDecode variant stood up
    as a second deployment on spare capacity → proving it genuinely disaggregates
    via vLLM's own NIXL KV-transfer counters → benchmarking a replay of
    representative traffic → promoting it behind a single shared-label
    ModelService by shifting replica capacity. Because the service load-balances
    evenly across healthy endpoints (one endpoint per replica), canary → promote →
    cut over → roll back are all just replicas — no traffic weights, no platform
    ticket.

It's backed by runnable artifacts, not just prose: manifests/ (cluster, cache,
unified + P/D deployments, the shared ModelService), a single
run.sh {deploy|prove|bench|promote|rollback}, and a replay-trace.jsonl. The
disaggregation recipe matches what #175 validated live — vanilla
vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode
--port=8001, matched --block-size — with Modelplane composing the EPP /
InferencePool / routing sidecar / NIXL plumbing.

Draft / WIP: the guide's benchmark curve and offload counters are placeholders
pending a live run on a fresh cluster. I'll fill the real numbers and confirm the
GuideLLM --data replay shape against the live endpoint, then mark ready.

Reviewers — the judgment calls are the narrative framing (does the problem →
technique → scenario arc land for an ML-team audience?) and whether the
promotion-by-replica-count story reads as faithful for a production rollout.

Builds on #175.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes. (docs, manifests, and scripts only — no composition function changes)
  • Signed off every commit with git commit -s.

@dennis-upbound dennis-upbound changed the title Add a scenario guide for deploying advanced serving techniques WIP: Add a scenario guide for deploying advanced serving techniques Jun 17, 2026
@dennis-upbound dennis-upbound force-pushed the dennis/prefill-decode-demo branch 2 times, most recently from d6e48f8 to d3ce655 Compare June 18, 2026 14:35
Modelplane's pitch is that adopting an advanced inference-serving technique
should be a deployment-level change, not a quarter-long project — but we have
no end-to-end, customer-facing walkthrough that shows it. A prospective ML team
can't see how they'd evaluate something like prefill/decode disaggregation on
their own workload and roll it out to production safely.

This adds a self-contained scenario guide and runnable kit under
demo/prefill-decode/. The guide (README.md) opens with the latency problem
(TTFT/ITL under load), explains why these techniques are powerful but
operationally heavy, then walks the real adoption workflow with prefill/decode
disaggregation as the worked example on one Modelplane cluster: a unified
baseline, a PrefillDecode variant stood up on spare capacity, proving it
disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of
representative traffic, and promoting it behind one shared-label ModelService by
shifting replica capacity (canary, promote, cut over, roll back are all just
replicas, since the service load-balances evenly across healthy endpoints).

Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D
deployments, the shared ModelService), a single run.sh
(deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The
disaggregation recipe matches what #175 validated live — vanilla
vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode
--port=8001, matched --block-size — with Modelplane composing the EPP,
InferencePool, routing sidecar, and NIXL plumbing.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound force-pushed the dennis/prefill-decode-demo branch from d3ce655 to b37cea1 Compare June 18, 2026 14:42
…ints

The promotion flow relied on a single ModelService selecting a shared
`model: qwen` label across both deployments, but compose-model-deployment only
stamps `modelplane.ai/deployment: <name>` on each ModelEndpoint — it doesn't
copy the deployment's own labels. So the shared-label selector matched nothing.

A ModelService's spec.endpoints is a list, so the correct way to front both
deployments from one endpoint is one entry per deployment, each selecting its
`modelplane.ai/deployment` label. The HTTPRoute then load-balances across all
matched endpoints, so the unified:P/D split still follows replica count and the
canary/promote/rollback flow is unchanged. Drop the now-unused shared label.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound deleted the dennis/prefill-decode-demo branch June 19, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant