WIP: Add a scenario guide for deploying advanced serving techniques by dennis-upbound · Pull Request #182 · modelplaneai/modelplane

dennis-upbound · 2026-06-17T21:43:12Z

Description of your changes

Modelplane's pitch is that adopting an advanced inference-serving technique should
be a deployment-level change, not a quarter-long project — but we have no
end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't
see how they'd evaluate something like prefill/decode disaggregation on their own
workload and roll it out to production safely.

This adds a self-contained scenario guide and runnable kit under
demo/prefill-decode/. The guide (README.md — "Deploying advanced serving
techniques") opens with the latency problem (TTFT/ITL under load), explains why
these techniques are powerful but operationally heavy to ship, then walks the real
adoption workflow with prefill/decode disaggregation as the worked example, on one
Modelplane cluster:

a unified baseline ModelDeployment → a PrefillDecode variant stood up
as a second deployment on spare capacity → proving it genuinely disaggregates
via vLLM's own NIXL KV-transfer counters → benchmarking a replay of
representative traffic → promoting it behind a single shared-label
ModelService by shifting replica capacity. Because the service load-balances
evenly across healthy endpoints (one endpoint per replica), canary → promote →
cut over → roll back are all just replicas — no traffic weights, no platform
ticket.

It's backed by runnable artifacts, not just prose: manifests/ (cluster, cache,
unified + P/D deployments, the shared ModelService), a single
run.sh {deploy|prove|bench|promote|rollback}, and a replay-trace.jsonl. The
disaggregation recipe matches what #175 validated live — vanilla
vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode
--port=8001, matched --block-size — with Modelplane composing the EPP /
InferencePool / routing sidecar / NIXL plumbing.

Draft / WIP: the guide's benchmark curve and offload counters are placeholders
pending a live run on a fresh cluster. I'll fill the real numbers and confirm the
GuideLLM --data replay shape against the live endpoint, then mark ready.

Reviewers — the judgment calls are the narrative framing (does the problem →
technique → scenario arc land for an ML-team audience?) and whether the
promotion-by-replica-count story reads as faithful for a production rollout.

Builds on #175.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
~~Added or updated tests covering any composition function changes.~~ (docs, manifests, and scripts only — no composition function changes)
Signed off every commit with git commit -s.

Modelplane's pitch is that adopting an advanced inference-serving technique should be a deployment-level change, not a quarter-long project — but we have no end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't see how they'd evaluate something like prefill/decode disaggregation on their own workload and roll it out to production safely. This adds a self-contained scenario guide and runnable kit under demo/prefill-decode/. The guide (README.md) opens with the latency problem (TTFT/ITL under load), explains why these techniques are powerful but operationally heavy, then walks the real adoption workflow with prefill/decode disaggregation as the worked example on one Modelplane cluster: a unified baseline, a PrefillDecode variant stood up on spare capacity, proving it disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of representative traffic, and promoting it behind one shared-label ModelService by shifting replica capacity (canary, promote, cut over, roll back are all just replicas, since the service load-balances evenly across healthy endpoints). Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D deployments, the shared ModelService), a single run.sh (deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The disaggregation recipe matches what #175 validated live — vanilla vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode --port=8001, matched --block-size — with Modelplane composing the EPP, InferencePool, routing sidecar, and NIXL plumbing. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

…ints The promotion flow relied on a single ModelService selecting a shared `model: qwen` label across both deployments, but compose-model-deployment only stamps `modelplane.ai/deployment: <name>` on each ModelEndpoint — it doesn't copy the deployment's own labels. So the shared-label selector matched nothing. A ModelService's spec.endpoints is a list, so the correct way to front both deployments from one endpoint is one entry per deployment, each selecting its `modelplane.ai/deployment` label. The HTTPRoute then load-balances across all matched endpoints, so the unified:P/D split still follows replica count and the canary/promote/rollback flow is unchanged. Drop the now-unused shared label. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

dennis-upbound changed the title ~~Add a scenario guide for deploying advanced serving techniques~~ WIP: Add a scenario guide for deploying advanced serving techniques Jun 17, 2026

dennis-upbound force-pushed the dennis/prefill-decode-demo branch 2 times, most recently from d6e48f8 to d3ce655 Compare June 18, 2026 14:35

dennis-upbound force-pushed the dennis/prefill-decode-demo branch from d3ce655 to b37cea1 Compare June 18, 2026 14:42

dennis-upbound closed this Jun 19, 2026

dennis-upbound deleted the dennis/prefill-decode-demo branch June 19, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Add a scenario guide for deploying advanced serving techniques#182

WIP: Add a scenario guide for deploying advanced serving techniques#182
dennis-upbound wants to merge 2 commits into
mainfrom
dennis/prefill-decode-demo

dennis-upbound commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dennis-upbound commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of your changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dennis-upbound commented Jun 17, 2026 •

edited

Loading