WIP: Add a scenario guide for deploying advanced serving techniques#182
Closed
dennis-upbound wants to merge 2 commits into
Closed
WIP: Add a scenario guide for deploying advanced serving techniques#182dennis-upbound wants to merge 2 commits into
dennis-upbound wants to merge 2 commits into
Conversation
d6e48f8 to
d3ce655
Compare
Modelplane's pitch is that adopting an advanced inference-serving technique should be a deployment-level change, not a quarter-long project — but we have no end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't see how they'd evaluate something like prefill/decode disaggregation on their own workload and roll it out to production safely. This adds a self-contained scenario guide and runnable kit under demo/prefill-decode/. The guide (README.md) opens with the latency problem (TTFT/ITL under load), explains why these techniques are powerful but operationally heavy, then walks the real adoption workflow with prefill/decode disaggregation as the worked example on one Modelplane cluster: a unified baseline, a PrefillDecode variant stood up on spare capacity, proving it disaggregates via vLLM's NIXL KV-transfer counters, benchmarking a replay of representative traffic, and promoting it behind one shared-label ModelService by shifting replica capacity (canary, promote, cut over, roll back are all just replicas, since the service load-balances evenly across healthy endpoints). Backed by runnable artifacts: manifests/ (cluster, cache, unified + P/D deployments, the shared ModelService), a single run.sh (deploy|prove|bench|promote|rollback), and a replay-trace.jsonl. The disaggregation recipe matches what #175 validated live — vanilla vllm/vllm-openai:v0.19.1 (ships NIXL), kv_producer/kv_consumer, decode --port=8001, matched --block-size — with Modelplane composing the EPP, InferencePool, routing sidecar, and NIXL plumbing. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
d3ce655 to
b37cea1
Compare
…ints The promotion flow relied on a single ModelService selecting a shared `model: qwen` label across both deployments, but compose-model-deployment only stamps `modelplane.ai/deployment: <name>` on each ModelEndpoint — it doesn't copy the deployment's own labels. So the shared-label selector matched nothing. A ModelService's spec.endpoints is a list, so the correct way to front both deployments from one endpoint is one entry per deployment, each selecting its `modelplane.ai/deployment` label. The HTTPRoute then load-balances across all matched endpoints, so the unified:P/D split still follows replica count and the canary/promote/rollback flow is unchanged. Drop the now-unused shared label. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
Modelplane's pitch is that adopting an advanced inference-serving technique should
be a deployment-level change, not a quarter-long project — but we have no
end-to-end, customer-facing walkthrough that shows it. A prospective ML team can't
see how they'd evaluate something like prefill/decode disaggregation on their own
workload and roll it out to production safely.
This adds a self-contained scenario guide and runnable kit under
demo/prefill-decode/. The guide (README.md— "Deploying advanced servingtechniques") opens with the latency problem (TTFT/ITL under load), explains why
these techniques are powerful but operationally heavy to ship, then walks the real
adoption workflow with prefill/decode disaggregation as the worked example, on one
Modelplane cluster:
ModelDeployment→ aPrefillDecodevariant stood upas a second deployment on spare capacity → proving it genuinely disaggregates
via vLLM's own NIXL KV-transfer counters → benchmarking a replay of
representative traffic → promoting it behind a single shared-label
ModelServiceby shifting replica capacity. Because the service load-balancesevenly across healthy endpoints (one endpoint per replica), canary → promote →
cut over → roll back are all just
replicas— no traffic weights, no platformticket.
It's backed by runnable artifacts, not just prose:
manifests/(cluster, cache,unified + P/D deployments, the shared
ModelService), a singlerun.sh {deploy|prove|bench|promote|rollback}, and areplay-trace.jsonl. Thedisaggregation recipe matches what #175 validated live — vanilla
vllm/vllm-openai:v0.19.1(ships NIXL),kv_producer/kv_consumer, decode--port=8001, matched--block-size— with Modelplane composing the EPP /InferencePool / routing sidecar / NIXL plumbing.
Draft / WIP: the guide's benchmark curve and offload counters are placeholders
pending a live run on a fresh cluster. I'll fill the real numbers and confirm the
GuideLLM
--datareplay shape against the live endpoint, then mark ready.Reviewers — the judgment calls are the narrative framing (does the problem →
technique → scenario arc land for an ML-team audience?) and whether the
promotion-by-replica-count story reads as faithful for a production rollout.
Builds on #175.
I have:
nix flake check(or./nix.sh flake check) and made sure it passes.Added or updated tests covering any composition function changes.(docs, manifests, and scripts only — no composition function changes)git commit -s.