Prefill/decode disaggregation design#116
Merged
Merged
Conversation
1727021 to
d7ef8b7
Compare
Issue #34's original sketch predates the KServe drop and describes disaggregation through KServe's LLMInferenceService. This documents it on the current architecture: a self-contained prefill block on the deployment, KV transfer via the engine's kv-transfer-config and NIXL, routing through the same swappable EPP on a GAIE InferencePool that unified serving uses, the correctness constraints to enforce as matching matures, and when disaggregation is worth it. It parallels design/modelcache.md and the decisions discussed on the issue. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The concepts guide covers unified and multi-node serving but not prefill/decode disaggregation, which the design doc now specifies. This adds a Disaggregated Serving section describing the prefill block, the compute-bound vs bandwidth-bound split, KV transfer over NIXL, the ModelCache and co-location requirements, and when disaggregation is worth it. It ties into the existing Multi-node Inference and ModelCache sections. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The ModelCache section still described spec.source as a discriminated union keyed on which field is set, with lowercase future types under "their own discriminator". The shipped API is a required source enum naming the kind (source: HuggingFace) with the matching object set alongside it, validated by CEL. This updates the bullet to that shape so the guide matches the resource users actually write. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The checklist-completed job runs require-checklist-action on every pull_request event, including while a PR is still a draft. A WIP draft that hasn't filled out the template checklist yet fails the check, which is noise: the checklist is something you complete before asking for review, not while iterating. This gates the job on the PR not being a draft and adds the ready_for_review trigger, so the check is skipped while the PR is a draft and runs when it's marked ready for review. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
2514709 to
2e87f15
Compare
negz
approved these changes
Jun 10, 2026
The example predated #101 merging, where nodeSelector became required: show it on both the decode and prefill roles, each with its own device selectors (illustrating distinct per-role hardware), and model the InfiniBand fabric as a Synthetic device. Fold the operator "when to use" guidance into the summary as background. Give the routing-discriminator alternative an API sketch so the discriminator-vs-template tradeoff is concrete, and add the "two ModelDeployments" alternative with why a single MD is better (co-location and that a prefill-only MD isn't conceptually a model deployment). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
A feasibility spike pinned down how the routing actually works, so the Routing section now describes it concretely rather than as an aspiration: one InferencePool fronts both roles and the EPP partitions them by an llm-d.ai/role label, picking a decode pod then a prefill pod and handing the decode pod the prefill address (x-prefiller-host-port) for a routing sidecar to forward the prompt over NIXL — correcting the earlier decode-only-pool sketch. It also records the gateway decision: InferencePool needs a GAIE-conformant gateway, which core Envoy Gateway is not, so serving clusters run Envoy AI Gateway (layered on the same data plane, leaving plain routes untouched); unified serving stays on its plain Service route for now. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The routing section named InferencePool without saying what it is: add a one-line definition and link to the GAIE docs. Frame Envoy AI Gateway as a deliberate choice among the conformant options (Envoy AI Gateway, Istio, kgateway), picked because it layers on the Envoy Gateway data plane ServingStack already runs rather than replacing the gateway. Note that Modelplane stamps the llm-d.ai/role label the EPP filters on, alongside its own internal modelplane.ai/pd-role label. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
workers.count defaults to 1, so an omitted prefill/decode count runs 1:1 rather than being required — say so instead of claiming the ratio is always explicit, and reframe the rejected alternative as "requiring both counts explicit" (which we don't, since it would make count mandatory for unified too). Drop the example's decode nic count from 8 to 1: it selects the InfiniBand fabric, not a per-GPU device. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The earlier justification — Envoy AI Gateway is lowest-friction because it layers on what ServingStack already runs — is a non-argument for a greenfield product: there's no installed base to preserve, and it wrongly implied the gateway is fixed. Reframe it as the default Modelplane installs (chosen on merits: Envoy-based, LLM-purpose-built, InferencePool-conformant) while making the gateway a swappable seam like the EPP, so customers can plug in their own GAIE-conformant gateway — notably on BYO clusters that already run one. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
negz
reviewed
Jun 10, 2026
Comment on lines
+252
to
+258
| routing: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: epp | ||
| image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.8.0 # default | ||
| args: ["--config-file=/config/epp.yaml"] # override to tune scorers |
Collaborator
There was a problem hiding this comment.
So for prefill decode we have the usual top level stuff, and prefill, and routing?
I think this sketch belongs in the main body of the design, not in an alternative, and could use a little more detail. It'd be useful to show it in context of a full deployment (that'd answer my question above).
Some other thoughts:
- I'm a little wary of defaulting it if omitted. It's convenient but asymmetrical with the other blocks, where we don't e.g. guess an engine config if you omit it.
- Where does the EPP pod run? (Does it need its own node selector etc etc?)
Collaborator
Author
There was a problem hiding this comment.
Good points — all addressed:
- Shape: yes — top-level
workers(decode),prefill, androuting. Moved the sketch into the main body: the example up top is now a full deployment showing all three, and the Routing section describesrouting.templatedirectly. Trimmed the alternative to just the discriminator-vs-template rationale. - Defaulting: agreed, dropped it.
routingis now required whenprefillis set (no guessed EPP), and bothworkers.countandprefill.workers.countare explicit (no default ratio) — symmetrical with how the engine is specified, per your point. - Where the EPP runs: added — it’s a lightweight CPU Deployment (watches the pods, scores requests; no model, no GPU), on ordinary nodes in the serving namespace, no special
nodeSelector, one perInferencePool. Its pod shape comes fromrouting.template.
Move the routing/EPP API out of the alternatives and into the main body: the example now shows a full deployment with top-level workers, prefill, and routing, and the Routing section describes routing.template directly. Per review, nothing in a disaggregated deployment is guessed: routing is required (no default EPP) and both workers.count and prefill.workers.count are explicit (no default ratio), symmetrical with how the engine is specified. Note where the EPP runs — a lightweight CPU Deployment on ordinary nodes, no GPU nodeSelector, one per InferencePool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Towards #34.
Issue #34 asks for prefill/decode disaggregation, but its original sketch predates the KServe drop and describes it through KServe's
LLMInferenceService.prefillsection. There's no current design or user-facing documentation for how disaggregation fits the backend-dispatcher architecture.This adds two docs, no code:
design/disaggregation.mdspecifies disaggregation on the current architecture: a self-containedprefillblock on the deployment, KV transfer via the engine's--kv-transfer-configand NIXL, routing through the same swappable EPP on a GAIEInferencePoolthat unified serving uses, the correctness constraints to enforce as matching matures, and when disaggregation is worth it. It parallelsdesign/modelcache.md, with an Alternatives considered section covering the KServe sketch, a bespoke proxy, a routing discriminator, Modelplane choosing per-role hardware, and an implicit prefill:decode ratio.docs/concepts.mdgains a Disaggregated Serving section covering the prefill block, the compute-bound vs bandwidth-bound split, KV transfer, and the ModelCache and co-location requirements. It also updates the ModelCache source bullet to the shippedsourceenum shape.It also carries one unrelated CI fix (
Skip the checklist requirement on draft PRs), bundled here because it's what surfaced the failure on this draft. Thechecklist-completedjob ranrequire-checklist-actionon everypull_requestevent, so a WIP draft that hadn't filled out the template checklist failed the check. The commit gates the job on the PR not being a draft and adds theready_for_reviewtrigger, so it's skipped while a PR is a draft and runs once it's marked ready. It's a self-contained commit and can be split into its own PR if you'd rather not bundle it.I have:
RunDocs and CI-config change; the sandboxednix flake check(or./nix.sh flake check) and made sure it passes.checkjob passes in CI on this PR.Added or updated tests covering any composition function changes.No composition function changes.git commit -s.