Skip to content

Prefill/decode disaggregation design#116

Merged
dennis-upbound merged 10 commits into
mainfrom
dennis/disagg-design
Jun 11, 2026
Merged

Prefill/decode disaggregation design#116
dennis-upbound merged 10 commits into
mainfrom
dennis/disagg-design

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Towards #34.

Issue #34 asks for prefill/decode disaggregation, but its original sketch predates the KServe drop and describes it through KServe's LLMInferenceService.prefill section. There's no current design or user-facing documentation for how disaggregation fits the backend-dispatcher architecture.

This adds two docs, no code:

  • design/disaggregation.md specifies disaggregation on the current architecture: a self-contained prefill block on the deployment, KV transfer via the engine's --kv-transfer-config and NIXL, routing through the same swappable EPP on a GAIE InferencePool that unified serving uses, the correctness constraints to enforce as matching matures, and when disaggregation is worth it. It parallels design/modelcache.md, with an Alternatives considered section covering the KServe sketch, a bespoke proxy, a routing discriminator, Modelplane choosing per-role hardware, and an implicit prefill:decode ratio.
  • docs/concepts.md gains a Disaggregated Serving section covering the prefill block, the compute-bound vs bandwidth-bound split, KV transfer, and the ModelCache and co-location requirements. It also updates the ModelCache source bullet to the shipped source enum shape.

It also carries one unrelated CI fix (Skip the checklist requirement on draft PRs), bundled here because it's what surfaced the failure on this draft. The checklist-completed job ran require-checklist-action on every pull_request event, so a WIP draft that hadn't filled out the template checklist failed the check. The commit gates the job on the PR not being a draft and adds the ready_for_review trigger, so it's skipped while a PR is a draft and runs once it's marked ready. It's a self-contained commit and can be split into its own PR if you'd rather not bundle it.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes. Docs and CI-config change; the sandboxed check job passes in CI on this PR.
  • Added or updated tests covering any composition function changes. No composition function changes.
  • Signed off every commit with git commit -s.

@dennis-upbound dennis-upbound force-pushed the dennis/disagg-design branch from 1727021 to d7ef8b7 Compare June 9, 2026 20:43
@dennis-upbound dennis-upbound marked this pull request as draft June 9, 2026 20:45
@dennis-upbound dennis-upbound changed the title Document prefill/decode disaggregation WIP: Document prefill/decode disaggregation Jun 9, 2026
@dennis-upbound dennis-upbound changed the title WIP: Document prefill/decode disaggregation Prefill/decode disaggregation design Jun 9, 2026
@dennis-upbound dennis-upbound marked this pull request as ready for review June 9, 2026 22:36
Issue #34's original sketch predates the KServe drop and describes
disaggregation through KServe's LLMInferenceService. This documents it on the
current architecture: a self-contained prefill block on the deployment, KV
transfer via the engine's kv-transfer-config and NIXL, routing through the same
swappable EPP on a GAIE InferencePool that unified serving uses, the
correctness constraints to enforce as matching matures, and when disaggregation
is worth it. It parallels design/modelcache.md and the decisions discussed on
the issue.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The concepts guide covers unified and multi-node serving but not prefill/decode
disaggregation, which the design doc now specifies. This adds a Disaggregated
Serving section describing the prefill block, the compute-bound vs
bandwidth-bound split, KV transfer over NIXL, the ModelCache and co-location
requirements, and when disaggregation is worth it. It ties into the existing
Multi-node Inference and ModelCache sections.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The ModelCache section still described spec.source as a discriminated union
keyed on which field is set, with lowercase future types under "their own
discriminator". The shipped API is a required source enum naming the kind
(source: HuggingFace) with the matching object set alongside it, validated by
CEL. This updates the bullet to that shape so the guide matches the resource
users actually write.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The checklist-completed job runs require-checklist-action on every pull_request
event, including while a PR is still a draft. A WIP draft that hasn't filled out
the template checklist yet fails the check, which is noise: the checklist is
something you complete before asking for review, not while iterating.

This gates the job on the PR not being a draft and adds the ready_for_review
trigger, so the check is skipped while the PR is a draft and runs when it's
marked ready for review.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Comment thread design/disaggregation.md
Comment thread design/disaggregation.md Outdated
Comment thread design/disaggregation.md Outdated
Comment thread design/disaggregation.md Outdated
Comment thread design/disaggregation.md Outdated
dennis-upbound and others added 4 commits June 10, 2026 15:55
The example predated #101 merging, where nodeSelector became required: show it
on both the decode and prefill roles, each with its own device selectors
(illustrating distinct per-role hardware), and model the InfiniBand fabric as a
Synthetic device. Fold the operator "when to use" guidance into the summary as
background. Give the routing-discriminator alternative an API sketch so the
discriminator-vs-template tradeoff is concrete, and add the "two
ModelDeployments" alternative with why a single MD is better (co-location and
that a prefill-only MD isn't conceptually a model deployment).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
A feasibility spike pinned down how the routing actually works, so the Routing
section now describes it concretely rather than as an aspiration: one
InferencePool fronts both roles and the EPP partitions them by an
llm-d.ai/role label, picking a decode pod then a prefill pod and handing the
decode pod the prefill address (x-prefiller-host-port) for a routing sidecar to
forward the prompt over NIXL — correcting the earlier decode-only-pool sketch.
It also records the gateway decision: InferencePool needs a GAIE-conformant
gateway, which core Envoy Gateway is not, so serving clusters run Envoy AI
Gateway (layered on the same data plane, leaving plain routes untouched);
unified serving stays on its plain Service route for now.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The routing section named InferencePool without saying what it is: add a
one-line definition and link to the GAIE docs. Frame Envoy AI Gateway as a
deliberate choice among the conformant options (Envoy AI Gateway, Istio,
kgateway), picked because it layers on the Envoy Gateway data plane ServingStack
already runs rather than replacing the gateway. Note that Modelplane stamps the
llm-d.ai/role label the EPP filters on, alongside its own internal
modelplane.ai/pd-role label.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
workers.count defaults to 1, so an omitted prefill/decode count runs 1:1 rather
than being required — say so instead of claiming the ratio is always explicit,
and reframe the rejected alternative as "requiring both counts explicit" (which
we don't, since it would make count mandatory for unified too). Drop the
example's decode nic count from 8 to 1: it selects the InfiniBand fabric, not a
per-GPU device.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
modelplane-docs Error Error Jun 10, 2026 11:08pm

Request Review

The earlier justification — Envoy AI Gateway is lowest-friction because it
layers on what ServingStack already runs — is a non-argument for a greenfield
product: there's no installed base to preserve, and it wrongly implied the
gateway is fixed. Reframe it as the default Modelplane installs (chosen on
merits: Envoy-based, LLM-purpose-built, InferencePool-conformant) while making
the gateway a swappable seam like the EPP, so customers can plug in their own
GAIE-conformant gateway — notably on BYO clusters that already run one.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Comment thread design/disaggregation.md Outdated
Comment on lines +252 to +258
routing:
template:
spec:
containers:
- name: epp
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.8.0 # default
args: ["--config-file=/config/epp.yaml"] # override to tune scorers

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for prefill decode we have the usual top level stuff, and prefill, and routing?

I think this sketch belongs in the main body of the design, not in an alternative, and could use a little more detail. It'd be useful to show it in context of a full deployment (that'd answer my question above).

Some other thoughts:

  • I'm a little wary of defaulting it if omitted. It's convenient but asymmetrical with the other blocks, where we don't e.g. guess an engine config if you omit it.
  • Where does the EPP pod run? (Does it need its own node selector etc etc?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points — all addressed:

  • Shape: yes — top-level workers (decode), prefill, and routing. Moved the sketch into the main body: the example up top is now a full deployment showing all three, and the Routing section describes routing.template directly. Trimmed the alternative to just the discriminator-vs-template rationale.
  • Defaulting: agreed, dropped it. routing is now required when prefill is set (no guessed EPP), and both workers.count and prefill.workers.count are explicit (no default ratio) — symmetrical with how the engine is specified, per your point.
  • Where the EPP runs: added — it’s a lightweight CPU Deployment (watches the pods, scores requests; no model, no GPU), on ordinary nodes in the serving namespace, no special nodeSelector, one per InferencePool. Its pod shape comes from routing.template.

Move the routing/EPP API out of the alternatives and into the main body: the
example now shows a full deployment with top-level workers, prefill, and
routing, and the Routing section describes routing.template directly. Per
review, nothing in a disaggregated deployment is guessed: routing is required
(no default EPP) and both workers.count and prefill.workers.count are explicit
(no default ratio), symmetrical with how the engine is specified. Note where the
EPP runs — a lightweight CPU Deployment on ordinary nodes, no GPU nodeSelector,
one per InferencePool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound merged commit 4e71ed9 into main Jun 11, 2026
3 checks passed
@negz negz deleted the dennis/disagg-design branch June 16, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants