Prefill/decode disaggregation design by dennis-upbound · Pull Request #116 · modelplaneai/modelplane

dennis-upbound · 2026-06-09T19:16:29Z

Towards #34.

Issue #34 asks for prefill/decode disaggregation, but its original sketch predates the KServe drop and describes it through KServe's LLMInferenceService.prefill section. There's no current design or user-facing documentation for how disaggregation fits the backend-dispatcher architecture.

This adds two docs, no code:

design/disaggregation.md specifies disaggregation on the current architecture: a self-contained prefill block on the deployment, KV transfer via the engine's --kv-transfer-config and NIXL, routing through the same swappable EPP on a GAIE InferencePool that unified serving uses, the correctness constraints to enforce as matching matures, and when disaggregation is worth it. It parallels design/modelcache.md, with an Alternatives considered section covering the KServe sketch, a bespoke proxy, a routing discriminator, Modelplane choosing per-role hardware, and an implicit prefill:decode ratio.
docs/concepts.md gains a Disaggregated Serving section covering the prefill block, the compute-bound vs bandwidth-bound split, KV transfer, and the ModelCache and co-location requirements. It also updates the ModelCache source bullet to the shipped source enum shape.

It also carries one unrelated CI fix (Skip the checklist requirement on draft PRs), bundled here because it's what surfaced the failure on this draft. The checklist-completed job ran require-checklist-action on every pull_request event, so a WIP draft that hadn't filled out the template checklist failed the check. The commit gates the job on the PR not being a draft and adds the ready_for_review trigger, so it's skipped while a PR is a draft and runs once it's marked ready. It's a self-contained commit and can be split into its own PR if you'd rather not bundle it.

I have:

Read and followed Modelplane's contribution process.
~~Run nix flake check (or ./nix.sh flake check) and made sure it passes.~~ Docs and CI-config change; the sandboxed check job passes in CI on this PR.
~~Added or updated tests covering any composition function changes.~~ No composition function changes.
Signed off every commit with git commit -s.

Issue #34's original sketch predates the KServe drop and describes disaggregation through KServe's LLMInferenceService. This documents it on the current architecture: a self-contained prefill block on the deployment, KV transfer via the engine's kv-transfer-config and NIXL, routing through the same swappable EPP on a GAIE InferencePool that unified serving uses, the correctness constraints to enforce as matching matures, and when disaggregation is worth it. It parallels design/modelcache.md and the decisions discussed on the issue. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The concepts guide covers unified and multi-node serving but not prefill/decode disaggregation, which the design doc now specifies. This adds a Disaggregated Serving section describing the prefill block, the compute-bound vs bandwidth-bound split, KV transfer over NIXL, the ModelCache and co-location requirements, and when disaggregation is worth it. It ties into the existing Multi-node Inference and ModelCache sections. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The ModelCache section still described spec.source as a discriminated union keyed on which field is set, with lowercase future types under "their own discriminator". The shipped API is a required source enum naming the kind (source: HuggingFace) with the matching object set alongside it, validated by CEL. This updates the bullet to that shape so the guide matches the resource users actually write. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The checklist-completed job runs require-checklist-action on every pull_request event, including while a PR is still a draft. A WIP draft that hasn't filled out the template checklist yet fails the check, which is noise: the checklist is something you complete before asking for review, not while iterating. This gates the job on the PR not being a draft and adds the ready_for_review trigger, so the check is skipped while the PR is a draft and runs when it's marked ready for review. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The example predated #101 merging, where nodeSelector became required: show it on both the decode and prefill roles, each with its own device selectors (illustrating distinct per-role hardware), and model the InfiniBand fabric as a Synthetic device. Fold the operator "when to use" guidance into the summary as background. Give the routing-discriminator alternative an API sketch so the discriminator-vs-template tradeoff is concrete, and add the "two ModelDeployments" alternative with why a single MD is better (co-location and that a prefill-only MD isn't conceptually a model deployment). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

A feasibility spike pinned down how the routing actually works, so the Routing section now describes it concretely rather than as an aspiration: one InferencePool fronts both roles and the EPP partitions them by an llm-d.ai/role label, picking a decode pod then a prefill pod and handing the decode pod the prefill address (x-prefiller-host-port) for a routing sidecar to forward the prompt over NIXL — correcting the earlier decode-only-pool sketch. It also records the gateway decision: InferencePool needs a GAIE-conformant gateway, which core Envoy Gateway is not, so serving clusters run Envoy AI Gateway (layered on the same data plane, leaving plain routes untouched); unified serving stays on its plain Service route for now. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The routing section named InferencePool without saying what it is: add a one-line definition and link to the GAIE docs. Frame Envoy AI Gateway as a deliberate choice among the conformant options (Envoy AI Gateway, Istio, kgateway), picked because it layers on the Envoy Gateway data plane ServingStack already runs rather than replacing the gateway. Note that Modelplane stamps the llm-d.ai/role label the EPP filters on, alongside its own internal modelplane.ai/pd-role label. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

workers.count defaults to 1, so an omitted prefill/decode count runs 1:1 rather than being required — say so instead of claiming the ratio is always explicit, and reframe the rejected alternative as "requiring both counts explicit" (which we don't, since it would make count mandatory for unified too). Drop the example's decode nic count from 8 to 1: it selects the InfiniBand fabric, not a per-GPU device. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

vercel · 2026-06-10T23:03:38Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
modelplane-docs	Error		Jun 10, 2026 11:08pm

The earlier justification — Envoy AI Gateway is lowest-friction because it layers on what ServingStack already runs — is a non-argument for a greenfield product: there's no installed base to preserve, and it wrongly implied the gateway is fixed. Reframe it as the default Modelplane installs (chosen on merits: Envoy-based, LLM-purpose-built, InferencePool-conformant) while making the gateway a swappable seam like the EPP, so customers can plug in their own GAIE-conformant gateway — notably on BYO clusters that already run one. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

negz · 2026-06-10T23:22:04Z

+  routing:
+    template:
+      spec:
+        containers:
+        - name: epp
+          image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.8.0   # default
+          args: ["--config-file=/config/epp.yaml"]                # override to tune scorers


So for prefill decode we have the usual top level stuff, and prefill, and routing?

I think this sketch belongs in the main body of the design, not in an alternative, and could use a little more detail. It'd be useful to show it in context of a full deployment (that'd answer my question above).

Some other thoughts:

I'm a little wary of defaulting it if omitted. It's convenient but asymmetrical with the other blocks, where we don't e.g. guess an engine config if you omit it.

Where does the EPP pod run? (Does it need its own node selector etc etc?)

Good points — all addressed:

Shape: yes — top-level workers (decode), prefill, and routing. Moved the sketch into the main body: the example up top is now a full deployment showing all three, and the Routing section describes routing.template directly. Trimmed the alternative to just the discriminator-vs-template rationale.

Defaulting: agreed, dropped it. routing is now required when prefill is set (no guessed EPP), and both workers.count and prefill.workers.count are explicit (no default ratio) — symmetrical with how the engine is specified, per your point.

Where the EPP runs: added — it’s a lightweight CPU Deployment (watches the pods, scores requests; no model, no GPU), on ordinary nodes in the serving namespace, no special nodeSelector, one per InferencePool. Its pod shape comes from routing.template.

Move the routing/EPP API out of the alternatives and into the main body: the example now shows a full deployment with top-level workers, prefill, and routing, and the Routing section describes routing.template directly. Per review, nothing in a disaggregated deployment is guessed: routing is required (no default EPP) and both workers.count and prefill.workers.count are explicit (no default ratio), symmetrical with how the engine is specified. Note where the EPP runs — a lightweight CPU Deployment on ordinary nodes, no GPU nodeSelector, one per InferencePool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>

dennis-upbound mentioned this pull request Jun 9, 2026

Support prefill/decode disaggregation #34

Closed

dennis-upbound force-pushed the dennis/disagg-design branch from 1727021 to d7ef8b7 Compare June 9, 2026 20:43

dennis-upbound marked this pull request as draft June 9, 2026 20:45

dennis-upbound changed the title ~~Document prefill/decode disaggregation~~ WIP: Document prefill/decode disaggregation Jun 9, 2026

dennis-upbound changed the title ~~WIP: Document prefill/decode disaggregation~~ Prefill/decode disaggregation design Jun 9, 2026

dennis-upbound marked this pull request as ready for review June 9, 2026 22:36

dennis-upbound added 4 commits June 9, 2026 15:43

dennis-upbound force-pushed the dennis/disagg-design branch from 2514709 to 2e87f15 Compare June 9, 2026 22:44

dennis-upbound requested review from negz and tr0njavolta June 9, 2026 22:44

negz approved these changes Jun 10, 2026

View reviewed changes

Comment thread design/disaggregation.md

Comment thread design/disaggregation.md Outdated

Comment thread design/disaggregation.md Outdated

Comment thread design/disaggregation.md Outdated

Comment thread design/disaggregation.md Outdated

dennis-upbound and others added 4 commits June 10, 2026 15:55

vercel Bot had a problem deploying to Preview June 10, 2026 23:03 Failure

vercel Bot had a problem deploying to Preview June 10, 2026 23:08 Failure

negz reviewed Jun 10, 2026

View reviewed changes

dennis-upbound mentioned this pull request Jun 11, 2026

Disaggregate prefill and decode: prefill block, role-aware scheduler, and InferencePool routing #124

Closed

4 tasks

dennis-upbound merged commit 4e71ed9 into main Jun 11, 2026
3 checks passed

negz deleted the dennis/disagg-design branch June 16, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prefill/decode disaggregation design#116

Prefill/decode disaggregation design#116
dennis-upbound merged 10 commits into
mainfrom
dennis/disagg-design

dennis-upbound commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vercel Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

negz Jun 10, 2026

Uh oh!

dennis-upbound Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dennis-upbound commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vercel Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negz Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-upbound Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dennis-upbound commented Jun 9, 2026 •

edited

Loading

vercel Bot commented Jun 10, 2026 •

edited

Loading