Skip to content

Propose and implement unopinionated ModelDeployment engines#137

Merged
negz merged 2 commits into
mainfrom
topological
Jun 14, 2026
Merged

Propose and implement unopinionated ModelDeployment engines#137
negz merged 2 commits into
mainfrom
topological

Conversation

@negz

@negz negz commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

Fixes #52.

Today spec.workers.topology describes a deployment with parallelism axes (tensor, pipeline, data, dataLocal). Each axis derives and injects engine-specific flags - breaking the pass-through property Modelplane relies on everywhere else, and creating two sources of truth (the user writes the same flags in args).

I propose we describe a deployment by its shape instead. spec.engines is an array of inference engines, each a Standalone member or a Leader and one or more Worker members. A member carries its own nodeSelector and engine template; an engine may be stamped out a fixed number of times with copies.

A small model on a single GPU shows the API at its simplest:

apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  replicas: 1
  clusterSelector:
    matchLabels:
      modelplane.ai/tier: production
  engines:
  - name: qwen3-8b
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("16Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.11.0
            args:
            - --model=Qwen/Qwen2.5-7B-Instruct

This PR ports today's functionality onto the new shape. The scheduler co-schedules a replica's engines onto one cluster, each on a pool that satisfies its members' selectors. compose-model-replica composes a workload per engine - a Deployment for a Standalone, a LeaderWorkerSet for a gang - fronted by one Service and HTTPRoute. Modelplane injects no engine flags; the only injection is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address.

Prefill/decode disaggregation (spec.serving and PrefillDecode) is designed here but not implemented.

Both paths were validated end to end on EKS L4 GPUs: a Standalone engine on one GPU, and a Leader/Worker gang serving Qwen2.5-7B pipeline-parallel across two single-GPU nodes, with completions flowing through the control-plane gateway. The multi-node example is the validated manifest. The run also filed #139, #140, and #141 for pre-existing issues it surfaced.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like it!

Comment thread design/unopinionated-deployments.md Outdated
Comment thread examples/deployment/generic.yaml Outdated
@negz negz force-pushed the topological branch 6 times, most recently from c103b14 to 10d591d Compare June 12, 2026 06:40
@negz negz changed the title Add early draft worker topology design Propose unopinionated ModelDeployments Jun 12, 2026
@negz negz marked this pull request as ready for review June 12, 2026 06:49
Copilot AI review requested due to automatic review settings June 12, 2026 06:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a draft design proposal to make ModelDeployment less engine/opinionated by describing deployment shape (worker groups) and serving (unified vs disaggregated), plus an accompanying “generic” example manifest; updates the existing design doc to mark the current ModelDeployment section as superseded by the new proposal.

Changes:

  • Added a new design document describing the proposed unopinionated ModelDeployment worker-group + serving model.
  • Added a new example manifest for a disaggregated multi-node deployment shape.
  • Annotated design/design.md to point readers to the new proposal and note which parts are superseded.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 15 comments.

File Description
examples/deployment/generic.yaml New example manifest demonstrating the proposed schema and disaggregated multi-node shape.
design/unopinionated-deployments.md New draft design doc describing the proposed unopinionated ModelDeployment API shape and scheduling/serving model.
design/design.md Marks the existing ModelDeployment spec description as superseded by the new proposal doc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/deployment/generic.yaml Outdated
Comment thread examples/deployment/generic.yaml Outdated
Comment thread examples/deployment/generic.yaml Outdated
Comment thread examples/deployment/generic.yaml Outdated
Comment thread examples/deployment/generic.yaml Outdated
Comment thread design/unopinionated-deployments.md Outdated
Comment thread design/unopinionated-deployments.md Outdated
Comment thread design/unopinionated-deployments.md Outdated
Comment thread design/unopinionated-deployments.md Outdated
Comment thread design/unopinionated-deployments.md Outdated
@dennis-upbound

dennis-upbound commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Two things worth pinning while serving is still fluid, both about how far it stretches:

EPD is coming, and serving.disaggregation is 2-role-shaped. The block names exactly prefillGroupName + decodeGroupName, and the role-labeling/routing that hangs off it assumes two phases. But encode-prefill-decode (a third "encode" stage for multimodal — vision/audio encoders on their own GPUs) is already shipping: SGLang has elastic encoder disaggregation (LMSYS, Jan 2026), llm-d-inference-scheduler is extending P/D→EPD (#608), and the routing sidecar already takes an --ec-connector for the encoder→prefiller hop. workers[] is N groups so the shape holds, but the 2-named-group block would need a third role (or a phases list) to express EPD. Cheaper to leave the door open now than to version the block later.

(More #34 than #137, flagging anyway: KV transfer is shifting from direct prefill→decode to tiered KV stores — Mooncake/LMCache, now NIXL backends (Mooncake, LMCache). The connector's a user flag, but a shared store is a stateful component with no home in workers[].)

@negz

negz commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

@dennis-upbound Good point, I did read a little about EPD while working on this. The good part is the encode stage should model cleanly under workers as a third group. Maybe we switch from Unified/Disaggregated to an enum that describes how it's disaggregated? So:

spec:
  serving:
    mode: DisaggregatedPD
    pd:
      # ...
spec:
  serving:
    mode: DisaggregatedEPD
    epd:
      # ...

Comment thread design/unopinionated-deployments.md Outdated
Comment thread design/unopinionated-deployments.md Outdated
@negz negz changed the title Propose unopinionated ModelDeployments Propose and implement unopinionated ModelDeployment workers Jun 12, 2026
@negz negz changed the title Propose and implement unopinionated ModelDeployment workers Propose and implement unopinionated ModelDeployment engines Jun 12, 2026
@negz negz force-pushed the topological branch 2 times, most recently from f6e5905 to 81a3dba Compare June 13, 2026 00:13
Comment thread docs/content/concepts.md Outdated
Comment thread examples/deployment/model-deployment-multinode.yaml Outdated
@negz negz force-pushed the topological branch 2 times, most recently from 17e0f7f to cfea134 Compare June 13, 2026 04:25
@negz negz force-pushed the topological branch 3 times, most recently from e3849a2 to c765093 Compare June 13, 2026 07:11
Comment thread apis/modeldeployments/definition.yaml Outdated
Comment thread docs/content/_index.md Outdated
Comment thread docs/content/concepts.md Outdated
Comment thread functions/compose-model-deployment/function/fn.py Outdated
negz added 2 commits June 13, 2026 00:44
This proposal began as an attempt to implement spec.workers.topology's
unimplemented data and dataLocal axes, so the API could express the
data-parallel and mixture-of-experts deployments frontier models like
Kimi K2 and DeepSeek V3 need. Working on it surfaced a problem with the
topology abstraction itself.

topology's axes do two things each: they shape the workload (pods and
nodes) and they name an engine flag Modelplane derives and injects.
Everywhere else Modelplane passes the user's args through untouched, so a
new engine or flag needs no change to Modelplane. topology breaks that:
the flags it derives (--tensor-parallel-size and the rest) are
engine-specific, spelled differently by vLLM, SGLang, and TensorRT-LLM,
so deriving them takes on the per-engine knowledge Modelplane was trying
to avoid. It also creates two sources of truth: the user writes the
parallelism flags in args, and topology derives them again, with nothing
reconciling the two.

design/unopinionated-deployments.md proposes describing a deployment by
its shape instead. spec.engines is an array of inference engines, each a
single Standalone member or a Leader and one or more Workers, stamped out
a fixed number of times by copies. A Worker member's worker.nodes says
how many nodes it spans - how big the engine is. spec.serving describes
how an InferenceCluster exposes those engines as an OpenAI-compatible
endpoint. Parallelism and the rest stay in the engine's own flags, which
Modelplane passes through - so the API stays unopinionated about the
engine and the parallelism topology.

This supersedes parts of the base design; design.md gains pointers to the
new doc pending it being folded in. Still a draft for discussion.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The ModelDeployment API modeled its workers as a topology block of
parallelism axes - tensor and pipeline - and derived engine flags
(--tensor-parallel-size, --pipeline-parallel-size) and a Ray bootstrap
from them. This coupled Modelplane to per-engine knowledge: the flags are
spelled differently by vLLM, SGLang, and TensorRT-LLM, and the user still
wrote the same flags in args, leaving two unreconciled sources of truth.
The shape also couldn't express data or expert parallelism.

This replaces topology with the shape the design in
design/unopinionated-deployments.md proposes. spec.engines is an array of
inference engines; an engine is one serving unit of either a single
Standalone member or a Leader and one or more Workers. The engine carries
one nodeSelector (moved down from the deployment level) describing what
each of its pods needs from its node: an engine's members are homogeneous
and always schedule to one node pool, so a per-member selector could only
express gangs the scheduler couldn't place. Each member carries its
engine template. Parallelism, quantization, and KV transfer now live
entirely in the members' engine commands and args, which pass through
verbatim; Modelplane injects no engine flags. The only thing it injects
is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang
leader's address (LWS_LEADER_ADDRESS for the LeaderWorkerSet backend), so
follower commands needn't hard-code the orchestrator's variable.

Sizing has three levels, with one scaling axis. spec.replicas scales the
model by adding whole ModelReplicas. An engine's copies stamps out a
fixed number of identical copies per replica - deliberately not named
replicas, so it doesn't read as a second scaling knob. A Worker member's
worker.nodes says how many nodes the member spans - how big the engine
is, not how many of it. Node cost is pods x copies, where pods is 1 for a
Standalone or 1 plus the Worker's nodes for a gang. worker nests under a
Worker-only object so CEL can reject it on other roles, and it carries no
schema default: the apiserver applies defaults before CEL validation, so
a default would inject worker onto every Standalone and Leader member and
trip the rule that forbids it, rejecting every ModelDeployment.

The scheduler co-schedules a replica's engines onto one cluster, each
engine on a pool that satisfies its nodeSelector, against a trial ledger
so two engines of a replica don't double-book a pool. The ModelReplica
spec mirrors the new shape with the scheduler's placement resolved on:
each engine carries its nodePoolName and resolved deviceRequests, shared
by all its members. compose-model-replica composes a workload per engine
- a Deployment for a Standalone, a LeaderWorkerSet for a gang - plus one
ResourceClaimTemplate per engine (a template stamps a fresh claim per
pod, so the whole gang shares it), fronted by one Service and HTTPRoute
spanning every engine's serving pods. Only Standalone pods and gang
leaders carry the serving label the Service selects on; a gang's worker
followers don't serve, so they carry no serving label and a multi-engine
replica's Deployments select on a per-workload label to avoid overlapping
selectors.

The replica name is reserved for the serving Service and HTTPRoute;
workloads are always named per engine. A LeaderWorkerSet's controller
creates a headless Service named after the LWS for gang pod DNS - the
address followers join - but only if no Service of that name exists. An
LWS sharing the serving Service's name leaves gang DNS unresolvable and
the gang deadlocked, with the leader waiting for followers that can never
find it.

Both paths were validated end to end on EKS: a Standalone engine on one
L4 GPU, and a Leader/Worker gang serving pipeline-parallel across two
single-L4 nodes. The multi-node example is the validated manifest; it
uses vLLM's bundled multi-node-serving.sh launcher, which blocks the
leader's engine until the whole gang has joined Ray.

This ports the functionality the repo has today onto the new shape.
Prefill/decode disaggregation - spec.serving and the Disaggregated mode -
is left for a follow-up; unified serving is the only behavior.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@dennis-upbound

Copy link
Copy Markdown
Collaborator

_place_engine_split splits an engine's members across pools when no single pool fits them all, and the _place_engine comment calls that "silently degrade interconnect." For a multi-node gang I think it's worse than degrade.

A Leader+Worker engine doing TP/PP/EP talks over its pool's fabric — NVLink within a node, InfiniBand within a pool. Put the leader on one pool and the worker on another and that fabric isn't there: cross-pool tensor parallel never forms and the gang just sits NotReady. The split hands back a placement that can't run, not a slower one, with no signal saying why.

Would it be safer to fail closed? If an engine has more than one claiming member and no single pool fits it, reject with InsufficientCapacity instead of splitting. A false reject is recoverable — add capacity, or size the gang to a pool — but a gang split across fabrics hangs silently. The claimless-coordinator ride-along you describe is fine as-is (it stays on a sibling's pool, one gang one fabric); it's specifically splitting members that each claim their own GPUs across pools that I'd guard.

@dennis-upbound

Copy link
Copy Markdown
Collaborator

Looks great!

@negz negz merged commit 0f0c660 into main Jun 14, 2026
6 checks passed
dennis-upbound added a commit that referenced this pull request Jun 14, 2026
PR #137 reshapes spec.workers into named engine groups and composes each
engine's workload + a unified Service/HTTPRoute, but defers disaggregated
prefill/decode routing. This adds that routing as a post-process over #137's
per-engine workloads.

When serving.mode is Disaggregated, the two engines named by
serving.disaggregation are role-labeled (llm-d.ai/role: prefill|decode), the
decode engine's serving pod gets the pd-sidecar (engine moved to 8001, sidecar
on 8000, --secure-proxy=false so the HTTP gateway path works), and the unified
Service is replaced by an InferencePool over both engines + a hardcoded
endpoint picker + an HTTPRoute pointing at the pool. Engine flags including the
NixlConnector --kv-transfer-config stay the user's, per #137; this injects none.

The image refs (llm-d-inference-scheduler, llm-d-routing-sidecar) and the
EndpointPickerConfig apiVersion (inference.networking.x-k8s.io/v1alpha1) are the
ones validated pullable/parseable on a live GKE cluster.

Speculative: layers on #137 (unmerged). The serving.{mode,disaggregation} XRD
block + regen, the fn.py compose hook, and tests are the remaining wiring.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 14, 2026
PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: Disaggregated and a
serving.disaggregation block naming the prefill and decode engines (CEL-checked:
exactly two engines, and the names must match engines). The scheduler copies
serving onto the ModelReplica. The replica backend, when disaggregated,
role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar
on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false
so the HTTP gateway path works), replaces the unified Service with an
InferencePool over both engines, and points the HTTPRoute at the pool fronted by
a hardcoded endpoint picker.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references and the EndpointPickerConfig
apiVersion are the ones validated pullable and parseable on a live GKE cluster.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 14, 2026
PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: Disaggregated and a
serving.disaggregation block naming the prefill and decode engines (CEL-checked:
exactly two engines, and the names must match engines). The scheduler copies
serving onto the ModelReplica. The replica backend, when disaggregated,
role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar
on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false
so the HTTP gateway path works), replaces the unified Service with an
InferencePool over both engines, and points the HTTPRoute at the pool fronted by
a hardcoded endpoint picker.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references and the EndpointPickerConfig
apiVersion are the ones validated pullable and parseable on a live GKE cluster.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 14, 2026
PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: Disaggregated and a
serving.disaggregation block naming the prefill and decode engines (CEL-checked:
exactly two engines, and the names must match engines). The scheduler copies
serving onto the ModelReplica. The replica backend, when disaggregated,
role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar
on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false
so the HTTP gateway path works), replaces the unified Service with an
InferencePool over both engines, and points the HTTPRoute at the pool fronted by
a hardcoded endpoint picker.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to
run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy
Gateway configured to delegate InferencePool resolution to the AI Gateway's
ext-proc server; #137 had stripped these. The serving stack now installs the AI
Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures
Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released
version the AI Gateway is tested against, whose chart bundles the ListenerSet
CRD the newer gateway needs.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references, the EndpointPickerConfig
apiVersion, and the Envoy Gateway version are the ones validated on a live GKE
cluster, end to end through the gateway.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 14, 2026
PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: Disaggregated and a
serving.disaggregation block naming the prefill and decode engines (CEL-checked:
exactly two engines, and the names must match engines). The scheduler copies
serving onto the ModelReplica. The replica backend, when disaggregated,
role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar
on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false
so the HTTP gateway path works), replaces the unified Service with an
InferencePool over both engines, and points the HTTPRoute at the pool fronted by
a hardcoded endpoint picker.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to
run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy
Gateway configured to delegate InferencePool resolution to the AI Gateway's
ext-proc server; #137 had stripped these. The serving stack now installs the AI
Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures
Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released
version the AI Gateway is tested against, whose chart bundles the ListenerSet
CRD the newer gateway needs.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references, the EndpointPickerConfig
apiVersion, and the Envoy Gateway version are the ones validated on a live GKE
cluster, end to end through the gateway.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 15, 2026
PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: PrefillDecode and marks each
engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of
each phase, and phase set only under PrefillDecode). The scheduler copies serving
and the phases onto the ModelReplica. The replica backend, when disaggregated,
finds the prefill and decode engines by phase, role-labels their serving pods
(llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the
sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works),
replaces the unified Service with an InferencePool over both engines, and points
the HTTPRoute at the pool fronted by a hardcoded endpoint picker.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to
run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy
Gateway configured to delegate InferencePool resolution to the AI Gateway's
ext-proc server; #137 had stripped these. The serving stack now installs the AI
Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures
Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released
version the AI Gateway is tested against, whose chart bundles the ListenerSet
CRD the newer gateway needs.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references, the EndpointPickerConfig
apiVersion, and the Envoy Gateway version are the ones validated on a live GKE
cluster, end to end through the gateway.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 15, 2026
PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: PrefillDecode and marks each
engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of
each phase, and phase set only under PrefillDecode). The scheduler copies serving
and the phases onto the ModelReplica. The replica backend, when disaggregated,
finds the prefill and decode engines by phase, role-labels their serving pods
(llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the
sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works),
replaces the unified Service with an InferencePool over both engines, and points
the HTTPRoute at the pool fronted by a hardcoded endpoint picker.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to
run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy
Gateway configured to delegate InferencePool resolution to the AI Gateway's
ext-proc server; #137 had stripped these. The serving stack now installs the AI
Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures
Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released
version the AI Gateway is tested against, whose chart bundles the ListenerSet
CRD the newer gateway needs.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references, the EndpointPickerConfig
apiVersion, and the Envoy Gateway version are the ones validated on a live GKE
cluster, end to end through the gateway.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 15, 2026
PR #137 reshaped spec.workers into named engine groups and composes each
engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated
prefill/decode routing. This adds it on that shape.

A ModelDeployment opts in with serving.mode: PrefillDecode and marks each
engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of
each phase, and phase set only under PrefillDecode). The scheduler copies serving
and the phases onto the ModelReplica.

The workload backends compose engines only; a routing layer then decorates them
with the serving surface serving.mode selects. Unified adds a Service and
HTTPRoute. PrefillDecode finds the prefill and decode engines by phase,
role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode
(the engine listens on 8001, the sidecar takes 8000, --secure-proxy=false so the
HTTP gateway path works), and fronts both with a GAIE InferencePool and a
hardcoded endpoint picker that sequences prefill then decode.

Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to
run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy
Gateway configured to delegate InferencePool resolution to the AI Gateway's
ext-proc server; #137 had stripped these. The serving stack now installs the AI
Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures
Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released
version the AI Gateway is tested against, whose chart bundles the ListenerSet
CRD the newer gateway needs.

Engine flags, including the NixlConnector --kv-transfer-config, stay the user's
per #137; this injects none. The image references, the EndpointPickerConfig
apiVersion, and the Envoy Gateway version are the ones validated on a live GKE
cluster, end to end through the gateway.

Towards #34.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@negz negz deleted the topological branch June 16, 2026 16:56
dennis-upbound added a commit that referenced this pull request Jun 17, 2026
The EPP's approx-prefix-cache-producer must chunk prefixes at the same KV block
size the engine uses, or prefix-cache routing silently degrades (no error, just
worse decisions). The config hardcoded blockSizeTokens: 16, which only works
because it matches vLLM's default --block-size; a user who sets --block-size 32
(engine flags are the user's, per #137) would quietly get bad routing.

Derive it best-effort from the decode engine's flags — vLLM's --block-size and
SGLang's --page-size — falling back to 16 when absent or unparseable, and render
it into the EPP config. Marked a HACK: peeking at user-owned engine args is the
pragmatic v0.1 unblock; the durable fix is a typed/overridable knob on the
serving block (#179).

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
dennis-upbound added a commit that referenced this pull request Jun 17, 2026
PrefillDecode silently fails when the engine image lacks the NIXL runtime:
vLLM's NixlConnector (and SGLang's PD transfer) import the `nixl` package, which
the base vllm/vllm-openai image doesn't include, so disaggregated engines
crashloop with "NIXL is not available". Engine images are the user's (#137), so
Modelplane can't bundle it — but nothing told the user it was required.

Document the prerequisite where it's relevant: the _disaggregated composition
docstring, the user-facing ModelDeployment doc, and the unopinionated-deployments
design. The fix is to use a kv-connector-enabled image — build vLLM with
INSTALL_KV_CONNECTORS=true (nixl + lmcache + mooncake) or a pre-built one such as
lmcache/vllm-openai.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replica topology and capability-based scheduling

3 participants