Propose and implement unopinionated ModelDeployment engines by negz · Pull Request #137 · modelplaneai/modelplane

negz · 2026-06-11T18:38:10Z

Description of your changes

Fixes #52.

Today spec.workers.topology describes a deployment with parallelism axes (tensor, pipeline, data, dataLocal). Each axis derives and injects engine-specific flags - breaking the pass-through property Modelplane relies on everywhere else, and creating two sources of truth (the user writes the same flags in args).

I propose we describe a deployment by its shape instead. spec.engines is an array of inference engines, each a Standalone member or a Leader and one or more Worker members. A member carries its own nodeSelector and engine template; an engine may be stamped out a fixed number of times with copies.

A small model on a single GPU shows the API at its simplest:

apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  replicas: 1
  clusterSelector:
    matchLabels:
      modelplane.ai/tier: production
  engines:
  - name: qwen3-8b
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("16Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.11.0
            args:
            - --model=Qwen/Qwen2.5-7B-Instruct

This PR ports today's functionality onto the new shape. The scheduler co-schedules a replica's engines onto one cluster, each on a pool that satisfies its members' selectors. compose-model-replica composes a workload per engine - a Deployment for a Standalone, a LeaderWorkerSet for a gang - fronted by one Service and HTTPRoute. Modelplane injects no engine flags; the only injection is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address.

Prefill/decode disaggregation (spec.serving and PrefillDecode) is designed here but not implemented.

Both paths were validated end to end on EKS L4 GPUs: a Standalone engine on one GPU, and a Leader/Worker gang serving Qwen2.5-7B pipeline-parallel across two single-GPU nodes, with completions flowing through the control-plane gateway. The multi-node example is the validated manifest. The run also filed #139, #140, and #141 for pre-existing issues it surfaced.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
Added or updated tests covering any composition function changes.
Signed off every commit with git commit -s.

dennis-upbound

i like it!

Copilot

Pull request overview

Adds a draft design proposal to make ModelDeployment less engine/opinionated by describing deployment shape (worker groups) and serving (unified vs disaggregated), plus an accompanying “generic” example manifest; updates the existing design doc to mark the current ModelDeployment section as superseded by the new proposal.

Changes:

Added a new design document describing the proposed unopinionated ModelDeployment worker-group + serving model.
Added a new example manifest for a disaggregated multi-node deployment shape.
Annotated design/design.md to point readers to the new proposal and note which parts are superseded.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 15 comments.

File	Description
examples/deployment/generic.yaml	New example manifest demonstrating the proposed schema and disaggregated multi-node shape.
design/unopinionated-deployments.md	New draft design doc describing the proposed unopinionated `ModelDeployment` API shape and scheduling/serving model.
design/design.md	Marks the existing `ModelDeployment` spec description as superseded by the new proposal doc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dennis-upbound · 2026-06-12T18:39:49Z

Two things worth pinning while serving is still fluid, both about how far it stretches:

EPD is coming, and serving.disaggregation is 2-role-shaped. The block names exactly prefillGroupName + decodeGroupName, and the role-labeling/routing that hangs off it assumes two phases. But encode-prefill-decode (a third "encode" stage for multimodal — vision/audio encoders on their own GPUs) is already shipping: SGLang has elastic encoder disaggregation (LMSYS, Jan 2026), llm-d-inference-scheduler is extending P/D→EPD (#608), and the routing sidecar already takes an --ec-connector for the encoder→prefiller hop. workers[] is N groups so the shape holds, but the 2-named-group block would need a third role (or a phases list) to express EPD. Cheaper to leave the door open now than to version the block later.

(More #34 than #137, flagging anyway: KV transfer is shifting from direct prefill→decode to tiered KV stores — Mooncake/LMCache, now NIXL backends (Mooncake, LMCache). The connector's a user flag, but a shared store is a stateful component with no home in workers[].)

negz · 2026-06-12T19:03:19Z

@dennis-upbound Good point, I did read a little about EPD while working on this. The good part is the encode stage should model cleanly under workers as a third group. Maybe we switch from Unified/Disaggregated to an enum that describes how it's disaggregated? So:

spec:
  serving:
    mode: DisaggregatedPD
    pd:
      # ...

spec:
  serving:
    mode: DisaggregatedEPD
    epd:
      # ...

This proposal began as an attempt to implement spec.workers.topology's unimplemented data and dataLocal axes, so the API could express the data-parallel and mixture-of-experts deployments frontier models like Kimi K2 and DeepSeek V3 need. Working on it surfaced a problem with the topology abstraction itself. topology's axes do two things each: they shape the workload (pods and nodes) and they name an engine flag Modelplane derives and injects. Everywhere else Modelplane passes the user's args through untouched, so a new engine or flag needs no change to Modelplane. topology breaks that: the flags it derives (--tensor-parallel-size and the rest) are engine-specific, spelled differently by vLLM, SGLang, and TensorRT-LLM, so deriving them takes on the per-engine knowledge Modelplane was trying to avoid. It also creates two sources of truth: the user writes the parallelism flags in args, and topology derives them again, with nothing reconciling the two. design/unopinionated-deployments.md proposes describing a deployment by its shape instead. spec.engines is an array of inference engines, each a single Standalone member or a Leader and one or more Workers, stamped out a fixed number of times by copies. A Worker member's worker.nodes says how many nodes it spans - how big the engine is. spec.serving describes how an InferenceCluster exposes those engines as an OpenAI-compatible endpoint. Parallelism and the rest stay in the engine's own flags, which Modelplane passes through - so the API stays unopinionated about the engine and the parallelism topology. This supersedes parts of the base design; design.md gains pointers to the new doc pending it being folded in. Still a draft for discussion. Signed-off-by: Nic Cope <nicc@rk0n.org>

The ModelDeployment API modeled its workers as a topology block of parallelism axes - tensor and pipeline - and derived engine flags (--tensor-parallel-size, --pipeline-parallel-size) and a Ray bootstrap from them. This coupled Modelplane to per-engine knowledge: the flags are spelled differently by vLLM, SGLang, and TensorRT-LLM, and the user still wrote the same flags in args, leaving two unreconciled sources of truth. The shape also couldn't express data or expert parallelism. This replaces topology with the shape the design in design/unopinionated-deployments.md proposes. spec.engines is an array of inference engines; an engine is one serving unit of either a single Standalone member or a Leader and one or more Workers. The engine carries one nodeSelector (moved down from the deployment level) describing what each of its pods needs from its node: an engine's members are homogeneous and always schedule to one node pool, so a per-member selector could only express gangs the scheduler couldn't place. Each member carries its engine template. Parallelism, quantization, and KV transfer now live entirely in the members' engine commands and args, which pass through verbatim; Modelplane injects no engine flags. The only thing it injects is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address (LWS_LEADER_ADDRESS for the LeaderWorkerSet backend), so follower commands needn't hard-code the orchestrator's variable. Sizing has three levels, with one scaling axis. spec.replicas scales the model by adding whole ModelReplicas. An engine's copies stamps out a fixed number of identical copies per replica - deliberately not named replicas, so it doesn't read as a second scaling knob. A Worker member's worker.nodes says how many nodes the member spans - how big the engine is, not how many of it. Node cost is pods x copies, where pods is 1 for a Standalone or 1 plus the Worker's nodes for a gang. worker nests under a Worker-only object so CEL can reject it on other roles, and it carries no schema default: the apiserver applies defaults before CEL validation, so a default would inject worker onto every Standalone and Leader member and trip the rule that forbids it, rejecting every ModelDeployment. The scheduler co-schedules a replica's engines onto one cluster, each engine on a pool that satisfies its nodeSelector, against a trial ledger so two engines of a replica don't double-book a pool. The ModelReplica spec mirrors the new shape with the scheduler's placement resolved on: each engine carries its nodePoolName and resolved deviceRequests, shared by all its members. compose-model-replica composes a workload per engine - a Deployment for a Standalone, a LeaderWorkerSet for a gang - plus one ResourceClaimTemplate per engine (a template stamps a fresh claim per pod, so the whole gang shares it), fronted by one Service and HTTPRoute spanning every engine's serving pods. Only Standalone pods and gang leaders carry the serving label the Service selects on; a gang's worker followers don't serve, so they carry no serving label and a multi-engine replica's Deployments select on a per-workload label to avoid overlapping selectors. The replica name is reserved for the serving Service and HTTPRoute; workloads are always named per engine. A LeaderWorkerSet's controller creates a headless Service named after the LWS for gang pod DNS - the address followers join - but only if no Service of that name exists. An LWS sharing the serving Service's name leaves gang DNS unresolvable and the gang deadlocked, with the leader waiting for followers that can never find it. Both paths were validated end to end on EKS: a Standalone engine on one L4 GPU, and a Leader/Worker gang serving pipeline-parallel across two single-L4 nodes. The multi-node example is the validated manifest; it uses vLLM's bundled multi-node-serving.sh launcher, which blocks the leader's engine until the whole gang has joined Ray. This ports the functionality the repo has today onto the new shape. Prefill/decode disaggregation - spec.serving and the Disaggregated mode - is left for a follow-up; unified serving is the only behavior. Signed-off-by: Nic Cope <nicc@rk0n.org>

dennis-upbound · 2026-06-13T19:51:16Z

_place_engine_split splits an engine's members across pools when no single pool fits them all, and the _place_engine comment calls that "silently degrade interconnect." For a multi-node gang I think it's worse than degrade.

A Leader+Worker engine doing TP/PP/EP talks over its pool's fabric — NVLink within a node, InfiniBand within a pool. Put the leader on one pool and the worker on another and that fabric isn't there: cross-pool tensor parallel never forms and the gang just sits NotReady. The split hands back a placement that can't run, not a slower one, with no signal saying why.

Would it be safer to fail closed? If an engine has more than one claiming member and no single pool fits it, reject with InsufficientCapacity instead of splitting. A false reject is recoverable — add capacity, or size the gang to a pool — but a gang split across fabrics hangs silently. The claimless-coordinator ride-along you describe is fine as-is (it stays on a sibling's pool, one gang one fabric); it's specifically splitting members that each claim their own GPUs across pools that I'd guard.

dennis-upbound · 2026-06-13T19:59:08Z

Looks great!

PR #137 reshapes spec.workers into named engine groups and composes each engine's workload + a unified Service/HTTPRoute, but defers disaggregated prefill/decode routing. This adds that routing as a post-process over #137's per-engine workloads. When serving.mode is Disaggregated, the two engines named by serving.disaggregation are role-labeled (llm-d.ai/role: prefill|decode), the decode engine's serving pod gets the pd-sidecar (engine moved to 8001, sidecar on 8000, --secure-proxy=false so the HTTP gateway path works), and the unified Service is replaced by an InferencePool over both engines + a hardcoded endpoint picker + an HTTPRoute pointing at the pool. Engine flags including the NixlConnector --kv-transfer-config stay the user's, per #137; this injects none. The image refs (llm-d-inference-scheduler, llm-d-routing-sidecar) and the EndpointPickerConfig apiVersion (inference.networking.x-k8s.io/v1alpha1) are the ones validated pullable/parseable on a live GKE cluster. Speculative: layers on #137 (unmerged). The serving.{mode,disaggregation} XRD block + regen, the fn.py compose hook, and tests are the remaining wiring. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: Disaggregated and a serving.disaggregation block naming the prefill and decode engines (CEL-checked: exactly two engines, and the names must match engines). The scheduler copies serving onto the ModelReplica. The replica backend, when disaggregated, role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references and the EndpointPickerConfig apiVersion are the ones validated pullable and parseable on a live GKE cluster. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: Disaggregated and a serving.disaggregation block naming the prefill and decode engines (CEL-checked: exactly two engines, and the names must match engines). The scheduler copies serving onto the ModelReplica. The replica backend, when disaggregated, role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). The scheduler copies serving and the phases onto the ModelReplica. The replica backend, when disaggregated, finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). The scheduler copies serving and the phases onto the ModelReplica. The workload backends compose engines only; a routing layer then decorates them with the serving surface serving.mode selects. Unified adds a Service and HTTPRoute. PrefillDecode finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine listens on 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), and fronts both with a GAIE InferencePool and a hardcoded endpoint picker that sequences prefill then decode. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The EPP's approx-prefix-cache-producer must chunk prefixes at the same KV block size the engine uses, or prefix-cache routing silently degrades (no error, just worse decisions). The config hardcoded blockSizeTokens: 16, which only works because it matches vLLM's default --block-size; a user who sets --block-size 32 (engine flags are the user's, per #137) would quietly get bad routing. Derive it best-effort from the decode engine's flags — vLLM's --block-size and SGLang's --page-size — falling back to 16 when absent or unparseable, and render it into the EPP config. Marked a HACK: peeking at user-owned engine args is the pragmatic v0.1 unblock; the durable fix is a typed/overridable knob on the serving block (#179). Signed-off-by: Dennis Ramdass <dennis@upbound.io>

PrefillDecode silently fails when the engine image lacks the NIXL runtime: vLLM's NixlConnector (and SGLang's PD transfer) import the `nixl` package, which the base vllm/vllm-openai image doesn't include, so disaggregated engines crashloop with "NIXL is not available". Engine images are the user's (#137), so Modelplane can't bundle it — but nothing told the user it was required. Document the prerequisite where it's relevant: the _disaggregated composition docstring, the user-facing ModelDeployment doc, and the unopinionated-deployments design. The fix is to use a kv-connector-enabled image — build vLLM with INSTALL_KV_CONNECTORS=true (nixl + lmcache + mooncake) or a pre-built one such as lmcache/vllm-openai. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

negz force-pushed the topological branch from d385f7c to 14bfebb Compare June 12, 2026 01:47

dennis-upbound approved these changes Jun 12, 2026

View reviewed changes

Comment thread design/unopinionated-deployments.md Outdated

dennis-upbound reviewed Jun 12, 2026

View reviewed changes

Comment thread examples/deployment/generic.yaml Outdated

negz force-pushed the topological branch 6 times, most recently from c103b14 to 10d591d Compare June 12, 2026 06:40

negz changed the title ~~Add early draft worker topology design~~ Propose unopinionated ModelDeployments Jun 12, 2026

negz marked this pull request as ready for review June 12, 2026 06:49

Copilot AI review requested due to automatic review settings June 12, 2026 06:49

Copilot started reviewing on behalf of negz June 12, 2026 06:50 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

dennis-upbound approved these changes Jun 12, 2026

View reviewed changes

negz mentioned this pull request Jun 12, 2026

Implement unopinionated ModelDeployment workers #138

Closed

4 tasks

negz force-pushed the topological branch from 10d591d to f79b0df Compare June 12, 2026 18:56

negz commented Jun 12, 2026

View reviewed changes

Comment thread design/unopinionated-deployments.md Outdated

negz commented Jun 12, 2026

View reviewed changes

Comment thread design/unopinionated-deployments.md Outdated

negz changed the title ~~Propose unopinionated ModelDeployments~~ Propose and implement unopinionated ModelDeployment workers Jun 12, 2026

negz force-pushed the topological branch from 5402c00 to cb4fa92 Compare June 12, 2026 23:23

negz changed the title ~~Propose and implement unopinionated ModelDeployment workers~~ Propose and implement unopinionated ModelDeployment engines Jun 12, 2026

negz force-pushed the topological branch 2 times, most recently from f6e5905 to 81a3dba Compare June 13, 2026 00:13

negz commented Jun 13, 2026

View reviewed changes

Comment thread docs/content/concepts.md Outdated

negz commented Jun 13, 2026

View reviewed changes

Comment thread examples/deployment/model-deployment-multinode.yaml Outdated

negz force-pushed the topological branch 2 times, most recently from 17e0f7f to cfea134 Compare June 13, 2026 04:25

negz force-pushed the topological branch 3 times, most recently from e3849a2 to c765093 Compare June 13, 2026 07:11

negz commented Jun 13, 2026

View reviewed changes

Comment thread apis/modeldeployments/definition.yaml Outdated

Comment thread docs/content/_index.md Outdated

Comment thread docs/content/concepts.md Outdated

Comment thread functions/compose-model-deployment/function/fn.py Outdated

negz added 2 commits June 13, 2026 00:44

negz force-pushed the topological branch from c765093 to a64f48f Compare June 13, 2026 07:44

dennis-upbound mentioned this pull request Jun 13, 2026

Disaggregated serving on the unopinionated engine shape #142

Merged

4 tasks

negz merged commit 0f0c660 into main Jun 14, 2026
6 checks passed

negz mentioned this pull request Jun 15, 2026

Scheduler can split a gang across interconnect fabrics #149

Open

negz mentioned this pull request Jun 15, 2026

Place each engine on a single pool instead of splitting across pools #150

Merged

4 tasks

negz deleted the topological branch June 16, 2026 16:56

This was referenced Jun 17, 2026

Make the EndpointPicker (EPP) config user-configurable for PrefillDecode serving #179

Open

Make PrefillDecode actually disaggregate #175

Merged

Uh oh!

Conversation

negz commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of your changes

Uh oh!

dennis-upbound left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dennis-upbound commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negz commented Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dennis-upbound commented Jun 13, 2026

Uh oh!

dennis-upbound commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

negz commented Jun 11, 2026 •

edited

Loading

dennis-upbound commented Jun 12, 2026 •

edited

Loading