Propose and implement unopinionated ModelDeployment engines#137
Conversation
c103b14 to
10d591d
Compare
There was a problem hiding this comment.
Pull request overview
Adds a draft design proposal to make ModelDeployment less engine/opinionated by describing deployment shape (worker groups) and serving (unified vs disaggregated), plus an accompanying “generic” example manifest; updates the existing design doc to mark the current ModelDeployment section as superseded by the new proposal.
Changes:
- Added a new design document describing the proposed unopinionated
ModelDeploymentworker-group + serving model. - Added a new example manifest for a disaggregated multi-node deployment shape.
- Annotated
design/design.mdto point readers to the new proposal and note which parts are superseded.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 15 comments.
| File | Description |
|---|---|
| examples/deployment/generic.yaml | New example manifest demonstrating the proposed schema and disaggregated multi-node shape. |
| design/unopinionated-deployments.md | New draft design doc describing the proposed unopinionated ModelDeployment API shape and scheduling/serving model. |
| design/design.md | Marks the existing ModelDeployment spec description as superseded by the new proposal doc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Two things worth pinning while EPD is coming, and (More #34 than #137, flagging anyway: KV transfer is shifting from direct prefill→decode to tiered KV stores — Mooncake/LMCache, now NIXL backends (Mooncake, LMCache). The connector's a user flag, but a shared store is a stateful component with no home in |
|
@dennis-upbound Good point, I did read a little about EPD while working on this. The good part is the encode stage should model cleanly under spec:
serving:
mode: DisaggregatedPD
pd:
# ...spec:
serving:
mode: DisaggregatedEPD
epd:
# ... |
f6e5905 to
81a3dba
Compare
17e0f7f to
cfea134
Compare
e3849a2 to
c765093
Compare
This proposal began as an attempt to implement spec.workers.topology's unimplemented data and dataLocal axes, so the API could express the data-parallel and mixture-of-experts deployments frontier models like Kimi K2 and DeepSeek V3 need. Working on it surfaced a problem with the topology abstraction itself. topology's axes do two things each: they shape the workload (pods and nodes) and they name an engine flag Modelplane derives and injects. Everywhere else Modelplane passes the user's args through untouched, so a new engine or flag needs no change to Modelplane. topology breaks that: the flags it derives (--tensor-parallel-size and the rest) are engine-specific, spelled differently by vLLM, SGLang, and TensorRT-LLM, so deriving them takes on the per-engine knowledge Modelplane was trying to avoid. It also creates two sources of truth: the user writes the parallelism flags in args, and topology derives them again, with nothing reconciling the two. design/unopinionated-deployments.md proposes describing a deployment by its shape instead. spec.engines is an array of inference engines, each a single Standalone member or a Leader and one or more Workers, stamped out a fixed number of times by copies. A Worker member's worker.nodes says how many nodes it spans - how big the engine is. spec.serving describes how an InferenceCluster exposes those engines as an OpenAI-compatible endpoint. Parallelism and the rest stay in the engine's own flags, which Modelplane passes through - so the API stays unopinionated about the engine and the parallelism topology. This supersedes parts of the base design; design.md gains pointers to the new doc pending it being folded in. Still a draft for discussion. Signed-off-by: Nic Cope <nicc@rk0n.org>
The ModelDeployment API modeled its workers as a topology block of parallelism axes - tensor and pipeline - and derived engine flags (--tensor-parallel-size, --pipeline-parallel-size) and a Ray bootstrap from them. This coupled Modelplane to per-engine knowledge: the flags are spelled differently by vLLM, SGLang, and TensorRT-LLM, and the user still wrote the same flags in args, leaving two unreconciled sources of truth. The shape also couldn't express data or expert parallelism. This replaces topology with the shape the design in design/unopinionated-deployments.md proposes. spec.engines is an array of inference engines; an engine is one serving unit of either a single Standalone member or a Leader and one or more Workers. The engine carries one nodeSelector (moved down from the deployment level) describing what each of its pods needs from its node: an engine's members are homogeneous and always schedule to one node pool, so a per-member selector could only express gangs the scheduler couldn't place. Each member carries its engine template. Parallelism, quantization, and KV transfer now live entirely in the members' engine commands and args, which pass through verbatim; Modelplane injects no engine flags. The only thing it injects is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address (LWS_LEADER_ADDRESS for the LeaderWorkerSet backend), so follower commands needn't hard-code the orchestrator's variable. Sizing has three levels, with one scaling axis. spec.replicas scales the model by adding whole ModelReplicas. An engine's copies stamps out a fixed number of identical copies per replica - deliberately not named replicas, so it doesn't read as a second scaling knob. A Worker member's worker.nodes says how many nodes the member spans - how big the engine is, not how many of it. Node cost is pods x copies, where pods is 1 for a Standalone or 1 plus the Worker's nodes for a gang. worker nests under a Worker-only object so CEL can reject it on other roles, and it carries no schema default: the apiserver applies defaults before CEL validation, so a default would inject worker onto every Standalone and Leader member and trip the rule that forbids it, rejecting every ModelDeployment. The scheduler co-schedules a replica's engines onto one cluster, each engine on a pool that satisfies its nodeSelector, against a trial ledger so two engines of a replica don't double-book a pool. The ModelReplica spec mirrors the new shape with the scheduler's placement resolved on: each engine carries its nodePoolName and resolved deviceRequests, shared by all its members. compose-model-replica composes a workload per engine - a Deployment for a Standalone, a LeaderWorkerSet for a gang - plus one ResourceClaimTemplate per engine (a template stamps a fresh claim per pod, so the whole gang shares it), fronted by one Service and HTTPRoute spanning every engine's serving pods. Only Standalone pods and gang leaders carry the serving label the Service selects on; a gang's worker followers don't serve, so they carry no serving label and a multi-engine replica's Deployments select on a per-workload label to avoid overlapping selectors. The replica name is reserved for the serving Service and HTTPRoute; workloads are always named per engine. A LeaderWorkerSet's controller creates a headless Service named after the LWS for gang pod DNS - the address followers join - but only if no Service of that name exists. An LWS sharing the serving Service's name leaves gang DNS unresolvable and the gang deadlocked, with the leader waiting for followers that can never find it. Both paths were validated end to end on EKS: a Standalone engine on one L4 GPU, and a Leader/Worker gang serving pipeline-parallel across two single-L4 nodes. The multi-node example is the validated manifest; it uses vLLM's bundled multi-node-serving.sh launcher, which blocks the leader's engine until the whole gang has joined Ray. This ports the functionality the repo has today onto the new shape. Prefill/decode disaggregation - spec.serving and the Disaggregated mode - is left for a follow-up; unified serving is the only behavior. Signed-off-by: Nic Cope <nicc@rk0n.org>
|
A Leader+Worker engine doing TP/PP/EP talks over its pool's fabric — NVLink within a node, InfiniBand within a pool. Put the leader on one pool and the worker on another and that fabric isn't there: cross-pool tensor parallel never forms and the gang just sits NotReady. The split hands back a placement that can't run, not a slower one, with no signal saying why. Would it be safer to fail closed? If an engine has more than one claiming member and no single pool fits it, reject with InsufficientCapacity instead of splitting. A false reject is recoverable — add capacity, or size the gang to a pool — but a gang split across fabrics hangs silently. The claimless-coordinator ride-along you describe is fine as-is (it stays on a sibling's pool, one gang one fabric); it's specifically splitting members that each claim their own GPUs across pools that I'd guard. |
|
Looks great! |
PR #137 reshapes spec.workers into named engine groups and composes each engine's workload + a unified Service/HTTPRoute, but defers disaggregated prefill/decode routing. This adds that routing as a post-process over #137's per-engine workloads. When serving.mode is Disaggregated, the two engines named by serving.disaggregation are role-labeled (llm-d.ai/role: prefill|decode), the decode engine's serving pod gets the pd-sidecar (engine moved to 8001, sidecar on 8000, --secure-proxy=false so the HTTP gateway path works), and the unified Service is replaced by an InferencePool over both engines + a hardcoded endpoint picker + an HTTPRoute pointing at the pool. Engine flags including the NixlConnector --kv-transfer-config stay the user's, per #137; this injects none. The image refs (llm-d-inference-scheduler, llm-d-routing-sidecar) and the EndpointPickerConfig apiVersion (inference.networking.x-k8s.io/v1alpha1) are the ones validated pullable/parseable on a live GKE cluster. Speculative: layers on #137 (unmerged). The serving.{mode,disaggregation} XRD block + regen, the fn.py compose hook, and tests are the remaining wiring. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: Disaggregated and a serving.disaggregation block naming the prefill and decode engines (CEL-checked: exactly two engines, and the names must match engines). The scheduler copies serving onto the ModelReplica. The replica backend, when disaggregated, role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references and the EndpointPickerConfig apiVersion are the ones validated pullable and parseable on a live GKE cluster. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: Disaggregated and a serving.disaggregation block naming the prefill and decode engines (CEL-checked: exactly two engines, and the names must match engines). The scheduler copies serving onto the ModelReplica. The replica backend, when disaggregated, role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references and the EndpointPickerConfig apiVersion are the ones validated pullable and parseable on a live GKE cluster. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: Disaggregated and a serving.disaggregation block naming the prefill and decode engines (CEL-checked: exactly two engines, and the names must match engines). The scheduler copies serving onto the ModelReplica. The replica backend, when disaggregated, role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: Disaggregated and a serving.disaggregation block naming the prefill and decode engines (CEL-checked: exactly two engines, and the names must match engines). The scheduler copies serving onto the ModelReplica. The replica backend, when disaggregated, role-labels each engine's serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). The scheduler copies serving and the phases onto the ModelReplica. The replica backend, when disaggregated, finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). The scheduler copies serving and the phases onto the ModelReplica. The replica backend, when disaggregated, finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine moves to 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), replaces the unified Service with an InferencePool over both engines, and points the HTTPRoute at the pool fronted by a hardcoded endpoint picker. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PR #137 reshaped spec.workers into named engine groups and composes each engine's workload plus a unified Service/HTTPRoute, but deferred disaggregated prefill/decode routing. This adds it on that shape. A ModelDeployment opts in with serving.mode: PrefillDecode and marks each engine's phase as Prefill or Decode (CEL-checked: exactly two engines, one of each phase, and phase set only under PrefillDecode). The scheduler copies serving and the phases onto the ModelReplica. The workload backends compose engines only; a routing layer then decorates them with the serving surface serving.mode selects. Unified adds a Service and HTTPRoute. PrefillDecode finds the prefill and decode engines by phase, role-labels their serving pods (llm-d.ai/role), injects the pd-sidecar on decode (the engine listens on 8001, the sidecar takes 8000, --secure-proxy=false so the HTTP gateway path works), and fronts both with a GAIE InferencePool and a hardcoded endpoint picker that sequences prefill then decode. Routing an HTTPRoute to an InferencePool backendRef needs the serving cluster to run the Envoy AI Gateway and the Gateway API Inference Extension, with Envoy Gateway configured to delegate InferencePool resolution to the AI Gateway's ext-proc server; #137 had stripped these. The serving stack now installs the AI Gateway (CRDs plus controller) and the GAIE InferencePool CRD, and configures Envoy Gateway's extensionManager. Envoy Gateway moves to v1.8.1, the released version the AI Gateway is tested against, whose chart bundles the ListenerSet CRD the newer gateway needs. Engine flags, including the NixlConnector --kv-transfer-config, stay the user's per #137; this injects none. The image references, the EndpointPickerConfig apiVersion, and the Envoy Gateway version are the ones validated on a live GKE cluster, end to end through the gateway. Towards #34. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The EPP's approx-prefix-cache-producer must chunk prefixes at the same KV block size the engine uses, or prefix-cache routing silently degrades (no error, just worse decisions). The config hardcoded blockSizeTokens: 16, which only works because it matches vLLM's default --block-size; a user who sets --block-size 32 (engine flags are the user's, per #137) would quietly get bad routing. Derive it best-effort from the decode engine's flags — vLLM's --block-size and SGLang's --page-size — falling back to 16 when absent or unparseable, and render it into the EPP config. Marked a HACK: peeking at user-owned engine args is the pragmatic v0.1 unblock; the durable fix is a typed/overridable knob on the serving block (#179). Signed-off-by: Dennis Ramdass <dennis@upbound.io>
PrefillDecode silently fails when the engine image lacks the NIXL runtime: vLLM's NixlConnector (and SGLang's PD transfer) import the `nixl` package, which the base vllm/vllm-openai image doesn't include, so disaggregated engines crashloop with "NIXL is not available". Engine images are the user's (#137), so Modelplane can't bundle it — but nothing told the user it was required. Document the prerequisite where it's relevant: the _disaggregated composition docstring, the user-facing ModelDeployment doc, and the unopinionated-deployments design. The fix is to use a kv-connector-enabled image — build vLLM with INSTALL_KV_CONNECTORS=true (nixl + lmcache + mooncake) or a pre-built one such as lmcache/vllm-openai. Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Description of your changes
Fixes #52.
Today
spec.workers.topologydescribes a deployment with parallelism axes (tensor,pipeline,data,dataLocal). Each axis derives and injects engine-specific flags - breaking the pass-through property Modelplane relies on everywhere else, and creating two sources of truth (the user writes the same flags inargs).I propose we describe a deployment by its shape instead.
spec.enginesis an array of inference engines, each aStandalonemember or aLeaderand one or moreWorkermembers. A member carries its ownnodeSelectorand engine template; an engine may be stamped out a fixed number of times withcopies.A small model on a single GPU shows the API at its simplest:
This PR ports today's functionality onto the new shape. The scheduler co-schedules a replica's engines onto one cluster, each on a pool that satisfies its members' selectors.
compose-model-replicacomposes a workload per engine - a Deployment for a Standalone, a LeaderWorkerSet for a gang - fronted by one Service and HTTPRoute. Modelplane injects no engine flags; the only injection isMODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address.Prefill/decode disaggregation (
spec.servingandPrefillDecode) is designed here but not implemented.Both paths were validated end to end on EKS L4 GPUs: a Standalone engine on one GPU, and a Leader/Worker gang serving Qwen2.5-7B pipeline-parallel across two single-GPU nodes, with completions flowing through the control-plane gateway. The multi-node example is the validated manifest. The run also filed #139, #140, and #141 for pre-existing issues it surfaced.
I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.