Implement unopinionated ModelDeployment workers by negz · Pull Request #138 · modelplaneai/modelplane

negz · 2026-06-12T18:51:32Z

Description of your changes

Implements the design proposed in #137.

ModelDeployment.spec.workers modeled a topology of tensor/pipeline axes, from which Modelplane derived engine flags and a Ray bootstrap. That coupled it to per-engine knowledge, left the parallelism flags written in two places, and couldn't express data or expert parallelism.

This replaces topology with an array of worker groups, each a Standalone member or a Leader plus one or more Workers. The member carries its nodeSelector and engine template; parallelism and KV transfer live in the engine's commands and args, which pass through verbatim. The only injection is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address. The scheduler co-schedules a replica's groups onto one cluster, each on a satisfying pool, and compose-model-replica composes a Deployment or LeaderWorkerSet per group behind one Service and HTTPRoute.

This ports today's functionality onto the new shape. Prefill/decode disaggregation (spec.serving) is left for a follow-up; unified serving is the only behavior.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
Added or updated tests covering any composition function changes.
Signed off every commit with git commit -s.

This proposal began as an attempt to implement spec.workers.topology's unimplemented data and dataLocal axes, so the API could express the data-parallel and mixture-of-experts deployments frontier models like Kimi K2 and DeepSeek V3 need. Working on it surfaced a problem with the topology abstraction itself. topology's axes do two things each: they shape the workload (pods and nodes) and they name an engine flag Modelplane derives and injects. Everywhere else Modelplane passes the user's args through untouched, so a new engine or flag needs no change to Modelplane. topology breaks that: the flags it derives (--tensor-parallel-size and the rest) are engine-specific, spelled differently by vLLM, SGLang, and TensorRT-LLM, so deriving them takes on the per-engine knowledge Modelplane was trying to avoid. It also creates two sources of truth: the user writes the parallelism flags in args, and topology derives them again, with nothing reconciling the two. design/unopinionated-deployments.md proposes describing a deployment by its shape instead. spec.workers becomes an array of worker groups, each a Standalone member or a Leader and one or more Workers, replicated by group. spec.serving describes how an InferenceCluster exposes those groups as an OpenAI-compatible endpoint. Parallelism and the rest stay in the engine's own flags, which Modelplane passes through - so the API stays unopinionated about the engine and the parallelism topology. This supersedes parts of the base design; design.md gains pointers to the new doc pending it being folded in. Still a draft for discussion. Towards #52. Signed-off-by: Nic Cope <nicc@rk0n.org>

The ModelDeployment API modeled its workers as a topology block of parallelism axes - tensor and pipeline - and derived engine flags (--tensor-parallel-size, --pipeline-parallel-size) and a Ray bootstrap from them. This coupled Modelplane to per-engine knowledge: the flags are spelled differently by vLLM, SGLang, and TensorRT-LLM, and the user still wrote the same flags in args, leaving two unreconciled sources of truth. The shape also couldn't express data or expert parallelism. This replaces topology with the shape the design in design/unopinionated-deployments.md proposes. spec.workers becomes an array of worker groups; a group is one serving unit of either a single Standalone member or a Leader and one or more Workers. Each member carries its own nodeSelector (moved down from the deployment level) and engine template, and a group may set replicas. Node cost is pods x replicas, where pods is 1 for a Standalone or 1 plus the Worker count for a gang. Parallelism, quantization, and KV transfer now live entirely in the members' engine commands and args, which pass through verbatim; Modelplane injects no engine flags. The only thing it injects is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address (LWS_LEADER_ADDRESS for the LeaderWorkerSet backend), so follower commands needn't hard-code the orchestrator's variable. A member's count carries no schema default, although it defaults to 1, because the apiserver applies defaults before CEL validation: a default would inject count onto every Standalone and Leader member and trip the rule that only a Worker may set it, rejecting every ModelDeployment. The scheduler now co-schedules a replica's groups onto one cluster, each group on a pool that satisfies every member, against a trial ledger so two groups of a replica don't double-book a pool. The ModelReplica spec mirrors the new shape with the scheduler's placement resolved on: each group carries its nodePoolName and each member its resolved deviceRequests. compose-model-replica composes a workload per group - a Deployment for a Standalone, a LeaderWorkerSet for a gang - fronted by one Service and HTTPRoute spanning every group's serving pods. Only Standalone pods and gang leaders carry the serving label the Service selects on; a gang's worker followers don't serve, so they carry no serving label and a multi-group replica's Deployments select on a per-workload label to avoid overlapping selectors. The replica name is reserved for the serving Service and HTTPRoute; workloads are always named per group. A LeaderWorkerSet's controller creates a headless Service named after the LWS for gang pod DNS - the address followers join - but only if no Service of that name exists. An LWS sharing the serving Service's name leaves gang DNS unresolvable and the gang deadlocked, with the leader waiting for followers that can never find it. Both paths were validated end to end on EKS: a Standalone group on one L4 GPU, and a Leader/Worker gang serving pipeline-parallel across two single-L4 nodes. The multi-node example is the validated manifest; it uses vLLM's bundled multi-node-serving.sh launcher, which blocks the leader's engine until the whole gang has joined Ray. This ports the functionality the repo has today onto the new shape. Prefill/decode disaggregation - spec.serving and the Disaggregated mode - is left for a follow-up; unified serving is the only behavior. Signed-off-by: Nic Cope <nicc@rk0n.org>

negz · 2026-06-12T22:23:00Z

Rolled into #137 alongside the design it implements, as separate commits.

negz force-pushed the topologicality branch from 3d609d9 to dadaa0f Compare June 12, 2026 18:58

negz force-pushed the topologicality branch from dadaa0f to 5402c00 Compare June 12, 2026 21:52

negz closed this Jun 12, 2026

negz deleted the topologicality branch June 16, 2026 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement unopinionated ModelDeployment workers#138

Implement unopinionated ModelDeployment workers#138
negz wants to merge 2 commits into
mainfrom
topologicality

negz commented Jun 12, 2026

Uh oh!

negz commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

negz commented Jun 12, 2026

Description of your changes

Uh oh!

negz commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant