Implement unopinionated ModelDeployment workers#138
Closed
negz wants to merge 2 commits into
Closed
Conversation
This proposal began as an attempt to implement spec.workers.topology's unimplemented data and dataLocal axes, so the API could express the data-parallel and mixture-of-experts deployments frontier models like Kimi K2 and DeepSeek V3 need. Working on it surfaced a problem with the topology abstraction itself. topology's axes do two things each: they shape the workload (pods and nodes) and they name an engine flag Modelplane derives and injects. Everywhere else Modelplane passes the user's args through untouched, so a new engine or flag needs no change to Modelplane. topology breaks that: the flags it derives (--tensor-parallel-size and the rest) are engine-specific, spelled differently by vLLM, SGLang, and TensorRT-LLM, so deriving them takes on the per-engine knowledge Modelplane was trying to avoid. It also creates two sources of truth: the user writes the parallelism flags in args, and topology derives them again, with nothing reconciling the two. design/unopinionated-deployments.md proposes describing a deployment by its shape instead. spec.workers becomes an array of worker groups, each a Standalone member or a Leader and one or more Workers, replicated by group. spec.serving describes how an InferenceCluster exposes those groups as an OpenAI-compatible endpoint. Parallelism and the rest stay in the engine's own flags, which Modelplane passes through - so the API stays unopinionated about the engine and the parallelism topology. This supersedes parts of the base design; design.md gains pointers to the new doc pending it being folded in. Still a draft for discussion. Towards #52. Signed-off-by: Nic Cope <nicc@rk0n.org>
The ModelDeployment API modeled its workers as a topology block of parallelism axes - tensor and pipeline - and derived engine flags (--tensor-parallel-size, --pipeline-parallel-size) and a Ray bootstrap from them. This coupled Modelplane to per-engine knowledge: the flags are spelled differently by vLLM, SGLang, and TensorRT-LLM, and the user still wrote the same flags in args, leaving two unreconciled sources of truth. The shape also couldn't express data or expert parallelism. This replaces topology with the shape the design in design/unopinionated-deployments.md proposes. spec.workers becomes an array of worker groups; a group is one serving unit of either a single Standalone member or a Leader and one or more Workers. Each member carries its own nodeSelector (moved down from the deployment level) and engine template, and a group may set replicas. Node cost is pods x replicas, where pods is 1 for a Standalone or 1 plus the Worker count for a gang. Parallelism, quantization, and KV transfer now live entirely in the members' engine commands and args, which pass through verbatim; Modelplane injects no engine flags. The only thing it injects is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address (LWS_LEADER_ADDRESS for the LeaderWorkerSet backend), so follower commands needn't hard-code the orchestrator's variable. A member's count carries no schema default, although it defaults to 1, because the apiserver applies defaults before CEL validation: a default would inject count onto every Standalone and Leader member and trip the rule that only a Worker may set it, rejecting every ModelDeployment. The scheduler now co-schedules a replica's groups onto one cluster, each group on a pool that satisfies every member, against a trial ledger so two groups of a replica don't double-book a pool. The ModelReplica spec mirrors the new shape with the scheduler's placement resolved on: each group carries its nodePoolName and each member its resolved deviceRequests. compose-model-replica composes a workload per group - a Deployment for a Standalone, a LeaderWorkerSet for a gang - fronted by one Service and HTTPRoute spanning every group's serving pods. Only Standalone pods and gang leaders carry the serving label the Service selects on; a gang's worker followers don't serve, so they carry no serving label and a multi-group replica's Deployments select on a per-workload label to avoid overlapping selectors. The replica name is reserved for the serving Service and HTTPRoute; workloads are always named per group. A LeaderWorkerSet's controller creates a headless Service named after the LWS for gang pod DNS - the address followers join - but only if no Service of that name exists. An LWS sharing the serving Service's name leaves gang DNS unresolvable and the gang deadlocked, with the leader waiting for followers that can never find it. Both paths were validated end to end on EKS: a Standalone group on one L4 GPU, and a Leader/Worker gang serving pipeline-parallel across two single-L4 nodes. The multi-node example is the validated manifest; it uses vLLM's bundled multi-node-serving.sh launcher, which blocks the leader's engine until the whole gang has joined Ray. This ports the functionality the repo has today onto the new shape. Prefill/decode disaggregation - spec.serving and the Disaggregated mode - is left for a follow-up; unified serving is the only behavior. Signed-off-by: Nic Cope <nicc@rk0n.org>
This was referenced Jun 12, 2026
Closed
Closed
Collaborator
Author
|
Rolled into #137 alongside the design it implements, as separate commits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
Implements the design proposed in #137.
ModelDeployment.spec.workersmodeled a topology of tensor/pipeline axes, from which Modelplane derived engine flags and a Ray bootstrap. That coupled it to per-engine knowledge, left the parallelism flags written in two places, and couldn't express data or expert parallelism.This replaces topology with an array of worker groups, each a Standalone member or a Leader plus one or more Workers. The member carries its nodeSelector and engine template; parallelism and KV transfer live in the engine's commands and args, which pass through verbatim. The only injection is
MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address. The scheduler co-schedules a replica's groups onto one cluster, each on a satisfying pool, and compose-model-replica composes a Deployment or LeaderWorkerSet per group behind one Service and HTTPRoute.This ports today's functionality onto the new shape. Prefill/decode disaggregation (
spec.serving) is left for a follow-up; unified serving is the only behavior.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.