Skip to content

Implement unopinionated ModelDeployment workers#138

Closed
negz wants to merge 2 commits into
mainfrom
topologicality
Closed

Implement unopinionated ModelDeployment workers#138
negz wants to merge 2 commits into
mainfrom
topologicality

Conversation

@negz

@negz negz commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

Implements the design proposed in #137.

ModelDeployment.spec.workers modeled a topology of tensor/pipeline axes, from which Modelplane derived engine flags and a Ray bootstrap. That coupled it to per-engine knowledge, left the parallelism flags written in two places, and couldn't express data or expert parallelism.

This replaces topology with an array of worker groups, each a Standalone member or a Leader plus one or more Workers. The member carries its nodeSelector and engine template; parallelism and KV transfer live in the engine's commands and args, which pass through verbatim. The only injection is MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's address. The scheduler co-schedules a replica's groups onto one cluster, each on a satisfying pool, and compose-model-replica composes a Deployment or LeaderWorkerSet per group behind one Service and HTTPRoute.

This ports today's functionality onto the new shape. Prefill/decode disaggregation (spec.serving) is left for a follow-up; unified serving is the only behavior.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

This proposal began as an attempt to implement spec.workers.topology's
unimplemented data and dataLocal axes, so the API could express the
data-parallel and mixture-of-experts deployments frontier models like
Kimi K2 and DeepSeek V3 need. Working on it surfaced a problem with the
topology abstraction itself.

topology's axes do two things each: they shape the workload (pods and
nodes) and they name an engine flag Modelplane derives and injects.
Everywhere else Modelplane passes the user's args through untouched, so a
new engine or flag needs no change to Modelplane. topology breaks that:
the flags it derives (--tensor-parallel-size and the rest) are
engine-specific, spelled differently by vLLM, SGLang, and TensorRT-LLM,
so deriving them takes on the per-engine knowledge Modelplane was trying
to avoid. It also creates two sources of truth: the user writes the
parallelism flags in args, and topology derives them again, with nothing
reconciling the two.

design/unopinionated-deployments.md proposes describing a deployment by
its shape instead. spec.workers becomes an array of worker groups, each a
Standalone member or a Leader and one or more Workers, replicated by
group. spec.serving describes how an InferenceCluster exposes those
groups as an OpenAI-compatible endpoint. Parallelism and the rest stay in
the engine's own flags, which Modelplane passes through - so the API
stays unopinionated about the engine and the parallelism topology.

This supersedes parts of the base design; design.md gains pointers to the
new doc pending it being folded in. Still a draft for discussion.

Towards #52.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz force-pushed the topologicality branch from 3d609d9 to dadaa0f Compare June 12, 2026 18:58
The ModelDeployment API modeled its workers as a topology block of
parallelism axes - tensor and pipeline - and derived engine flags
(--tensor-parallel-size, --pipeline-parallel-size) and a Ray bootstrap
from them. This coupled Modelplane to per-engine knowledge: the flags are
spelled differently by vLLM, SGLang, and TensorRT-LLM, and the user still
wrote the same flags in args, leaving two unreconciled sources of truth.
The shape also couldn't express data or expert parallelism.

This replaces topology with the shape the design in
design/unopinionated-deployments.md proposes. spec.workers becomes an
array of worker groups; a group is one serving unit of either a single
Standalone member or a Leader and one or more Workers. Each member carries
its own nodeSelector (moved down from the deployment level) and engine
template, and a group may set replicas. Node cost is pods x replicas,
where pods is 1 for a Standalone or 1 plus the Worker count for a gang.
Parallelism, quantization, and KV transfer now live entirely in the
members' engine commands and args, which pass through verbatim; Modelplane
injects no engine flags. The only thing it injects is
MODELPLANE_LEADER_ADDRESS, a backend-neutral alias for the gang leader's
address (LWS_LEADER_ADDRESS for the LeaderWorkerSet backend), so follower
commands needn't hard-code the orchestrator's variable.

A member's count carries no schema default, although it defaults to 1,
because the apiserver applies defaults before CEL validation: a default
would inject count onto every Standalone and Leader member and trip the
rule that only a Worker may set it, rejecting every ModelDeployment.

The scheduler now co-schedules a replica's groups onto one cluster, each
group on a pool that satisfies every member, against a trial ledger so two
groups of a replica don't double-book a pool. The ModelReplica spec mirrors
the new shape with the scheduler's placement resolved on: each group
carries its nodePoolName and each member its resolved deviceRequests.
compose-model-replica composes a workload per group - a Deployment for a
Standalone, a LeaderWorkerSet for a gang - fronted by one Service and
HTTPRoute spanning every group's serving pods. Only Standalone pods and
gang leaders carry the serving label the Service selects on; a gang's
worker followers don't serve, so they carry no serving label and a
multi-group replica's Deployments select on a per-workload label to avoid
overlapping selectors.

The replica name is reserved for the serving Service and HTTPRoute;
workloads are always named per group. A LeaderWorkerSet's controller
creates a headless Service named after the LWS for gang pod DNS - the
address followers join - but only if no Service of that name exists. An
LWS sharing the serving Service's name leaves gang DNS unresolvable and
the gang deadlocked, with the leader waiting for followers that can never
find it.

Both paths were validated end to end on EKS: a Standalone group on one L4
GPU, and a Leader/Worker gang serving pipeline-parallel across two
single-L4 nodes. The multi-node example is the validated manifest; it uses
vLLM's bundled multi-node-serving.sh launcher, which blocks the leader's
engine until the whole gang has joined Ray.

This ports the functionality the repo has today onto the new shape.
Prefill/decode disaggregation - spec.serving and the Disaggregated mode -
is left for a follow-up; unified serving is the only behavior.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz

negz commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Rolled into #137 alongside the design it implements, as separate commits.

@negz negz closed this Jun 12, 2026
@negz negz deleted the topologicality branch June 16, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant