Place each engine on a single pool instead of splitting across pools by negz · Pull Request #150 · modelplaneai/modelplane

negz · 2026-06-15T18:57:40Z

Description of your changes

The scheduler placed an engine's members on one pool when it could, and split them across pools when no single pool fit. Splitting is unsafe, as @dennis-upbound pointed out reviewing #137: the scheduler can't reason about interconnect fabric. Pool identity is the finest grain it has, so when a gang's members land on different pools it can't tell whether those pools share a fabric.

A Leader/Worker gang doing tensor or pipeline parallel talks over its pool's fabric — NVLink within a node, InfiniBand within a pool. Scatter its members across pools that don't share a fabric and the collective never forms: the gang sits NotReady with no signal saying why. Splitting doesn't degrade the interconnect, it removes it — handing back a placement that can't run rather than a slower one.

This change places every member of an engine on a single pool, or rejects the engine on that cluster. Rejection is safe and recoverable — the fill phase tries another cluster, and the deployment surfaces InsufficientCapacity if none fit — whereas a cross-fabric placement is a silent hang. Per-member nodeSelectors stay (a member can still claim different devices from the shared pool, and a claimless member rides along at zero node cost); they just no longer place members on different pools.

Co-scheduling a gang across pools that genuinely share a fabric — an operator exposing low-utilization nodes beside the GPU nodes as a separate pool, say — needs the scheduler to model fabric, which it doesn't yet. That's tracked in #149.

The change is concentrated in _place_engine, now reduced to "place the whole engine on the first single pool that fits, or reject." _place_engine_split and _place_member are gone. That's the part worth reviewer attention.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
Added or updated tests covering any composition function changes.
Signed off every commit with git commit -s.

The scheduler placed an engine's members on one pool when it could, and split them across pools when no single pool fit. Splitting is unsafe: the scheduler can't reason about interconnect fabric. Pool identity is the finest grain it has, so when a gang's members land on different pools it can't tell whether those pools share a fabric. A Leader/Worker gang doing tensor or pipeline parallel talks over its pool's fabric - NVLink within a node, InfiniBand within a pool. Scatter its members across pools that don't share a fabric and the collective never forms: the gang sits NotReady with no signal saying why. Splitting doesn't degrade the interconnect, it removes it - handing back a placement that can't run rather than a slower one. This change places every member of an engine on a single pool, or rejects the engine on that cluster. Rejection is safe and recoverable - the fill phase tries another cluster, and the deployment surfaces InsufficientCapacity if none fit - whereas a cross-fabric placement is a silent hang. Per-member nodeSelectors stay (a member can still claim different devices from the shared pool, and a claimless member rides along at zero node cost), they just no longer place members on different pools. Co-scheduling a gang across pools that genuinely share a fabric - an operator exposing low-utilization nodes beside the GPU nodes as a separate pool, say - needs the scheduler to model fabric, which it doesn't yet. That's tracked in #149. Signed-off-by: Nic Cope <nicc@rk0n.org>

Copilot

Pull request overview

Updates the compose-model-deployment scheduler to ensure all members of an engine are placed onto a single GPU pool within a cluster (or the engine is rejected for that cluster), avoiding cross-pool placements that can silently hang multi-node collectives when pools don’t share an interconnect fabric.

Changes:

Remove cross-pool “split placement” for engine members; place whole engines on the first single pool that satisfies all members and capacity.
Update scheduler documentation/comments to reflect the new single-pool constraint and its motivation (#149).
Expand scheduling tests to cover rejection when no single pool fits and successful placement onto an alternate cluster when one cluster can’t fit the whole engine.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
functions/compose-model-deployment/function/scheduling.py	Enforces single-pool-per-engine placement and removes split-member placement logic.
functions/compose-model-deployment/tests/test_scheduling.py	Updates/extends tests to validate rejection and cross-cluster fallback under the new placement rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dennis-upbound

lgtm

negz marked this pull request as ready for review June 15, 2026 19:25

Copilot AI review requested due to automatic review settings June 15, 2026 19:25

Copilot started reviewing on behalf of negz June 15, 2026 19:25 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread functions/compose-model-deployment/function/scheduling.py

dennis-upbound approved these changes Jun 16, 2026

View reviewed changes

dennis-upbound merged commit 8d49848 into main Jun 16, 2026
5 checks passed

negz mentioned this pull request Jun 16, 2026

EKS has no autoscaler installed #166

Closed

negz deleted the splitting-hairs branch June 16, 2026 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Place each engine on a single pool instead of splitting across pools#150

Place each engine on a single pool instead of splitting across pools#150
dennis-upbound merged 1 commit into
mainfrom
splitting-hairs

negz commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

dennis-upbound left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

negz commented Jun 15, 2026

Description of your changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

dennis-upbound left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants