Place each engine on a single pool instead of splitting across pools#150
Merged
Conversation
The scheduler placed an engine's members on one pool when it could, and split them across pools when no single pool fit. Splitting is unsafe: the scheduler can't reason about interconnect fabric. Pool identity is the finest grain it has, so when a gang's members land on different pools it can't tell whether those pools share a fabric. A Leader/Worker gang doing tensor or pipeline parallel talks over its pool's fabric - NVLink within a node, InfiniBand within a pool. Scatter its members across pools that don't share a fabric and the collective never forms: the gang sits NotReady with no signal saying why. Splitting doesn't degrade the interconnect, it removes it - handing back a placement that can't run rather than a slower one. This change places every member of an engine on a single pool, or rejects the engine on that cluster. Rejection is safe and recoverable - the fill phase tries another cluster, and the deployment surfaces InsufficientCapacity if none fit - whereas a cross-fabric placement is a silent hang. Per-member nodeSelectors stay (a member can still claim different devices from the shared pool, and a claimless member rides along at zero node cost), they just no longer place members on different pools. Co-scheduling a gang across pools that genuinely share a fabric - an operator exposing low-utilization nodes beside the GPU nodes as a separate pool, say - needs the scheduler to model fabric, which it doesn't yet. That's tracked in #149. Signed-off-by: Nic Cope <nicc@rk0n.org>
There was a problem hiding this comment.
Pull request overview
Updates the compose-model-deployment scheduler to ensure all members of an engine are placed onto a single GPU pool within a cluster (or the engine is rejected for that cluster), avoiding cross-pool placements that can silently hang multi-node collectives when pools don’t share an interconnect fabric.
Changes:
- Remove cross-pool “split placement” for engine members; place whole engines on the first single pool that satisfies all members and capacity.
- Update scheduler documentation/comments to reflect the new single-pool constraint and its motivation (#149).
- Expand scheduling tests to cover rejection when no single pool fits and successful placement onto an alternate cluster when one cluster can’t fit the whole engine.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| functions/compose-model-deployment/function/scheduling.py | Enforces single-pool-per-engine placement and removes split-member placement logic. |
| functions/compose-model-deployment/tests/test_scheduling.py | Updates/extends tests to validate rejection and cross-cluster fallback under the new placement rules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
The scheduler placed an engine's members on one pool when it could, and split them across pools when no single pool fit. Splitting is unsafe, as @dennis-upbound pointed out reviewing #137: the scheduler can't reason about interconnect fabric. Pool identity is the finest grain it has, so when a gang's members land on different pools it can't tell whether those pools share a fabric.
A Leader/Worker gang doing tensor or pipeline parallel talks over its pool's fabric — NVLink within a node, InfiniBand within a pool. Scatter its members across pools that don't share a fabric and the collective never forms: the gang sits NotReady with no signal saying why. Splitting doesn't degrade the interconnect, it removes it — handing back a placement that can't run rather than a slower one.
This change places every member of an engine on a single pool, or rejects the engine on that cluster. Rejection is safe and recoverable — the fill phase tries another cluster, and the deployment surfaces
InsufficientCapacityif none fit — whereas a cross-fabric placement is a silent hang. Per-membernodeSelectors stay (a member can still claim different devices from the shared pool, and a claimless member rides along at zero node cost); they just no longer place members on different pools.Co-scheduling a gang across pools that genuinely share a fabric — an operator exposing low-utilization nodes beside the GPU nodes as a separate pool, say — needs the scheduler to model fabric, which it doesn't yet. That's tracked in #149.
The change is concentrated in
_place_engine, now reduced to "place the whole engine on the first single pool that fits, or reject."_place_engine_splitand_place_memberare gone. That's the part worth reviewer attention.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.