Skip to content

Place each engine on a single pool instead of splitting across pools#150

Merged
dennis-upbound merged 1 commit into
mainfrom
splitting-hairs
Jun 16, 2026
Merged

Place each engine on a single pool instead of splitting across pools#150
dennis-upbound merged 1 commit into
mainfrom
splitting-hairs

Conversation

@negz

@negz negz commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

The scheduler placed an engine's members on one pool when it could, and split them across pools when no single pool fit. Splitting is unsafe, as @dennis-upbound pointed out reviewing #137: the scheduler can't reason about interconnect fabric. Pool identity is the finest grain it has, so when a gang's members land on different pools it can't tell whether those pools share a fabric.

A Leader/Worker gang doing tensor or pipeline parallel talks over its pool's fabric — NVLink within a node, InfiniBand within a pool. Scatter its members across pools that don't share a fabric and the collective never forms: the gang sits NotReady with no signal saying why. Splitting doesn't degrade the interconnect, it removes it — handing back a placement that can't run rather than a slower one.

This change places every member of an engine on a single pool, or rejects the engine on that cluster. Rejection is safe and recoverable — the fill phase tries another cluster, and the deployment surfaces InsufficientCapacity if none fit — whereas a cross-fabric placement is a silent hang. Per-member nodeSelectors stay (a member can still claim different devices from the shared pool, and a claimless member rides along at zero node cost); they just no longer place members on different pools.

Co-scheduling a gang across pools that genuinely share a fabric — an operator exposing low-utilization nodes beside the GPU nodes as a separate pool, say — needs the scheduler to model fabric, which it doesn't yet. That's tracked in #149.

The change is concentrated in _place_engine, now reduced to "place the whole engine on the first single pool that fits, or reject." _place_engine_split and _place_member are gone. That's the part worth reviewer attention.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

The scheduler placed an engine's members on one pool when it could, and
split them across pools when no single pool fit. Splitting is unsafe: the
scheduler can't reason about interconnect fabric. Pool identity is the
finest grain it has, so when a gang's members land on different pools it
can't tell whether those pools share a fabric.

A Leader/Worker gang doing tensor or pipeline parallel talks over its
pool's fabric - NVLink within a node, InfiniBand within a pool. Scatter
its members across pools that don't share a fabric and the collective
never forms: the gang sits NotReady with no signal saying why. Splitting
doesn't degrade the interconnect, it removes it - handing back a
placement that can't run rather than a slower one.

This change places every member of an engine on a single pool, or rejects
the engine on that cluster. Rejection is safe and recoverable - the fill
phase tries another cluster, and the deployment surfaces
InsufficientCapacity if none fit - whereas a cross-fabric placement is a
silent hang. Per-member nodeSelectors stay (a member can still claim
different devices from the shared pool, and a claimless member rides along
at zero node cost), they just no longer place members on different pools.

Co-scheduling a gang across pools that genuinely share a fabric - an
operator exposing low-utilization nodes beside the GPU nodes as a separate
pool, say - needs the scheduler to model fabric, which it doesn't yet.
That's tracked in #149.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz marked this pull request as ready for review June 15, 2026 19:25
Copilot AI review requested due to automatic review settings June 15, 2026 19:25

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the compose-model-deployment scheduler to ensure all members of an engine are placed onto a single GPU pool within a cluster (or the engine is rejected for that cluster), avoiding cross-pool placements that can silently hang multi-node collectives when pools don’t share an interconnect fabric.

Changes:

  • Remove cross-pool “split placement” for engine members; place whole engines on the first single pool that satisfies all members and capacity.
  • Update scheduler documentation/comments to reflect the new single-pool constraint and its motivation (#149).
  • Expand scheduling tests to cover rejection when no single pool fits and successful placement onto an alternate cluster when one cluster can’t fit the whole engine.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
functions/compose-model-deployment/function/scheduling.py Enforces single-pool-per-engine placement and removes split-member placement logic.
functions/compose-model-deployment/tests/test_scheduling.py Updates/extends tests to validate rejection and cross-cluster fallback under the new placement rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functions/compose-model-deployment/function/scheduling.py

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dennis-upbound dennis-upbound merged commit 8d49848 into main Jun 16, 2026
5 checks passed
@negz negz deleted the splitting-hairs branch June 16, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants