feat: continuous-batching scheduler (#449 M3 Stage 2a-ii)#468
Merged
Conversation
The second half of Stage 2a: a minimal continuous-batching scheduler over the ragged decode graph (2a-i). B slots serve N > B requests that finish at staggered times, so queued requests are admitted mid-stream into freed slots (slot recycling). Validated by reference-equivalence under dynamic membership: every request's output matches its independent single-seq reference regardless of when it was admitted or which peers shared its batch. With 2a-i this completes Stage 2a. Spike-only, so CI is unaffected. Shim (spike/iree-ffi/iree_gate.c): xla_llama_refresh_mirror pulls the live rank-5 KV of all active slots back to the host mirror (d2h). Mid-stream admit is refresh + prefill_slot(freed) + commit: refresh captures the active slots' advanced state, prefill_slot overwrites only the freed slot, commit re-uploads, so admitting one sequence does not disturb the others. Driver spike/iree-ffi/src/bin/llama_sched.rs: B slots, a FIFO queue of N requests with varied prompt lengths and varied token caps. The loop admits queued requests into free slots, ragged-decodes all active slots in lockstep (inactive rows are masked and ignored, overwritten on the next admit), evicts at the cap or an EOS token, and recycles the slot. Phase 1 captures each request's capped single-seq reference; Phase 2 runs the scheduler and asserts each request's collected stream equals its reference. Results, all reference-exact: CPU local-task B=3 N=6 (6 admits, 3 of them mid-stream); CUDA GB10 B=4 N=8 with caps spread 12-39 (8 admits, 4 mid-stream). The mid-stream admits each ran refresh_mirror while other slots were live and all sequences still matched, so the refresh and commit round-trip preserves active slots and a recycled slot starts clean. The host-mirror admit does a full rank-5 KV round-trip (d2h refresh + h2d commit) per admit. That is correct but not the throughput path, so the tok/s here is a correctness result, not a throughput one (Stage 1 measured raw batched throughput). 2b will replace the round-trip with a device-side write of only the admitted slot's region. Full result in spike/iree-ffi/FINDINGS_sched.md; the Stage 2 plan is in spike/openxla/STAGE2_DESIGN.md. Next is 2b: productize an XlaBatchEngine in mlxcel-xla. Refs #449
This was referenced Jun 28, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The second half of Stage 2a (#449 M3): a minimal continuous-batching scheduler over the ragged decode graph (PR #466).
Bslots serveN > Brequests that finish at staggered times, so queued requests are admitted mid-stream into freed slots (slot recycling). Validated by reference-equivalence under dynamic membership. With 2a-i this completes Stage 2a. Spike-only, so CI is unaffected.What is in it
xla_llama_refresh_mirror: pulls the live rank-5 KV of all active slots back to the host mirror (d2h). Mid-stream admit isrefresh + prefill_slot(freed) + commitso admitting one sequence does not disturb the others (refresh captures active slots' advanced state, prefill_slot overwrites only the freed slot, commit re-uploads).src/bin/llama_sched.rs:Bslots, a FIFO queue ofNrequests with varied lengths and varied token caps. The loop admits queued requests into free slots, ragged-decodes all active slots (inactive rows masked and ignored), evicts at the cap or an EOS token, and recycles the slot. Phase 1 captures each request's capped single-seq reference; Phase 2 asserts each request's stream equals its reference.Validation (reference-equivalence under recycling)
Because caps vary, requests evict at different steps and later requests admit mid-stream while others are still decoding, so a passing run proves refresh/commit preserves the active slots and a recycled slot starts clean.
local-taskFindings
Bwith masked inactive slots works. A half-full batch produces the same per-request output as a full one.Full result in
spike/iree-ffi/FINDINGS_sched.md; the Stage 2 plan inspike/openxla/STAGE2_DESIGN.md.Next
2b: productize an
XlaBatchEngineinmlxcel-xla(contiguous KV, fixedB_max, a device-side slot write instead of the host-mirror round-trip, greedy, CLI/bench), then 2c the commonBatchEnginetrait + server integration.Refs #449