Skip to content

feat: continuous-batching scheduler (#449 M3 Stage 2a-ii)#468

Merged
inureyes merged 1 commit into
mainfrom
feature/issue-449-batch-scheduler
Jun 28, 2026
Merged

feat: continuous-batching scheduler (#449 M3 Stage 2a-ii)#468
inureyes merged 1 commit into
mainfrom
feature/issue-449-batch-scheduler

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

The second half of Stage 2a (#449 M3): a minimal continuous-batching scheduler over the ragged decode graph (PR #466). B slots serve N > B requests that finish at staggered times, so queued requests are admitted mid-stream into freed slots (slot recycling). Validated by reference-equivalence under dynamic membership. With 2a-i this completes Stage 2a. Spike-only, so CI is unaffected.

What is in it

  • Shim xla_llama_refresh_mirror: pulls the live rank-5 KV of all active slots back to the host mirror (d2h). Mid-stream admit is refresh + prefill_slot(freed) + commit so admitting one sequence does not disturb the others (refresh captures active slots' advanced state, prefill_slot overwrites only the freed slot, commit re-uploads).
  • Driver src/bin/llama_sched.rs: B slots, a FIFO queue of N requests with varied lengths and varied token caps. The loop admits queued requests into free slots, ragged-decodes all active slots (inactive rows masked and ignored), evicts at the cap or an EOS token, and recycles the slot. Phase 1 captures each request's capped single-seq reference; Phase 2 asserts each request's stream equals its reference.

Validation (reference-equivalence under recycling)

Because caps vary, requests evict at different steps and later requests admit mid-stream while others are still decoding, so a passing run proves refresh/commit preserves the active slots and a recycled slot starts clean.

Device B N caps admits result
CPU local-task 3 6 12-14 6 (3 initial + 3 mid-stream) 6/6 reference-exact
CUDA GB10 4 8 12-39 8 (4 initial + 4 mid-stream) 8/8 reference-exact

Findings

  • Continuous batching is correct. Admit, evict, slot recycling, and mid-stream join all preserve per-request output: each sequence is identical to its standalone run.
  • The host-mirror admit is correct but not the throughput path. Each admit does a full rank-5 KV round-trip (d2h refresh + h2d commit), so the tok/s here is a correctness result, not a throughput one (Stage 1 measured raw batched throughput). 2b will replace it with a device-side write of only the admitted slot's region.
  • Static B with masked inactive slots works. A half-full batch produces the same per-request output as a full one.

Full result in spike/iree-ffi/FINDINGS_sched.md; the Stage 2 plan in spike/openxla/STAGE2_DESIGN.md.

Next

2b: productize an XlaBatchEngine in mlxcel-xla (contiguous KV, fixed B_max, a device-side slot write instead of the host-mirror round-trip, greedy, CLI/bench), then 2c the common BatchEngine trait + server integration.

Refs #449

The second half of Stage 2a: a minimal continuous-batching scheduler over the ragged decode graph (2a-i). B slots serve N > B requests that finish at staggered times, so queued requests are admitted mid-stream into freed slots (slot recycling). Validated by reference-equivalence under dynamic membership: every request's output matches its independent single-seq reference regardless of when it was admitted or which peers shared its batch. With 2a-i this completes Stage 2a. Spike-only, so CI is unaffected.

Shim (spike/iree-ffi/iree_gate.c): xla_llama_refresh_mirror pulls the live rank-5 KV of all active slots back to the host mirror (d2h). Mid-stream admit is refresh + prefill_slot(freed) + commit: refresh captures the active slots' advanced state, prefill_slot overwrites only the freed slot, commit re-uploads, so admitting one sequence does not disturb the others.

Driver spike/iree-ffi/src/bin/llama_sched.rs: B slots, a FIFO queue of N requests with varied prompt lengths and varied token caps. The loop admits queued requests into free slots, ragged-decodes all active slots in lockstep (inactive rows are masked and ignored, overwritten on the next admit), evicts at the cap or an EOS token, and recycles the slot. Phase 1 captures each request's capped single-seq reference; Phase 2 runs the scheduler and asserts each request's collected stream equals its reference.

Results, all reference-exact: CPU local-task B=3 N=6 (6 admits, 3 of them mid-stream); CUDA GB10 B=4 N=8 with caps spread 12-39 (8 admits, 4 mid-stream). The mid-stream admits each ran refresh_mirror while other slots were live and all sequences still matched, so the refresh and commit round-trip preserves active slots and a recycled slot starts clean.

The host-mirror admit does a full rank-5 KV round-trip (d2h refresh + h2d commit) per admit. That is correct but not the throughput path, so the tok/s here is a correctness result, not a throughput one (Stage 1 measured raw batched throughput). 2b will replace the round-trip with a device-side write of only the admitted slot's region.

Full result in spike/iree-ffi/FINDINGS_sched.md; the Stage 2 plan is in spike/openxla/STAGE2_DESIGN.md. Next is 2b: productize an XlaBatchEngine in mlxcel-xla.

Refs #449
@inureyes inureyes added type:enhancement New features, capabilities, or significant additions status:review Under review priority:medium Medium priority area:architecture Architecture and code structure changes labels Jun 28, 2026
@inureyes inureyes merged commit 8edd9f6 into main Jun 28, 2026
5 checks passed
@inureyes inureyes deleted the feature/issue-449-batch-scheduler branch June 28, 2026 13:43
@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant