feat: continuous-batching scheduler (#449 M3 Stage 2a-ii) by inureyes · Pull Request #468 · lablup/mlxcel

inureyes · 2026-06-28T13:42:24Z

Summary

The second half of Stage 2a (#449 M3): a minimal continuous-batching scheduler over the ragged decode graph (PR #466). B slots serve N > B requests that finish at staggered times, so queued requests are admitted mid-stream into freed slots (slot recycling). Validated by reference-equivalence under dynamic membership. With 2a-i this completes Stage 2a. Spike-only, so CI is unaffected.

What is in it

Shim xla_llama_refresh_mirror: pulls the live rank-5 KV of all active slots back to the host mirror (d2h). Mid-stream admit is refresh + prefill_slot(freed) + commit so admitting one sequence does not disturb the others (refresh captures active slots' advanced state, prefill_slot overwrites only the freed slot, commit re-uploads).
Driver src/bin/llama_sched.rs: B slots, a FIFO queue of N requests with varied lengths and varied token caps. The loop admits queued requests into free slots, ragged-decodes all active slots (inactive rows masked and ignored), evicts at the cap or an EOS token, and recycles the slot. Phase 1 captures each request's capped single-seq reference; Phase 2 asserts each request's stream equals its reference.

Validation (reference-equivalence under recycling)

Because caps vary, requests evict at different steps and later requests admit mid-stream while others are still decoding, so a passing run proves refresh/commit preserves the active slots and a recycled slot starts clean.

Device	B	N	caps	admits	result
CPU `local-task`	3	6	12-14	6 (3 initial + 3 mid-stream)	6/6 reference-exact
CUDA GB10	4	8	12-39	8 (4 initial + 4 mid-stream)	8/8 reference-exact

Findings

Continuous batching is correct. Admit, evict, slot recycling, and mid-stream join all preserve per-request output: each sequence is identical to its standalone run.
The host-mirror admit is correct but not the throughput path. Each admit does a full rank-5 KV round-trip (d2h refresh + h2d commit), so the tok/s here is a correctness result, not a throughput one (Stage 1 measured raw batched throughput). 2b will replace it with a device-side write of only the admitted slot's region.
Static B with masked inactive slots works. A half-full batch produces the same per-request output as a full one.

Full result in spike/iree-ffi/FINDINGS_sched.md; the Stage 2 plan in spike/openxla/STAGE2_DESIGN.md.

The second half of Stage 2a: a minimal continuous-batching scheduler over the ragged decode graph (2a-i). B slots serve N > B requests that finish at staggered times, so queued requests are admitted mid-stream into freed slots (slot recycling). Validated by reference-equivalence under dynamic membership: every request's output matches its independent single-seq reference regardless of when it was admitted or which peers shared its batch. With 2a-i this completes Stage 2a. Spike-only, so CI is unaffected. Shim (spike/iree-ffi/iree_gate.c): xla_llama_refresh_mirror pulls the live rank-5 KV of all active slots back to the host mirror (d2h). Mid-stream admit is refresh + prefill_slot(freed) + commit: refresh captures the active slots' advanced state, prefill_slot overwrites only the freed slot, commit re-uploads, so admitting one sequence does not disturb the others. Driver spike/iree-ffi/src/bin/llama_sched.rs: B slots, a FIFO queue of N requests with varied prompt lengths and varied token caps. The loop admits queued requests into free slots, ragged-decodes all active slots in lockstep (inactive rows are masked and ignored, overwritten on the next admit), evicts at the cap or an EOS token, and recycles the slot. Phase 1 captures each request's capped single-seq reference; Phase 2 runs the scheduler and asserts each request's collected stream equals its reference. Results, all reference-exact: CPU local-task B=3 N=6 (6 admits, 3 of them mid-stream); CUDA GB10 B=4 N=8 with caps spread 12-39 (8 admits, 4 mid-stream). The mid-stream admits each ran refresh_mirror while other slots were live and all sequences still matched, so the refresh and commit round-trip preserves active slots and a recycled slot starts clean. The host-mirror admit does a full rank-5 KV round-trip (d2h refresh + h2d commit) per admit. That is correct but not the throughput path, so the tok/s here is a correctness result, not a throughput one (Stage 1 measured raw batched throughput). 2b will replace the round-trip with a device-side write of only the admitted slot's region. Full result in spike/iree-ffi/FINDINGS_sched.md; the Stage 2 plan is in spike/openxla/STAGE2_DESIGN.md. Next is 2b: productize an XlaBatchEngine in mlxcel-xla. Refs #449

inureyes added type:enhancement New features, capabilities, or significant additions status:review Under review priority:medium Medium priority area:architecture Architecture and code structure changes labels Jun 28, 2026

inureyes merged commit 8edd9f6 into main Jun 28, 2026
5 checks passed

inureyes deleted the feature/issue-449-batch-scheduler branch June 28, 2026 13:43

inureyes added status:done Completed and removed status:review Under review labels Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: continuous-batching scheduler (#449 M3 Stage 2a-ii)#468

feat: continuous-batching scheduler (#449 M3 Stage 2a-ii)#468
inureyes merged 1 commit into
mainfrom
feature/issue-449-batch-scheduler

inureyes commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jun 28, 2026

Summary

What is in it

Validation (reference-equivalence under recycling)

Findings

Next

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant