feat: productize the continuous-batching engine (#449 M3 Stage 2b) by inureyes · Pull Request #469 · lablup/mlxcel

inureyes · 2026-06-28T15:04:37Z

Summary

Stage 2b of the OpenXLA/IREE throughput milestone (#449 M3): productize the Stage 2a spike into mlxcel-xla as XlaBatchEngine, a continuous-batching engine. B_max slots share one rank-5 KV cache and serve a request stream; every active slot advances one token per step through the ragged decode graph, and requests of different lengths join and leave the batch at different times.

The headline change over the spike: admit is now a device-side slot write. The Stage 2a scheduler (PR #468) admitted a request with a full host-mirror round-trip (d2h the whole rank-5 KV, overwrite one slot, h2d back). This PR replaces that with iree_hal_device_transfer_d2d of only the admitted slot's region, so live slots are never moved off-device. It is both cheaper and simpler (no refresh/commit), and inherently non-disturbing because only one slot's bytes change.

What's here

C shim (csrc/xla_iree.c): rank-5 KV state + xla_llama_ragged_reset / xla_llama_prefill_slot (device-side d2d slot write) / xla_llama_decode_ragged. The xla_llama_create signature is unchanged (the spike's vestigial vocab arg is dropped).
Assets: bundled ragged decode graphs for B_max ∈ {4, 8} (emitter decode-ragged-argmax, module @decode_step so the shim's decode_step.main resolves them). Selected by b_max at load; more slot counts = regenerate an asset (assets README).
iree.rs: ragged FFI + IreeRaggedLlama owner; shared create_ctx / compile_prefill_and helpers (dedup with the single-seq path); IreeLlama::prefill_first for the reference path.
batch.rs: XlaBatchEngine (submit / pump / cancel, per-request EngineEvents, greedy), a backend-neutral Scheduler split out so its admit/evict/cancel bookkeeping is unit-tested without a device, and XlaReferenceEngine for validation.
examples/xla_batch_bench.rs (required-features = ["xla-iree"]): the reference-equivalence + throughput harness.

Validation

Reference-equivalence gate (the Stage 2a gate, now over the productized engine + device-side admit): every request's batched stream must equal its independent single-sequence reference, regardless of when it was admitted or which peers shared its batch. All reference-exact, with mid-stream admit and slot recycling exercised in every run:

Device	B_max	N	result	batched tok/s	vs sequential
CPU `local-task`	4	8	reference-exact	6.25	3.8x
CPU `local-task`	8	10	reference-exact	5.70	3.5x
CUDA GB10	4	8	reference-exact	17.38	4.0x
CUDA GB10	8	16	reference-exact	23.41	4.9x

(tok/s is batched vs the same bench's sequential single-seq baseline; both bundled assets validated on both devices.)

Gates: cargo fmt clean; cargo clippy -D warnings clean (no-features, iree, and the example); cargo test -p mlxcel-xla 10 pass (incl. the Scheduler logic). Three build configs compile: no-features (CI-equivalent, engine absent), --features iree, and the xla-iree example.

Scope / non-goals

Backend-neutral at the request level (no server types), so the Stage 2c BatchEngine trait + server adapter wrap it unchanged. XlaBackend::supports_batched_serving() stays false until 2c wires it in.
Greedy only; fixed B_max from the bundled buckets; contiguous per-slot KV. Sampling, runtime bucket selection, paged KV, and chunked prefill are Stage 2c/2d.
Default and CI builds are unaffected: the engine is behind the iree feature and the example behind xla-iree.

Refs #449 (epic). Follows #462 (Stage 1), #466 (2a-i), #468 (2a-ii). Design: spike/openxla/STAGE2_DESIGN.md.

Productize the Stage 2a spike into mlxcel-xla as XlaBatchEngine: B_max slots share one rank-5 KV cache and serve a request stream, advancing every active slot one token per step through the ragged decode graph. Replaces the spike's host-mirror admit (a full d2h+h2d KV round-trip) with a DEVICE-SIDE slot write: a new request's prompt KV is copied into just its slot's region via iree_hal_device_transfer_d2d, leaving live slots untouched. - C shim (csrc/xla_iree.c): rank-5 KV state + xla_llama_ragged_reset / xla_llama_prefill_slot (device-side d2d) / xla_llama_decode_ragged; the create signature is unchanged. - Assets: bundled ragged decode graphs for B_max in {4, 8} (emitter decode-ragged-argmax), selected by b_max at load. - iree.rs: ragged FFI + IreeRaggedLlama owner; shared create_ctx / compile_prefill_and helpers; IreeLlama::prefill_first for references. - batch.rs: XlaBatchEngine (submit / pump / cancel, per-request EngineEvents, greedy), a backend-neutral Scheduler split out for unit tests, and XlaReferenceEngine for validation. - examples/xla_batch_bench.rs (required-features = xla-iree): the reference-equivalence + throughput harness. Validated reference-exact (every request matches its independent single-seq reference) on CPU local-task (B=4 N=8, B=8 N=10) and CUDA GB10 (B=4 N=8, B=8 N=16), with mid-stream admit and slot recycling. Backend-neutral at the request level, so the Stage 2c BatchEngine trait + server adapter wrap it unchanged; supports_batched_serving() stays false until 2c. Default and CI builds are unaffected: the engine is behind the iree feature and the example behind xla-iree.

inureyes added area:architecture Architecture and code structure changes priority:medium Medium priority type:enhancement New features, capabilities, or significant additions status:done Completed labels Jun 28, 2026

inureyes merged commit b8bb217 into main Jun 28, 2026
5 checks passed

inureyes deleted the feature/449-stage2b-xla-batch-engine branch June 28, 2026 15:11

This was referenced Jun 28, 2026

feat: serve the OpenXLA backend through the continuous-batching engine (#449 M3 Stage 2c) #470

Merged

feat: sample on the OpenXLA serve path (temperature/top-p/top-k) (#449 M3 Stage 2d) #471

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: productize the continuous-batching engine (#449 M3 Stage 2b)#469

feat: productize the continuous-batching engine (#449 M3 Stage 2b)#469
inureyes merged 1 commit into
mainfrom
feature/449-stage2b-xla-batch-engine

inureyes commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jun 28, 2026

Summary

What's here

Validation

Scope / non-goals

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant