Skip to content

feat: productize the continuous-batching engine (#449 M3 Stage 2b)#469

Merged
inureyes merged 1 commit into
mainfrom
feature/449-stage2b-xla-batch-engine
Jun 28, 2026
Merged

feat: productize the continuous-batching engine (#449 M3 Stage 2b)#469
inureyes merged 1 commit into
mainfrom
feature/449-stage2b-xla-batch-engine

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Stage 2b of the OpenXLA/IREE throughput milestone (#449 M3): productize the Stage 2a spike into mlxcel-xla as XlaBatchEngine, a continuous-batching engine. B_max slots share one rank-5 KV cache and serve a request stream; every active slot advances one token per step through the ragged decode graph, and requests of different lengths join and leave the batch at different times.

The headline change over the spike: admit is now a device-side slot write. The Stage 2a scheduler (PR #468) admitted a request with a full host-mirror round-trip (d2h the whole rank-5 KV, overwrite one slot, h2d back). This PR replaces that with iree_hal_device_transfer_d2d of only the admitted slot's region, so live slots are never moved off-device. It is both cheaper and simpler (no refresh/commit), and inherently non-disturbing because only one slot's bytes change.

What's here

  • C shim (csrc/xla_iree.c): rank-5 KV state + xla_llama_ragged_reset / xla_llama_prefill_slot (device-side d2d slot write) / xla_llama_decode_ragged. The xla_llama_create signature is unchanged (the spike's vestigial vocab arg is dropped).
  • Assets: bundled ragged decode graphs for B_max ∈ {4, 8} (emitter decode-ragged-argmax, module @decode_step so the shim's decode_step.main resolves them). Selected by b_max at load; more slot counts = regenerate an asset (assets README).
  • iree.rs: ragged FFI + IreeRaggedLlama owner; shared create_ctx / compile_prefill_and helpers (dedup with the single-seq path); IreeLlama::prefill_first for the reference path.
  • batch.rs: XlaBatchEngine (submit / pump / cancel, per-request EngineEvents, greedy), a backend-neutral Scheduler split out so its admit/evict/cancel bookkeeping is unit-tested without a device, and XlaReferenceEngine for validation.
  • examples/xla_batch_bench.rs (required-features = ["xla-iree"]): the reference-equivalence + throughput harness.

Validation

Reference-equivalence gate (the Stage 2a gate, now over the productized engine + device-side admit): every request's batched stream must equal its independent single-sequence reference, regardless of when it was admitted or which peers shared its batch. All reference-exact, with mid-stream admit and slot recycling exercised in every run:

Device B_max N result batched tok/s vs sequential
CPU local-task 4 8 reference-exact 6.25 3.8x
CPU local-task 8 10 reference-exact 5.70 3.5x
CUDA GB10 4 8 reference-exact 17.38 4.0x
CUDA GB10 8 16 reference-exact 23.41 4.9x

(tok/s is batched vs the same bench's sequential single-seq baseline; both bundled assets validated on both devices.)

Gates: cargo fmt clean; cargo clippy -D warnings clean (no-features, iree, and the example); cargo test -p mlxcel-xla 10 pass (incl. the Scheduler logic). Three build configs compile: no-features (CI-equivalent, engine absent), --features iree, and the xla-iree example.

Scope / non-goals

  • Backend-neutral at the request level (no server types), so the Stage 2c BatchEngine trait + server adapter wrap it unchanged. XlaBackend::supports_batched_serving() stays false until 2c wires it in.
  • Greedy only; fixed B_max from the bundled buckets; contiguous per-slot KV. Sampling, runtime bucket selection, paged KV, and chunked prefill are Stage 2c/2d.
  • Default and CI builds are unaffected: the engine is behind the iree feature and the example behind xla-iree.

Refs #449 (epic). Follows #462 (Stage 1), #466 (2a-i), #468 (2a-ii). Design: spike/openxla/STAGE2_DESIGN.md.

Productize the Stage 2a spike into mlxcel-xla as XlaBatchEngine: B_max slots
share one rank-5 KV cache and serve a request stream, advancing every active
slot one token per step through the ragged decode graph. Replaces the spike's
host-mirror admit (a full d2h+h2d KV round-trip) with a DEVICE-SIDE slot write:
a new request's prompt KV is copied into just its slot's region via
iree_hal_device_transfer_d2d, leaving live slots untouched.

- C shim (csrc/xla_iree.c): rank-5 KV state + xla_llama_ragged_reset /
  xla_llama_prefill_slot (device-side d2d) / xla_llama_decode_ragged; the
  create signature is unchanged.
- Assets: bundled ragged decode graphs for B_max in {4, 8} (emitter
  decode-ragged-argmax), selected by b_max at load.
- iree.rs: ragged FFI + IreeRaggedLlama owner; shared create_ctx /
  compile_prefill_and helpers; IreeLlama::prefill_first for references.
- batch.rs: XlaBatchEngine (submit / pump / cancel, per-request EngineEvents,
  greedy), a backend-neutral Scheduler split out for unit tests, and
  XlaReferenceEngine for validation.
- examples/xla_batch_bench.rs (required-features = xla-iree): the
  reference-equivalence + throughput harness.

Validated reference-exact (every request matches its independent single-seq
reference) on CPU local-task (B=4 N=8, B=8 N=10) and CUDA GB10 (B=4 N=8,
B=8 N=16), with mid-stream admit and slot recycling. Backend-neutral at the
request level, so the Stage 2c BatchEngine trait + server adapter wrap it
unchanged; supports_batched_serving() stays false until 2c.

Default and CI builds are unaffected: the engine is behind the iree feature
and the example behind xla-iree.
@inureyes inureyes added area:architecture Architecture and code structure changes priority:medium Medium priority type:enhancement New features, capabilities, or significant additions status:done Completed labels Jun 28, 2026
@inureyes inureyes merged commit b8bb217 into main Jun 28, 2026
5 checks passed
@inureyes inureyes deleted the feature/449-stage2b-xla-batch-engine branch June 28, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant