feat: serve the OpenXLA backend through the continuous-batching engine (#449 M3 Stage 2c) by inureyes · Pull Request #470 · lablup/mlxcel

inureyes · 2026-06-28T15:54:53Z

Summary

Stage 2c of the OpenXLA/IREE throughput milestone (#449 M3): wire the Stage 2b XlaBatchEngine (PR #469) into mlxcel-server so the OpenXLA backend is servable, not just CLI. With MLXCEL_BACKEND=xla on an xla-iree build, requests are served through the continuous-batching engine. The MLX serving path is unchanged.

Until now the server only ran MLX: it always built a BatchScheduler from an MLX LoadedModel (the XLA backend's load_model() errors), so MLXCEL_BACKEND=xla never reached the server. This PR adds the first non-MLX serving worker.

What's here

BatchEngine trait (src/server/batch/mod.rs): the backend-neutral serve contract (the cross-backend batching seam ADR 0004 deferred). A worker owns its model + KV/scheduling, consumes ModelRequests, and streams GenerateEvents until Shutdown. The MLX BatchScheduler and the new XlaServeWorker both implement it; the MLX worker spawn sites now dispatch through serve(), which forwards to the scheduler's existing run loop (behavior unchanged).
XlaServeWorker (src/server/batch/xla_worker.rs, #[cfg(feature = "xla-iree")]): drains ModelRequests, tokenizes prompts, submits to the engine, pumps it one step at a time, and maps each per-request EngineEvent back to a GenerateEvent, reusing the server's StreamingDecodeState for byte-fallback-safe detokenization. Polls the cancel flag; blocks on recv when idle.
ModelProvider routing (model_provider.rs): MLXCEL_BACKEND=xla spawns the XLA worker (new_with_xla_worker / spawn_xla_model_worker, which build the engine + tokenizer on the worker thread, same async-load posture as the MLX path); --max-batch-size maps to the engine's bundled B_max (>= 8 -> 8, else 4); the HAL device comes from MLXCEL_XLA_DEVICE.
XlaBackend::supports_batched_serving() returns true under xla-iree.

Scope (the engine is greedy + text-only by nature)

Honors max_tokens and the model's EOS ids; serves greedily regardless of sampling parameters (logged once). Rejects what it cannot serve faithfully, with a clear error rather than wrong output: logprobs (no logit readback), structured / JSON-schema output (no constraint masking), and multimodal inputs (text-only). Stop strings are not enforced yet (EOS + max_tokens terminate). So this is a reference/experimental serving path, gated behind xla-iree.

Validation (E2E on GB10, CUDA)

Server started with MLXCEL_BACKEND=xla MLXCEL_XLA_DEVICE=cuda + Llama-3.2-1B:

Token-exact: /v1/completions for "The capital of France is" returns " Paris. The Eiffel Tower is located in Paris. The Louvre Museum is also located in", identical to the --no-chat-template CLI reference (same engine, same greedy).
Continuous batching: 3 concurrent prompts each served correctly through B_max=4 ("Jupiter ... gas giant", "4", "small village ... Tuscany").
Streaming: SSE chunks concatenate to the same text.
Option rejection: a logprobs request returns error: ... the OpenXLA backend does not support logprobs ....
Usage / finish_reason correct (prompt_tokens, completion_tokens, length/stop).

MLX serving path unchanged: default build compiles with the XLA worker absent; serve() is a verbatim forward to the scheduler's run; cargo clippy -D warnings clean (default + xla-iree); scheduler tests green.

Refs #449 (epic). Follows #469 (Stage 2b), #468/#466 (2a), #462 (Stage 1). Next: Stage 2d (paged-KV, chunked prefill, sampling, stop strings).

#449 M3 Stage 2c) Wire the Stage 2b XlaBatchEngine into mlxcel-server so the OpenXLA backend is servable, not just CLI: with MLXCEL_BACKEND=xla on an xla-iree build, requests are served through the continuous-batching engine. The MLX serving path is unchanged. - BatchEngine trait (src/server/batch/mod.rs): the backend-neutral serve contract (the cross-backend batching seam ADR 0004 deferred). The MLX BatchScheduler and the new XlaServeWorker both implement it; the MLX worker spawn sites now dispatch through serve(), which forwards to the scheduler's existing run loop (behavior unchanged). - XlaServeWorker (src/server/batch/xla_worker.rs, gated on xla-iree): drains ModelRequests, encodes prompts, submits to the engine, pumps it one step at a time, and maps each EngineEvent to a GenerateEvent reusing the server's StreamingDecodeState for byte-fallback-safe detokenization; polls the cancel flag; blocks on recv when idle. - ModelProvider routing (model_provider.rs): MLXCEL_BACKEND=xla spawns the XLA worker (new_with_xla_worker / spawn_xla_model_worker, which build the engine + tokenizer on the worker thread); --max-batch-size maps to the engine's bundled B_max ({4, 8}); the HAL device comes from MLXCEL_XLA_DEVICE. - XlaBackend::supports_batched_serving() returns true under xla-iree. Scope: greedy and text-only (the engine's nature). Honors max_tokens and the model's EOS ids; serves greedily regardless of sampling parameters (logged once); rejects logprobs, structured / JSON-schema output, and multimodal inputs with a clear error rather than serving them wrong; stop strings are not enforced yet. Validated end-to-end on GB10 (CUDA): /v1/completions is token-exact against the --no-chat-template CLI reference; three concurrent prompts are each served correctly (continuous batching, B_max=4); streaming concatenates; a logprobs request is rejected. The MLX serving path is unchanged (default build and scheduler tests green). The engine is greedy, so this is a reference/experimental serving path gated behind xla-iree; default and CI builds are unaffected.

inureyes added area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions labels Jun 28, 2026

inureyes merged commit 4a7ceff into main Jun 28, 2026
5 checks passed

inureyes deleted the feature/449-stage2c-xla-serve-worker branch June 28, 2026 15:57

This was referenced Jun 28, 2026

feat: sample on the OpenXLA serve path (temperature/top-p/top-k) (#449 M3 Stage 2d) #471

Merged

feat: enforce stop strings on the OpenXLA serve path (#449 M3 Stage 2d) #478

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: serve the OpenXLA backend through the continuous-batching engine (#449 M3 Stage 2c)#470

feat: serve the OpenXLA backend through the continuous-batching engine (#449 M3 Stage 2c)#470
inureyes merged 1 commit into
mainfrom
feature/449-stage2c-xla-serve-worker

inureyes commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jun 28, 2026

Summary

What's here

Scope (the engine is greedy + text-only by nature)

Validation (E2E on GB10, CUDA)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant