Skip to content

feat: serve the OpenXLA backend through the continuous-batching engine (#449 M3 Stage 2c)#470

Merged
inureyes merged 1 commit into
mainfrom
feature/449-stage2c-xla-serve-worker
Jun 28, 2026
Merged

feat: serve the OpenXLA backend through the continuous-batching engine (#449 M3 Stage 2c)#470
inureyes merged 1 commit into
mainfrom
feature/449-stage2c-xla-serve-worker

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Stage 2c of the OpenXLA/IREE throughput milestone (#449 M3): wire the Stage 2b XlaBatchEngine (PR #469) into mlxcel-server so the OpenXLA backend is servable, not just CLI. With MLXCEL_BACKEND=xla on an xla-iree build, requests are served through the continuous-batching engine. The MLX serving path is unchanged.

Until now the server only ran MLX: it always built a BatchScheduler from an MLX LoadedModel (the XLA backend's load_model() errors), so MLXCEL_BACKEND=xla never reached the server. This PR adds the first non-MLX serving worker.

What's here

  • BatchEngine trait (src/server/batch/mod.rs): the backend-neutral serve contract (the cross-backend batching seam ADR 0004 deferred). A worker owns its model + KV/scheduling, consumes ModelRequests, and streams GenerateEvents until Shutdown. The MLX BatchScheduler and the new XlaServeWorker both implement it; the MLX worker spawn sites now dispatch through serve(), which forwards to the scheduler's existing run loop (behavior unchanged).
  • XlaServeWorker (src/server/batch/xla_worker.rs, #[cfg(feature = "xla-iree")]): drains ModelRequests, tokenizes prompts, submits to the engine, pumps it one step at a time, and maps each per-request EngineEvent back to a GenerateEvent, reusing the server's StreamingDecodeState for byte-fallback-safe detokenization. Polls the cancel flag; blocks on recv when idle.
  • ModelProvider routing (model_provider.rs): MLXCEL_BACKEND=xla spawns the XLA worker (new_with_xla_worker / spawn_xla_model_worker, which build the engine + tokenizer on the worker thread, same async-load posture as the MLX path); --max-batch-size maps to the engine's bundled B_max (>= 8 -> 8, else 4); the HAL device comes from MLXCEL_XLA_DEVICE.
  • XlaBackend::supports_batched_serving() returns true under xla-iree.

Scope (the engine is greedy + text-only by nature)

Honors max_tokens and the model's EOS ids; serves greedily regardless of sampling parameters (logged once). Rejects what it cannot serve faithfully, with a clear error rather than wrong output: logprobs (no logit readback), structured / JSON-schema output (no constraint masking), and multimodal inputs (text-only). Stop strings are not enforced yet (EOS + max_tokens terminate). So this is a reference/experimental serving path, gated behind xla-iree.

Validation (E2E on GB10, CUDA)

Server started with MLXCEL_BACKEND=xla MLXCEL_XLA_DEVICE=cuda + Llama-3.2-1B:

  • Token-exact: /v1/completions for "The capital of France is" returns " Paris. The Eiffel Tower is located in Paris. The Louvre Museum is also located in", identical to the --no-chat-template CLI reference (same engine, same greedy).
  • Continuous batching: 3 concurrent prompts each served correctly through B_max=4 ("Jupiter ... gas giant", "4", "small village ... Tuscany").
  • Streaming: SSE chunks concatenate to the same text.
  • Option rejection: a logprobs request returns error: ... the OpenXLA backend does not support logprobs ....
  • Usage / finish_reason correct (prompt_tokens, completion_tokens, length/stop).

MLX serving path unchanged: default build compiles with the XLA worker absent; serve() is a verbatim forward to the scheduler's run; cargo clippy -D warnings clean (default + xla-iree); scheduler tests green.

Refs #449 (epic). Follows #469 (Stage 2b), #468/#466 (2a), #462 (Stage 1). Next: Stage 2d (paged-KV, chunked prefill, sampling, stop strings).

#449 M3 Stage 2c)

Wire the Stage 2b XlaBatchEngine into mlxcel-server so the OpenXLA backend is
servable, not just CLI: with MLXCEL_BACKEND=xla on an xla-iree build, requests
are served through the continuous-batching engine. The MLX serving path is
unchanged.

- BatchEngine trait (src/server/batch/mod.rs): the backend-neutral serve contract
  (the cross-backend batching seam ADR 0004 deferred). The MLX BatchScheduler and
  the new XlaServeWorker both implement it; the MLX worker spawn sites now dispatch
  through serve(), which forwards to the scheduler's existing run loop (behavior
  unchanged).
- XlaServeWorker (src/server/batch/xla_worker.rs, gated on xla-iree): drains
  ModelRequests, encodes prompts, submits to the engine, pumps it one step at a
  time, and maps each EngineEvent to a GenerateEvent reusing the server's
  StreamingDecodeState for byte-fallback-safe detokenization; polls the cancel
  flag; blocks on recv when idle.
- ModelProvider routing (model_provider.rs): MLXCEL_BACKEND=xla spawns the XLA
  worker (new_with_xla_worker / spawn_xla_model_worker, which build the engine +
  tokenizer on the worker thread); --max-batch-size maps to the engine's bundled
  B_max ({4, 8}); the HAL device comes from MLXCEL_XLA_DEVICE.
- XlaBackend::supports_batched_serving() returns true under xla-iree.

Scope: greedy and text-only (the engine's nature). Honors max_tokens and the
model's EOS ids; serves greedily regardless of sampling parameters (logged once);
rejects logprobs, structured / JSON-schema output, and multimodal inputs with a
clear error rather than serving them wrong; stop strings are not enforced yet.

Validated end-to-end on GB10 (CUDA): /v1/completions is token-exact against the
--no-chat-template CLI reference; three concurrent prompts are each served
correctly (continuous batching, B_max=4); streaming concatenates; a logprobs
request is rejected. The MLX serving path is unchanged (default build and
scheduler tests green). The engine is greedy, so this is a reference/experimental
serving path gated behind xla-iree; default and CI builds are unaffected.
@inureyes inureyes added area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions labels Jun 28, 2026
@inureyes inureyes merged commit 4a7ceff into main Jun 28, 2026
5 checks passed
@inureyes inureyes deleted the feature/449-stage2c-xla-serve-worker branch June 28, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant