Skip to content

feat: run the OpenXLA backend on CUDA (GB10) via a source-built IREE runtime (#449)#461

Merged
inureyes merged 1 commit into
mainfrom
feat/449-cuda-productize
Jun 27, 2026
Merged

feat: run the OpenXLA backend on CUDA (GB10) via a source-built IREE runtime (#449)#461
inureyes merged 1 commit into
mainfrom
feat/449-cuda-productize

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

The OpenXLA backend (issue #449) now runs on the GB10 GPU. MLXCEL_BACKEND=xla MLXCEL_XLA_DEVICE=cuda mlxcel generate drives Llama-3.2-1B through IREE on CUDA, token-exact (48/48) vs HF temp-0 at ~5 tok/s (~2.6x the CPU path).

Why a source build

  • The prebuilt IREE dist is CPU/Vulkan only: no cuda driver, and its iree-compile has no cuda codegen.
  • Vulkan via the dist does not work on the GB10: IREE's Vulkan allocator cannot allocate against NVIDIA's Grace-Blackwell unified memory.
  • So CUDA is the GPU path, using a source-built cuda-enabled IREE runtime plus a cuda-capable iree-compile, version-matched. It is a separate, mutually-exclusive build mode keyed on IREE_CUDA_HOME, so the merged CPU (IREE_DIST) path and CI (which build neither) are unchanged.

What changed

  • mlxcel-xla/build.rs: cuda mode (IREE_CUDA_HOME) compiles the shim against the source runtime headers with XLA_GATE_CUDA and bakes the cuda iree-compile path (IREE_CUDA_COMPILE, overridable at runtime via MLXCEL_XLA_IREE_COMPILE).
  • csrc/xla_iree.c: registers the cuda driver explicitly (guarded by XLA_GATE_CUDA; the unified runtime bundles only the local-task registration).
  • root build.rs: the cuda runtime link recipe (the source unified archive already bundles the cuda driver impl, so whole-archive it alone + the registration wrapper + IREE's vendored printf + flatcc; the cuda driver dlopens libcuda, so no link-time -lcuda).
  • src/iree.rs: cuda target flags, cuda iree-compile sourcing, and the compiler path in the vmfb cache key.
  • runtime.rs: suppress the MLX CPU-fallback footgun warning when the XLA backend is selected (it drives inference through IREE, not MLX, so the message is misleading).
  • spike/iree-ffi: the same cuda mode that proved the path token-exact before it was productized.
  • README: the source-runtime build recipe.

How to run (local)

# build the cuda-enabled IREE runtime from source (runtime only, no LLVM) — see the crate README
export IREE_CUDA_HOME=/path/to/iree   IREE_CUDA_COMPILE=/path/to/cuda/iree-compile
cargo build --release --features xla-iree
MLXCEL_BACKEND=xla MLXCEL_XLA_DEVICE=cuda ./target/release/mlxcel generate -m <llama-3.2-1b> -p "..." -n 48

Scope

  • Still Llama-3.2-1B, prompts capped at the 256-token bucket, greedy, single-sequence (batch-1). Throughput needs batched graphs + a multi-sequence session, a follow-up.
  • The cuda runtime build is a local artifact (not committed); the recipe is in the README.

Validation

  • dist (CPU) mode still builds (no regression); cuda mode builds, links, and runs end to end on the GB10 at ~5 tok/s.
  • default and xla-backend builds, cargo fmt --check, and cargo clippy (default -D warnings + xla-backend) are clean; cargo test --features xla-backend --lib backend:: 6/6.

Refs #449.

…runtime (#449)

The OpenXLA backend (issue #449) now runs on the GB10 GPU. `MLXCEL_BACKEND=xla MLXCEL_XLA_DEVICE=cuda mlxcel generate` drives Llama-3.2-1B through IREE on CUDA, token-exact (48/48) vs HF temp-0 at about 5 tok/s, roughly 2.6x the CPU path.

The prebuilt IREE dist is CPU/Vulkan only (no cuda driver, and its iree-compile has no cuda codegen), and Vulkan through that dist does not work on the GB10 (IREE's Vulkan allocator cannot allocate against NVIDIA's Grace-Blackwell unified memory). CUDA is therefore the GPU path, using a source-built cuda-enabled IREE runtime plus a cuda-capable iree-compile, version-matched to each other. This is a separate, mutually-exclusive build mode keyed on IREE_CUDA_HOME, so the merged CPU (IREE_DIST) path is unchanged and CI, which builds neither, stays green.

mlxcel-xla/build.rs gains a cuda mode (IREE_CUDA_HOME): it compiles the shim against the source runtime headers with XLA_GATE_CUDA defined and bakes the cuda iree-compile path (IREE_CUDA_COMPILE, overridable at runtime via MLXCEL_XLA_IREE_COMPILE). The C shim registers the cuda driver explicitly, since the unified runtime bundles only the local-task registration. The runtime link recipe lives in the root build.rs (a dependency's link-args do not propagate to the binary): the source-built unified archive already bundles the cuda driver impl, so it is whole-archived alone, with the cuda registration wrapper, IREE's vendored printf (the unified printf.c.o needs vsnprintf_), and flatcc in a group. The cuda driver dlopens libcuda at runtime, so no link-time -lcuda. src/iree.rs adds the cuda target flags, sources iree-compile from MLXCEL_XLA_IREE_COMPILE in cuda mode, and includes the compiler path in the vmfb cache key so a cuda vmfb is never reused for a cpu build. The source-runtime build recipe is documented in the crate README.

The MLX CPU-fallback footgun warning is suppressed when the OpenXLA backend is selected, because it drives inference through IREE, not MLX, and can run on the GPU; the message would otherwise be misleading.

spike/iree-ffi gains the same cuda mode (IREE_CUDA_HOME) that proved the path token-exact before it was productized.

Validation: the dist (CPU) mode still builds (no regression); the cuda mode builds, links, and runs end to end on the GB10 at about 5 tok/s; default and xla-backend builds, fmt, and clippy are clean; backend tests pass.

Refs #449.
@inureyes inureyes added type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:architecture Architecture and code structure changes labels Jun 27, 2026
@inureyes inureyes merged commit a942cb1 into main Jun 27, 2026
5 checks passed
@inureyes inureyes deleted the feat/449-cuda-productize branch June 27, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:medium Medium priority type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant