feat: introduce a ComputeBackend seam to abstract the forward-execution engine by inureyes · Pull Request #446 · lablup/mlxcel

inureyes · 2026-06-25T23:05:59Z

Summary

Introduce a ComputeBackend seam so a future non-MLX execution engine can host forward() without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whose furiosa-opt toolchain compiles to a virtual ISA and cannot use MLX at all. This PR lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels.

Seam design

ComputeBackend (new src/backend/mod.rs) abstracts who executes forward(), not individual ops. It sits at the model-load boundary (called once per load), not at the per-token forward() call, so MLX graph fusion and mx.compile stay intact and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points (load_model, load_model_with_adapter, load_model_with_tensor_parallel) so the MLX path keeps its concrete hot types exposed: paged KV, prompt-cache detach and adopt, and cache tensors are never type-erased behind Box<dyn>. A non-MLX engine implements the same LanguageModel forward contract behind a different backend.

What moved behind MlxBackend

MlxBackend (src/backend/mlx.rs) is a zero-sized type that implements ComputeBackend by delegating unchanged to the existing crate::loading entry points. No loading logic is reimplemented; the same code runs whether reached directly or through the seam. The public load_model / load_model_with_adapter / load_model_with_tensor_parallel functions remain available (tests and bench harnesses still use them).

Feature flag

A default-off experimental-backend Cargo feature gates the optional non-MLX path. The src/backend/experimental.rs scaffold module and the Backend::Experimental enum variant are cfg-gated behind it, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code. The scaffold reports "not implemented" rather than pretending to load; a real engine (and any hardware-feasibility gate) is future work. The feature-on select_backend() is the only place that reads a runtime backend switch (MLXCEL_BACKEND), and it is compiled only when the feature is enabled.

How codegen equivalence / no dispatch is guaranteed when the feature is off

Backend is an enum whose only variant under default features is Backend::Mlx, and MlxBackend is zero-sized, so Backend is itself zero-sized with no discriminant. select_backend() (#[inline], #[must_use]) is a constant constructor that always returns that one variant with no environment read and no branch. Every Backend method is a single-arm match marked #[inline]. After inlining (release builds use fat LTO and codegen-units = 1), select_backend().load_model(p) lowers to a direct call to the existing MLX loader, identical to the pre-seam build. The env-reading selection path and the second enum variant only exist under #[cfg(feature = "experimental-backend")], so they are absent from default codegen entirely.

How behavior preservation is ensured

Behavior is preserved by construction: the control-plane load call sites now go through the seam, and the seam methods delegate to the unchanged loaders, so the same loading and forward code runs. Rerouted call sites: src/commands/generate.rs (primary load: tensor-parallel / adapter / plain, plus the offline draft-model load), src/commands/chat.rs (REPL load), and src/server/model_worker.rs (both the batched and the legacy sequential worker loops, each covering tensor-parallel / adapter / plain). The pipeline-parallel branch keeps its own distributed loader and does not go through the seam.

What changed

src/backend/mod.rs (new): ComputeBackend trait, Backend enum, select_backend().
src/backend/mlx.rs (new): MlxBackend, delegates to crate::loading.
src/backend/experimental.rs (new, cfg-gated): scaffold plug-in slot for a non-MLX engine.
src/backend/tests.rs (new): scoped seam tests.
src/lib.rs: pub mod backend; and re-export of Backend, ComputeBackend, MlxBackend, select_backend.
Cargo.toml: default-off experimental-backend feature.
src/commands/generate.rs, src/commands/chat.rs, src/server/model_worker.rs: route loads through select_backend().

Test plan

cargo check --lib --features metal,accelerate
cargo check --bins --features metal,accelerate
cargo check --lib --features metal,accelerate,experimental-backend (gated module compiles)
cargo clippy --lib --tests --features metal,accelerate -- -D warnings
cargo clippy --lib --features metal,accelerate,experimental-backend -- -D warnings
cargo test --lib backend:: --features metal,accelerate (3 tests pass)
cargo fmt --check
Real-checkpoint temp-0 byte-identical token parity and throughput on the MLX path are owned by the maintainer's release-build parity gate.

Closes #338

…on engine Add a backend boundary so a future non-MLX execution engine can host forward() without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whose furiosa-opt toolchain compiles to a virtual ISA and cannot use MLX at all. This change lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels. The new src/backend module defines a ComputeBackend trait that abstracts who executes forward(), not individual ops. The seam sits at the model-load boundary (once per load), not at the per-token forward() call, so MLX graph fusion and mx.compile are untouched and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points so the MLX path keeps its concrete hot types (paged KV, prompt-cache detach and adopt, cache tensors) exposed. MlxBackend (src/backend/mlx.rs) implements the trait by delegating unchanged to the existing crate::loading entry points, so the same loading and forward code runs whether reached directly or through the seam. Selection folds away under default features. Backend is an enum whose only variant is Backend::Mlx, and MlxBackend is a zero-sized type, so the enum is zero-sized with no discriminant. select_backend() always returns that one variant with no environment read and no branch, and every Backend method is a single-arm match marked #[inline]. After inlining, select_backend().load_model(p) lowers to a direct call to the existing MLX loader, identical codegen to the pre-seam build. The optional non-MLX path lives behind the default-off experimental-backend Cargo feature: the experimental module and the Backend::Experimental variant are cfg-gated, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code, and the feature-on select_backend() (the only place that reads an env switch) is compiled only when the feature is enabled. Behavior is preserved by construction: the control-plane load call sites (CLI generate and chat, both server model-worker loops, and the offline draft-model load) now call select_backend().load_model* instead of crate::load_model*, and those methods delegate to the unchanged loaders. The public load_model / load_model_with_adapter / load_model_with_tensor_parallel functions remain available. Focused tests assert selection resolves to MLX under default features, that the seam reaches the real MLX loader (errors on a missing directory rather than a backend shim), and compile-time that MlxBackend implements ComputeBackend and LoadedModel implements LanguageModel.

…ype tradeoff (#338)

Redraw the compute-backend seam from PR #446 into the inference-session contract ADR 0004 settled on. PR #446 drew the boundary at model load and returned the concrete MLX LoadedModel; that altitude cannot host a graph-compiler backend (FuriosaAI, Tenstorrent, OpenXLA) that never produces an MlxArray. This adds the session layer and moves the MLX CLI path behind it, byte-identical by construction, while leaving the server batched path exactly as it was. Layered contract: the core single-sequence InferenceSession (mlxcel-core/src/session.rs) is an object-safe, engine-neutral trait that exposes capability advertisement plus the conceptual token-level primitives prefill / decode_step, the shape a future non-MLX backend (issue #449, a separate default-off crate) fills in. The MLX implementation MlxInferenceSession wraps the existing CxxGenerator and delegates every generation method (generate, generate_streaming, generate_streaming_with_embeddings, generate_with_stats, generate_with_stats_and_embeddings, evaluate_loglikelihoods) verbatim, so the exact same decode loop, KV optimizations, and sampling run. The decode loop and CxxGenerator internals are unchanged, so CLI output stays byte-identical. On MLX the prefill / decode_step primitives return a reserved-contract message because the CLI drives the fused generate_* entry points; they are the contract for the compiler-family backend, not the MLX fast path. ComputeBackend gains create_session (returns a Session) and supports_batched_serving. The MLX backend builds Session::Mlx(MlxInferenceSession) with the same KV mode and token bias the CLI used before; the cfg-gated experimental scaffold returns the same not_implemented error from create_session and reports no batched serving. The retained load_model -> (LoadedModel, MlxcelTokenizer) entry is untouched, so src/server/model_worker.rs and the BatchScheduler keep owning LoadedModel directly. Batched serving is advertised as an MLX backend capability the single-sequence session does not cover yet (the deferred KV / batching abstraction per ADR 0004). Dispatch folds away under default features: Session is a single-variant enum (Session::Mlx), select_backend folds to the single Backend::Mlx variant, and every Session and Backend method is a single-arm match marked inline, so backend.create_session(...).generate(...) lowers to a direct call into the wrapped CxxGenerator with no runtime indirection added on the hot path. The per-token forward stays inside the session method and KVCache is never type-erased. The experimental-backend module and enum variant stay cfg-gated off, so shipping Apple-Silicon and CUDA builds compile no extra backend code. CLI call sites rerouted: src/commands/generate.rs (generate_standard and generate_with_embeddings now obtain a Session from the backend instead of constructing CxxGenerator directly) and src/commands/chat.rs (run_chat builds one session for the REPL; stream_turn and the /clear reset drive the session). Same KV mode, same sampling config, same arguments. The offline draft-model load already routes through select_backend().load_model. Tests: mlxcel-core session_tests assert capability advertisement, token-bias wiring, the object-safe trait bound, and that the MLX step primitives report they are the reserved compiler-backend contract; backend tests assert the MLX backend resolves a session, advertises batched serving, threads the token bias through, and (under the experimental-backend feature) that the scaffold's session creation errors.

inureyes added type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:architecture Architecture and code structure changes status:review Under review labels Jun 25, 2026

docs: document the ComputeBackend seam and note the concrete-return-t…

6e7bfa6

…ype tradeoff (#338)

inureyes added status:done Completed and removed status:review Under review labels Jun 25, 2026

inureyes merged commit 6ac38ec into main Jun 25, 2026
5 checks passed

inureyes deleted the feat/issue-338-compute-backend-seam branch June 25, 2026 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: introduce a ComputeBackend seam to abstract the forward-execution engine#446

feat: introduce a ComputeBackend seam to abstract the forward-execution engine#446
inureyes merged 2 commits into
mainfrom
feat/issue-338-compute-backend-seam

inureyes commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jun 25, 2026

Summary

Seam design

What moved behind MlxBackend

Feature flag

How codegen equivalence / no dispatch is guaranteed when the feature is off

How behavior preservation is ensured

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant