feat: introduce a ComputeBackend seam to abstract the forward-execution engine#446
Merged
Merged
Conversation
…on engine Add a backend boundary so a future non-MLX execution engine can host forward() without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whose furiosa-opt toolchain compiles to a virtual ISA and cannot use MLX at all. This change lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels. The new src/backend module defines a ComputeBackend trait that abstracts who executes forward(), not individual ops. The seam sits at the model-load boundary (once per load), not at the per-token forward() call, so MLX graph fusion and mx.compile are untouched and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points so the MLX path keeps its concrete hot types (paged KV, prompt-cache detach and adopt, cache tensors) exposed. MlxBackend (src/backend/mlx.rs) implements the trait by delegating unchanged to the existing crate::loading entry points, so the same loading and forward code runs whether reached directly or through the seam. Selection folds away under default features. Backend is an enum whose only variant is Backend::Mlx, and MlxBackend is a zero-sized type, so the enum is zero-sized with no discriminant. select_backend() always returns that one variant with no environment read and no branch, and every Backend method is a single-arm match marked #[inline]. After inlining, select_backend().load_model(p) lowers to a direct call to the existing MLX loader, identical codegen to the pre-seam build. The optional non-MLX path lives behind the default-off experimental-backend Cargo feature: the experimental module and the Backend::Experimental variant are cfg-gated, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code, and the feature-on select_backend() (the only place that reads an env switch) is compiled only when the feature is enabled. Behavior is preserved by construction: the control-plane load call sites (CLI generate and chat, both server model-worker loops, and the offline draft-model load) now call select_backend().load_model* instead of crate::load_model*, and those methods delegate to the unchanged loaders. The public load_model / load_model_with_adapter / load_model_with_tensor_parallel functions remain available. Focused tests assert selection resolves to MLX under default features, that the seam reaches the real MLX loader (errors on a missing directory rather than a backend shim), and compile-time that MlxBackend implements ComputeBackend and LoadedModel implements LanguageModel.
This was referenced Jun 26, 2026
inureyes
added a commit
that referenced
this pull request
Jun 26, 2026
Redraw the compute-backend seam from PR #446 into the inference-session contract ADR 0004 settled on. PR #446 drew the boundary at model load and returned the concrete MLX LoadedModel; that altitude cannot host a graph-compiler backend (FuriosaAI, Tenstorrent, OpenXLA) that never produces an MlxArray. This adds the session layer and moves the MLX CLI path behind it, byte-identical by construction, while leaving the server batched path exactly as it was. Layered contract: the core single-sequence InferenceSession (mlxcel-core/src/session.rs) is an object-safe, engine-neutral trait that exposes capability advertisement plus the conceptual token-level primitives prefill / decode_step, the shape a future non-MLX backend (issue #449, a separate default-off crate) fills in. The MLX implementation MlxInferenceSession wraps the existing CxxGenerator and delegates every generation method (generate, generate_streaming, generate_streaming_with_embeddings, generate_with_stats, generate_with_stats_and_embeddings, evaluate_loglikelihoods) verbatim, so the exact same decode loop, KV optimizations, and sampling run. The decode loop and CxxGenerator internals are unchanged, so CLI output stays byte-identical. On MLX the prefill / decode_step primitives return a reserved-contract message because the CLI drives the fused generate_* entry points; they are the contract for the compiler-family backend, not the MLX fast path. ComputeBackend gains create_session (returns a Session) and supports_batched_serving. The MLX backend builds Session::Mlx(MlxInferenceSession) with the same KV mode and token bias the CLI used before; the cfg-gated experimental scaffold returns the same not_implemented error from create_session and reports no batched serving. The retained load_model -> (LoadedModel, MlxcelTokenizer) entry is untouched, so src/server/model_worker.rs and the BatchScheduler keep owning LoadedModel directly. Batched serving is advertised as an MLX backend capability the single-sequence session does not cover yet (the deferred KV / batching abstraction per ADR 0004). Dispatch folds away under default features: Session is a single-variant enum (Session::Mlx), select_backend folds to the single Backend::Mlx variant, and every Session and Backend method is a single-arm match marked inline, so backend.create_session(...).generate(...) lowers to a direct call into the wrapped CxxGenerator with no runtime indirection added on the hot path. The per-token forward stays inside the session method and KVCache is never type-erased. The experimental-backend module and enum variant stay cfg-gated off, so shipping Apple-Silicon and CUDA builds compile no extra backend code. CLI call sites rerouted: src/commands/generate.rs (generate_standard and generate_with_embeddings now obtain a Session from the backend instead of constructing CxxGenerator directly) and src/commands/chat.rs (run_chat builds one session for the REPL; stream_turn and the /clear reset drive the session). Same KV mode, same sampling config, same arguments. The offline draft-model load already routes through select_backend().load_model. Tests: mlxcel-core session_tests assert capability advertisement, token-bias wiring, the object-safe trait bound, and that the MLX step primitives report they are the reserved compiler-backend contract; backend tests assert the MLX backend resolves a session, advertises batched serving, threads the token bias through, and (under the experimental-backend feature) that the scaffold's session creation errors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduce a
ComputeBackendseam so a future non-MLX execution engine can hostforward()without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whosefuriosa-opttoolchain compiles to a virtual ISA and cannot use MLX at all. This PR lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels.Seam design
ComputeBackend(newsrc/backend/mod.rs) abstracts who executesforward(), not individual ops. It sits at the model-load boundary (called once per load), not at the per-tokenforward()call, so MLX graph fusion andmx.compilestay intact and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points (load_model,load_model_with_adapter,load_model_with_tensor_parallel) so the MLX path keeps its concrete hot types exposed: paged KV, prompt-cache detach and adopt, and cache tensors are never type-erased behindBox<dyn>. A non-MLX engine implements the sameLanguageModelforward contract behind a different backend.What moved behind MlxBackend
MlxBackend(src/backend/mlx.rs) is a zero-sized type that implementsComputeBackendby delegating unchanged to the existingcrate::loadingentry points. No loading logic is reimplemented; the same code runs whether reached directly or through the seam. The publicload_model/load_model_with_adapter/load_model_with_tensor_parallelfunctions remain available (tests and bench harnesses still use them).Feature flag
A default-off
experimental-backendCargo feature gates the optional non-MLX path. Thesrc/backend/experimental.rsscaffold module and theBackend::Experimentalenum variant arecfg-gated behind it, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code. The scaffold reports "not implemented" rather than pretending to load; a real engine (and any hardware-feasibility gate) is future work. The feature-onselect_backend()is the only place that reads a runtime backend switch (MLXCEL_BACKEND), and it is compiled only when the feature is enabled.How codegen equivalence / no dispatch is guaranteed when the feature is off
Backendis an enum whose only variant under default features isBackend::Mlx, andMlxBackendis zero-sized, soBackendis itself zero-sized with no discriminant.select_backend()(#[inline],#[must_use]) is a constant constructor that always returns that one variant with no environment read and no branch. EveryBackendmethod is a single-armmatchmarked#[inline]. After inlining (release builds use fat LTO andcodegen-units = 1),select_backend().load_model(p)lowers to a direct call to the existing MLX loader, identical to the pre-seam build. The env-reading selection path and the second enum variant only exist under#[cfg(feature = "experimental-backend")], so they are absent from default codegen entirely.How behavior preservation is ensured
Behavior is preserved by construction: the control-plane load call sites now go through the seam, and the seam methods delegate to the unchanged loaders, so the same loading and forward code runs. Rerouted call sites:
src/commands/generate.rs(primary load: tensor-parallel / adapter / plain, plus the offline draft-model load),src/commands/chat.rs(REPL load), andsrc/server/model_worker.rs(both the batched and the legacy sequential worker loops, each covering tensor-parallel / adapter / plain). The pipeline-parallel branch keeps its own distributed loader and does not go through the seam.What changed
src/backend/mod.rs(new):ComputeBackendtrait,Backendenum,select_backend().src/backend/mlx.rs(new):MlxBackend, delegates tocrate::loading.src/backend/experimental.rs(new,cfg-gated): scaffold plug-in slot for a non-MLX engine.src/backend/tests.rs(new): scoped seam tests.src/lib.rs:pub mod backend;and re-export ofBackend,ComputeBackend,MlxBackend,select_backend.Cargo.toml: default-offexperimental-backendfeature.src/commands/generate.rs,src/commands/chat.rs,src/server/model_worker.rs: route loads throughselect_backend().Test plan
cargo check --lib --features metal,acceleratecargo check --bins --features metal,acceleratecargo check --lib --features metal,accelerate,experimental-backend(gated module compiles)cargo clippy --lib --tests --features metal,accelerate -- -D warningscargo clippy --lib --features metal,accelerate,experimental-backend -- -D warningscargo test --lib backend:: --features metal,accelerate(3 tests pass)cargo fmt --checkCloses #338