Comparing changes

Differential Revision: D101547370 Pull Request resolved: #18992

JumpFalseCall instructions can set destination_instruction to themselves, creating an infinite loop that hangs the runtime. This adds a configurable instruction counter (default 10M, overridable via -DET_MAX_INSTRUCTIONS) to Method::execute() that returns Error::InvalidState if exceeded. This PR was authored with the assistance of Claude. ### Test plan Existing tests ```bash cmake -B build -DEXECUTORCH_BUILD_TESTS=ON cmake --build build --target method_test ctest --test-dir build -R method_test --output-on-failure ``` Co-authored-by: Github Executorch <github_executorch@arm.com>

Introduce remapId function that checks error instead of std::unordered_map::at(), which throws std::out_of_range in noexcept functions causing std::terminate(). Applied across all ~30 node-definition functions in XNNCompiler. retake #18804, for windows compatibility Co-authored-by: Github Executorch <github_executorch@arm.com>

Apply module: arm and ciflow/trunk labels when backends/arm or examples/arm files are modified. Signed-off-by: Sebastian Larsson <sebastian.larsson@arm.com>

… matmul (#18877) Adds gather_qmv Metal kernel for Mixture-of-Experts: performs per-expert quantized matrix-vector multiply y[i] = W[expert_idx[i]] @ x[i]. Extends the existing qmv kernels in op_linear_4bit.mm with expert index-based pointer offsets, following the same pattern as MLX's affine_gather_qmv_fast. Two dispatch paths (matching op_linear_4bit.mm): - gather_qmv_fast: optimized path for K%512==0 and N%8==0 - gather_qmv_impl: generic fallback for any K and N Uses the same affine INT4 dequantization format as op_linear_4bit.mm (scale * accum + sum * bias). Instantiated for 4-bit with group sizes {32, 64, 128} and dtypes {float, bfloat16}. Includes: Metal shader + C++ host dispatch, Python custom op definition (metal::gather_qmv) with reference CPU impl and Meta impl, C shim dict, fallback kernel registration, CMakeLists entry, and test module.

Summary: Check has been failing and other quant models use 15 so it seems fine. Differential Revision: D101837680

Adds Metal kernel for the gated delta rule recurrence used by Qwen 3.5 MoE's GatedDeltaNet linear attention layers. Ported from the MLX delegate PR (#18785) Metal shader. The kernel processes the full sequence sequentially within a single GPU dispatch, keeping recurrent state in per-thread registers. Grid: [32, Dv, B*Hv], Threadgroup: [32, 4, 1]. Each simdgroup of 32 threads handles Dk/32 elements of the key dimension with SIMD reduction for dot products. The op mutates the recurrent state buffer in-place (mutates_args). Instantiated for both real model (Dk=128, Dv=128, Hk=32, Hv=32) and tiny test (Dk=64, Dv=64, Hk=4, Hv=4) dimensions. Includes: Metal shader + C++ host dispatch, Python custom op definition (metal::gated_delta_rule) with reference CPU impl and Meta impl, C shim dict, fallback kernel registration, CMakeLists entry, and test module.

Adds metal_source_transformations.py with module replacements for Metal: - FusedMoEExperts -> MetalMoEExperts (two metal::gather_qmv calls with SiLU gating, replacing torch.ops.triton.fused_moe) - GatedDeltaNet -> metal::gated_delta_rule custom op (replaces both the T=1 native path and T>1 Triton kernel) - FullAttention -> removes turboquant codepath, keeps standard SDPA - SparseMoE -> removes .float() cast on expert_weights Also includes quantize_experts_metal() which quantizes expert weights to MLX affine INT4 format (unsigned uint4 with scale + bias per group), compatible with the Metal gather_qmv kernel.

Adds Metal backend support to export.py via --backend metal flag: - _prepare_and_quantize_metal: applies source transforms, quantizes experts to MLX affine INT4, quantizes non-expert layers with fpa4w (skips shared_expert_gate with N<4 for prefill compatibility) - _export_metal: exports decode + prefill methods via MetalBackend/ MetalPartitioner CUDA and MLX paths are unchanged.

Summary: Pull Request resolved: #19002 Move 6 permute optimization passes and their shared infrastructure from executorch/backends/cadence/aot/ to executorch/backends/transforms/ so they can be shared between the Cadence and Arm backends without a cross-backend dependency. New files: - permute_pass_utils.py: base classes (HierarchicalInplacePassInterface, RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims, get_shape, get_edge_overload_packet) - fuse_cascaded_transpose_or_permute_ops.py - fuse_cascaded_view_ops.py - fuse_transpose_or_permute_op_pairs_pass.py - remove_permutes_around_elementwise_ops.py - postpone_permute_below_squeeze_view.py - replace_nop_transpose_or_permute_with_view.py The shared versions omit register_cadence_pass decorators and cadence-specific ops from default op sets. Cadence files will subclass these and re-add the decorators and ops. Added OSS tests (test_permute_optimization_passes.py) for the 4 passes that can be imported without quantized op registration: FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps, PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via pytest and are discovered automatically through pytest.ini testpaths. Differential Revision: D101459577 Reviewed By: ethansfng

Summary: Replace implicit `tosa_dim_order`-based layout handling with explicit `permute_copy` ops around TOSA operators that require NHWC layout. ### Rewrite passes insert explicit NCHW↔NHWC permutes `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass` now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op, NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for layout conversion. This makes layout transitions visible in the graph. ### Grouped conv decomposition in NHWC `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single input/output permute pair wrapping the whole group. Supports INT8, INT16 (with and without bias) quantisation paths, including the full INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) → RESCALE(INT32→INT16). ### `ToTosaMemoryFormatPass` scoped down Now only assigns non-identity dim_order to parameter/buffer placeholders (for weight serialisation) and graph I/O. Inserts `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already carry a matching permute (inserted by the rewrite passes). ### TOSA dialect op metas expect NHWC All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`, `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC input layout and produce NHWC output shapes. ### Removed `tosa_dim_order` shape remapping `tosa_shape()` no longer reorders dimensions—just resolves symints. `_get_matching_fake_tensor()` returns `node.meta["val"]` directly. Serialisation mapping always uses identity dim_order. ### Operator serialisation simplified `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute` no longer remap reduction/concat axes through `dim_order` since tensors are already in the layout expected by TOSA. ### Permute optimisation passes added Six shared passes from `executorch/backends/transforms/` are now run after TOSA lowering to fuse, cancel, and simplify the permutes introduced above: - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`) - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`) - `ReplaceNopTransposeOrPermuteWithViewPass` - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView` - `FuseCascadedTransposeOrPermuteOps` - `FuseCascadedViewOps` ### Removed passes `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass` are removed—their logic is now handled inline by `RewriteConvPass`. `RewriteSlicePass` is repositioned after the permute optimisations. ### Ethos-U55 partitioner simplified The dual NCHW/NHWC permute constraint check is removed since tensors are always in the expected layout at partition time. Differential Revision: D100712787

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Apr 21, 2026

This comparison is taking too long to generate.

Uh oh!