-
Notifications
You must be signed in to change notification settings - Fork 974
Comparing changes
Open a pull request
base repository: pytorch/executorch
base: c4dbf62
head repository: pytorch/executorch
compare: 7199184
- 11 commits
- 65 files changed
- 7 contributors
Commits on Apr 21, 2026
-
Fix float32 indices in test_batched_export_with_backprop
Differential Revision: D101547370 Pull Request resolved: #18992
Configuration menu - View commit details
-
Copy full SHA for 78a6689 - Browse repository at this point
Copy the full SHA 78a6689View commit details -
Add instruction execution limit to prevent infinite loops (#18679)
JumpFalseCall instructions can set destination_instruction to themselves, creating an infinite loop that hangs the runtime. This adds a configurable instruction counter (default 10M, overridable via -DET_MAX_INSTRUCTIONS) to Method::execute() that returns Error::InvalidState if exceeded. This PR was authored with the assistance of Claude. ### Test plan Existing tests ```bash cmake -B build -DEXECUTORCH_BUILD_TESTS=ON cmake --build build --target method_test ctest --test-dir build -R method_test --output-on-failure ``` Co-authored-by: Github Executorch <github_executorch@arm.com>
Configuration menu - View commit details
-
Copy full SHA for c4983ce - Browse repository at this point
Copy the full SHA c4983ceView commit details -
Fix unchecked map access in xnnpack (#19008)
Introduce remapId function that checks error instead of std::unordered_map::at(), which throws std::out_of_range in noexcept functions causing std::terminate(). Applied across all ~30 node-definition functions in XNNCompiler. retake #18804, for windows compatibility Co-authored-by: Github Executorch <github_executorch@arm.com>
Configuration menu - View commit details
-
Copy full SHA for e281726 - Browse repository at this point
Copy the full SHA e281726View commit details -
Add labeler for ARM path changes (#18606)
Apply module: arm and ciflow/trunk labels when backends/arm or examples/arm files are modified. Signed-off-by: Sebastian Larsson <sebastian.larsson@arm.com>
Configuration menu - View commit details
-
Copy full SHA for ccaf17e - Browse repository at this point
Copy the full SHA ccaf17eView commit details -
Metal backend: Add gather_qmv kernel for MoE expert-indexed quantized…
… matmul (#18877) Adds gather_qmv Metal kernel for Mixture-of-Experts: performs per-expert quantized matrix-vector multiply y[i] = W[expert_idx[i]] @ x[i]. Extends the existing qmv kernels in op_linear_4bit.mm with expert index-based pointer offsets, following the same pattern as MLX's affine_gather_qmv_fast. Two dispatch paths (matching op_linear_4bit.mm): - gather_qmv_fast: optimized path for K%512==0 and N%8==0 - gather_qmv_impl: generic fallback for any K and N Uses the same affine INT4 dequantization format as op_linear_4bit.mm (scale * accum + sum * bias). Instantiated for 4-bit with group sizes {32, 64, 128} and dtypes {float, bfloat16}. Includes: Metal shader + C++ host dispatch, Python custom op definition (metal::gather_qmv) with reference CPU impl and Meta impl, C shim dict, fallback kernel registration, CMakeLists entry, and test module.
Configuration menu - View commit details
-
Copy full SHA for 6875814 - Browse repository at this point
Copy the full SHA 6875814View commit details -
Relax SQNR check for mv2 (#19030)
Summary: Check has been failing and other quant models use 15 so it seems fine. Differential Revision: D101837680
Configuration menu - View commit details
-
Copy full SHA for 2fce946 - Browse repository at this point
Copy the full SHA 2fce946View commit details -
Metal backend: Add gated delta rule kernel for linear attention (#18878)
Adds Metal kernel for the gated delta rule recurrence used by Qwen 3.5 MoE's GatedDeltaNet linear attention layers. Ported from the MLX delegate PR (#18785) Metal shader. The kernel processes the full sequence sequentially within a single GPU dispatch, keeping recurrent state in per-thread registers. Grid: [32, Dv, B*Hv], Threadgroup: [32, 4, 1]. Each simdgroup of 32 threads handles Dk/32 elements of the key dimension with SIMD reduction for dot products. The op mutates the recurrent state buffer in-place (mutates_args). Instantiated for both real model (Dk=128, Dv=128, Hk=32, Hv=32) and tiny test (Dk=64, Dv=64, Hk=4, Hv=4) dimensions. Includes: Metal shader + C++ host dispatch, Python custom op definition (metal::gated_delta_rule) with reference CPU impl and Meta impl, C shim dict, fallback kernel registration, CMakeLists entry, and test module.
Configuration menu - View commit details
-
Copy full SHA for d408a10 - Browse repository at this point
Copy the full SHA d408a10View commit details -
Qwen 3.5 MoE: Add Metal source transformations (#18879)
Adds metal_source_transformations.py with module replacements for Metal: - FusedMoEExperts -> MetalMoEExperts (two metal::gather_qmv calls with SiLU gating, replacing torch.ops.triton.fused_moe) - GatedDeltaNet -> metal::gated_delta_rule custom op (replaces both the T=1 native path and T>1 Triton kernel) - FullAttention -> removes turboquant codepath, keeps standard SDPA - SparseMoE -> removes .float() cast on expert_weights Also includes quantize_experts_metal() which quantizes expert weights to MLX affine INT4 format (unsigned uint4 with scale + bias per group), compatible with the Metal gather_qmv kernel.
Configuration menu - View commit details
-
Copy full SHA for 9600f63 - Browse repository at this point
Copy the full SHA 9600f63View commit details -
Qwen 3.5 MoE: Add --backend metal export path (#18880)
Adds Metal backend support to export.py via --backend metal flag: - _prepare_and_quantize_metal: applies source transforms, quantizes experts to MLX affine INT4, quantizes non-expert layers with fpa4w (skips shared_expert_gate with N<4 for prefill compatibility) - _export_metal: exports decode + prefill methods via MetalBackend/ MetalPartitioner CUDA and MLX paths are unchanged.
Configuration menu - View commit details
-
Copy full SHA for 799bf5a - Browse repository at this point
Copy the full SHA 799bf5aView commit details -
Move permute optimization passes to shared transforms location (#19002)
Summary: Pull Request resolved: #19002 Move 6 permute optimization passes and their shared infrastructure from executorch/backends/cadence/aot/ to executorch/backends/transforms/ so they can be shared between the Cadence and Arm backends without a cross-backend dependency. New files: - permute_pass_utils.py: base classes (HierarchicalInplacePassInterface, RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims, get_shape, get_edge_overload_packet) - fuse_cascaded_transpose_or_permute_ops.py - fuse_cascaded_view_ops.py - fuse_transpose_or_permute_op_pairs_pass.py - remove_permutes_around_elementwise_ops.py - postpone_permute_below_squeeze_view.py - replace_nop_transpose_or_permute_with_view.py The shared versions omit register_cadence_pass decorators and cadence-specific ops from default op sets. Cadence files will subclass these and re-add the decorators and ops. Added OSS tests (test_permute_optimization_passes.py) for the 4 passes that can be imported without quantized op registration: FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps, PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via pytest and are discovered automatically through pytest.ini testpaths. Differential Revision: D101459577 Reviewed By: ethansfng
Configuration menu - View commit details
-
Copy full SHA for 738ac7e - Browse repository at this point
Copy the full SHA 738ac7eView commit details -
Replace tosa_dim_order with explicit NCHW↔NHWC permutes
Summary: Replace implicit `tosa_dim_order`-based layout handling with explicit `permute_copy` ops around TOSA operators that require NHWC layout. ### Rewrite passes insert explicit NCHW↔NHWC permutes `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass` now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op, NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for layout conversion. This makes layout transitions visible in the graph. ### Grouped conv decomposition in NHWC `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single input/output permute pair wrapping the whole group. Supports INT8, INT16 (with and without bias) quantisation paths, including the full INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) → RESCALE(INT32→INT16). ### `ToTosaMemoryFormatPass` scoped down Now only assigns non-identity dim_order to parameter/buffer placeholders (for weight serialisation) and graph I/O. Inserts `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already carry a matching permute (inserted by the rewrite passes). ### TOSA dialect op metas expect NHWC All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`, `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC input layout and produce NHWC output shapes. ### Removed `tosa_dim_order` shape remapping `tosa_shape()` no longer reorders dimensions—just resolves symints. `_get_matching_fake_tensor()` returns `node.meta["val"]` directly. Serialisation mapping always uses identity dim_order. ### Operator serialisation simplified `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute` no longer remap reduction/concat axes through `dim_order` since tensors are already in the layout expected by TOSA. ### Permute optimisation passes added Six shared passes from `executorch/backends/transforms/` are now run after TOSA lowering to fuse, cancel, and simplify the permutes introduced above: - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`) - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`) - `ReplaceNopTransposeOrPermuteWithViewPass` - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView` - `FuseCascadedTransposeOrPermuteOps` - `FuseCascadedViewOps` ### Removed passes `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass` are removed—their logic is now handled inline by `RewriteConvPass`. `RewriteSlicePass` is repositioned after the permute optimisations. ### Ethos-U55 partitioner simplified The dual NCHW/NHWC permute constraint check is removed since tensors are always in the expected layout at partition time. Differential Revision: D100712787
Configuration menu - View commit details
-
Copy full SHA for 7199184 - Browse repository at this point
Copy the full SHA 7199184View commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff c4dbf62...7199184