Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: pytorch/executorch
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: c4dbf62
Choose a base ref
...
head repository: pytorch/executorch
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 7199184
Choose a head ref
  • 11 commits
  • 65 files changed
  • 7 contributors

Commits on Apr 21, 2026

  1. Fix float32 indices in test_batched_export_with_backprop

    Differential Revision: D101547370
    
    Pull Request resolved: #18992
    jeanschmidt authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    78a6689 View commit details
    Browse the repository at this point in the history
  2. Add instruction execution limit to prevent infinite loops (#18679)

    JumpFalseCall instructions can set destination_instruction to
    themselves, creating an infinite loop that hangs the runtime. This adds
    a configurable instruction counter (default 10M, overridable via
    -DET_MAX_INSTRUCTIONS) to Method::execute() that returns
    Error::InvalidState if exceeded.
    
    This PR was authored with the assistance of Claude.
    
    
    ### Test plan
    Existing tests
    ```bash
    cmake -B build -DEXECUTORCH_BUILD_TESTS=ON
    cmake --build build --target method_test
    ctest --test-dir build -R method_test --output-on-failure
    ```
    
    Co-authored-by: Github Executorch <github_executorch@arm.com>
    lucylq and Github Executorch authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    c4983ce View commit details
    Browse the repository at this point in the history
  3. Fix unchecked map access in xnnpack (#19008)

    Introduce remapId function that checks error instead of
    std::unordered_map::at(), which throws std::out_of_range in noexcept
    functions causing std::terminate(). Applied across all ~30
    node-definition functions in XNNCompiler.
    
    retake #18804, for windows
    compatibility
    
    Co-authored-by: Github Executorch <github_executorch@arm.com>
    lucylq and Github Executorch authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    e281726 View commit details
    Browse the repository at this point in the history
  4. Add labeler for ARM path changes (#18606)

    Apply module: arm and ciflow/trunk labels when backends/arm or
    examples/arm files are modified.
    
    Signed-off-by: Sebastian Larsson <sebastian.larsson@arm.com>
    Sebastian-Larsson authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    ccaf17e View commit details
    Browse the repository at this point in the history
  5. Metal backend: Add gather_qmv kernel for MoE expert-indexed quantized…

    … matmul (#18877)
    
    Adds gather_qmv Metal kernel for Mixture-of-Experts: performs per-expert
    quantized matrix-vector multiply y[i] = W[expert_idx[i]] @ x[i]. Extends
    the existing qmv kernels in op_linear_4bit.mm with expert
    index-based pointer offsets, following the same pattern as MLX's
    affine_gather_qmv_fast.
    
    Two dispatch paths (matching op_linear_4bit.mm):
     - gather_qmv_fast: optimized path for K%512==0 and N%8==0
     - gather_qmv_impl: generic fallback for any K and N
    
    Uses the same affine INT4 dequantization format as op_linear_4bit.mm
    (scale * accum + sum * bias). Instantiated for 4-bit with group sizes
    {32, 64, 128} and dtypes {float, bfloat16}.
    
    Includes: Metal shader + C++ host dispatch, Python custom op definition
    (metal::gather_qmv) with reference CPU impl and Meta impl, C shim dict,
    fallback kernel registration, CMakeLists entry, and test module.
    manuelcandales authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    6875814 View commit details
    Browse the repository at this point in the history
  6. Relax SQNR check for mv2 (#19030)

    Summary: Check has been failing and other quant models use 15 so it
    seems fine.
    
    Differential Revision: D101837680
    JacobSzwejbka authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    2fce946 View commit details
    Browse the repository at this point in the history
  7. Metal backend: Add gated delta rule kernel for linear attention (#18878)

    Adds Metal kernel for the gated delta rule recurrence used by Qwen 3.5
    MoE's GatedDeltaNet linear attention layers. Ported from the MLX
    delegate
    PR (#18785) Metal shader. The kernel processes the full sequence
    sequentially within a single GPU dispatch, keeping recurrent state in
    per-thread registers.
    
    Grid: [32, Dv, B*Hv], Threadgroup: [32, 4, 1]. Each simdgroup of 32
    threads handles Dk/32 elements of the key dimension with SIMD reduction
    for dot products.
    
    The op mutates the recurrent state buffer in-place (mutates_args).
    Instantiated for both real model (Dk=128, Dv=128, Hk=32, Hv=32) and
    tiny test (Dk=64, Dv=64, Hk=4, Hv=4) dimensions.
    
    Includes: Metal shader + C++ host dispatch, Python custom op definition
    (metal::gated_delta_rule) with reference CPU impl and Meta impl, C shim
    dict, fallback kernel registration, CMakeLists entry, and test module.
    manuelcandales authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    d408a10 View commit details
    Browse the repository at this point in the history
  8. Qwen 3.5 MoE: Add Metal source transformations (#18879)

    Adds metal_source_transformations.py with module replacements for Metal:
    - FusedMoEExperts -> MetalMoEExperts (two metal::gather_qmv calls with
      SiLU gating, replacing torch.ops.triton.fused_moe)
    - GatedDeltaNet -> metal::gated_delta_rule custom op (replaces both the
      T=1 native path and T>1 Triton kernel)
    - FullAttention -> removes turboquant codepath, keeps standard SDPA
    - SparseMoE -> removes .float() cast on expert_weights
    
    Also includes quantize_experts_metal() which quantizes expert weights to
    MLX affine INT4 format (unsigned uint4 with scale + bias per group),
    compatible with the Metal gather_qmv kernel.
    manuelcandales authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    9600f63 View commit details
    Browse the repository at this point in the history
  9. Qwen 3.5 MoE: Add --backend metal export path (#18880)

    Adds Metal backend support to export.py via --backend metal flag:
    - _prepare_and_quantize_metal: applies source transforms, quantizes
      experts to MLX affine INT4, quantizes non-expert layers with fpa4w
      (skips shared_expert_gate with N<4 for prefill compatibility)
    - _export_metal: exports decode + prefill methods via MetalBackend/
      MetalPartitioner
    
    CUDA and MLX paths are unchanged.
    manuelcandales authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    799bf5a View commit details
    Browse the repository at this point in the history
  10. Move permute optimization passes to shared transforms location (#19002)

    Summary:
    Pull Request resolved: #19002
    
    Move 6 permute optimization passes and their shared infrastructure from
    executorch/backends/cadence/aot/ to executorch/backends/transforms/ so
    they can be shared between the Cadence and Arm backends without a
    cross-backend dependency.
    
    New files:
    - permute_pass_utils.py: base classes (HierarchicalInplacePassInterface,
      RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and
      utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims,
      get_shape, get_edge_overload_packet)
    - fuse_cascaded_transpose_or_permute_ops.py
    - fuse_cascaded_view_ops.py
    - fuse_transpose_or_permute_op_pairs_pass.py
    - remove_permutes_around_elementwise_ops.py
    - postpone_permute_below_squeeze_view.py
    - replace_nop_transpose_or_permute_with_view.py
    
    The shared versions omit register_cadence_pass decorators and
    cadence-specific ops from default op sets. Cadence files will subclass
    these and re-add the decorators and ops.
    
    Added OSS tests (test_permute_optimization_passes.py) for the 4 passes
    that can be imported without quantized op registration:
    FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps,
    PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and
    ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via
    pytest and are discovered automatically through pytest.ini testpaths.
    
    Differential Revision: D101459577
    
    Reviewed By: ethansfng
    mcremon-meta authored and meta-codesync[bot] committed Apr 21, 2026
    Configuration menu
    Copy the full SHA
    738ac7e View commit details
    Browse the repository at this point in the history
  11. Replace tosa_dim_order with explicit NCHW↔NHWC permutes

    Summary:
    Replace implicit `tosa_dim_order`-based layout handling with explicit
    `permute_copy` ops around TOSA operators that require NHWC layout.
    
    ### Rewrite passes insert explicit NCHW↔NHWC permutes
    
    `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass`
    now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op,
    NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for
    layout conversion. This makes layout transitions visible in the graph.
    
    ### Grouped conv decomposition in NHWC
    
    `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into
    per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single
    input/output permute pair wrapping the whole group. Supports INT8,
    INT16 (with and without bias) quantisation paths, including the full
    INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
    RESCALE(INT32→INT16).
    
    ### `ToTosaMemoryFormatPass` scoped down
    
    Now only assigns non-identity dim_order to parameter/buffer
    placeholders (for weight serialisation) and graph I/O. Inserts
    `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already
    carry a matching permute (inserted by the rewrite passes).
    
    ### TOSA dialect op metas expect NHWC
    
    All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`,
    `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC
    input layout and produce NHWC output shapes.
    
    ### Removed `tosa_dim_order` shape remapping
    
    `tosa_shape()` no longer reorders dimensions—just resolves symints.
    `_get_matching_fake_tensor()` returns `node.meta["val"]` directly.
    Serialisation mapping always uses identity dim_order.
    
    ### Operator serialisation simplified
    
    `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute`
    no longer remap reduction/concat axes through `dim_order` since
    tensors are already in the layout expected by TOSA.
    
    ### Permute optimisation passes added
    
    Six shared passes from `executorch/backends/transforms/` are now run
    after TOSA lowering to fuse, cancel, and simplify the permutes
    introduced above:
    - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`)
    - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`)
    - `ReplaceNopTransposeOrPermuteWithViewPass`
    - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView`
    - `FuseCascadedTransposeOrPermuteOps`
    - `FuseCascadedViewOps`
    
    ### Removed passes
    
    `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass`
    are removed—their logic is now handled inline by `RewriteConvPass`.
    `RewriteSlicePass` is repositioned after the permute optimisations.
    
    ### Ethos-U55 partitioner simplified
    
    The dual NCHW/NHWC permute constraint check is removed since tensors
    are always in the expected layout at partition time.
    
    Differential Revision: D100712787
    mcremon-meta authored and meta-codesync[bot] committed Apr 21, 2026
    Configuration menu
    Copy the full SHA
    7199184 View commit details
    Browse the repository at this point in the history
Loading