Skip to content

[WebGPU] Add gating logic for subgroup shuffle primitives#1

Open
ksgr5566 wants to merge 844 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
ksgr5566:webgpu-subgroup-test
Open

[WebGPU] Add gating logic for subgroup shuffle primitives#1
ksgr5566 wants to merge 844 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
ksgr5566:webgpu-subgroup-test

Conversation

@ksgr5566
Copy link
Copy Markdown

@ksgr5566 ksgr5566 commented Feb 24, 2026

Summary

This adds gating logic on top of apache#17699 to support optional subgroup shuffle
primitives based on a compile-time flag.

Problem

The PR apache#17699 always generates subgroup shuffle ops when targeting WebGPU.
However, not all WebGPU devices support subgroups. We need a way to:

  • Default to shared memory reductions (universally compatible)
  • Optionally enable subgroup shuffles for devices that support them

Solution

Implement gating via TVM target parameter:

  • Default thread_warp_size=1 disables warp reductions (uses shared memory + barriers)
  • Add target parser UpdateWebGPUAttrs() that sets thread_warp_size=32 when supports_subgroups=true
  • Add --enable-subgroups CLI flag in mlc-llm to surface the option to users

The gating happens at the reduction path selection level (IsWarpReduction() in
lower_thread_allreduce.cc), ensuring subgroup ops are never generated unless explicitly enabled.

Changes

Testing

Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions;
with flag, generates subgroupShuffle* ops.
Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f

guan404ming and others added 30 commits January 6, 2026 00:35
…trided_slice (apache#18633)

## Why

The dynamic_strided_slice operator was missing FRelaxInferLayout and
TMixedPrecisionPolicy attributes, preventing it from participating in
layout transformations and mixed precision optimizations.

## How
- Add `TMixedPrecisionPolicy` attribute with `kFollow` policy and 
- Add `InferLayoutDynStridedSlice` function that falls back to initial
layout (since begin/end/strides are dynamic tensors that cannot be
transformed at compile time)
## Why

The flip operator lacked layout inference support, preventing it from
participating in layout transformations during the ConvertLayout pass.

  ## How

- Add InferLayoutFlip function that transforms the axis attribute
according to the input layout
  - Register FRelaxInferLayout attribute for relax.flip operator
  - Add test case for conv2d followed by flip with layout conversion
…8638)

## Why

The scatter_elements operator was missing FRelaxInferLayout support,
which prevented proper layout transformation when used with operators
like conv2d that require layout conversion.

## How

- Implement InferLayoutScatterElements function that handles layout
inference for scatter_elements
- Transform axis attribute according to the inferred layout using
FindAxis
- Handle sub-indexed layout fallback to initial layout
- Add test case for conv2d + scatter_elements layout conversion
…ython 3.14+ (apache#18639)

This PR fixes the type annotation checker in
`tvm.tir.schedule._type_checker` to correctly handle subscripted
generics (e.g., `Union[str, int]`, `List[str]`, `Tuple[str, int]`) in
Python 3.14+.

## Background
In Python 3.14, the internal representation of generic types has
changed:
- `Union[str, int]` is now of type `typing.Union` instead of
`typing._GenericAlias` or `typing._SpecialGenericAlias`
- These types now have `__origin__` attribute directly on the type
object
- The existing type checker failed to recognize these new
representations, causing the dispatcher to fall through to "atomic"
instead of correctly identifying them as "union", "list", etc.

## Changes
Added a check for `__origin__` attribute at the beginning of the method
to handle Python 3.14's new generic type representations. This is fully
backward compatible since the new `__origin__` check is only applied
when the attribute exists.

## Tests
Added parametrized tests to verify the dispatcher correctly handles
subscripted generics:
- `Union[str, int]` → identified as "union"
- `List[str]` → identified as "list"
- `Dict[str, int]` → identified as "dict"
- `Tuple[str, int]` → identified as "tuple"
- `Union[List[str], Dict[str, int]]` → identified as "union" with nested
generics
…true (apache#18641)

### Summary
Fixed incorrect output shape of Pool op when ceil_mode = true

### Steps to Reproduce
Example: Create Pool Operator from PyTorch
```
class PoolModule(nn.Module):
    def forward(self, x):
        return torch.nn.AvgPool2d(2, 2, 1, True)(x)
```
```
class Module:
    def main(x: R.Tensor((1, 3, 17, 17), dtype="float32")) -> R.Tuple(R.Tensor((1, 3, 10, 10), dtype="float32")):
        with R.dataflow():
            lv: R.Tensor((1, 3, 10, 10), dtype="float32") = R.nn.avg_pool2d(x, pool_size=[2, 2], strides=[2, 2], dilation=[1, 1], padding=[1, 1, 1, 1], ceil_mode=True, count_include_pad=True, layout="NCHW", out_layout="NCHW")
            gv: R.Tuple(R.Tensor((1, 3, 10, 10), dtype="float32")) = (lv,)
            R.output(gv)
        return gv
```

### Expected
```
class Module:
    def main(x: R.Tensor((1, 3, 17, 17), dtype="float32")) -> R.Tuple(R.Tensor((1, 3, 9, 9), dtype="float32")):
        with R.dataflow():
            lv: R.Tensor((1, 3, 9, 9), dtype="float32") = R.nn.avg_pool2d(x, pool_size=[2, 2], strides=[2, 2], dilation=[1, 1], padding=[1, 1, 1, 1], ceil_mode=True, count_include_pad=True, layout="NCHW", out_layout="NCHW")
            gv: R.Tuple(R.Tensor((1, 3, 9, 9), dtype="float32")) = (lv,)
            R.output(gv)
        return gv

```
### Resolve
- Citation:
https://docs.pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html
<img width="500" height="200" alt="PR1"
src="https://github.com/user-attachments/assets/52a27448-006f-409e-b8b4-65f49e908d5f"
/>

- Fixed: apache#18594
## Why

The scatter_nd operator was missing FRelaxInferLayout attribute, which
is needed for proper layout transformation during model optimization.

### How

- Added InferLayoutScatterND function that uses data tensor's layout for
output since scatter_nd maintains input shape
- Registered FRelaxInferLayout attribute
)

## Why

The gather_elements operator lacked layout inference support, preventing
it from participating in layout transformations during the ConvertLayout
pass.

## How

- Add InferLayoutGatherElements function that transforms the axis
attribute according to the input layout
- Register FRelaxInferLayout attribute

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR supports NVRTC as an alternative to NVCC for faster, device-side
JIT compilation of CUDA kernels, in favor of the PR
[https://github.com/apache/tvm-ffi/pull/283](https://github.com/apache/tvm-ffi/pull/283).

It enhances the CUDA compilation backend by:
- Adding Python NVRTC support using cuda-python bindings
- Removing legacy C++ NVRTC fallback in favor of a Python-first approach
- Keeping nvcc as the default compiler with fatbin output (no behavior
change for existing users)

Users can choose the compilation backend using an environment variable
`TVM_CUDA_COMPILE_MODE`, choosing from "nvcc" and "nvrtc". For example,

`TVM_CUDA_COMPILE_MODE=nvrtc python3 your_program.py`

Here is a short benchmark of the compilation speed of kernels in
`test_target_codegen_cuda.py`.

### NVCC vs NVRTC Compilation Time Comparison (Python-side Call)

| Test Case | Code Size | NVCC Time (ms) | NVRTC Time (ms) | Speedup |
| :--- | :--- | :--- | :--- | :--- |
| `test_crossthread_reduction1` | 1945 B | 241.27 | 51.23 | **4.7x** |
| `test_cuda_bf16_vectorize_add` | 3760 B | 342.72 | 44.50 | **7.7x** |
| `test_cuda_const_float_to_half` | 12394 B | 272.85 | 31.99 | **8.5x**
|
| `test_cuda_device_func_call` | 975 B | 215.58 | 21.47 | **10.0x** |
| `test_cuda_float_const_hex_format` | 685 B | 217.39 | 20.52 |
**10.6x** |
| `test_cuda_floordiv_with_vectorization` | 1050 B | 213.88 | 23.32 |
**9.2x** |
| `test_cuda_inf_nan` | 673 B | 214.33 | 24.94 | **8.6x** |
| `test_cuda_tensormap` | 755 B | 213.91 | 20.74 | **10.3x** |
| `test_cuda_thread_sync_inside_condition` | 1007 B | 213.43 | 28.29 |
**7.5x** |
| `test_cuda_vectorize_add` | 908 B | 226.81 | 40.39 | **5.6x** |
| `test_cuda_vectorize_load` | 734 B | 217.25 | 24.02 | **9.0x** |
| `test_device_host_call_same_func` | 924 B | 216.03 | 21.21 | **10.2x**
|
| `test_vectorized_intrin1` | 847 B | 226.15 | 26.34 | **8.6x** |

### NVSHMEM Support

Currently, NVSHMEM is **not** supported via NVRTC.
- Fallback Behavior: When NVSHMEM is required, the compilation pipeline
will automatically fall back to NVCC, even if `TVM_CUDA_COMPILE_MODE` is
set to nvrtc.
- Future Roadmap: Support for NVRTC with NVSHMEM is planned for
follow-up PRs.
Introduces the below features over texture annotation

- Lowering, codegen and runtime for texture.
- image2d_array_t support - Added depth dimension allows more
allocations using texture instead of falling back to buffer when the
texture limits exceeds.
- A comprehensive set of schedules for Adreno textures.
- Texture packing of arbitrary types up to 128 bit (FP16-NCHW8c,
INT8-NCHW16c ...etc.).
- A clBufferDescriptor debug dump controlled by cmake options.
- Pipeline definition for adreno target.


While covering these features the below interfaces or passes or enhanced
which need a review.

- alloc_tensor: VDevice information is passed across these API's. The
way of texture allocation is ```alloc_storage``` allocates buffer/image
objects as requested followed by alloc_tensor being a view of any scope.
This takes care of optimum utilization backing memory across different
image objects or scopes.
- Constants Saving: Handled by adding memory scope section in
executable. This introduces a new header magic to retain the backward
compatibility.
- Static Memory Planing: Mostly port from Relay static memory planner
with mixed mode allocator.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Sanjay <sanjs@qti.qualcomm.com>
- Remove TODO about `ctx->ReportWarning()` since LOG(WARNING) is the
standard pattern
- Remove TODO about skipping type params/constraints since the current
FuncType implementation only contains arg_types and ret_type fields.
`type_params` and `type_constraints` don't exist in the codebase.
- Fixed typo in warning message
This change is part of tile-ai#21.

On darwin platform, when trying to compile a `.c` file as objective-c,
`-x objective-c++` needs to be prior to source files in command line
arguments. Without this PR, it's not straightforward to do so.
…ope shapes (apache#18658)

This PR support handle slope and axis argument of PReLU op with
different slope shapes: (1xCx1x1) or (S,) or (1,1) etc.

### Description
- Handle slope and axis argument of PReLu op (to pass into
relax.op.nn.prelu function)
- If slope shape = (1xCx1x1), get axis = 1 and reshape slope to (C,)
- else if slope shape = (S,) or (1, 1), get axis = len(x_shape) - 1
(take the last axis of the input x)
(https://onnx.ai/onnx/repo-docs/Broadcasting.html)
 - else raise error
 
### Resolved
- Fixed 1: apache#18596
- Fixed 2: apache#18598
- Fixed 3: apache#18606
- Fixed 4: apache#18607
This PR bumps tvm-ffi to latest version
### Motivation

The ONNX Resize operator supports resizing N-D tensors per the ONNX
specification.
However, the current Relax ONNX frontend only supports 4D inputs and
raises an
assertion error for valid non-4D models.

This PR extends the Relax ONNX Resize converter beyond the 4D-only
restriction,
aligning behavior with the ONNX specification and ONNX Runtime for
supported ranks.

### Changes

- Remove the 4D-only assertion in the Relax ONNX Resize converter
- Preserve the existing 4D resize path without behavior changes
- Support non-4D Resize by lowering to existing resize implementations
for supported ranks
- Ensure Resize attributes are handled correctly for non-4D cases

### Testing

- Verified outputs against ONNX Runtime
- Added ONNX Resize tests covering non-4D input cases
(`test_resize_nd_sizes`)

Fixes: [[Bug] Resize N-D import failure: TVM only supports 4D resize2d,
but ONNX Resize supports N-D tensors
https://github.com/apache/tvm/issues/18608](https://github.com/apache/tvm/issues/18608)
…pache#18663)

This PR fixes an issue in StaticPlanBlockMemory where dynamic shapes
were incorrectly planned as static memory when only a lower bound was
provided for TIR variables.

Repro:
<details>
<summary>repro_dynamic_memory_plan.py</summary>

```python
import tvm
from tvm import relax, testing
from tvm.relax.frontend.torch import from_exported_program
from torch.export import Dim, export
import torch


class SimpleConv(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1)

    def forward(self, x):
        return self.conv(x)


def main():
    model = SimpleConv().eval()

    example = torch.randn(2, 3, 32, 32)
    batch = Dim("batch")  # No max= specified, so upper bound is unknown
    exported = export(model, (example,), dynamic_shapes={"x": {0: batch}})

    mod = from_exported_program(exported)
    mod = relax.transform.DecomposeOpsForInference()(mod)

    target = tvm.target.Target("llvm")
    exe = tvm.compile(mod, target=target)

    vm = relax.VirtualMachine(exe, tvm.cpu())
    inp = tvm.runtime.from_dlpack(example)
    out = vm["main"](inp)

    expected = model(example).detach().numpy()
    actual = out[0].numpy()
    testing.assert_allclose(actual, expected, rtol=1e-4, atol=1e-4)


if __name__ == "__main__":
    main()
```

</details>

This will fail with the following error.

<details>
<summary>output</summary>

```
$ uv run python repro_dynamic_memory_plan.py 
/home/ubuntu/data/project/tvm-example/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "/home/ubuntu/data/project/tvm-example/frontend/repro_dynamic_memory_plan.py", line 40, in <module>
    main()
  File "/home/ubuntu/data/project/tvm-example/frontend/repro_dynamic_memory_plan.py", line 32, in main
    out = vm["main"](inp)
          ^^^^^^^^^^^^^^^
  File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
  File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 549, in tvm::runtime::vm::VirtualMachineImpl::InvokeClosurePacked(tvm::ffi::ObjectRef const&, tvm::ffi::PackedArgs, tvm::ffi::Any*)
    clo->impl.CallPacked(ffi::PackedArgs(packed_args.data(), packed_args.size()), rv);

  File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 622, in operator()
    *rv = static_cast<VirtualMachineImpl*>(ctx_ptr)->InvokeBytecode(gf_idx, inputs);

  File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 693, in tvm::runtime::vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::ffi::Any, std::allocator<tvm::ffi::Any> > const&)
    RunLoop();
  
  File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 816, in tvm::runtime::vm::VirtualMachineImpl::RunLoop()
    this->RunInstrCall(curr_frame, instr);

  File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 767, in tvm::runtime::vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::vm::VMFrame*, tvm::runtime::vm::Instruction)
    this->InvokeClosurePacked(func_pool_[instr.func_idx].cast<ObjectRef>(), args, &ret);

  File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/builtin.cc", line 405, in operator()
    *rv = sobj->AllocTensor(offset, shape, dtype);

  File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/memory/memory_manager.cc", line 98, in tvm::runtime::memory::StorageObj::AllocTensor(long, tvm::ffi::Shape, DLDataType)
    ICHECK(offset + needed_size <= this->buffer.size)
  
  File "/home/ubuntu/data/project/tvm-example/tvm/include/tvm/runtime/logging.h", line 321, in tvm::runtime::detail::LogFatal::~LogFatal()
    GetEntry().Finalize();

  File "/home/ubuntu/data/project/tvm-example/tvm/include/tvm/runtime/logging.h", line 337, in tvm::runtime::detail::LogFatal::Entry::Finalize()
    InternalError error(file_, lineno_, stream_.str());

tvm.error.InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 524288 at offset 0 in region that is 262144bytes
```

</details>
## Why

TODO comment indicated that topi.take should be replaced with
relax.op.take once it has better support. The relax take operator is now
mature enough to replace the TOPI external call.

## How

- Replace bb.emit_te(topi.take, ...) with relax.op.take(..., axis=0) in
EmbedLayerNormalization converter
## Why

The ONNX Unique operator supports four optional outputs (unique values,
indices, inverse_indices, and counts), but the TVM ONNX frontend only
returned the unique values output.

## How

- Updated `Unique._impl_v11` to check the number of expected outputs via
`attr["tvm_custom"]["num_outputs"]`
- Pass `return_index`, `return_inverse`, and `return_counts` parameters
to `relax.op.unique`
- Return a `relax.Tuple` containing all requested outputs
The line below is not a MSVC syntax

```
#pragma disagnostic push
```

instead, we should always use:

```
#pragma warning(push)
```

in MSVC.
This allows user defined callback to specify layouts dynamically based
on call description.
Helpful to alter layouts based on the operator shapes or attributes.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Why

ONNX models use the Size operator to get total element count of a
tensor. Relax didn't have a native equivalent.

## How

- Adds R.size(tensor) operator that returns the total number of elements
in a tensor as a scalar int64
This PR supported Local Response Normalization operator for ONNX.

### Description
Implement and Test Local Response Normalization operator for ONNX
frontend.

### Implement
- Using avg_pool operator to compute LRN
- Pseudocode:
```
def local_response_norm(input, size, alpha, beta, k):
    dim = input.dim()
    check_only_support_3D_4D()
    div = input.mul(input)
    div = expand_dim(div, 1)
    pad_len = size // 2
    if dim == 3:
        div = avg_pool2d(div,
                        (size, 1),
                        stride=1,
                        padding=(pad_len, 0, pad_len, 0))
    else:
        div = avg_pool3d(div,
                        (size, 1, 1),
                        stride=1,
                        padding=(pad_len, 0, 0, pad_len, 0, 0))

    div = squeeze_dim(div, 1)
    div = div.mul(alpha).add(k).pow(beta)
    return input / div
```

### Reference
Implement same as Pytorch:
https://discuss.pytorch.org/t/why-use-avgpool2d-and-avgpool3d-in-local-response-norm/97236
…#18664)

## Why

The LegalizeOps transform was using string matching to detect
data-dependent operators by checking if "dynamic" appears in the
operator name. This approach is fragile and doesn't scale well as new
data-dependent operators are added.

## How

- Add FDataDependent operator attribute to properly mark data-dependent
operators
- Set FDataDependent=true for relax.dynamic_strided_slice operator
- Update LegalizeOps transform to check the FDataDependent attribute
instead of string matching
## Why
NMS operator returns fixed-size output with trailing garbage data,
wasting memory and requiring manual trimming for ONNX
compatibility.
## How
- Add dynamic_strided_slice to trim NMS output to valid detections only
- Build slice parameters using TE compute to avoid legalization issues
## Why

The code was using the TypeScript type name
`FTVMFFIErrorSetRaisedFromCStr` instead of the actual C function export
name `TVMFFIErrorSetRaisedFromCStr`. The F prefix is a naming convention
for TypeScript function types, not the actual export names, which would
make RPC run into error in my local test.
## How

`lib.exports.FTVMFFIErrorSetRaisedFromCStr` -->
`lib.exports.TVMFFIErrorSetRaisedFromCStr` to match the C function name
defined in ctypes.ts.
…#18683)

## Why
Recent FFI refactoring added new type indices kTVMFFIStr (65) and
kTVMFFIBytes (66). The RPC server didn't
handle these, causing "cannot support type index 65/66" errors when
Python sends string/bytes arguments via
RPC.
## How
`WriteFFIAny` in C++ writes the `type_index` twice for these types. The
fix adds handlers that read and discard the duplicate type_index before
reading the actual data.
## Why

When a client connects without session_constructor_args, the RPC
protocol sends nargs=0 (LocalSession request). The WASM RPC server
requires ["rpc.WasmSession", wasm_binary] instead.
             
## How
Adds handling to skip LocalSession init packets and wait for the proper
WasmSession init, with a log message for debugging.
…18671)

### Motivation
InjectPTXLDG32 rewrites BufferStore when encountering if_then_else, but
it only
initializes temporary buffers when an Allocate node exists. For
functions without
Allocate, this leads to uninitialized buffers and a hard segfault during
compilation.
In addition, the PTX-only pass can run on CPU/LLVM targets when
tir.ptx_ldg32=1,
injecting PTX intrinsics that are invalid for non-CUDA codegen.

This PR ensures temporary buffers are created even when no Allocate
exists, and
skips InjectPTXLDG32 on non-CUDA targets, preventing segfaults and
invalid PTX
intrinsics on CPU.

### Changes
- Ensure temp buffers are created when the rewrite path is taken without
Allocate
- Insert allocations at the function level when needed
- Guard InjectPTXLDG32 so it only runs on CUDA targets
- Add tests for CUDA (insertion) and CPU (skip) behavior

### Testing
test_tir_transform_inject_ptx_ldg32.py

### Fixes
- [apache#18612](apache#18612)
- [apache#18617](apache#18617)
- [apache#18599](apache#18599)
Aharrypotter and others added 30 commits March 26, 2026 22:38
…rt (apache#18933)

## Summary

Add `relax.image.affine_grid` operator for Spatial Transformer Networks,
along with PyTorch and ONNX frontend integration.

TOPI compute (`topi.image.affine_grid`) already exists. This PR
completes the Relax-level registration and frontend support, following
the existing `resize2d` / `grid_sample` pattern.

## Changes

**Relax op registration:**
- C++ op function, FFI registration, and struct info inference
(`resize.h`, `resize.cc`)
- Python wrapper with flexible size parameter handling (`image.py`)
- Legalization to `topi.image.affine_grid` with `PrimExpr` → `int`
conversion
- Op-level tests (struct info inference + e2e numerical correctness) and
legalization test

**PyTorch frontend:**
- Converter for `aten.affine_grid_generator.default`
- Layout conversion from TVM `[N,2,H,W]` to PyTorch `[N,H,W,2]` via
`permute_dims`
- Single-kernel path is 5.6x faster than the decomposed path (30+ ops)
- Structural IR test and numerical correctness test

**ONNX frontend:**
- `AffineGrid` converter with `_impl_v20` (opset 20, when the op was
first introduced)
- Support for constant size tensor `[N,C,H,W]`
- Layout conversion from TVM `[N,2,H,W]` to ONNX `[N,H,W,2]`
- End-to-end correctness test against ONNX Runtime

## Limitations

- Only `align_corners=True` is supported (matches current TOPI
implementation)
- Only 2D affine grid is supported

## Validation

```bash
pytest tests/python/relax/test_op_image.py -k "affine_grid" -v           # 8 passed
pytest tests/python/relax/test_transform_legalize_ops_image.py -k "affine_grid" -v  # 1 passed
pytest tests/python/relax/test_frontend_from_exported_program.py -k "affine_grid" -v  # 2 passed
pytest tests/python/relax/test_frontend_onnx.py -k "affine_grid" -v     # 1 passed
```

All 12 tests passed.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
… tirx rename (apache#18941)

- Remove duplicate Apache license blocks in
`docs/get_started/overview.rst` and `docs/README.md` introduced by
apache#18913, which render as visible garbage text in built documentation
- Fix incorrect `tirx/schedule` and `tirx/tensor_intrin` paths in
`docs/arch/index.rst` — these modules are in `s_tir/`, not `tirx/`
DeviceInfoCollector did not track Bind statements, so when CSE (or
any other pass) inserted a Bind before a thread_extent AttrStmt, the
collected extent referenced a locally-bound variable instead of
function parameters.  LowerDeviceKernelLaunch then produced dangling
references in the host function.

Fix: record Bind definitions in DeviceInfoCollector and inline them
when extracting thread_extent values and dynamic shared memory sizes.
the file `tvm/python/update_version.py` not exist. So remove the
comment.
…box decode (apache#18942)

Introduce relax.vision.multibox_transform_loc with
MultiboxTransformLocAttrs: decode center-size offsets against ltrb
priors, softmax on class logits, and optional clip, threshold masking,
and background score zeroing. Register the C++ op with FInferStructInfo
checks for shapes and dtypes (including batch and 4*N consistency).
Legalize to topi.vision.multibox_transform_loc.

Add tests for struct inference, invalid inputs, Legalize+e2e on LLVM,
attribute branches, and TVMScript roundtrip. Add a standalone numpy
reference under topi/testing (not exported from tvm.topi.testing to
avoid pulling scipy).

Update TFLite frontend NotImplementedError text for
DETECTION_POSTPROCESS and NON_MAX_SUPPRESSION_V5 to note multibox is
available and link tracking issue apache#18928.
## Summary

Bump tvm-ffi submodule from c85fd42 (apache#471) to 63224e3 (apache#512), spanning
41 commits with 7 breaking changes. Fix regressions introduced by the
bump:

### Fixes

1. **Duplicate field declarations in C++ types** — New tvm-ffi
auto-wires `__init__` from C++ reflection by walking the parent type
chain. Child types that re-declared parent fields
(`RXPlaceholderOpNode`, `FunctionFrameNode`) caused duplicate parameter
errors. Fixed by removing duplicate field registrations from child
types.

2. **Repr format regression** (7 tests) — New tvm-ffi `CObject.__repr__`
uses dataclass repr. Added `Node.__repr__` in `python/tvm/ir/base.py` to
use TVMScript printer for IR nodes.

3. **Host/device function split** (3 tests) — `str(target.kind)` now
returns full dataclass repr instead of kind name. Changed to
`target.kind.name` in `python/tvm/tirx/build.py`.

4. **`__slots__` enforcement** — New tvm-ffi enforces `__slots__=()` on
Object subclasses. Added `__slots__ = ("__dict__",)` to classes that
need instance attributes: `Pass`, `BlockBuilder`, `TVMDerivedObject`.

### Changes
- `3rdparty/tvm-ffi` — submodule bump c85fd42 → 63224e3
- `python/tvm/ir/base.py` — `Node.__repr__` using TVMScript printer
- `python/tvm/ir/transform.py` — `Pass.__slots__ = ("__dict__",)`
- `python/tvm/tirx/build.py` — `target.kind.name` instead of
`str(target.kind)`
- `python/tvm/relax/block_builder.py` — `BlockBuilder.__slots__ =
("__dict__",)`
- `python/tvm/runtime/support.py` — `TVMDerivedObject.__slots__ =
("__dict__", "__weakref__")`
- `python/tvm/s_tir/meta_schedule/utils.py` —
`TVMDerivedObject.__slots__ = ("__dict__",)`
- `include/tvm/script/ir_builder/relax/frame.h` — remove duplicate field
registrations
- `src/relax/ir/emit_te.h` — remove duplicate field registrations

## Test plan
- [x] tirx-base: 251 passed, 23 skipped
- [x] relax import + build: verified
- [ ] Full CI
Include `ffi/extra/json_parser.cc` and `ffi/extra/json_writer.cc` to
maintain compatibility after FFI JSON refactor
## Summary

Add `relax.vision.get_valid_counts` and classic
`relax.vision.non_max_suppression` for object-detection post-processing
pipelines.

`get_valid_counts` performs score-based bounding box filtering and
compacts valid boxes to the front of each batch. Classic
`non_max_suppression` performs flexible IoU-based suppression on
filtered boxes, complementing existing `all_class_non_max_suppression`
for custom post-processing workflows.

This PR implements the Relax-level registration, legalization, TOPI
compute, and test coverage for both operators.

## Changes

**Relax op registration and legalization:**
- C++ op functions, FFI registration, and struct info inference for both
operators (`vision.h`, `vision.cc`)
- Python wrappers with Relax docstrings (`vision.py`)
- Legalization to `topi.vision.get_valid_counts` and
`topi.vision.non_max_suppression`
- Additional struct-info validation for `score_index`, `id_index`, and
`coord_start` when `elem_length` is statically known

**TOPI and testing:**
- Full TOPI implementation for `get_valid_counts`
- Reimplementation of classic `non_max_suppression` in TOPI
- NumPy reference implementations in `tvm.topi.testing` for both
operators
- Op-level tests for struct info inference, legalization, invalid
attribute ranges, and e2e numerical correctness
- Stronger legalization tests that verify both `relax.call_tir`
introduction and removal of the original Relax vision op

## Limitations

- Attribute range validation for `score_index`, `id_index`, and
`coord_start` is only enforced when the input `elem_length` is
statically known during struct-info inference.
- Classic `non_max_suppression` follows the existing Relax / TOPI API
shape and is intended for single-class or class-aware custom
post-processing flows, distinct from `all_class_non_max_suppression`.

## Validation

```bash
pytest tests/python/relax/test_op_vision.py -k "get_valid_counts" -v
pytest tests/python/relax/test_op_vision.py -k "test_nms_" -v
```
All related tests passed.
…he#18946)

### Summary

This PR implements the ONNX `If` operator in the Relax ONNX frontend.
The `If` operator enables conditional branching in ONNX models, where a
boolean condition selects between two subgraph branches (`then_branch`
and `else_branch`) at runtime. This is required for any model with
runtime-dependent execution paths.

Closes apache#18945 (Tier 1 — `If` operator)

### Implementation Notes

- The main challenge is that `relax.If` cannot be emitted inside a
dataflow block, which is how the ONNX frontend normally builds the
entire graph. To handle this, when the graph contains an `If` node, the
function body is built as a regular binding block instead — matching the
approach used by the PyTorch Relax frontend for `torch.cond`.

- Each branch is an ONNX subgraph that can reference values from the
outer graph. A new `_convert_subgraph` method handles converting these
subgraphs into Relax expressions, making outer-scope values available to
the branch while ensuring branch-local bindings don't leak back to the
parent graph.

### Why `relax.If` cannot live inside a dataflow block

Dataflow blocks in Relax carry a semantic guarantee: every operation
inside them must be pure and side-effect-free with no control flow. This
allows the compiler to treat the entire block as a static computational
graph for optimizations like operator fusion and constant folding. An
`If` node breaks this guarantee by introducing runtime-dependent
branching, so Relax's well-formedness checker explicitly forbids it. I
discovered this when the checker raised:
```
This IR is not well-formed: If nodes are not allowed to appear in dataflow blocks.
```

The fix — skipping the dataflow block when the graph contains an `If`
node — mirrors exactly how the PyTorch Relax frontend handles
`torch.cond`.

### Known Limitations

**Dataflow block**: Models whose top-level graph contains an `If` node
are built without a dataflow block, which may affect downstream
optimisation passes that rely on dataflow block structure.

### Tests

Four new tests covering: scalar and tensor conditions, condition
computed from another op, and multiple branch outputs. All verified
against onnxruntime via `check_correctness`.

---------

Signed-off-by: OmarAzizi <oalazizi75@gmail.com>
…e#18948)

Introduce relax.nn.conv3d_transpose (attrs, C++ inference/layout, Python
API) and lower it to TOPI group_conv3d_transpose_ncdhw when using
NCDHW/IODHW with dilation 1, matching the conv2d_transpose legalization
policy.

Wire the Relax ONNX frontend to emit conv3d_transpose for 5D inputs.
Extend tests for ONNX, struct info, LegalizeOps, and TVMScript
round-trip; fix ConvTranspose test output spatial size to include
output_padding.apache#18945
…#18952)

## Summary

Add Relax `roi_pool` support and wire it through the ONNX frontend for
`MaxRoiPool`.

## Changes

- add `relax.vision.roi_pool`, including attrs, Python wrapper, struct
info inference, and legalization
- add TOPI `roi_pool` compute for NCHW layout
- support ONNX `MaxRoiPool` in the Relax ONNX frontend
- handle empty / out-of-bound pooled bins according to ONNX/reference
semantics, returning `0` instead of propagating invalid reductions
- add regression tests for Relax op inference, legalization, and ONNX
frontend import
- add out-of-bound ROI coverage to make sure fully invalid pooled bins
still match ONNX Runtime

## Validation

- `pytest tests/python/relax/test_op_vision.py -k roi_pool`
- `pytest tests/python/relax/test_frontend_onnx.py -k 'max_roi_pool'`


This PR completes the `MaxRoiPool` portion of the Relax ONNX frontend
operator work tracked in apache#18945.
…ache#18951)

### Summary

Implements the `MatMulInteger` operator (opset 10) in the Relax ONNX
frontend — INT8 matrix multiplication. Required for quantized model
inference (e.g. ONNX QDQ models).

Closes apache#18945 (Tier 1 — MatMulInteger operator)

### Tests

- All 4 `int8`/`uint8` dtype combinations, with and without scalar zero
points
- 3-D and 4-D batched matmul

---------

Signed-off-by: OmarAzizi <oalazizi75@gmail.com>
…dule (apache#18947)

This pr add a new tutorial `mix_python_and_tvm_with_pymodule.py`
demonstrating how to use `BasePyModule` to mix Python/PyTorch functions
with TIR and Relax in a single IRModule.
## Tutorial Contents (7 steps)                                  
- **Step 1**: `I.pyfunc` + `call_tir` basics, DLPack zero-copy
conversion, `show()`
- **Step 2**: Debugging with `print` in pyfuncs — inspect intermediate
tensors without compiling
- **Step 3**: Realistic pipeline combining `call_tir`,
`call_dps_packed`, and Python/PyTorch in one forward pass
- **Step 4**: Dynamic function registration via `add_python_function`
- **Step 5**: `RelaxToPyFuncConverter` — convert Relax IR to PyTorch at
different compilation stages (before and after passes) to verify
numerical correctness
- **Step 6**: `R.call_py_func` — cross-level calls between compiled
Relax VM and Python functions
- **Step 7**: Symbolic shapes for dynamic batch sizes
This pr also fixs a bug in `BasePyModule._compile_functions` where
modules without Relax functions would incorrectly attempt Relax VM
compilation, producing spurious warnings like `Failed to compile Relax
VM: 'NoneType' object has no attribute 'kind'`.
…he#18950)

## Summary

This PR adds Relax ONNX frontend support for:
- `Optional`
- `OptionalHasElement`
- `OptionalGetElement`
- `MatMulInteger16` from the `com.microsoft` domain

The implementation follows existing TVM ONNX frontend patterns and keeps
Optional handling explicit through an empty-Optional sentinel during
import.

## Changes

- add ONNX frontend converters for `Optional`, `OptionalHasElement`, and
`OptionalGetElement`
- add ONNX frontend converter for `MatMulInteger16`
- extend ONNX attribute parsing to handle `TYPE_PROTO`
- preserve empty Optional values during import and unwrap them
consistently
- register Optional-related ops and `MatMulInteger16` in the ONNX
converter map
- handle Optional outputs correctly in importer output counting and
normalization
- tighten converter docstrings and input validation for better
consistency with nearby TVM code

## Tests

Added or updated tests in `tests/python/relax/test_frontend_onnx.py` to
cover:
- numerical correctness for `MatMulInteger16`
- structural IR checks for `MatMulInteger16`
- invalid dtype rejection for `MatMulInteger16`
- tensor and sequence Optional round-trips
- empty Optional behavior for `OptionalHasElement`
- structural IR checks ensuring Optional ops are erased as expected
- missing `type` attribute rejection for empty `Optional`
- empty `OptionalGetElement` rejection

## Validation

Validated with:
- `python -m ruff check python/tvm/relax/frontend/onnx/onnx_frontend.py
tests/python/relax/test_frontend_onnx.py`
- `python -m pytest -n 1 tests/python/relax/test_frontend_onnx.py -k
"optional or matmulinteger16" -v`

Result:
- `13 passed`

This PR completes the ONNX `MatMulInteger16` and `Optional` work tracked
in apache#18945.
…pache#18956)

## Summary

Complete `Reshape` handling for shape values in the Relax ONNX frontend.

## Changes

- keep `ShapeExpr -> Reshape([-1])` on the shape-specialized path
- materialize `ShapeExpr` to an `int64` tensor for other reshape targets
and apply regular tensor reshape semantics
- add frontend coverage for `Shape -> Reshape([-1])`
- add frontend coverage for reshaping shape outputs to non-`[-1]`
targets such as `[1, 3]` and `[3, 1]`
- extend symbolic shape deduction coverage to include the common `Shape
-> Reshape([-1]) -> Gather -> Unsqueeze` shape-construction pattern

## Validation

- `pytest -k 'test_symbolic_shape_deduction or test_reshape_shape_output
or test_reshape'`

This PR completes the `Reshape` limitation in the Relax ONNX frontend
operator work tracked in apache#18945.
)

Since we've split the old `tir` namespace into `tirx` (core IR /
lowering) and `s_tir` (schedule primitives / auto-tuning), some outdated
documentation need to be updated. The global rename still leaves a few
concept-level references using "tirx" in prose (for example, "Relax and
tirx programs"). Since "tirx" now refers only to one part of the old
TensorIR stack, these higher-level references should use "TensorIR"
instead, so they correctly cover both `tirx` and `s_tir`.

In this PR, we
- Add tirx / s_tir module descriptions to
`docs/deep_dive/tensor_ir/index.rst` and `docs/arch/index.rst` (new
`tvm/s_tir` section, updated `tvm/tirx` section).
- Fix concept-level prose in `docs/arch/index.rst` and
`docs/arch/pass_infra.rst` to use "TensorIR" instead of "tirx" where
referring to the concept rather than the namespace.
- Fix `docs/arch/runtimes/vulkan.rst` to use "TIR" instead of "tirx" in
debug shader description.
- Correct `tvm/dlight` → `tvm/s_tir/dlight` section path and "tirx
schedules" → "s_tir schedules" in `docs/arch/index.rst`.
- Revert unintended label changes in `abstraction.rst` and
`tir_creation.py` (labels kept as `_tir-abstraction`, `_tir-creation`).
- Revert unintended title change in `tir_transformation.py` (kept as
"Transformation").
- Revert `exclude-members` change in `tirx/tirx.rst` (kept original
list).
…` and `Slice` (apache#18955)

## Summary

Relates to apache#18945.

This PR improves ONNX frontend handling for dynamic
`Unsqueeze`/`Squeeze`/`Slice`, tightens validation paths, and adds
targeted structural/negative regression tests.

- Refactor constant-path `Unsqueeze` lowering to use a single `reshape`
based on computed target shape.
- Remove scalar-specific branching and repeated `expand_dims` in the
constant path.
- Add/keep structural helper usage in ONNX frontend tests for Relax
call-op checks.
- Add regression coverage for scalar-input `Unsqueeze`.

## Changes

- Add dynamic-axes conversion paths for `Unsqueeze` and `Squeeze`:
  - infer output shape via runtime shape-tensor construction
- lower to `relax.reshape` with validated shape rank/length assumptions
- Improve `Slice` conversion robustness:
  - support dynamic parameter forms with stricter rank/length validation
  - reject invalid zero-step inputs when statically known
  - fix docstring wording (`Splice` -> `Slice`)
- Strengthen ONNX frontend tests:
  - negative test for duplicate `Unsqueeze` axes
- structural IR check for dynamic `Slice` (`relax.dynamic_strided_slice`
present, `relax.strided_slice` absent)
  - negative test for zero-step `Slice`
- Refactor constant-path `Unsqueeze` scalar handling:
- replace scalar special-casing + repeated `expand_dims` with one
target-shape `reshape`
  - add scalar-input regression test
- Restore shared test helper used by structural Relax call-op checks.

## validation

- `ruff check`: passed
- `pre-commit --files`: passed
- `pytest`: 8 passed
The ONNX Resize converter previously rejected non-constant ROI inputs,
which blocked models where ROI is provided at runtime. This change adds
a dynamic-ROI path lowered through TOPI resize kernels while preserving
the existing relax.image.resize* path for static ROI.

Specifically:
- add reusable helper to convert ONNX full ROI ([starts..., ends...])
into spatial ROI vector
- add reusable helper to emit topi.image.resize1d/2d/3d for dynamic ROI
- keep static ROI fast path for relax.image.resize2d/resize3d
- normalize dynamic ROI expr before emit_te to ensure struct_info is
populated
- handle optional Resize inputs (roi/scales/sizes) more defensively
- add frontend test coverage with graph-input ROI:
test_resize_dynamic_roi_tf_crop_and_resize

Ref: apache#18945
…PI error (apache#18957)

- The intermediate variable `ceil_log2` in `gpu_2d_continuous_cumsum`
created a `LetStmt`-bound `Var` in the TIR function
- When `MakePackedAPI` processed the function, it reported `ceil_log2`
as an undefined variable not passed as an API argument
- Inline the expression directly into `total_rounds` to avoid the
intermediate `Var` — the computation is identical

## Test plan
- Compile a model that uses GPU sampling (e.g. any LLM with top-p
sampling on Metal) and verify compilation succeeds
- The error this fixes: `Check failed: undefined.size() == 0: In
PrimFunc gpu_2d_continuous_cumsum variables [ceil_log2] are used, but
are not passed in as API arguments`

Co-authored-by: Akaash Parthasarathy <43900735+akaashrp@users.noreply.github.com>
- Fix few typos
- Unify Android naming
- Fix HTTPS link
…ons (apache#18960)

Compiling Qwen3.5 yielded WGSL of the following form: `var<storage,
read_write> storage : array<f32>;
`. This led to a 'cyclic dependency' error due to the identifier
collision. This PR reserves keywords such as storage to avoid parsing
errors.
…ts (apache#18959)

Replace `str(target.kind)` with `target.kind.name` for `Target` objects
since `target.kind` is a `TargetKind` object while `target.kind.name`
yields a string describing the target
- Fix rollup errors by upgrading rollup version and updating rollup
config
- Bump tvmjs version to 0.24.0-dev3
…che#18969)

### Summary

This PR implements the `select_last_index` attribute (introduced in
opset 12) for the `ArgMax` and `ArgMin` ONNX operators.

Previously, setting `select_last_index=1` raised
`OpAttributeUnImplemented`. This closes the limitation tracked in the
ONNX frontend issue.

### Implementation

When `select_last_index=1`, the input tensor is reversed along the
reduction axis using `relax.op.flip`, argmax/argmin is computed on the
flipped copy, and the result is remapped back to the original index
space via `last_idx = (axis_size - 1) - flipped_idx`

Closes part of apache#18945

---------

Signed-off-by: OmarAzizi <oalazizi75@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.