[WebGPU] Add gating logic for subgroup shuffle primitives by ksgr5566 · Pull Request #1 · CharlieFRuan/tvm

ksgr5566 · 2026-02-24T10:32:11Z

Summary

This adds gating logic on top of apache#17699 to support optional subgroup shuffle
primitives based on a compile-time flag.

Problem

The PR apache#17699 always generates subgroup shuffle ops when targeting WebGPU.
However, not all WebGPU devices support subgroups. We need a way to:

Default to shared memory reductions (universally compatible)
Optionally enable subgroup shuffles for devices that support them

Solution

Implement gating via TVM target parameter:

Default thread_warp_size=1 disables warp reductions (uses shared memory + barriers)
Add target parser UpdateWebGPUAttrs() that sets thread_warp_size=32 when supports_subgroups=true
Add --enable-subgroups CLI flag in mlc-llm to surface the option to users

The gating happens at the reduction path selection level (IsWarpReduction() in
lower_thread_allreduce.cc), ensuring subgroup ops are never generated unless explicitly enabled.

Changes

TVM: Target parser + default thread_warp_size=1
MLC-LLM: --enable-subgroups flag ([WebGPU] Add --enable-subgroups flag for optional subgroup support mlc-ai/mlc-llm#3431)
WebLLM: WGSL shader dumping for verification

Testing

Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions;
with flag, generates subgroupShuffle* ops.
Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f

…trided_slice (apache#18633) ## Why The dynamic_strided_slice operator was missing FRelaxInferLayout and TMixedPrecisionPolicy attributes, preventing it from participating in layout transformations and mixed precision optimizations. ## How - Add `TMixedPrecisionPolicy` attribute with `kFollow` policy and - Add `InferLayoutDynStridedSlice` function that falls back to initial layout (since begin/end/strides are dynamic tensors that cannot be transformed at compile time)

## Why The flip operator lacked layout inference support, preventing it from participating in layout transformations during the ConvertLayout pass. ## How - Add InferLayoutFlip function that transforms the axis attribute according to the input layout - Register FRelaxInferLayout attribute for relax.flip operator - Add test case for conv2d followed by flip with layout conversion

…8638) ## Why The scatter_elements operator was missing FRelaxInferLayout support, which prevented proper layout transformation when used with operators like conv2d that require layout conversion. ## How - Implement InferLayoutScatterElements function that handles layout inference for scatter_elements - Transform axis attribute according to the inferred layout using FindAxis - Handle sub-indexed layout fallback to initial layout - Add test case for conv2d + scatter_elements layout conversion

…ython 3.14+ (apache#18639) This PR fixes the type annotation checker in `tvm.tir.schedule._type_checker` to correctly handle subscripted generics (e.g., `Union[str, int]`, `List[str]`, `Tuple[str, int]`) in Python 3.14+. ## Background In Python 3.14, the internal representation of generic types has changed: - `Union[str, int]` is now of type `typing.Union` instead of `typing._GenericAlias` or `typing._SpecialGenericAlias` - These types now have `__origin__` attribute directly on the type object - The existing type checker failed to recognize these new representations, causing the dispatcher to fall through to "atomic" instead of correctly identifying them as "union", "list", etc. ## Changes Added a check for `__origin__` attribute at the beginning of the method to handle Python 3.14's new generic type representations. This is fully backward compatible since the new `__origin__` check is only applied when the attribute exists. ## Tests Added parametrized tests to verify the dispatcher correctly handles subscripted generics: - `Union[str, int]` → identified as "union" - `List[str]` → identified as "list" - `Dict[str, int]` → identified as "dict" - `Tuple[str, int]` → identified as "tuple" - `Union[List[str], Dict[str, int]]` → identified as "union" with nested generics

…true (apache#18641) ### Summary Fixed incorrect output shape of Pool op when ceil_mode = true ### Steps to Reproduce Example: Create Pool Operator from PyTorch ``` class PoolModule(nn.Module): def forward(self, x): return torch.nn.AvgPool2d(2, 2, 1, True)(x) ``` ``` class Module: def main(x: R.Tensor((1, 3, 17, 17), dtype="float32")) -> R.Tuple(R.Tensor((1, 3, 10, 10), dtype="float32")): with R.dataflow(): lv: R.Tensor((1, 3, 10, 10), dtype="float32") = R.nn.avg_pool2d(x, pool_size=[2, 2], strides=[2, 2], dilation=[1, 1], padding=[1, 1, 1, 1], ceil_mode=True, count_include_pad=True, layout="NCHW", out_layout="NCHW") gv: R.Tuple(R.Tensor((1, 3, 10, 10), dtype="float32")) = (lv,) R.output(gv) return gv ``` ### Expected ``` class Module: def main(x: R.Tensor((1, 3, 17, 17), dtype="float32")) -> R.Tuple(R.Tensor((1, 3, 9, 9), dtype="float32")): with R.dataflow(): lv: R.Tensor((1, 3, 9, 9), dtype="float32") = R.nn.avg_pool2d(x, pool_size=[2, 2], strides=[2, 2], dilation=[1, 1], padding=[1, 1, 1, 1], ceil_mode=True, count_include_pad=True, layout="NCHW", out_layout="NCHW") gv: R.Tuple(R.Tensor((1, 3, 9, 9), dtype="float32")) = (lv,) R.output(gv) return gv ``` ### Resolve - Citation: https://docs.pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html <img width="500" height="200" alt="PR1" src="https://github.com/user-attachments/assets/52a27448-006f-409e-b8b4-65f49e908d5f" /> - Fixed: apache#18594

## Why The scatter_nd operator was missing FRelaxInferLayout attribute, which is needed for proper layout transformation during model optimization. ### How - Added InferLayoutScatterND function that uses data tensor's layout for output since scatter_nd maintains input shape - Registered FRelaxInferLayout attribute

) ## Why The gather_elements operator lacked layout inference support, preventing it from participating in layout transformations during the ConvertLayout pass. ## How - Add InferLayoutGatherElements function that transforms the axis attribute according to the input layout - Register FRelaxInferLayout attribute --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

This PR supports NVRTC as an alternative to NVCC for faster, device-side JIT compilation of CUDA kernels, in favor of the PR [https://github.com/apache/tvm-ffi/pull/283](https://github.com/apache/tvm-ffi/pull/283). It enhances the CUDA compilation backend by: - Adding Python NVRTC support using cuda-python bindings - Removing legacy C++ NVRTC fallback in favor of a Python-first approach - Keeping nvcc as the default compiler with fatbin output (no behavior change for existing users) Users can choose the compilation backend using an environment variable `TVM_CUDA_COMPILE_MODE`, choosing from "nvcc" and "nvrtc". For example, `TVM_CUDA_COMPILE_MODE=nvrtc python3 your_program.py` Here is a short benchmark of the compilation speed of kernels in `test_target_codegen_cuda.py`. ### NVCC vs NVRTC Compilation Time Comparison (Python-side Call) | Test Case | Code Size | NVCC Time (ms) | NVRTC Time (ms) | Speedup | | :--- | :--- | :--- | :--- | :--- | | `test_crossthread_reduction1` | 1945 B | 241.27 | 51.23 | **4.7x** | | `test_cuda_bf16_vectorize_add` | 3760 B | 342.72 | 44.50 | **7.7x** | | `test_cuda_const_float_to_half` | 12394 B | 272.85 | 31.99 | **8.5x** | | `test_cuda_device_func_call` | 975 B | 215.58 | 21.47 | **10.0x** | | `test_cuda_float_const_hex_format` | 685 B | 217.39 | 20.52 | **10.6x** | | `test_cuda_floordiv_with_vectorization` | 1050 B | 213.88 | 23.32 | **9.2x** | | `test_cuda_inf_nan` | 673 B | 214.33 | 24.94 | **8.6x** | | `test_cuda_tensormap` | 755 B | 213.91 | 20.74 | **10.3x** | | `test_cuda_thread_sync_inside_condition` | 1007 B | 213.43 | 28.29 | **7.5x** | | `test_cuda_vectorize_add` | 908 B | 226.81 | 40.39 | **5.6x** | | `test_cuda_vectorize_load` | 734 B | 217.25 | 24.02 | **9.0x** | | `test_device_host_call_same_func` | 924 B | 216.03 | 21.21 | **10.2x** | | `test_vectorized_intrin1` | 847 B | 226.15 | 26.34 | **8.6x** | ### NVSHMEM Support Currently, NVSHMEM is **not** supported via NVRTC. - Fallback Behavior: When NVSHMEM is required, the compilation pipeline will automatically fall back to NVCC, even if `TVM_CUDA_COMPILE_MODE` is set to nvrtc. - Future Roadmap: Support for NVRTC with NVSHMEM is planned for follow-up PRs.

Introduces the below features over texture annotation - Lowering, codegen and runtime for texture. - image2d_array_t support - Added depth dimension allows more allocations using texture instead of falling back to buffer when the texture limits exceeds. - A comprehensive set of schedules for Adreno textures. - Texture packing of arbitrary types up to 128 bit (FP16-NCHW8c, INT8-NCHW16c ...etc.). - A clBufferDescriptor debug dump controlled by cmake options. - Pipeline definition for adreno target. While covering these features the below interfaces or passes or enhanced which need a review. - alloc_tensor: VDevice information is passed across these API's. The way of texture allocation is ```alloc_storage``` allocates buffer/image objects as requested followed by alloc_tensor being a view of any scope. This takes care of optimum utilization backing memory across different image objects or scopes. - Constants Saving: Handled by adding memory scope section in executable. This introduces a new header magic to retain the backward compatibility. - Static Memory Planing: Mostly port from Relay static memory planner with mixed mode allocator. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Sanjay <sanjs@qti.qualcomm.com>

- Remove TODO about `ctx->ReportWarning()` since LOG(WARNING) is the standard pattern - Remove TODO about skipping type params/constraints since the current FuncType implementation only contains arg_types and ret_type fields. `type_params` and `type_constraints` don't exist in the codebase. - Fixed typo in warning message

This change is part of tile-ai#21. On darwin platform, when trying to compile a `.c` file as objective-c, `-x objective-c++` needs to be prior to source files in command line arguments. Without this PR, it's not straightforward to do so.

…ope shapes (apache#18658) This PR support handle slope and axis argument of PReLU op with different slope shapes: (1xCx1x1) or (S,) or (1,1) etc. ### Description - Handle slope and axis argument of PReLu op (to pass into relax.op.nn.prelu function) - If slope shape = (1xCx1x1), get axis = 1 and reshape slope to (C,) - else if slope shape = (S,) or (1, 1), get axis = len(x_shape) - 1 (take the last axis of the input x) (https://onnx.ai/onnx/repo-docs/Broadcasting.html) - else raise error ### Resolved - Fixed 1: apache#18596 - Fixed 2: apache#18598 - Fixed 3: apache#18606 - Fixed 4: apache#18607

This PR bumps tvm-ffi to latest version

### Motivation The ONNX Resize operator supports resizing N-D tensors per the ONNX specification. However, the current Relax ONNX frontend only supports 4D inputs and raises an assertion error for valid non-4D models. This PR extends the Relax ONNX Resize converter beyond the 4D-only restriction, aligning behavior with the ONNX specification and ONNX Runtime for supported ranks. ### Changes - Remove the 4D-only assertion in the Relax ONNX Resize converter - Preserve the existing 4D resize path without behavior changes - Support non-4D Resize by lowering to existing resize implementations for supported ranks - Ensure Resize attributes are handled correctly for non-4D cases ### Testing - Verified outputs against ONNX Runtime - Added ONNX Resize tests covering non-4D input cases (`test_resize_nd_sizes`) Fixes: [[Bug] Resize N-D import failure: TVM only supports 4D resize2d, but ONNX Resize supports N-D tensors https://github.com/apache/tvm/issues/18608](https://github.com/apache/tvm/issues/18608)

…pache#18663) This PR fixes an issue in StaticPlanBlockMemory where dynamic shapes were incorrectly planned as static memory when only a lower bound was provided for TIR variables. Repro: <details> <summary>repro_dynamic_memory_plan.py</summary> ```python import tvm from tvm import relax, testing from tvm.relax.frontend.torch import from_exported_program from torch.export import Dim, export import torch class SimpleConv(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1) def forward(self, x): return self.conv(x) def main(): model = SimpleConv().eval() example = torch.randn(2, 3, 32, 32) batch = Dim("batch") # No max= specified, so upper bound is unknown exported = export(model, (example,), dynamic_shapes={"x": {0: batch}}) mod = from_exported_program(exported) mod = relax.transform.DecomposeOpsForInference()(mod) target = tvm.target.Target("llvm") exe = tvm.compile(mod, target=target) vm = relax.VirtualMachine(exe, tvm.cpu()) inp = tvm.runtime.from_dlpack(example) out = vm["main"](inp) expected = model(example).detach().numpy() actual = out[0].numpy() testing.assert_allclose(actual, expected, rtol=1e-4, atol=1e-4) if __name__ == "__main__": main() ``` </details> This will fail with the following error. <details> <summary>output</summary> ``` $ uv run python repro_dynamic_memory_plan.py /home/ubuntu/data/project/tvm-example/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.) return torch._C._cuda_getDeviceCount() > 0 Traceback (most recent call last): File "/home/ubuntu/data/project/tvm-example/frontend/repro_dynamic_memory_plan.py", line 40, in <module> main() File "/home/ubuntu/data/project/tvm-example/frontend/repro_dynamic_memory_plan.py", line 32, in main out = vm["main"](inp) ^^^^^^^^^^^^^^^ File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__ File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 549, in tvm::runtime::vm::VirtualMachineImpl::InvokeClosurePacked(tvm::ffi::ObjectRef const&, tvm::ffi::PackedArgs, tvm::ffi::Any*) clo->impl.CallPacked(ffi::PackedArgs(packed_args.data(), packed_args.size()), rv); File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 622, in operator() *rv = static_cast<VirtualMachineImpl*>(ctx_ptr)->InvokeBytecode(gf_idx, inputs); File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 693, in tvm::runtime::vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::ffi::Any, std::allocator<tvm::ffi::Any> > const&) RunLoop(); File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 816, in tvm::runtime::vm::VirtualMachineImpl::RunLoop() this->RunInstrCall(curr_frame, instr); File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/vm.cc", line 767, in tvm::runtime::vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::vm::VMFrame*, tvm::runtime::vm::Instruction) this->InvokeClosurePacked(func_pool_[instr.func_idx].cast<ObjectRef>(), args, &ret); File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/vm/builtin.cc", line 405, in operator() *rv = sobj->AllocTensor(offset, shape, dtype); File "/home/ubuntu/data/project/tvm-example/tvm/src/runtime/memory/memory_manager.cc", line 98, in tvm::runtime::memory::StorageObj::AllocTensor(long, tvm::ffi::Shape, DLDataType) ICHECK(offset + needed_size <= this->buffer.size) File "/home/ubuntu/data/project/tvm-example/tvm/include/tvm/runtime/logging.h", line 321, in tvm::runtime::detail::LogFatal::~LogFatal() GetEntry().Finalize(); File "/home/ubuntu/data/project/tvm-example/tvm/include/tvm/runtime/logging.h", line 337, in tvm::runtime::detail::LogFatal::Entry::Finalize() InternalError error(file_, lineno_, stream_.str()); tvm.error.InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 524288 at offset 0 in region that is 262144bytes ``` </details>

## Why TODO comment indicated that topi.take should be replaced with relax.op.take once it has better support. The relax take operator is now mature enough to replace the TOPI external call. ## How - Replace bb.emit_te(topi.take, ...) with relax.op.take(..., axis=0) in EmbedLayerNormalization converter

## Why The ONNX Unique operator supports four optional outputs (unique values, indices, inverse_indices, and counts), but the TVM ONNX frontend only returned the unique values output. ## How - Updated `Unique._impl_v11` to check the number of expected outputs via `attr["tvm_custom"]["num_outputs"]` - Pass `return_index`, `return_inverse`, and `return_counts` parameters to `relax.op.unique` - Return a `relax.Tuple` containing all requested outputs

The line below is not a MSVC syntax ``` #pragma disagnostic push ``` instead, we should always use: ``` #pragma warning(push) ``` in MSVC.

This allows user defined callback to specify layouts dynamically based on call description. Helpful to alter layouts based on the operator shapes or attributes. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## Why ONNX models use the Size operator to get total element count of a tensor. Relax didn't have a native equivalent. ## How - Adds R.size(tensor) operator that returns the total number of elements in a tensor as a scalar int64

This PR supported Local Response Normalization operator for ONNX. ### Description Implement and Test Local Response Normalization operator for ONNX frontend. ### Implement - Using avg_pool operator to compute LRN - Pseudocode: ``` def local_response_norm(input, size, alpha, beta, k): dim = input.dim() check_only_support_3D_4D() div = input.mul(input) div = expand_dim(div, 1) pad_len = size // 2 if dim == 3: div = avg_pool2d(div, (size, 1), stride=1, padding=(pad_len, 0, pad_len, 0)) else: div = avg_pool3d(div, (size, 1, 1), stride=1, padding=(pad_len, 0, 0, pad_len, 0, 0)) div = squeeze_dim(div, 1) div = div.mul(alpha).add(k).pow(beta) return input / div ``` ### Reference Implement same as Pytorch: https://discuss.pytorch.org/t/why-use-avgpool2d-and-avgpool3d-in-local-response-norm/97236

…#18664) ## Why The LegalizeOps transform was using string matching to detect data-dependent operators by checking if "dynamic" appears in the operator name. This approach is fragile and doesn't scale well as new data-dependent operators are added. ## How - Add FDataDependent operator attribute to properly mark data-dependent operators - Set FDataDependent=true for relax.dynamic_strided_slice operator - Update LegalizeOps transform to check the FDataDependent attribute instead of string matching

## Why NMS operator returns fixed-size output with trailing garbage data, wasting memory and requiring manual trimming for ONNX compatibility. ## How - Add dynamic_strided_slice to trim NMS output to valid detections only - Build slice parameters using TE compute to avoid legalization issues

## Why The code was using the TypeScript type name `FTVMFFIErrorSetRaisedFromCStr` instead of the actual C function export name `TVMFFIErrorSetRaisedFromCStr`. The F prefix is a naming convention for TypeScript function types, not the actual export names, which would make RPC run into error in my local test. ## How `lib.exports.FTVMFFIErrorSetRaisedFromCStr` --> `lib.exports.TVMFFIErrorSetRaisedFromCStr` to match the C function name defined in ctypes.ts.

…#18683) ## Why Recent FFI refactoring added new type indices kTVMFFIStr (65) and kTVMFFIBytes (66). The RPC server didn't handle these, causing "cannot support type index 65/66" errors when Python sends string/bytes arguments via RPC. ## How `WriteFFIAny` in C++ writes the `type_index` twice for these types. The fix adds handlers that read and discard the duplicate type_index before reading the actual data.

## Why When a client connects without session_constructor_args, the RPC protocol sends nargs=0 (LocalSession request). The WASM RPC server requires ["rpc.WasmSession", wasm_binary] instead. ## How Adds handling to skip LocalSession init packets and wait for the proper WasmSession init, with a log message for debugging.

…18671) ### Motivation InjectPTXLDG32 rewrites BufferStore when encountering if_then_else, but it only initializes temporary buffers when an Allocate node exists. For functions without Allocate, this leads to uninitialized buffers and a hard segfault during compilation. In addition, the PTX-only pass can run on CPU/LLVM targets when tir.ptx_ldg32=1, injecting PTX intrinsics that are invalid for non-CUDA codegen. This PR ensures temporary buffers are created even when no Allocate exists, and skips InjectPTXLDG32 on non-CUDA targets, preventing segfaults and invalid PTX intrinsics on CPU. ### Changes - Ensure temp buffers are created when the rewrite path is taken without Allocate - Insert allocations at the function level when needed - Guard InjectPTXLDG32 so it only runs on CUDA targets - Add tests for CUDA (insertion) and CPU (skip) behavior ### Testing test_tir_transform_inject_ptx_ldg32.py ### Fixes - [apache#18612](apache#18612) - [apache#18617](apache#18617) - [apache#18599](apache#18599)

…rt (apache#18933) ## Summary Add `relax.image.affine_grid` operator for Spatial Transformer Networks, along with PyTorch and ONNX frontend integration. TOPI compute (`topi.image.affine_grid`) already exists. This PR completes the Relax-level registration and frontend support, following the existing `resize2d` / `grid_sample` pattern. ## Changes **Relax op registration:** - C++ op function, FFI registration, and struct info inference (`resize.h`, `resize.cc`) - Python wrapper with flexible size parameter handling (`image.py`) - Legalization to `topi.image.affine_grid` with `PrimExpr` → `int` conversion - Op-level tests (struct info inference + e2e numerical correctness) and legalization test **PyTorch frontend:** - Converter for `aten.affine_grid_generator.default` - Layout conversion from TVM `[N,2,H,W]` to PyTorch `[N,H,W,2]` via `permute_dims` - Single-kernel path is 5.6x faster than the decomposed path (30+ ops) - Structural IR test and numerical correctness test **ONNX frontend:** - `AffineGrid` converter with `_impl_v20` (opset 20, when the op was first introduced) - Support for constant size tensor `[N,C,H,W]` - Layout conversion from TVM `[N,2,H,W]` to ONNX `[N,H,W,2]` - End-to-end correctness test against ONNX Runtime ## Limitations - Only `align_corners=True` is supported (matches current TOPI implementation) - Only 2D affine grid is supported ## Validation ```bash pytest tests/python/relax/test_op_image.py -k "affine_grid" -v # 8 passed pytest tests/python/relax/test_transform_legalize_ops_image.py -k "affine_grid" -v # 1 passed pytest tests/python/relax/test_frontend_from_exported_program.py -k "affine_grid" -v # 2 passed pytest tests/python/relax/test_frontend_onnx.py -k "affine_grid" -v # 1 passed ``` All 12 tests passed. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

… tirx rename (apache#18941) - Remove duplicate Apache license blocks in `docs/get_started/overview.rst` and `docs/README.md` introduced by apache#18913, which render as visible garbage text in built documentation - Fix incorrect `tirx/schedule` and `tirx/tensor_intrin` paths in `docs/arch/index.rst` — these modules are in `s_tir/`, not `tirx/`

as per title

DeviceInfoCollector did not track Bind statements, so when CSE (or any other pass) inserted a Bind before a thread_extent AttrStmt, the collected extent referenced a locally-bound variable instead of function parameters. LowerDeviceKernelLaunch then produced dangling references in the host function. Fix: record Bind definitions in DeviceInfoCollector and inline them when extracting thread_extent values and dynamic shared memory sizes.

the file `tvm/python/update_version.py` not exist. So remove the comment.

…box decode (apache#18942) Introduce relax.vision.multibox_transform_loc with MultiboxTransformLocAttrs: decode center-size offsets against ltrb priors, softmax on class logits, and optional clip, threshold masking, and background score zeroing. Register the C++ op with FInferStructInfo checks for shapes and dtypes (including batch and 4*N consistency). Legalize to topi.vision.multibox_transform_loc. Add tests for struct inference, invalid inputs, Legalize+e2e on LLVM, attribute branches, and TVMScript roundtrip. Add a standalone numpy reference under topi/testing (not exported from tvm.topi.testing to avoid pulling scipy). Update TFLite frontend NotImplementedError text for DETECTION_POSTPROCESS and NON_MAX_SUPPRESSION_V5 to note multibox is available and link tracking issue apache#18928.

## Summary Bump tvm-ffi submodule from c85fd42 (apache#471) to 63224e3 (apache#512), spanning 41 commits with 7 breaking changes. Fix regressions introduced by the bump: ### Fixes 1. **Duplicate field declarations in C++ types** — New tvm-ffi auto-wires `__init__` from C++ reflection by walking the parent type chain. Child types that re-declared parent fields (`RXPlaceholderOpNode`, `FunctionFrameNode`) caused duplicate parameter errors. Fixed by removing duplicate field registrations from child types. 2. **Repr format regression** (7 tests) — New tvm-ffi `CObject.__repr__` uses dataclass repr. Added `Node.__repr__` in `python/tvm/ir/base.py` to use TVMScript printer for IR nodes. 3. **Host/device function split** (3 tests) — `str(target.kind)` now returns full dataclass repr instead of kind name. Changed to `target.kind.name` in `python/tvm/tirx/build.py`. 4. **`__slots__` enforcement** — New tvm-ffi enforces `__slots__=()` on Object subclasses. Added `__slots__ = ("__dict__",)` to classes that need instance attributes: `Pass`, `BlockBuilder`, `TVMDerivedObject`. ### Changes - `3rdparty/tvm-ffi` — submodule bump c85fd42 → 63224e3 - `python/tvm/ir/base.py` — `Node.__repr__` using TVMScript printer - `python/tvm/ir/transform.py` — `Pass.__slots__ = ("__dict__",)` - `python/tvm/tirx/build.py` — `target.kind.name` instead of `str(target.kind)` - `python/tvm/relax/block_builder.py` — `BlockBuilder.__slots__ = ("__dict__",)` - `python/tvm/runtime/support.py` — `TVMDerivedObject.__slots__ = ("__dict__", "__weakref__")` - `python/tvm/s_tir/meta_schedule/utils.py` — `TVMDerivedObject.__slots__ = ("__dict__",)` - `include/tvm/script/ir_builder/relax/frame.h` — remove duplicate field registrations - `src/relax/ir/emit_te.h` — remove duplicate field registrations ## Test plan - [x] tirx-base: 251 passed, 23 skipped - [x] relax import + build: verified - [ ] Full CI

Include `ffi/extra/json_parser.cc` and `ffi/extra/json_writer.cc` to maintain compatibility after FFI JSON refactor

## Summary Add `relax.vision.get_valid_counts` and classic `relax.vision.non_max_suppression` for object-detection post-processing pipelines. `get_valid_counts` performs score-based bounding box filtering and compacts valid boxes to the front of each batch. Classic `non_max_suppression` performs flexible IoU-based suppression on filtered boxes, complementing existing `all_class_non_max_suppression` for custom post-processing workflows. This PR implements the Relax-level registration, legalization, TOPI compute, and test coverage for both operators. ## Changes **Relax op registration and legalization:** - C++ op functions, FFI registration, and struct info inference for both operators (`vision.h`, `vision.cc`) - Python wrappers with Relax docstrings (`vision.py`) - Legalization to `topi.vision.get_valid_counts` and `topi.vision.non_max_suppression` - Additional struct-info validation for `score_index`, `id_index`, and `coord_start` when `elem_length` is statically known **TOPI and testing:** - Full TOPI implementation for `get_valid_counts` - Reimplementation of classic `non_max_suppression` in TOPI - NumPy reference implementations in `tvm.topi.testing` for both operators - Op-level tests for struct info inference, legalization, invalid attribute ranges, and e2e numerical correctness - Stronger legalization tests that verify both `relax.call_tir` introduction and removal of the original Relax vision op ## Limitations - Attribute range validation for `score_index`, `id_index`, and `coord_start` is only enforced when the input `elem_length` is statically known during struct-info inference. - Classic `non_max_suppression` follows the existing Relax / TOPI API shape and is intended for single-class or class-aware custom post-processing flows, distinct from `all_class_non_max_suppression`. ## Validation ```bash pytest tests/python/relax/test_op_vision.py -k "get_valid_counts" -v pytest tests/python/relax/test_op_vision.py -k "test_nms_" -v ``` All related tests passed.

…he#18946) ### Summary This PR implements the ONNX `If` operator in the Relax ONNX frontend. The `If` operator enables conditional branching in ONNX models, where a boolean condition selects between two subgraph branches (`then_branch` and `else_branch`) at runtime. This is required for any model with runtime-dependent execution paths. Closes apache#18945 (Tier 1 — `If` operator) ### Implementation Notes - The main challenge is that `relax.If` cannot be emitted inside a dataflow block, which is how the ONNX frontend normally builds the entire graph. To handle this, when the graph contains an `If` node, the function body is built as a regular binding block instead — matching the approach used by the PyTorch Relax frontend for `torch.cond`. - Each branch is an ONNX subgraph that can reference values from the outer graph. A new `_convert_subgraph` method handles converting these subgraphs into Relax expressions, making outer-scope values available to the branch while ensuring branch-local bindings don't leak back to the parent graph. ### Why `relax.If` cannot live inside a dataflow block Dataflow blocks in Relax carry a semantic guarantee: every operation inside them must be pure and side-effect-free with no control flow. This allows the compiler to treat the entire block as a static computational graph for optimizations like operator fusion and constant folding. An `If` node breaks this guarantee by introducing runtime-dependent branching, so Relax's well-formedness checker explicitly forbids it. I discovered this when the checker raised: ``` This IR is not well-formed: If nodes are not allowed to appear in dataflow blocks. ``` The fix — skipping the dataflow block when the graph contains an `If` node — mirrors exactly how the PyTorch Relax frontend handles `torch.cond`. ### Known Limitations **Dataflow block**: Models whose top-level graph contains an `If` node are built without a dataflow block, which may affect downstream optimisation passes that rely on dataflow block structure. ### Tests Four new tests covering: scalar and tensor conditions, condition computed from another op, and multiple branch outputs. All verified against onnxruntime via `check_correctness`. --------- Signed-off-by: OmarAzizi <oalazizi75@gmail.com>

…e#18948) Introduce relax.nn.conv3d_transpose (attrs, C++ inference/layout, Python API) and lower it to TOPI group_conv3d_transpose_ncdhw when using NCDHW/IODHW with dilation 1, matching the conv2d_transpose legalization policy. Wire the Relax ONNX frontend to emit conv3d_transpose for 5D inputs. Extend tests for ONNX, struct info, LegalizeOps, and TVMScript round-trip; fix ConvTranspose test output spatial size to include output_padding.apache#18945

…#18952) ## Summary Add Relax `roi_pool` support and wire it through the ONNX frontend for `MaxRoiPool`. ## Changes - add `relax.vision.roi_pool`, including attrs, Python wrapper, struct info inference, and legalization - add TOPI `roi_pool` compute for NCHW layout - support ONNX `MaxRoiPool` in the Relax ONNX frontend - handle empty / out-of-bound pooled bins according to ONNX/reference semantics, returning `0` instead of propagating invalid reductions - add regression tests for Relax op inference, legalization, and ONNX frontend import - add out-of-bound ROI coverage to make sure fully invalid pooled bins still match ONNX Runtime ## Validation - `pytest tests/python/relax/test_op_vision.py -k roi_pool` - `pytest tests/python/relax/test_frontend_onnx.py -k 'max_roi_pool'` This PR completes the `MaxRoiPool` portion of the Relax ONNX frontend operator work tracked in apache#18945.

…ache#18951) ### Summary Implements the `MatMulInteger` operator (opset 10) in the Relax ONNX frontend — INT8 matrix multiplication. Required for quantized model inference (e.g. ONNX QDQ models). Closes apache#18945 (Tier 1 — MatMulInteger operator) ### Tests - All 4 `int8`/`uint8` dtype combinations, with and without scalar zero points - 3-D and 4-D batched matmul --------- Signed-off-by: OmarAzizi <oalazizi75@gmail.com>

…dule (apache#18947) This pr add a new tutorial `mix_python_and_tvm_with_pymodule.py` demonstrating how to use `BasePyModule` to mix Python/PyTorch functions with TIR and Relax in a single IRModule. ## Tutorial Contents (7 steps) - **Step 1**: `I.pyfunc` + `call_tir` basics, DLPack zero-copy conversion, `show()` - **Step 2**: Debugging with `print` in pyfuncs — inspect intermediate tensors without compiling - **Step 3**: Realistic pipeline combining `call_tir`, `call_dps_packed`, and Python/PyTorch in one forward pass - **Step 4**: Dynamic function registration via `add_python_function` - **Step 5**: `RelaxToPyFuncConverter` — convert Relax IR to PyTorch at different compilation stages (before and after passes) to verify numerical correctness - **Step 6**: `R.call_py_func` — cross-level calls between compiled Relax VM and Python functions - **Step 7**: Symbolic shapes for dynamic batch sizes This pr also fixs a bug in `BasePyModule._compile_functions` where modules without Relax functions would incorrectly attempt Relax VM compilation, producing spurious warnings like `Failed to compile Relax VM: 'NoneType' object has no attribute 'kind'`.

…he#18950) ## Summary This PR adds Relax ONNX frontend support for: - `Optional` - `OptionalHasElement` - `OptionalGetElement` - `MatMulInteger16` from the `com.microsoft` domain The implementation follows existing TVM ONNX frontend patterns and keeps Optional handling explicit through an empty-Optional sentinel during import. ## Changes - add ONNX frontend converters for `Optional`, `OptionalHasElement`, and `OptionalGetElement` - add ONNX frontend converter for `MatMulInteger16` - extend ONNX attribute parsing to handle `TYPE_PROTO` - preserve empty Optional values during import and unwrap them consistently - register Optional-related ops and `MatMulInteger16` in the ONNX converter map - handle Optional outputs correctly in importer output counting and normalization - tighten converter docstrings and input validation for better consistency with nearby TVM code ## Tests Added or updated tests in `tests/python/relax/test_frontend_onnx.py` to cover: - numerical correctness for `MatMulInteger16` - structural IR checks for `MatMulInteger16` - invalid dtype rejection for `MatMulInteger16` - tensor and sequence Optional round-trips - empty Optional behavior for `OptionalHasElement` - structural IR checks ensuring Optional ops are erased as expected - missing `type` attribute rejection for empty `Optional` - empty `OptionalGetElement` rejection ## Validation Validated with: - `python -m ruff check python/tvm/relax/frontend/onnx/onnx_frontend.py tests/python/relax/test_frontend_onnx.py` - `python -m pytest -n 1 tests/python/relax/test_frontend_onnx.py -k "optional or matmulinteger16" -v` Result: - `13 passed` This PR completes the ONNX `MatMulInteger16` and `Optional` work tracked in apache#18945.

…pache#18956) ## Summary Complete `Reshape` handling for shape values in the Relax ONNX frontend. ## Changes - keep `ShapeExpr -> Reshape([-1])` on the shape-specialized path - materialize `ShapeExpr` to an `int64` tensor for other reshape targets and apply regular tensor reshape semantics - add frontend coverage for `Shape -> Reshape([-1])` - add frontend coverage for reshaping shape outputs to non-`[-1]` targets such as `[1, 3]` and `[3, 1]` - extend symbolic shape deduction coverage to include the common `Shape -> Reshape([-1]) -> Gather -> Unsqueeze` shape-construction pattern ## Validation - `pytest -k 'test_symbolic_shape_deduction or test_reshape_shape_output or test_reshape'` This PR completes the `Reshape` limitation in the Relax ONNX frontend operator work tracked in apache#18945.

) Since we've split the old `tir` namespace into `tirx` (core IR / lowering) and `s_tir` (schedule primitives / auto-tuning), some outdated documentation need to be updated. The global rename still leaves a few concept-level references using "tirx" in prose (for example, "Relax and tirx programs"). Since "tirx" now refers only to one part of the old TensorIR stack, these higher-level references should use "TensorIR" instead, so they correctly cover both `tirx` and `s_tir`. In this PR, we - Add tirx / s_tir module descriptions to `docs/deep_dive/tensor_ir/index.rst` and `docs/arch/index.rst` (new `tvm/s_tir` section, updated `tvm/tirx` section). - Fix concept-level prose in `docs/arch/index.rst` and `docs/arch/pass_infra.rst` to use "TensorIR" instead of "tirx" where referring to the concept rather than the namespace. - Fix `docs/arch/runtimes/vulkan.rst` to use "TIR" instead of "tirx" in debug shader description. - Correct `tvm/dlight` → `tvm/s_tir/dlight` section path and "tirx schedules" → "s_tir schedules" in `docs/arch/index.rst`. - Revert unintended label changes in `abstraction.rst` and `tir_creation.py` (labels kept as `_tir-abstraction`, `_tir-creation`). - Revert unintended title change in `tir_transformation.py` (kept as "Transformation"). - Revert `exclude-members` change in `tirx/tirx.rst` (kept original list).

…` and `Slice` (apache#18955) ## Summary Relates to apache#18945. This PR improves ONNX frontend handling for dynamic `Unsqueeze`/`Squeeze`/`Slice`, tightens validation paths, and adds targeted structural/negative regression tests. - Refactor constant-path `Unsqueeze` lowering to use a single `reshape` based on computed target shape. - Remove scalar-specific branching and repeated `expand_dims` in the constant path. - Add/keep structural helper usage in ONNX frontend tests for Relax call-op checks. - Add regression coverage for scalar-input `Unsqueeze`. ## Changes - Add dynamic-axes conversion paths for `Unsqueeze` and `Squeeze`: - infer output shape via runtime shape-tensor construction - lower to `relax.reshape` with validated shape rank/length assumptions - Improve `Slice` conversion robustness: - support dynamic parameter forms with stricter rank/length validation - reject invalid zero-step inputs when statically known - fix docstring wording (`Splice` -> `Slice`) - Strengthen ONNX frontend tests: - negative test for duplicate `Unsqueeze` axes - structural IR check for dynamic `Slice` (`relax.dynamic_strided_slice` present, `relax.strided_slice` absent) - negative test for zero-step `Slice` - Refactor constant-path `Unsqueeze` scalar handling: - replace scalar special-casing + repeated `expand_dims` with one target-shape `reshape` - add scalar-input regression test - Restore shared test helper used by structural Relax call-op checks. ## validation - `ruff check`: passed - `pre-commit --files`: passed - `pytest`: 8 passed

The ONNX Resize converter previously rejected non-constant ROI inputs, which blocked models where ROI is provided at runtime. This change adds a dynamic-ROI path lowered through TOPI resize kernels while preserving the existing relax.image.resize* path for static ROI. Specifically: - add reusable helper to convert ONNX full ROI ([starts..., ends...]) into spatial ROI vector - add reusable helper to emit topi.image.resize1d/2d/3d for dynamic ROI - keep static ROI fast path for relax.image.resize2d/resize3d - normalize dynamic ROI expr before emit_te to ensure struct_info is populated - handle optional Resize inputs (roi/scales/sizes) more defensively - add frontend test coverage with graph-input ROI: test_resize_dynamic_roi_tf_crop_and_resize Ref: apache#18945

…PI error (apache#18957) - The intermediate variable `ceil_log2` in `gpu_2d_continuous_cumsum` created a `LetStmt`-bound `Var` in the TIR function - When `MakePackedAPI` processed the function, it reported `ceil_log2` as an undefined variable not passed as an API argument - Inline the expression directly into `total_rounds` to avoid the intermediate `Var` — the computation is identical ## Test plan - Compile a model that uses GPU sampling (e.g. any LLM with top-p sampling on Metal) and verify compilation succeeds - The error this fixes: `Check failed: undefined.size() == 0: In PrimFunc gpu_2d_continuous_cumsum variables [ceil_log2] are used, but are not passed in as API arguments` Co-authored-by: Akaash Parthasarathy <43900735+akaashrp@users.noreply.github.com>

- Fix few typos - Unify Android naming - Fix HTTPS link

…ons (apache#18960) Compiling Qwen3.5 yielded WGSL of the following form: `var<storage, read_write> storage : array<f32>; `. This led to a 'cyclic dependency' error due to the identifier collision. This PR reserves keywords such as storage to avoid parsing errors.

…ts (apache#18959) Replace `str(target.kind)` with `target.kind.name` for `Target` objects since `target.kind` is a `TargetKind` object while `target.kind.name` yields a string describing the target

- Fix rollup errors by upgrading rollup version and updating rollup config - Bump tvmjs version to 0.24.0-dev3

…che#18969) ### Summary This PR implements the `select_last_index` attribute (introduced in opset 12) for the `ArgMax` and `ArgMin` ONNX operators. Previously, setting `select_last_index=1` raised `OpAttributeUnImplemented`. This closes the limitation tracked in the ONNX frontend issue. ### Implementation When `select_last_index=1`, the input tensor is reversed along the reduction axis using `relax.op.flip`, argmax/argmin is computed on the flipped copy, and the result is remapped back to the original index space via `last_idx = (axis_size - 1) - flipped_idx` Closes part of apache#18945 --------- Signed-off-by: OmarAzizi <oalazizi75@gmail.com>

guan404ming and others added 30 commits January 6, 2026 00:35

[FFI] bump to latest version (apache#18654)

eb4eb3e

This PR bumps tvm-ffi to latest version

fix: MSVC pragma (apache#18674)

26c6b13

The line below is not a MSVC syntax ``` #pragma disagnostic push ``` instead, we should always use: ``` #pragma warning(push) ``` in MSVC.

[Relax] Add native size operator (apache#18667)

fed71ef

## Why ONNX models use the Size operator to get total element count of a tensor. Relax didn't have a native equivalent. ## How - Adds R.size(tensor) operator that returns the total number of elements in a tensor as a scalar int64

[NVRTC] Add NVSHMEM support to NVRTC compilation path (apache#18681)

2004a8b

[release] Update version to 0.23.0 on main branch

75ce1b4

[release] Update version to 0.24.dev0 on main branch

87f478e

Aharrypotter and others added 30 commits March 26, 2026 22:38

[Fix] Fix tvm.tir references in Tflite frontend (apache#18940)

53b6576

as per title

[DOC] Fix inconsistent code comments (apache#18939)

f66c048

the file `tvm/python/update_version.py` not exist. So remove the comment.

Merge remote-tracking branch 'origin/main' into webgpu-subgroup-test

4fb4cce

Merge remote-tracking branch 'origin/main' into webgpu-subgroup-test

3c2ab40

[WebGPU] Address PR review feedback for subgroup lowering

53409cb

[Web] Update includes after FFI JSON refactor (apache#18944)

9dcc731

Include `ffi/extra/json_parser.cc` and `ffi/extra/json_writer.cc` to maintain compatibility after FFI JSON refactor

[DOC] Fix various issues (apache#18966)

eb53118

- Fix few typos - Unify Android naming - Fix HTTPS link

[Fix] Replace str(target.kind) with target.kind.name for Target objec…

81889de

…ts (apache#18959) Replace `str(target.kind)` with `target.kind.name` for `Target` objects since `target.kind` is a `TargetKind` object while `target.kind.name` yields a string describing the target

[Web] Fix rollup errors and bump tvmjs version (apache#18958)

ccb84cd

- Fix rollup errors by upgrading rollup version and updating rollup config - Bump tvmjs version to 0.24.0-dev3

Merge remote-tracking branch 'origin/main' into webgpu-subgroup-test

a69af46

[WebGPU] Resolve comments

17e9e67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] Add gating logic for subgroup shuffle primitives#1

[WebGPU] Add gating logic for subgroup shuffle primitives#1
ksgr5566 wants to merge 844 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
ksgr5566:webgpu-subgroup-test

ksgr5566 commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ksgr5566 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ksgr5566 commented Feb 24, 2026 •

edited

Loading