[WebGPU] Support warp-level shuffle primitives with subgroup#17699
Draft
CharlieFRuan wants to merge 1 commit intoapache:mainfrom
Draft
[WebGPU] Support warp-level shuffle primitives with subgroup#17699CharlieFRuan wants to merge 1 commit intoapache:mainfrom
CharlieFRuan wants to merge 1 commit intoapache:mainfrom
Conversation
| } | ||
|
|
||
| const requiredFeatures: GPUFeatureName[] = []; | ||
| // TODO(Charlie): cannot type annotate because @webgpu/types |
Contributor
There was a problem hiding this comment.
@webgpu/types 0.1.55 should work now. See gpuweb/types#167
This was referenced Feb 24, 2026
MasterJH5574
pushed a commit
that referenced
this pull request
Apr 6, 2026
## Summary This adds gating logic on top of #17699 to support optional subgroup shuffle primitives based on a compile-time flag. ## Problem The PR #17699 always generates subgroup shuffle ops when targeting WebGPU. However, not all WebGPU devices support subgroups. We need a way to: - Default to shared memory reductions (universally compatible) - Optionally enable subgroup shuffles for devices that support them ## Solution Implement gating via TVM target parameter: - Default `thread_warp_size=1` disables warp reductions (uses shared memory + barriers) - Add target parser `UpdateWebGPUAttrs()` that sets `thread_warp_size=32` when `supports_subgroups=true` - Add `--enable-subgroups` CLI flag in mlc-llm to surface the option to users The gating happens at the reduction path selection level (`IsWarpReduction()` in `lower_thread_allreduce.cc`), ensuring subgroup ops are never generated unless explicitly enabled. ## Testing Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions; with flag, generates subgroupShuffle* ops. Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f Benchmarking: https://gist.github.com/ksgr5566/c9bd5bc5aadba999ec2f2c38eb0c49b3
Aharrypotter
pushed a commit
to Aharrypotter/tvm
that referenced
this pull request
Apr 10, 2026
## Summary This adds gating logic on top of apache#17699 to support optional subgroup shuffle primitives based on a compile-time flag. ## Problem The PR apache#17699 always generates subgroup shuffle ops when targeting WebGPU. However, not all WebGPU devices support subgroups. We need a way to: - Default to shared memory reductions (universally compatible) - Optionally enable subgroup shuffles for devices that support them ## Solution Implement gating via TVM target parameter: - Default `thread_warp_size=1` disables warp reductions (uses shared memory + barriers) - Add target parser `UpdateWebGPUAttrs()` that sets `thread_warp_size=32` when `supports_subgroups=true` - Add `--enable-subgroups` CLI flag in mlc-llm to surface the option to users The gating happens at the reduction path selection level (`IsWarpReduction()` in `lower_thread_allreduce.cc`), ensuring subgroup ops are never generated unless explicitly enabled. ## Testing Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions; with flag, generates subgroupShuffle* ops. Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f Benchmarking: https://gist.github.com/ksgr5566/c9bd5bc5aadba999ec2f2c38eb0c49b3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR supports warp-level shuffle primitives using the newly introduced
subgroupin WebGPU. We then use them in the implementation of allreduce lowering.The introduced primitives are:
subgroupShuffle()subgroupShuffleUp()subgroupShuffleDown()This PR largely follows the Metal counterpart:
Tested with
Llama3.2-1B-q4f16_1andLlama3.1-8B-q4f16_1E2E with WebLLM. The dumped WebGPU kernel indeed contains subgroup shuffle primitives: https://gist.github.com/CharlieFRuan/cb54a8db0513ecbbc16c5de8df5ab845Remaining TODOs
GPUFeatureName's inclusion ofsubgroupsin@webgpu/typesResources