[WebGPU] Support warp-level shuffle primitives with subgroup by CharlieFRuan · Pull Request #17699 · apache/tvm

CharlieFRuan · 2025-03-03T03:48:33Z

Overview

This PR supports warp-level shuffle primitives using the newly introduced subgroup in WebGPU. We then use them in the implementation of allreduce lowering.

The introduced primitives are:

subgroupShuffle()
subgroupShuffleUp()
subgroupShuffleDown()

This PR largely follows the Metal counterpart:

[Codegen][Metal] Support metal warp-level primitive #15401

Tested with Llama3.2-1B-q4f16_1 and Llama3.1-8B-q4f16_1 E2E with WebLLM. The dumped WebGPU kernel indeed contains subgroup shuffle primitives: https://gist.github.com/CharlieFRuan/cb54a8db0513ecbbc16c5de8df5ab845

Remaining TODOs

Benchmark speedup
Be able to parameterize whether to use subgroup or not when targeting WebGPU, since not all devices support it
Check GPUFeatureName's inclusion of subgroups in @webgpu/types
Some WebGPU devices can have > 256 max num thread per block, be able to target different kinds

Resources

beaufortfrancois · 2025-03-05T07:37:38Z

    }

-    const requiredFeatures: GPUFeatureName[] = [];
+    // TODO(Charlie): cannot type annotate because @webgpu/types


@webgpu/types 0.1.55 should work now. See gpuweb/types#167

Great, thanks!

## Summary This adds gating logic on top of #17699 to support optional subgroup shuffle primitives based on a compile-time flag. ## Problem The PR #17699 always generates subgroup shuffle ops when targeting WebGPU. However, not all WebGPU devices support subgroups. We need a way to: - Default to shared memory reductions (universally compatible) - Optionally enable subgroup shuffles for devices that support them ## Solution Implement gating via TVM target parameter: - Default `thread_warp_size=1` disables warp reductions (uses shared memory + barriers) - Add target parser `UpdateWebGPUAttrs()` that sets `thread_warp_size=32` when `supports_subgroups=true` - Add `--enable-subgroups` CLI flag in mlc-llm to surface the option to users The gating happens at the reduction path selection level (`IsWarpReduction()` in `lower_thread_allreduce.cc`), ensuring subgroup ops are never generated unless explicitly enabled. ## Testing Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions; with flag, generates subgroupShuffle* ops. Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f Benchmarking: https://gist.github.com/ksgr5566/c9bd5bc5aadba999ec2f2c38eb0c49b3

## Summary This adds gating logic on top of apache#17699 to support optional subgroup shuffle primitives based on a compile-time flag. ## Problem The PR apache#17699 always generates subgroup shuffle ops when targeting WebGPU. However, not all WebGPU devices support subgroups. We need a way to: - Default to shared memory reductions (universally compatible) - Optionally enable subgroup shuffles for devices that support them ## Solution Implement gating via TVM target parameter: - Default `thread_warp_size=1` disables warp reductions (uses shared memory + barriers) - Add target parser `UpdateWebGPUAttrs()` that sets `thread_warp_size=32` when `supports_subgroups=true` - Add `--enable-subgroups` CLI flag in mlc-llm to surface the option to users The gating happens at the reduction path selection level (`IsWarpReduction()` in `lower_thread_allreduce.cc`), ensuring subgroup ops are never generated unless explicitly enabled. ## Testing Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions; with flag, generates subgroupShuffle* ops. Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f Benchmarking: https://gist.github.com/ksgr5566/c9bd5bc5aadba999ec2f2c38eb0c49b3

[WebGPU] Support warp-level shuffle ops with subgroup

3faae71

CharlieFRuan mentioned this pull request Mar 3, 2025

Use subgroup operations when possible mlc-ai/web-llm#553

Open

beaufortfrancois mentioned this pull request Mar 3, 2025

Add subgroups feature support huggingface/transformers.js#1217

Closed

beaufortfrancois reviewed Mar 5, 2025

View reviewed changes

This was referenced Feb 24, 2026

[WebGPU] Add gating logic for subgroup shuffle primitives CharlieFRuan/tvm#1

Open

[WebGPU] Add gating logic for subgroup shuffle primitives #18823

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] Support warp-level shuffle primitives with subgroup#17699

[WebGPU] Support warp-level shuffle primitives with subgroup#17699
CharlieFRuan wants to merge 1 commit intoapache:mainfrom
CharlieFRuan:pr-0302-webgpu-shuffle

CharlieFRuan commented Mar 3, 2025 •

edited

Loading

Uh oh!

beaufortfrancois Mar 5, 2025

Uh oh!

CharlieFRuan Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CharlieFRuan commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Remaining TODOs

Resources

Uh oh!

beaufortfrancois Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CharlieFRuan commented Mar 3, 2025 •

edited

Loading