Optimizations for MXFP8/NVFP4 dequantize kernels by YigongQin · Pull Request #2865 · NVIDIA/TransformerEngine

YigongQin · 2026-04-10T21:46:58Z

Description

Handle empty tensors in dequantize for CUDA graph compatibility
Add swizzled scale support to the NVFP4 dequantize kernel, reusing the existing MXFP8 swizzle index computation
Add C++ unit tests for both NVFP4 and MXFP8 dequantization (including swizzled scale variants)
Fix to_cpu() and set_scale() in test infrastructure to correctly sync amax/scale for NVTE_NVFP4_1D_SCALING mode

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Handle empty tensors in dequantize for CUDA graph compatibility — Early return when input has zero elements, avoiding kernel launches on empty tensors.
Add GEMM-swizzled scale support to NVFP4 dequantize kernel — Template the kernel with WITH_GEMM_SWIZZLED_SCALES to support reading scales from swizzled layout, reusing the MXFP8 swizzle index computation.
Add GEMM-swizzled scale support to MXFP8 dequantize kernel — Extend the MXFP8 dequantize kernel to handle swizzled scale inputs.
Add C++ unit tests for NVFP4 dequantization — 21 tests for compact scales + 21 tests for swizzled scales, covering multiple sizes and output dtypes (fp32, bf16, fp16).
Add C++ unit tests for MXFP8 dequantization with swizzled scales — New swizzled test suite for MXFP8.
Fix to_cpu() to sync amax/scale for NVFP4 tensors — Previously only synced for NVTE_DELAYED_TENSOR_SCALING, causing the CPU reference to use stale amax=0.
Fix set_scale() to work for NVFP4 tensors — Same condition fix, enabling the scale to be properly uploaded to GPU before quantization.
Fix swizzled test ordering — Move from_cpu() before the FP4 data copy to prevent from_cpu() from overwriting the copied data with zeros.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

zianglih · 2026-04-14T18:43:43Z

The following relevant unit tests passed on SM100 (with the drop optimize_for_gemm = False changes):

python3 -m pytest --tb=auto tests/pytorch/test_backward_override.py
python3 -m pytest --tb=auto tests/pytorch/test_sanity.py
python3 -m pytest --tb=auto tests/pytorch/test_cpu_offloading.py
PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 NVTE_FUSED_ATTN=0 python3 -m pytest --tb=auto tests/pytorch/test_cuda_graphs.py
NVTE_TEST_NVINSPECT_ENABLED=1 NVTE_TEST_NVINSPECT_CONFIG_FILE=tests/pytorch/debug/test_configs/dummy_feature.yaml NVTE_TEST_NVINSPECT_FEATURE_DIRS=transformer_engine/debug/features PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py

zianglih · 2026-04-14T21:48:46Z

After this PR, fwd is around 3%-4% faster for DeepSeek shape MoE:

# With the optimization
NVTE_BACKWARD_OVERRIDE=dequantized python benchmarks/linear/benchmark_grouped_linear.py --recipe mxfp8 --fwd-only
       m     k     n recipe  num_gemms  grouped_fwd_time_ms
0  16384  7168  2048  mxfp8          4             0.272829
1  32768  7168  2048  mxfp8          4             0.509788
2  65536  7168  2048  mxfp8          4             0.948633
3  98304  7168  2048  mxfp8          4             1.391146
0  16384  7168  2048  mxfp8          8             0.303238
1  32768  7168  2048  mxfp8          8             0.533896
2  65536  7168  2048  mxfp8          8             1.003446
3  98304  7168  2048  mxfp8          8             1.470030

# Without the optimization
git restore --source 77b8681de5cf -- transformer_engine/pytorch/module
NVTE_BACKWARD_OVERRIDE=dequantized python benchmarks/linear/benchmark_grouped_linear.py --recipe mxfp8 --fwd-only
       m     k     n recipe  num_gemms  grouped_fwd_time_ms
0  16384  7168  2048  mxfp8          4             0.282720
1  32768  7168  2048  mxfp8          4             0.526736
2  65536  7168  2048  mxfp8          4             0.982166
3  98304  7168  2048  mxfp8          4             1.451485
0  16384  7168  2048  mxfp8          8             0.313753
1  32768  7168  2048  mxfp8          8             0.551043
2  65536  7168  2048  mxfp8          8             1.040773
3  98304  7168  2048  mxfp8          8             1.527951

greptile-apps · 2026-04-15T17:04:05Z

Greptile Summary

This PR extends the MXFP8 and NVFP4 dequantize kernels to support GEMM-swizzled scale layouts by templating both kernels on WITH_GEMM_SWIZZLED_SCALES and reusing the existing gemm_swizzled_scale_idx helper, adds an empty-tensor early return in the dispatch layer for CUDA graph compatibility, removes corresponding Python workarounds that are now superseded, and adds C++ unit tests for both quantization formats across compact and swizzled paths.

Confidence Score: 5/5

Safe to merge; no P0/P1 issues found — all findings are P2 suggestions.

The kernel index math for both rowwise and colwise swizzled paths was verified to be within bounds. The empty-tensor guard is correctly placed at the dispatch level. Python workaround removals are consistent with the new dequantize capability. One P2 asymmetry exists in basic_linear.py where the removed condition was broader than the analogous sibling modules, but it does not affect current behavior.

transformer_engine/pytorch/ops/basic/basic_linear.py — removed condition was broader (backward_override is not None) than linear.py/layernorm_linear.py (== "dequantized").

Important Files Changed

Filename	Overview
transformer_engine/common/cast/dispatch/dequantize.cuh	Adds early-return guard for empty tensors (numel == 0), enabling CUDA graph compatibility. Clean and straightforward.
transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh	Adds WITH_GEMM_SWIZZLED_SCALES template parameter and num_scale_tiles_X to kernel; swizzle index reuses gemm_swizzled_scale_idx with transposed args for colwise. Removes prior assertion blocking swizzled input. Logic appears correct.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	Adds WITH_GEMM_SWIZZLED_SCALES template and num_scale_tiles_X = DIVUP(Mread, 4) for swizzle; reuses mxfp8 swizzle helper. Index math checks out.
tests/cpp/operator/test_dequantize_nvfp4.cu	New test file with compact and swizzled NVFP4 dequantize tests. CPU reference computed from quantized data; swizzled test validates against compact path. Guards for Blackwell CC and FP4_TYPE_SUPPORTED are present.
tests/cpp/operator/test_dequantize_mxfp8.cu	Adds swizzled-scale test suite for MXFP8 (performTest_x1_swizzled) and zero-row test cases. Follows correct sequencing: copy data, swizzle scales, dequantize, compare.
transformer_engine/pytorch/ops/basic/basic_linear.py	Removes optimize_for_gemm=False block; the removed condition was broader (backward_override is not None) than the analogous linear.py/layernorm_linear.py condition (== "dequantized").
transformer_engine/pytorch/module/grouped_linear.py	Removes empty-tensor workaround in backward (now handled in C++ dispatch) and removes optimize_for_gemm=False guard for dequantized backward.
transformer_engine/pytorch/module/linear.py	Removes optimize_for_gemm=False for dequantized backward; now valid since dequantize supports swizzled scales.
transformer_engine/pytorch/module/layernorm_linear.py	Identical optimize_for_gemm=False removal as linear.py. Clean.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nvte_dequantize] --> B[dequantize_helper dispatch]
    B --> C{numel == 0?}
    C -->|yes| D[early return — CUDA graph safe]
    C -->|no| E{scaling_mode}
    E -->|NVTE_MXFP8_1D_SCALING| F[mxfp8::dequantize]
    E -->|NVTE_NVFP4_1D_SCALING| G[nvfp4::dequantize]
    F --> H{with_gemm_swizzled_scales?}
    G --> I{with_gemm_swizzled_scales?}
    H -->|true| J[dequantize_mxfp8_kernel WITH_GEMM_SWIZZLED_SCALES=true]
    H -->|false| K[dequantize_mxfp8_kernel WITH_GEMM_SWIZZLED_SCALES=false]
    I -->|true| L[dequantize_fp4_kernel WITH_GEMM_SWIZZLED_SCALES=true]
    I -->|false| M[dequantize_fp4_kernel WITH_GEMM_SWIZZLED_SCALES=false]

_{Reviews (14): Last reviewed commit: "Remove unnecessary scale from NVFP4 C++ ..." | Re-trigger Greptile}

zhongbozhu · 2026-04-15T18:00:24Z

+    }
+}
+
+std::vector<std::pair<size_t, size_t>> nvfp4_tensor_dims = {


There is one edge case:

For MXFP8, When the input shape is like 64x64, it will produce scaling factor shape 64x2, but then zero padded to 128x4. We should be able to inject some very large random values in the padded region during malloc (because we don't use torch.zeros to malloc but torch.empty), and detect whether dequantize results is affected. If things work as expected, this line will be triggered

TransformerEngine/transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh

Line 759 in a073ad5

// Zero out swizzled scales if padding is needed

and the dequantize numerics won't be affected.

For NVFP4, I think we optimize for GEMM (or swizzle fusion) is actually not enabled, same for the zero-out edge case handling logic?

TransformerEngine/transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

Line 1189 in a073ad5

NVTE_CHECK(!output->with_gemm_swizzled_scales, "Output must have scales in compact format.");

So there shouldn't be any unswizzle logic needed here?

For NVFP4, I believe currently only device-init grouped quantize with RHT has the swizzle fusion feature, so the scaling factor zero-out is the job of the dedicated swizzle kernel. So if we dequantize + unswizzle for NVFP4, the unswizzle logic might not be correct.

For both MXFP8 and NVFP4, the unit test logic is: 1. generate compact scales (or from quantization); 2. call nvte_swizzle_scaling_factors to swizzle compact scales; 3. compare results of nvte_dequantize with compact scales and swizzled scales. Quantize with swizzle fusion is never enabled for both MXFP8 and NVFP4

timmoon10 · 2026-04-23T18:54:23Z

-        fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
-        if fp8_recipe.backward_override == "dequantized" and (
-            fp8_recipe.mxfp8() or fp8_recipe.nvfp4()
-        ):
-            input_quantizer.optimize_for_gemm = False
-            if grad_output_quantizer is not None:
-                grad_output_quantizer.optimize_for_gemm = False


I'm of two minds about this:

Logically, GEMM-optimized data is not guaranteed to support anything except GEMMs. Even if MXFP8 and NVFP4 dequant happens to support them, these are custom optimizations. Future recipes can not be expected to support dequantizing GEMM-optimzied data by default.

It's a little pedantic to have edge-case logic that won't be triggered by any of our existing use-cases. Given how subtle this is, I worry about it becoming stale and distracting.

I think for now, this change is fine. However, if we encounter problems in a future recipe, we should reimplement it properly:

# LOGICALLY WRONG! # Fails if we add a new recipe if recipe.backward_override == "dequantized" and recipe.future_recipe(): input_quantizer.optimize_for_gemm = False # LOGICALLY RIGHT! # Automatically handles new recipes if recipe.backward_override == "dequantized" and not ( recipe.float8_per_tensor_scaling() or recipe.float8_block_scaling() or recipe.mxfp8() or recipe.nvfp4() ): input_quantizer.optimize_for_gemm = False

CC @ptrendx @ksivaman @zhongbozhu

zhongbozhu · 2026-04-23T23:21:01Z

/te-ci pytorch L1

timmoon10 · 2026-04-24T00:34:37Z

/te-ci core

timmoon10

LGTM, pending CI. These kernels will be very useful.

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

Signed-off-by: Ziang Li <ziangli@umich.edu> Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

for more information, see https://pre-commit.ci

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2026-05-02T02:11:18Z

/te-ci

YigongQin force-pushed the yigongq/bwd-dequantize-optim branch from f5e7375 to 39c0fb1 Compare April 10, 2026 22:04

zianglih force-pushed the yigongq/bwd-dequantize-optim branch from ddab15d to 3a4afdd Compare April 14, 2026 18:46

YigongQin marked this pull request as ready for review April 15, 2026 16:49

greptile-apps Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread tests/cpp/operator/test_dequantize_nvfp4.cu Outdated

zhongbozhu reviewed Apr 15, 2026

View reviewed changes

YigongQin force-pushed the yigongq/bwd-dequantize-optim branch from e6f2a6c to 0eccfb1 Compare April 16, 2026 22:09

ptrendx assigned Oleg-Goncharov Apr 16, 2026

zianglih mentioned this pull request Apr 17, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

26 tasks

YigongQin force-pushed the yigongq/bwd-dequantize-optim branch from 0eccfb1 to 2c479b0 Compare April 20, 2026 22:15

timmoon10 reviewed Apr 23, 2026

View reviewed changes

YigongQin force-pushed the yigongq/bwd-dequantize-optim branch 2 times, most recently from 666c496 to 80484a9 Compare April 23, 2026 21:53

timmoon10 reviewed Apr 23, 2026

View reviewed changes

Comment thread tests/cpp/test_common.h Outdated

Comment thread tests/cpp/test_common.cu Outdated

timmoon10 previously approved these changes Apr 24, 2026

View reviewed changes

YigongQin dismissed timmoon10’s stale review via c6e4288 April 24, 2026 23:29

YigongQin force-pushed the yigongq/bwd-dequantize-optim branch from c6e4288 to ce7b295 Compare May 1, 2026 16:57

This comment was marked as spam.

Sign in to view

YigongQin and others added 8 commits May 1, 2026 14:47

Handle empty tensors in dequantize for CUDA graph compatibility

f4a739b

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

dequant with swizzled scales

68a44de

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

pass nvfp4 dequant tests

28a6a75

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

cleanup unit tests

84586e4

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

remove allocation in set amax

93efce2

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

Drop disabling optimize_for_gemm introduced in PR 2644

a3fdecf

Signed-off-by: Ziang Li <ziangli@umich.edu> Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

Drop optimize_for_gemm in basic linear

3ab663c

Signed-off-by: Ziang Li <ziangli@umich.edu> Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1ef8a23

for more information, see https://pre-commit.ci

timmoon10 and others added 5 commits May 1, 2026 14:47

Apply suggestions from code review

675ca17

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

remove redundant set scale

bb5f898

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

rebase on nvfp4 test fix

316bedb

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

remove redundant line

0541dcd

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

add missing from_cpu() for scale

03affc2

Signed-off-by: YigongQin <qqqyyy1233@outlook.com>

YigongQin force-pushed the yigongq/bwd-dequantize-optim branch from a2cdef5 to 03affc2 Compare May 1, 2026 21:48

Remove unnecessary scale from NVFP4 C++ tests

881ff03

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 force-pushed the yigongq/bwd-dequantize-optim branch from 4ff09d4 to 881ff03 Compare May 2, 2026 02:09

Conversation

YigongQin commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

zianglih commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented Apr 14, 2026

Uh oh!

greptile-apps Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

zhongbozhu Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

YigongQin Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhongbozhu commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Apr 24, 2026

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

timmoon10 commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YigongQin commented Apr 10, 2026 •

edited

Loading

zianglih commented Apr 14, 2026 •

edited

Loading

greptile-apps Bot commented Apr 15, 2026 •

edited

Loading

zhongbozhu Apr 15, 2026 •

edited

Loading

timmoon10 Apr 23, 2026 •

edited

Loading