[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization by b8zhong · Pull Request #15514 · sgl-project/sglang

b8zhong · 2025-12-20T05:54:38Z

Motivation

After flashinfer-ai/flashinfer#2131 in Flashinfer, we can benefit from SwapAB, where the input order is swapped to benefit when the M dimension is < 32 (e.g when BS < 32 in decoding). When it is larger, there is no benefit.

Modifications

(Requires Flashinfer nightly, and the backend currently only supports SM90)
Note that Flashinfer will compile it's own DeepGEMM. So it is separate from the DeepGEMM built in the Docker container.

Accuracy Tests

Benchmarking and Profiling

for ((N=1; N<=128; N*=2)); do
  python3 -m sglang.bench_serving \
    --backend sglang \
    --flush-cache \
    --dataset-name random \
    --random-input-len 1024 \
    --random-output-len 1024 \
    --random-range-ratio 1.0 \
    --num-prompts $((6*N)) \
    --max-concurrency $N \
    --output-file res.jsonl
done

We can see that when the M dimension is small, there is around a 5-8% E2E benefit

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-20T05:54:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2026-01-04T10:24:45Z

@b8zhong Will the warmup process be handled by flashinfer for this case?
We know that original deepgemm kernels need a lot of time for warm up...

b8zhong · 2026-01-06T01:03:24Z

@Fridge003 I think, it uses the same DeepGEMM compiler under the hood. E.g during warmup you can see this process and a few similar ones. Although, I don't absolutely have the most context, so it may or may not be fully correct

root       41757  0.0  0.0   2808  1112 pts/1    S    00:55   0:00 \
/bin/sh -c \
/usr/local/cuda/bin/nvcc \
  --generate-dependencies-with-compile \
  --dependency-output fp8_blockscale_gemm_90/fp8_blockscale_gemm.cuda.o.d \
  -DPy_LIMITED_API=0x03090000 \
  -D_GLIBCXX_USE_CXX11_ABI=1 \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/cutlass_extensions/include \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels \
  -isystem /usr/include/python3.12 \
  -isystem /usr/local/cuda/include \
  -isystem /usr/local/cuda/include/cccl \
  -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include \
  --compiler-options=-fPIC \
  --expt-relaxed-constexpr \
  -static-global-template-stub=false \
  -std=c++17 \
  --threads=1 \
  -use_fast_math \
  -DNDEBUG \
  -O3 \
  -gencode=arch=compute_90a,code=sm_90a \
  -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED \
  -DCOMPILE_HOPPER_TMA_GEMMS \
  -DENABLE_BF16 \
  -DENABLE_FP8 \
  -DENABLE_FP8_BLOCK_SCALE \
  -DFLASHINFER_ENABLE_F16 \
  -DFLASHINFER_ENABLE_BF16 \
  -DFLASHINFER_ENABLE_FP8_E4M3 \
  -DFLASHINFER_ENABLE_FP8_E5M2 \
  -DFLASHINFER_ENABLE_FP8_E8M0 \
  -DFLASHINFER_ENABLE_FP4_E2M1 \
  -c /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_gemm.cu \
  -o fp8_blockscale_gemm_90/fp8_blockscale_gemm.cuda.o

This is the regular --fp8-gemm-backend=deep_gemm

root       62105  0.0  0.0   2804  1072 pts/1    S+   01:08   0:00 sh -c -- /usr/local/cuda/bin/nvcc /root/.cache/deep_gemm/cache/kernel.sm90_fp8_gemm_1d2d.6f90a6ff4eb7dbd1eb1a6966f918905c/kernel.cu -o /root/.cache/deep_gemm/tmp/57456-d236c320-2b64367a-55587891 -std=c++20 --diag-suppress=39,161,174,177,186,940 --ptxas-options=--register-usage-level=10 -I/usr/local/lib/python3.12/dist-packages/deep_gemm/include --gpu-architecture=sm_90a --compiler-options=-fPIC,-O3,-fconcepts,-Wno-deprecated-declarations,-Wno-abi -cubin -O3 --expt-relaxed-constexpr --expt-extended-lambda 2>&1

b8zhong · 2026-01-06T01:10:10Z

@Fridge003 Although, it does seem somewhat faster actually (maybe 2x faster?)

b8zhong

Added accuracy numbers too~

b8zhong · 2026-01-18T16:48:49Z

/tag-and-rerun-ci

Fridge003 · 2026-01-26T14:21:57Z

Can we add a test for this new fp8 gemm kernel

b8zhong · 2026-01-28T12:50:46Z

Done @Fridge003

python/sglang/srt/layers/quantization/fp8_utils.py

Fridge003 · 2026-01-31T16:46:29Z

/rerun-failed-ci

…ect#15514) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners December 20, 2025 05:54

Fridge003 mentioned this pull request Dec 21, 2025

Update flashinfer to 0.6.1 #15551

Merged

6 tasks

b8zhong force-pushed the brayden/add-swapab-sm90 branch from efad674 to d735872 Compare January 2, 2026 01:04

b8zhong changed the title ~~Add Flashinfer DeepGEMM SM90 for SwapAB Optimization~~ [Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization Jan 2, 2026

b8zhong force-pushed the brayden/add-swapab-sm90 branch from d735872 to ae254a2 Compare January 5, 2026 16:04

b8zhong commented Jan 6, 2026

View reviewed changes

b8zhong force-pushed the brayden/add-swapab-sm90 branch from ae254a2 to d12cb83 Compare January 18, 2026 16:48

github-actions bot added the run-ci label Jan 18, 2026

b8zhong requested a review from Fridge003 January 18, 2026 17:26

b8zhong force-pushed the brayden/add-swapab-sm90 branch from d12cb83 to 87e195c Compare January 22, 2026 23:01

b8zhong force-pushed the brayden/add-swapab-sm90 branch from 87e195c to 0d1dfc4 Compare January 28, 2026 12:40

b8zhong added 2 commits January 28, 2026 07:41

more

6668293

update test_fp8_blockwise_gemm.py

35b6375

b8zhong force-pushed the brayden/add-swapab-sm90 branch from 0d1dfc4 to 35b6375 Compare January 28, 2026 12:41

Fridge003 approved these changes Jan 31, 2026

View reviewed changes

python/sglang/srt/layers/quantization/fp8_utils.py Outdated Show resolved Hide resolved

tiny

7661e5f

b8zhong added run-ci and removed run-ci labels Jan 31, 2026

Fridge003 merged commit 398d13a into sgl-project:main Feb 1, 2026
78 of 87 checks passed

b8zhong deleted the brayden/add-swapab-sm90 branch February 1, 2026 03:44

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 2, 2026

[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization (sgl-proj…

4f5c46d

…ect#15514) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization (sgl-proj…

6a8b09a

…ect#15514) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization (sgl-proj…

8fad4e0

…ect#15514) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization#15514

[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization#15514
Fridge003 merged 3 commits intosgl-project:mainfrom
bzhng-development:brayden/add-swapab-sm90

b8zhong commented Dec 20, 2025

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

Fridge003 commented Jan 4, 2026

Uh oh!

b8zhong commented Jan 6, 2026 •

edited

Loading

Uh oh!

b8zhong commented Jan 6, 2026

Uh oh!

b8zhong left a comment

Uh oh!

b8zhong commented Jan 18, 2026

Uh oh!

Fridge003 commented Jan 26, 2026

Uh oh!

b8zhong commented Jan 28, 2026

Uh oh!

Uh oh!

Fridge003 commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

b8zhong commented Dec 20, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

Fridge003 commented Jan 4, 2026

Uh oh!

b8zhong commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b8zhong commented Jan 6, 2026

Uh oh!

b8zhong left a comment

Choose a reason for hiding this comment

Uh oh!

b8zhong commented Jan 18, 2026

Uh oh!

Fridge003 commented Jan 26, 2026

Uh oh!

b8zhong commented Jan 28, 2026

Uh oh!

Uh oh!

Fridge003 commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

b8zhong commented Jan 6, 2026 •

edited

Loading