Skip to content

[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization#15514

Merged
Fridge003 merged 3 commits intosgl-project:mainfrom
bzhng-development:brayden/add-swapab-sm90
Feb 1, 2026
Merged

[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization#15514
Fridge003 merged 3 commits intosgl-project:mainfrom
bzhng-development:brayden/add-swapab-sm90

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Dec 20, 2025

Motivation

After flashinfer-ai/flashinfer#2131 in Flashinfer, we can benefit from SwapAB, where the input order is swapped to benefit when the M dimension is < 32 (e.g when BS < 32 in decoding). When it is larger, there is no benefit.

Modifications

(Requires Flashinfer nightly, and the backend currently only supports SM90)
Note that Flashinfer will compile it's own DeepGEMM. So it is separate from the DeepGEMM built in the Docker container.

Accuracy Tests

Benchmarking and Profiling

for ((N=1; N<=128; N*=2)); do
  python3 -m sglang.bench_serving \
    --backend sglang \
    --flush-cache \
    --dataset-name random \
    --random-input-len 1024 \
    --random-output-len 1024 \
    --random-range-ratio 1.0 \
    --num-prompts $((6*N)) \
    --max-concurrency $N \
    --output-file res.jsonl
done

We can see that when the M dimension is small, there is around a 5-8% E2E benefit

image image

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003 Fridge003 mentioned this pull request Dec 21, 2025
6 tasks
@b8zhong b8zhong force-pushed the brayden/add-swapab-sm90 branch from efad674 to d735872 Compare January 2, 2026 01:04
@b8zhong b8zhong changed the title Add Flashinfer DeepGEMM SM90 for SwapAB Optimization [Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization Jan 2, 2026
@Fridge003
Copy link
Collaborator

@b8zhong Will the warmup process be handled by flashinfer for this case?
We know that original deepgemm kernels need a lot of time for warm up...

@b8zhong b8zhong force-pushed the brayden/add-swapab-sm90 branch from d735872 to ae254a2 Compare January 5, 2026 16:04
@b8zhong
Copy link
Collaborator Author

b8zhong commented Jan 6, 2026

@Fridge003 I think, it uses the same DeepGEMM compiler under the hood. E.g during warmup you can see this process and a few similar ones. Although, I don't absolutely have the most context, so it may or may not be fully correct

root       41757  0.0  0.0   2808  1112 pts/1    S    00:55   0:00 \
/bin/sh -c \
/usr/local/cuda/bin/nvcc \
  --generate-dependencies-with-compile \
  --dependency-output fp8_blockscale_gemm_90/fp8_blockscale_gemm.cuda.o.d \
  -DPy_LIMITED_API=0x03090000 \
  -D_GLIBCXX_USE_CXX11_ABI=1 \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/cutlass_extensions/include \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include \
  -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels \
  -isystem /usr/include/python3.12 \
  -isystem /usr/local/cuda/include \
  -isystem /usr/local/cuda/include/cccl \
  -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include \
  -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include \
  --compiler-options=-fPIC \
  --expt-relaxed-constexpr \
  -static-global-template-stub=false \
  -std=c++17 \
  --threads=1 \
  -use_fast_math \
  -DNDEBUG \
  -O3 \
  -gencode=arch=compute_90a,code=sm_90a \
  -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED \
  -DCOMPILE_HOPPER_TMA_GEMMS \
  -DENABLE_BF16 \
  -DENABLE_FP8 \
  -DENABLE_FP8_BLOCK_SCALE \
  -DFLASHINFER_ENABLE_F16 \
  -DFLASHINFER_ENABLE_BF16 \
  -DFLASHINFER_ENABLE_FP8_E4M3 \
  -DFLASHINFER_ENABLE_FP8_E5M2 \
  -DFLASHINFER_ENABLE_FP8_E8M0 \
  -DFLASHINFER_ENABLE_FP4_E2M1 \
  -c /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_gemm.cu \
  -o fp8_blockscale_gemm_90/fp8_blockscale_gemm.cuda.o

This is the regular --fp8-gemm-backend=deep_gemm

root       62105  0.0  0.0   2804  1072 pts/1    S+   01:08   0:00 sh -c -- /usr/local/cuda/bin/nvcc /root/.cache/deep_gemm/cache/kernel.sm90_fp8_gemm_1d2d.6f90a6ff4eb7dbd1eb1a6966f918905c/kernel.cu -o /root/.cache/deep_gemm/tmp/57456-d236c320-2b64367a-55587891 -std=c++20 --diag-suppress=39,161,174,177,186,940 --ptxas-options=--register-usage-level=10 -I/usr/local/lib/python3.12/dist-packages/deep_gemm/include --gpu-architecture=sm_90a --compiler-options=-fPIC,-O3,-fconcepts,-Wno-deprecated-declarations,-Wno-abi -cubin -O3 --expt-relaxed-constexpr --expt-extended-lambda 2>&1

@b8zhong
Copy link
Collaborator Author

b8zhong commented Jan 6, 2026

@Fridge003 Although, it does seem somewhat faster actually (maybe 2x faster?)

Copy link
Collaborator Author

@b8zhong b8zhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added accuracy numbers too~

@b8zhong b8zhong force-pushed the brayden/add-swapab-sm90 branch from ae254a2 to d12cb83 Compare January 18, 2026 16:48
@b8zhong
Copy link
Collaborator Author

b8zhong commented Jan 18, 2026

/tag-and-rerun-ci

@b8zhong b8zhong requested a review from Fridge003 January 18, 2026 17:26
@b8zhong b8zhong force-pushed the brayden/add-swapab-sm90 branch from d12cb83 to 87e195c Compare January 22, 2026 23:01
@Fridge003
Copy link
Collaborator

Can we add a test for this new fp8 gemm kernel

@b8zhong b8zhong force-pushed the brayden/add-swapab-sm90 branch from 87e195c to 0d1dfc4 Compare January 28, 2026 12:40
@b8zhong b8zhong force-pushed the brayden/add-swapab-sm90 branch from 0d1dfc4 to 35b6375 Compare January 28, 2026 12:41
@b8zhong
Copy link
Collaborator Author

b8zhong commented Jan 28, 2026

Done @Fridge003

@Fridge003
Copy link
Collaborator

/rerun-failed-ci

@b8zhong b8zhong added run-ci and removed run-ci labels Jan 31, 2026
@Fridge003 Fridge003 merged commit 398d13a into sgl-project:main Feb 1, 2026
78 of 87 checks passed
@b8zhong b8zhong deleted the brayden/add-swapab-sm90 branch February 1, 2026 03:44
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 2, 2026
…ect#15514)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
…ect#15514)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
…ect#15514)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants