[Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` by b8zhong · Pull Request #16534 · sgl-project/sglang

b8zhong · 2026-01-06T03:28:12Z

Motivation

After #13773, it seems useful to also add a --fp4-gemm-backend, to avoid env vars like SGLANG_FLASHINFER_FP4_GEMM_BACKEND. Currently, it's absolute usefulness to decrease the complexity because many backends are specified through Flashinfer mm_fp4 backend argument, but it should become more useful in the future.

Modifications

Refactor the codes (similar to FP8 one). Also, add the CuDNN backend to the list, which has been an available choice for a while.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

gemini-code-assist · 2026-01-06T03:28:15Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2026-01-06T03:29:26Z

/tag-and-rerun-ci

Fridge003 · 2026-01-12T16:20:54Z

python/sglang/srt/layers/quantization/fp4_utils.py

+    # Handle deprecated env var for backward compatibility
+    # TODO: Remove this in a future version
+    if backend == "auto":
+        env_backend = envs.SGLANG_FLASHINFER_FP4_GEMM_BACKEND.get()


Deduplicate line48-line53 here

Consolidated 👍

Fridge003 · 2026-01-12T16:36:45Z

Can we add a fp4-gemm B200 test

b8zhong · 2026-01-13T01:15:08Z

@Fridge003 Do you want a UT or E2E test (since the logics of different FP4 backends are mostly in scale preperation or smth, I suppose it can just be a short UT)

Fridge003 · 2026-01-13T14:31:42Z

@Fridge003 Do you want a UT or E2E test (since the logics of different FP4 backends are mostly in scale preperation or smth, I suppose it can just be a short UT)

We need an E2E test to make sure the kernels are integrated correctly.
Just similar to test_llama_fp4.py (or we can change this file to a fp4 gemm test)

b8zhong · 2026-01-14T17:47:13Z

The auto argument to mm_fp4 was only added in 0.6.0, we'd better hold off until the upgrade

Fridge003 · 2026-01-18T05:25:29Z

/tag-and-rerun-ci

Fridge003 · 2026-01-18T15:17:09Z

fp4 gemm tests passed
https://github.com/sgl-project/sglang/actions/runs/21097600663/job/60716682822?pr=16534

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Ying1123, ch-wan, hnyls2002, merrymercy and xiezhq-hermann as code owners January 6, 2026 03:28

github-actions bot added quant LLM Quantization blackwell SM100/SM120 labels Jan 6, 2026

github-actions bot added the run-ci label Jan 6, 2026

Fridge003 reviewed Jan 12, 2026

View reviewed changes

b8zhong force-pushed the brayden/refactor-fp4-backend branch from 9e5d738 to fa454ff Compare January 14, 2026 17:36

b8zhong removed the run-ci label Jan 14, 2026

This was referenced Jan 16, 2026

Update flashinfer to 0.6.1 #15551

Merged

[depend on flashinfer 0.6.0) change mm_fp4 to auto for cu13 #16232

Closed

b8zhong added 3 commits January 17, 2026 08:44

more

c803d8d

more

603f3f1

upd

c8f07e3

b8zhong force-pushed the brayden/refactor-fp4-backend branch from 6073da0 to c8f07e3 Compare January 17, 2026 16:44

Fridge003 approved these changes Jan 18, 2026

View reviewed changes

github-actions bot added the run-ci label Jan 18, 2026

Fridge003 merged commit 4df74eb into sgl-project:main Jan 18, 2026
399 of 451 checks passed

b8zhong deleted the brayden/refactor-fp4-backend branch January 18, 2026 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND`#16534

[Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND`#16534
Fridge003 merged 3 commits intosgl-project:mainfrom
bzhng-development:brayden/refactor-fp4-backend

b8zhong commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Uh oh!

b8zhong commented Jan 6, 2026

Uh oh!

Fridge003 Jan 12, 2026

Uh oh!

b8zhong Jan 14, 2026

Uh oh!

Fridge003 commented Jan 12, 2026

Uh oh!

b8zhong commented Jan 13, 2026

Uh oh!

Fridge003 commented Jan 13, 2026

Uh oh!

b8zhong commented Jan 14, 2026

Uh oh!

Fridge003 commented Jan 18, 2026

Uh oh!

Fridge003 commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

b8zhong commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Uh oh!

b8zhong commented Jan 6, 2026

Uh oh!

Fridge003 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

b8zhong Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Jan 12, 2026

Uh oh!

b8zhong commented Jan 13, 2026

Uh oh!

Fridge003 commented Jan 13, 2026

Uh oh!

b8zhong commented Jan 14, 2026

Uh oh!

Fridge003 commented Jan 18, 2026

Uh oh!

Fridge003 commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

b8zhong commented Jan 6, 2026 •

edited

Loading