Skip to content

Commit 69a07f3

Browse files
mmangkadJohnsonms
authored andcommitted
[Doc] Fix outdated --fp4-gemm-backend documentation (sgl-project#18350)
1 parent 0209e7c commit 69a07f3

File tree

3 files changed

+5
-9
lines changed

3 files changed

+5
-9
lines changed

docs/advanced_features/server_arguments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -267,7 +267,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
267267
| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter` |
268268
| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter` |
269269
| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (optimal for Blackwell and low-latency), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only). **NOTE**: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. | `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `cutlass`, `triton`, `aiter` |
270-
| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'auto' (default, auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_cutlass' (FlashInfer CUTLASS backend, optimal on CUDA 12), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. **NOTE**: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. | `auto` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` |
270+
| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'flashinfer_cutlass' (default), 'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. **NOTE**: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. | `flashinfer_cutlass` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` |
271271
| `--disable-flashinfer-autotune` | Flashinfer autotune is enabled by default. Set this flag to disable the autotune. | `False` | bool flag (set to enable) |
272272
273273
## Speculative decoding

python/sglang/srt/layers/quantization/modelopt_quant.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -133,11 +133,7 @@ def fp4_gemm(
133133
fp4_backend = get_fp4_gemm_runner_backend()
134134
if enable_flashinfer_fp4_gemm:
135135
# Use the remapping logic to convert SGLang backend names to FlashInfer API names
136-
backend = (
137-
fp4_backend.get_flashinfer_backend()
138-
if not fp4_backend.is_auto()
139-
else "cutlass"
140-
)
136+
backend = fp4_backend.get_flashinfer_backend()
141137
return flashinfer_fp4_gemm(
142138
input, weight, input_sf, weight_sf, alpha, out_dtype, backend=backend
143139
)

python/sglang/srt/server_args.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -448,7 +448,7 @@ class ServerArgs:
448448
grammar_backend: Optional[str] = None
449449
mm_attention_backend: Optional[str] = None
450450
fp8_gemm_runner_backend: str = "auto"
451-
fp4_gemm_runner_backend: str = "auto"
451+
fp4_gemm_runner_backend: str = "flashinfer_cutlass"
452452
nsa_prefill_backend: Optional[str] = (
453453
None # None = auto-detect based on hardware/kv_cache_dtype
454454
)
@@ -3789,9 +3789,9 @@ def add_cli_args(parser: argparse.ArgumentParser):
37893789
default=ServerArgs.fp4_gemm_runner_backend,
37903790
dest="fp4_gemm_runner_backend",
37913791
help="Choose the runner backend for NVFP4 GEMM operations. "
3792-
"Options: 'auto' (default, selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), "
3792+
"Options: 'flashinfer_cutlass' (default), "
3793+
"'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), "
37933794
"'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), "
3794-
"'flashinfer_cutlass' (FlashInfer CUTLASS backend, optimal on CUDA 12), "
37953795
"'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). "
37963796
"NOTE: This replaces the deprecated environment variable "
37973797
"SGLANG_FLASHINFER_FP4_GEMM_BACKEND.",

0 commit comments

Comments
 (0)