add weightless qk norm to RMSNorm interface for Llama 4 by b8zhong · Pull Request #12813 · sgl-project/sglang

b8zhong · 2025-11-07T04:38:49Z

Otherwise with Llama 4 that has weightless qk_norm, we get these false warnings:

[2025-11-07 04:20:16 TP7] Some weights are not initialized from checkpoints {'language_model.model.layers.16.self_attn.qk_norm.weight', 'language_model.model.layers.32.self_attn.qk_norm.weight', 'language_model.model.layers.2.self_attn.qk_norm.weight', 'language_model.model.layers.10.self_attn.qk_norm.weight', 'language_model.model.layers.5.self_attn.qk_norm.weight', 'language_model.model.layers.36.self_attn.qk_norm.weight', 'language_model.model.layers.14.self_attn.qk_norm.weight', 'language_model.model.layers.17.self_attn.qk_norm.weight', 'language_model.model.layers.26.self_attn.qk_norm.weight', 'language_model.model.layers.42.self_attn.qk_norm.weight', 'language_model.model.layers.21.self_attn.qk_norm.weight', 'language_model.model.layers.25.self_attn.qk_norm.weight', 'language_model.model.layers.46.self_attn.qk_norm.weight', 'language_model.model.layers.6.self_attn.qk_norm.weight', 'language_model.model.layers.34.self_attn.qk_norm.weight', 'language_model.model.layers.22.self_attn.qk_norm.weight', 'language_model.model.layers.1.self_attn.qk_norm.weight', 'language_model.model.layers.24.self_attn.qk_norm.weight', 'language_model.model.layers.20.self_attn.qk_norm.weight', 'language_model.model.layers.41.self_attn.qk_norm.weight', 'language_model.model.layers.33.self_attn.qk_norm.weight', 'language_model.model.layers.28.self_attn.qk_norm.weight', 'language_model.model.layers.38.self_attn.qk_norm.weight', 'language_model.model.layers.4.self_attn.qk_norm.weight', 'language_model.model.layers.37.self_attn.qk_norm.weight', 'language_model.model.layers.40.self_attn.qk_norm.weight', 'language_model.model.layers.45.self_attn.qk_norm.weight', 'language_model.model.layers.29.self_attn.qk_norm.weight', 'language_model.model.layers.0.self_attn.qk_norm.weight', 'language_model.model.layers.13.self_attn.qk_norm.weight', 'language_model.model.layers.18.self_attn.qk_norm.weight', 'language_model.model.layers.8.self_attn.qk_norm.weight', 'language_model.model.layers.44.self_attn.qk_norm.weight', 'language_model.model.layers.9.self_attn.qk_norm.weight', 'language_model.model.layers.12.self_attn.qk_norm.weight', 'language_model.model.layers.30.self_attn.qk_norm.weight'}

python3 -m sglang.launch_server \
  --model-path=/opt/dlami/nvme/models/Llama-4-Scout-17B-16E-Instruct/ \
  --tp=8 \
  --trust-remote-code \
  --mem-fraction-static=0.7 \
  --context-length=131072 \
  --kv-cache-dtype=fp8_e4m3 \
  --attention-backend=fa3 \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

After:

Multi-thread loading shards:  68% Completed | 34/50 [01:59<00:55,  3.45s/it]
Multi-thread loading shards:  70% Completed | 35/50 [02:03<00:52,  3.52s/it]
Multi-thread loading shards:  72% Completed | 36/50 [02:06<00:48,  3.47s/it]
Multi-thread loading shards:  74% Completed | 37/50 [02:10<00:45,  3.48s/it]
Multi-thread loading shards:  76% Completed | 38/50 [02:13<00:41,  3.44s/it]
Multi-thread loading shards:  78% Completed | 39/50 [02:17<00:39,  3.61s/it]
Multi-thread loading shards:  80% Completed | 40/50 [02:21<00:36,  3.61s/it]
Multi-thread loading shards:  82% Completed | 41/50 [02:24<00:32,  3.58s/it]
Multi-thread loading shards:  84% Completed | 42/50 [02:28<00:28,  3.55s/it]
Multi-thread loading shards:  88% Completed | 44/50 [02:31<00:16,  2.68s/it]
Multi-thread loading shards:  90% Completed | 45/50 [02:36<00:15,  3.13s/it]
Multi-thread loading shards:  92% Completed | 46/50 [02:39<00:12,  3.21s/it]
Multi-thread loading shards:  94% Completed | 47/50 [02:43<00:10,  3.49s/it]
[2025-11-07 04:36:50 TP3] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2025-11-07 04:36:50 TP3] Setting sliding_window_size to be attention_chunk_size: 8192
[2025-11-07 04:36:50 TP3] Load weight end. type=Llama4ForConditionalGeneration, dtype=torch.bfloat16, avail mem=107.74 GB, mem usage=30.27 GB.
Multi-thread loading shards:  96% Completed | 48/50 [02:47<00:07,  3.55s/it]
Multi-thread loading shards:  98% Completed | 49/50 [02:51<00:03,  3.78s/it]
Multi-thread loading shards: 100% Completed | 50/50 [02:51<00:00,  3.44s/it]

root@ip-10-40-0-228:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 500
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 52.57it/s]
Accuracy: 0.926
Invalid: 0.000
Latency: 25.235 s
Output throughput: 5361.274 token/s

gemini-code-assist · 2025-11-07T04:39:26Z

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the RMSNorm layer to support configurations where the normalization weight is not a learnable parameter. This change is crucial for correctly handling Llama 4 models, which employ 'weightless' query-key normalization layers. By preventing the system from attempting to load non-existent learnable weights, it eliminates spurious initialization warnings and ensures seamless model loading and operation for Llama 4 architectures.

Highlights

RMSNorm Flexibility: The RMSNorm class now includes a has_weight parameter in its constructor, allowing it to be initialized without a learnable nn.Parameter for its weight. If has_weight is False, the weight is initialized as a static torch.Tensor of ones.
Llama 4 qk_norm Integration: The Llama4Attention module has been updated to utilize this new RMSNorm functionality, specifically initializing its qk_norm (query-key normalization) with has_weight=False.
Warning Resolution: This modification effectively resolves false warnings about uninitialized weights that previously occurred when loading Llama 4 models that feature weightless qk_norm layers, as demonstrated by the provided 'before' and 'after' logs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for weightless RMSNorm, which is necessary for Llama 4's qk_norm. The approach of adding a has_weight flag is correct. However, the implementation for the weightless case has a critical issue with device placement. When has_weight=False, the weight tensor is not registered as a parameter or buffer, so it won't be moved to the correct device with the rest of the model, leading to runtime errors. I've provided a suggestion to fix this by registering it as a buffer.

python/sglang/srt/layers/layernorm.py

…12813) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

b8zhong requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock, kushanam and merrymercy as code owners November 7, 2025 04:38

b8zhong added the run-ci label Nov 7, 2025

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

python/sglang/srt/layers/layernorm.py Show resolved Hide resolved

b8zhong assigned Fridge003 Nov 8, 2025

Fridge003 approved these changes Dec 8, 2025

View reviewed changes

more

0b08823

b8zhong force-pushed the add-weightless-qk-norm branch from 374a362 to 0b08823 Compare January 6, 2026 19:29

b8zhong enabled auto-merge (squash) January 6, 2026 19:29

Kangyan-Zhou disabled auto-merge January 30, 2026 03:09

Kangyan-Zhou merged commit 22df62d into sgl-project:main Jan 30, 2026
68 of 102 checks passed

b8zhong deleted the add-weightless-qk-norm branch January 30, 2026 03:38

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

add weightless qk norm to RMSNorm interface for Llama 4 (sgl-project#…

5e191e6

…12813) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026

add weightless qk norm to RMSNorm interface for Llama 4 (sgl-project#…

52974be

…12813) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

add weightless qk norm to RMSNorm interface for Llama 4 (sgl-project#…

1719f4e

…12813) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

add weightless qk norm to RMSNorm interface for Llama 4 (sgl-project#…

d7eab4f

…12813) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add weightless qk norm to RMSNorm interface for Llama 4#12813

add weightless qk norm to RMSNorm interface for Llama 4#12813
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
bzhng-development:add-weightless-qk-norm

b8zhong commented Nov 7, 2025

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

b8zhong commented Nov 7, 2025

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants