Skip to content

add weightless qk norm to RMSNorm interface for Llama 4#12813

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
bzhng-development:add-weightless-qk-norm
Jan 30, 2026
Merged

add weightless qk norm to RMSNorm interface for Llama 4#12813
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
bzhng-development:add-weightless-qk-norm

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Nov 7, 2025

Otherwise with Llama 4 that has weightless qk_norm, we get these false warnings:

[2025-11-07 04:20:16 TP7] Some weights are not initialized from checkpoints {'language_model.model.layers.16.self_attn.qk_norm.weight', 'language_model.model.layers.32.self_attn.qk_norm.weight', 'language_model.model.layers.2.self_attn.qk_norm.weight', 'language_model.model.layers.10.self_attn.qk_norm.weight', 'language_model.model.layers.5.self_attn.qk_norm.weight', 'language_model.model.layers.36.self_attn.qk_norm.weight', 'language_model.model.layers.14.self_attn.qk_norm.weight', 'language_model.model.layers.17.self_attn.qk_norm.weight', 'language_model.model.layers.26.self_attn.qk_norm.weight', 'language_model.model.layers.42.self_attn.qk_norm.weight', 'language_model.model.layers.21.self_attn.qk_norm.weight', 'language_model.model.layers.25.self_attn.qk_norm.weight', 'language_model.model.layers.46.self_attn.qk_norm.weight', 'language_model.model.layers.6.self_attn.qk_norm.weight', 'language_model.model.layers.34.self_attn.qk_norm.weight', 'language_model.model.layers.22.self_attn.qk_norm.weight', 'language_model.model.layers.1.self_attn.qk_norm.weight', 'language_model.model.layers.24.self_attn.qk_norm.weight', 'language_model.model.layers.20.self_attn.qk_norm.weight', 'language_model.model.layers.41.self_attn.qk_norm.weight', 'language_model.model.layers.33.self_attn.qk_norm.weight', 'language_model.model.layers.28.self_attn.qk_norm.weight', 'language_model.model.layers.38.self_attn.qk_norm.weight', 'language_model.model.layers.4.self_attn.qk_norm.weight', 'language_model.model.layers.37.self_attn.qk_norm.weight', 'language_model.model.layers.40.self_attn.qk_norm.weight', 'language_model.model.layers.45.self_attn.qk_norm.weight', 'language_model.model.layers.29.self_attn.qk_norm.weight', 'language_model.model.layers.0.self_attn.qk_norm.weight', 'language_model.model.layers.13.self_attn.qk_norm.weight', 'language_model.model.layers.18.self_attn.qk_norm.weight', 'language_model.model.layers.8.self_attn.qk_norm.weight', 'language_model.model.layers.44.self_attn.qk_norm.weight', 'language_model.model.layers.9.self_attn.qk_norm.weight', 'language_model.model.layers.12.self_attn.qk_norm.weight', 'language_model.model.layers.30.self_attn.qk_norm.weight'}
python3 -m sglang.launch_server \
  --model-path=/opt/dlami/nvme/models/Llama-4-Scout-17B-16E-Instruct/ \
  --tp=8 \
  --trust-remote-code \
  --mem-fraction-static=0.7 \
  --context-length=131072 \
  --kv-cache-dtype=fp8_e4m3 \
  --attention-backend=fa3 \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

After:

Multi-thread loading shards:  68% Completed | 34/50 [01:59<00:55,  3.45s/it]
Multi-thread loading shards:  70% Completed | 35/50 [02:03<00:52,  3.52s/it]
Multi-thread loading shards:  72% Completed | 36/50 [02:06<00:48,  3.47s/it]
Multi-thread loading shards:  74% Completed | 37/50 [02:10<00:45,  3.48s/it]
Multi-thread loading shards:  76% Completed | 38/50 [02:13<00:41,  3.44s/it]
Multi-thread loading shards:  78% Completed | 39/50 [02:17<00:39,  3.61s/it]
Multi-thread loading shards:  80% Completed | 40/50 [02:21<00:36,  3.61s/it]
Multi-thread loading shards:  82% Completed | 41/50 [02:24<00:32,  3.58s/it]
Multi-thread loading shards:  84% Completed | 42/50 [02:28<00:28,  3.55s/it]
Multi-thread loading shards:  88% Completed | 44/50 [02:31<00:16,  2.68s/it]
Multi-thread loading shards:  90% Completed | 45/50 [02:36<00:15,  3.13s/it]
Multi-thread loading shards:  92% Completed | 46/50 [02:39<00:12,  3.21s/it]
Multi-thread loading shards:  94% Completed | 47/50 [02:43<00:10,  3.49s/it]
[2025-11-07 04:36:50 TP3] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2025-11-07 04:36:50 TP3] Setting sliding_window_size to be attention_chunk_size: 8192
[2025-11-07 04:36:50 TP3] Load weight end. type=Llama4ForConditionalGeneration, dtype=torch.bfloat16, avail mem=107.74 GB, mem usage=30.27 GB.
Multi-thread loading shards:  96% Completed | 48/50 [02:47<00:07,  3.55s/it]
Multi-thread loading shards:  98% Completed | 49/50 [02:51<00:03,  3.78s/it]
Multi-thread loading shards: 100% Completed | 50/50 [02:51<00:00,  3.44s/it]
root@ip-10-40-0-228:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 500
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 52.57it/s]
Accuracy: 0.926
Invalid: 0.000
Latency: 25.235 s
Output throughput: 5361.274 token/s

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the RMSNorm layer to support configurations where the normalization weight is not a learnable parameter. This change is crucial for correctly handling Llama 4 models, which employ 'weightless' query-key normalization layers. By preventing the system from attempting to load non-existent learnable weights, it eliminates spurious initialization warnings and ensures seamless model loading and operation for Llama 4 architectures.

Highlights

  • RMSNorm Flexibility: The RMSNorm class now includes a has_weight parameter in its constructor, allowing it to be initialized without a learnable nn.Parameter for its weight. If has_weight is False, the weight is initialized as a static torch.Tensor of ones.
  • Llama 4 qk_norm Integration: The Llama4Attention module has been updated to utilize this new RMSNorm functionality, specifically initializing its qk_norm (query-key normalization) with has_weight=False.
  • Warning Resolution: This modification effectively resolves false warnings about uninitialized weights that previously occurred when loading Llama 4 models that feature weightless qk_norm layers, as demonstrated by the provided 'before' and 'after' logs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for weightless RMSNorm, which is necessary for Llama 4's qk_norm. The approach of adding a has_weight flag is correct. However, the implementation for the weightless case has a critical issue with device placement. When has_weight=False, the weight tensor is not registered as a parameter or buffer, so it won't be moved to the correct device with the rest of the model, leading to runtime errors. I've provided a suggestion to fix this by registering it as a buffer.

@b8zhong b8zhong force-pushed the add-weightless-qk-norm branch from 374a362 to 0b08823 Compare January 6, 2026 19:29
@b8zhong b8zhong enabled auto-merge (squash) January 6, 2026 19:29
@Kangyan-Zhou Kangyan-Zhou disabled auto-merge January 30, 2026 03:09
@Kangyan-Zhou Kangyan-Zhou merged commit 22df62d into sgl-project:main Jan 30, 2026
68 of 102 checks passed
@b8zhong b8zhong deleted the add-weightless-qk-norm branch January 30, 2026 03:38
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
…12813)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
…12813)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
…12813)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
…12813)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants