[Performance] Optimize Mllama LayerNorm -> Upd by vincentzed · Pull Request #9725 · sgl-project/sglang

vincentzed · 2025-08-28T00:00:11Z

Motivation

Modifications

This doesn't change much in performance, and doesn't affect correctness, but we get to

Remove the old fn, and use the RMSNorm from our own layernorm collection.

Accuracy Tests

Benchmarking and Profiling

On main

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model unsloth/Llama-3.2-11B-Vision-Instruct \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
  --num-prompts 1000 \
  --request-rate 100 \
  --max-concurrency 256
benchmark_args=Namespace(backend='sglang', base_url=None, host='127.0.0.1', port=30000, dataset_name='random', dataset_path='', model='unsloth/Llama-3.2-11B-Vision-Instruct', tokenizer=None, num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.5, random_image_num_images=1, random_image_resolution='1080p', request_rate=100.0, max_concurrency=256, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, lora_name=None, prompt_suffix='', pd_separated=False, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256)
Namespace(backend='sglang', base_url=None, host='127.0.0.1', port=30000, dataset_name='random', dataset_path='', model='unsloth/Llama-3.2-11B-Vision-Instruct', tokenizer=None, num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.5, random_image_num_images=1, random_image_resolution='1080p', request_rate=100.0, max_concurrency=256, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, lora_name=None, prompt_suffix='', pd_separated=False, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256)

#Input tokens: 766657
#Output tokens: 768060
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [04:00<00:00,  4.16it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    100.0     
Max request concurrency:                 256       
Successful requests:                     1000      
Benchmark duration (s):                  240.14    
Total input tokens:                      766657    
Total generated tokens:                  768060    
Total generated tokens (retokenized):    767513    
Request throughput (req/s):              4.16      
Input token throughput (tok/s):          3192.58   
Output token throughput (tok/s):         3198.42   
Total token throughput (tok/s):          6391.00   
Concurrency:                             236.33    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   56751.07  
Median E2E Latency (ms):                 55566.93  
---------------Time to First Token----------------
Mean TTFT (ms):                          6923.81   
Median TTFT (ms):                        5463.59   
P99 TTFT (ms):                           39203.79  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           64.96     
Median ITL (ms):                         51.57     
P95 ITL (ms):                            134.38    
P99 ITL (ms):                            196.86    
Max ITL (ms):                            10824.08  
==================================================

On new branch

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model unsloth/Llama-3.2-11B-Vision-Instruct \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
  --num-prompts 1000 \
  --request-rate 100 \
  --max-concurrency 256
benchmark_args=Namespace(backend='sglang', base_url=None, host='127.0.0.1', port=30000, dataset_name='random', dataset_path='', model='unsloth/Llama-3.2-11B-Vision-Instruct', tokenizer=None, num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.5, random_image_num_images=1, random_image_resolution='1080p', request_rate=100.0, max_concurrency=256, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, lora_name=None, prompt_suffix='', pd_separated=False, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256)
Namespace(backend='sglang', base_url=None, host='127.0.0.1', port=30000, dataset_name='random', dataset_path='', model='unsloth/Llama-3.2-11B-Vision-Instruct', tokenizer=None, num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.5, random_image_num_images=1, random_image_resolution='1080p', request_rate=100.0, max_concurrency=256, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, lora_name=None, prompt_suffix='', pd_separated=False, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256)

#Input tokens: 766657
#Output tokens: 768060
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:59<00:00,  4.18it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    100.0     
Max request concurrency:                 256       
Successful requests:                     1000      
Benchmark duration (s):                  239.29    
Total input tokens:                      766657    
Total generated tokens:                  768060    
Total generated tokens (retokenized):    767513    
Request throughput (req/s):              4.18      
Input token throughput (tok/s):          3203.88   
Output token throughput (tok/s):         3209.74   
Total token throughput (tok/s):          6413.62   
Concurrency:                             236.01    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   56473.94  
Median E2E Latency (ms):                 55398.16  
---------------Time to First Token----------------
Mean TTFT (ms):                          6886.42   
Median TTFT (ms):                        5504.80   
P99 TTFT (ms):                           38626.01  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           64.65     
Median ITL (ms):                         51.13     
P95 ITL (ms):                            134.28    
P99 ITL (ms):                            198.65    
Max ITL (ms):                            10808.11  
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the layer normalization implementation within the Mllama model by replacing a custom RMSNorm class with an existing, shared RMSNorm from the project's collection. The primary goal is code consolidation and improved maintainability, without introducing significant changes to performance or correctness. The changes involve removing the redundant custom class and updating the MllamaTextCrossAttention module to use the standardized RMSNorm, along with necessary adjustments to tensor handling.

Highlights

Removed Custom Layer Normalization Class: The custom MllamaTextRMSNorm class, which provided a specific implementation of RMS normalization, has been removed from the codebase.
Adopted Shared RMSNorm Implementation: The MllamaTextCrossAttention module now utilizes an existing, shared RMSNorm implementation instead of the previously custom-defined MllamaTextRMSNorm for its query (q_norm) and key (k_norm) normalization layers.
Adjusted Tensor Operations for New Norm: The forward pass logic within MllamaTextCrossAttention was updated to correctly handle tensor reshaping before and after applying the new RMSNorm, ensuring compatibility and proper operation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a good refactoring that replaces the model-specific MllamaTextRMSNorm with the shared RMSNorm implementation from the common layers. This improves code reuse and maintainability. The necessary adjustments in MllamaTextCrossAttention to handle the generic RMSNorm layer appear correct. The benchmark results confirm that this change has a negligible impact on performance, which is great.

JustinTong0323

Thanks Vincent!

ispobock · 2025-09-02T14:10:45Z

python/sglang/srt/models/mllama.py

+            k = self.k_norm(k.reshape(-1, self.head_dim)).reshape(
+                -1, self.num_local_key_value_heads, self.head_dim
+            )
+        q = q.reshape(-1, self.num_local_heads, self.head_dim)


reshape will cause additional memory copy and kernel launch (can be seen in the torch profile). could you check if it's necessary?

I will do the check.

I updated, I removed two reshapes to be views, the norm ones still need it (both).

vincentzed · 2025-09-20T02:38:51Z

It is ready

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

vincentzed changed the title ~~Optimize Mllama LayerNorm -> Upd~~ [Performance] Optimize Mllama LayerNorm -> Upd Aug 28, 2025

JustinTong0323 approved these changes Aug 28, 2025

View reviewed changes

ispobock reviewed Sep 2, 2025

View reviewed changes

vincentzed force-pushed the mllama-rms branch from 9995662 to 9317eff Compare September 3, 2025 23:41

vincentzed requested a review from ispobock September 4, 2025 00:20

vincentzed added 4 commits September 19, 2025 22:38

upd

07b9d05

upd

e04096a

upd

fd47c41

If we actually need this reshape

9b39a67

vincentzed force-pushed the mllama-rms branch from 9c0c814 to 9b39a67 Compare September 20, 2025 02:38

vincentzed added 2 commits September 27, 2025 13:16

Merge branch 'main' into mllama-rms

8f258e3

Merge branch 'main' into mllama-rms

631e3ba

JustinTong0323 added the run-ci label Oct 4, 2025

JustinTong0323 enabled auto-merge (squash) October 4, 2025 20:53

Merge branch 'main' into mllama-rms

e32c3aa

Kangyan-Zhou disabled auto-merge February 1, 2026 00:02

Kangyan-Zhou merged commit c2ab371 into sgl-project:main Feb 1, 2026
64 of 71 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 2, 2026

[Performance] Optimize Mllama LayerNorm -> Upd (sgl-project#9725)

3b8b736

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[Performance] Optimize Mllama LayerNorm -> Upd (sgl-project#9725)

503f75b

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[Performance] Optimize Mllama LayerNorm -> Upd (sgl-project#9725)

7508efc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Optimize Mllama LayerNorm -> Upd#9725

[Performance] Optimize Mllama LayerNorm -> Upd#9725
Kangyan-Zhou merged 7 commits intosgl-project:mainfrom
bzhng-development:mllama-rms

vincentzed commented Aug 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

JustinTong0323 left a comment

Uh oh!

ispobock Sep 2, 2025

Uh oh!

vincentzed Sep 3, 2025

Uh oh!

vincentzed Sep 4, 2025

Uh oh!

vincentzed commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vincentzed commented Aug 28, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

Uh oh!

ispobock Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants