Skip to content

[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models#18154

Merged
BBuf merged 6 commits intosgl-project:mainfrom
haojin2:offline_bench
Mar 4, 2026
Merged

[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models#18154
BBuf merged 6 commits intosgl-project:mainfrom
haojin2:offline_bench

Conversation

@haojin2
Copy link
Contributor

@haojin2 haojin2 commented Feb 3, 2026

Motivation

Address part of step 1 for #18077

Modifications

  • Added bench_offline_throughput.py under multimodal_gen similar to the counterpart for LLM

Accuracy Tests

N/A

Benchmarking and Profiling

  • Need all diffusion dependencies:

  • pip install imageio cache_dit remote-pdb accelerate addict

  • Need to install source version of transformers and diffusers

  • pip install git+https://github.com/huggingface/transformers

  • pip install git+https://github.com/huggingface/diffusers

  • Sample single-GPU (RTX 6000 pro) run with GLM-Image + sglang backend + torch.compile: python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend sglang --enable-torch-compile --num-prompts 20 --batch-size 1 with resulting report:

==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          20                            
---------------------------------------------------------------------------
Total Requests:                               20                            
Successful Requests:                          20                            
Failed Requests:                              0                             
Total Duration (seconds):                     233.38                        
---------------------------------------------------------------------------
Frames Generated:                             20                            
Megapixels Generated:                         5.24                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.0857                        
MP Throughput (MP/sec):                       0.0225                        
Requests Per Second:                          0.0857                        
Latency Per Request (sec):                    11.6688                       
Peak Memory (MB):                             0                             
==============================================================================================================
  • Sample single-GPU (RTX 6000 pro) run with GLM-Image + diffusers backend: python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend diffusers --num-prompts 20 --batch-size 1 with resulting report:
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          20                            
---------------------------------------------------------------------------
Total Requests:                               20                            
Successful Requests:                          20                            
Failed Requests:                              0                             
Total Duration (seconds):                     246.26                        
---------------------------------------------------------------------------
Frames Generated:                             20                            
Megapixels Generated:                         5.24                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.0812                        
MP Throughput (MP/sec):                       0.0213                        
Requests Per Second:                          0.0812                        
Latency Per Request (sec):                    12.3132                       
Peak Memory (MB):                             0                             
==============================================================================================================
  • Verification of refactored bench_serving.py (on RTX 6000 pro) with GLM-Image

\server: sglang serve --model-path zai-org/GLM-Image --backend sglang
bench_serving: python3 -m sglang.multimodal_gen.benchmarks.bench_serving --dataset random --num-prompts 10 --width 512 --height 512 --model zai-org/GLM-Image

================= Serving Benchmark Result =================
Task:                                    text-to-image  
Model:                                   zai-org/GLM-Image
Dataset:                                 random         
--------------------------------------------------
Benchmark duration (s):                  131.30         
Request rate:                            inf            
Max request concurrency:                 1              
Successful requests:                     10/10             
--------------------------------------------------
Request throughput (req/s):              0.08           
Latency Mean (s):                        13.1293        
Latency Median (s):                      12.9035        
Latency P99 (s):                         14.9457        
--------------------------------------------------
Peak Memory Max (MB):                    35387.64       
Peak Memory Mean (MB):                   35387.45       
Peak Memory Median (MB):                 35387.64       
============================================================
  • TODO: verify on all currently-supported models under multimodal_gen for runnability

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the diffusion SGLang Diffusion label Feb 3, 2026
@haojin2 haojin2 force-pushed the offline_bench branch 2 times, most recently from 466c89f to b81f932 Compare February 3, 2026 06:25
@mickqian
Copy link
Collaborator

mickqian commented Feb 3, 2026

also, could you clean the code a bit?x

@haojin2
Copy link
Contributor Author

haojin2 commented Feb 5, 2026

cc @zhaochenyang20 Refactored as requested
Also tested for new bench_serving script

@zhaochenyang20
Copy link
Collaborator

  1. Could you update your PR description also? I think it's two days ago.

@zhaochenyang20
Copy link
Collaborator

We can first have this PR merged. And I think the profiling of diffusion router could be interesting:

radixark/miles#544 (comment)

Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Refactor the print lines in LLM and Diffusion. I think you can put a helper function in https://github.com/sgl-project/sglang/blob/main/python/sglang/test/test_utils.py

  2. debugging with the bench_offline launching commands for the engine over multi GPUs.

@zhaochenyang20
Copy link
Collaborator

Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a strong suggestion regarding the architecture of our benchmark tools. Instead of maintaining two separate scripts (bench_offline_throughput.py and bench_serving.py), we should merge them into a single, unified entry point (e.g., bench_throughput.py).

Both scenarios share identical logic for Argument Parsing, Dataset Loading, and Result Reporting/Metrics Calculation. The only distinct logic is the inference backend execution.

  1. Unified Argument Parsing: Add a --backend argument (e.g., choices=["engine", "server"]) to switch modes.

  2. Shared Data Loading: Reuse the datasets.py logic for both modes.

  3. Backend Abstraction:

If backend == "engine": Initialize and launch the GPUWorker.

If backend == "server": Check the health of the endpoint.

  1. Execution Loop: Send requests via the selected backend interface.

  2. Unified Reporting: Calculate and print metrics using a shared logic to ensure fair comparison between offline and online performance.

This refactoring would significantly maximize code reuse and improve maintainability. What do you think?

@haojin2 haojin2 requested a review from ping1jing2 as a code owner February 20, 2026 05:10
@haojin2 haojin2 force-pushed the offline_bench branch 4 times, most recently from a797cb0 to e752743 Compare February 20, 2026 06:02
@yhyang201
Copy link
Collaborator

He has already switched to DiffGenerator. Could you please take a look? Thanks. @mickqian

Add default value `eps=1e-5` to `register_fake` implementations of
`fused_norm_scale_shift` and `fused_scale_residual_norm_scale_shift`
custom ops, matching the default in the actual custom_op signatures.

Made-with: Cursor
@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Mar 3, 2026

My testing for this is:

uv pip install -e ".[diffusion]"  
# This is for GLM image
pip install --upgrade transformers
python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
    --model-path zai-org/GLM-Image \
    --height 512 --width 512 \
    --num-inference-steps 3 \
    --backend sglang \
    --num-prompts 3
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          3                             
---------------------------------------------------------------------------
Total Requests:                               3                             
Successful Requests:                          3                             
Failed Requests:                              0                             
Total Duration (seconds):                     31.16                         
---------------------------------------------------------------------------
Frames Generated:                             3                             
Megapixels Generated:                         0.79                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.10                          
MP Throughput (MP/sec):                       0.03                          
Requests Per Second:                          0.10                          
Latency Per Request (sec):                    10.39                         
Peak Memory (MB):                             35610.00                      
==============================================================================================================
python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
    --model-path zai-org/GLM-Image \
    --height 512 --width 512 \
    --num-inference-steps 3 \
    --backend sglang \
    --enable-torch-compile \
    --num-prompts 3
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          3                             
---------------------------------------------------------------------------
Total Requests:                               3                             
Successful Requests:                          3                             
Failed Requests:                              0                             
Total Duration (seconds):                     31.47                         
---------------------------------------------------------------------------
Frames Generated:                             3                             
Megapixels Generated:                         0.79                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.10                          
MP Throughput (MP/sec):                       0.02                          
Requests Per Second:                          0.10                          
Latency Per Request (sec):                    10.49                         
Peak Memory (MB):                             35634.00                      
==============================================================================================================
python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
    --model-path zai-org/GLM-Image \
    --height 512 --width 512 \
    --num-inference-steps 3 \
    --backend sglang \
    --num-prompts 3 \
    --skip-warmup \
    --output-file /tmp/bench_result.json

cat /tmp/bench_result.json
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          3                             
---------------------------------------------------------------------------
Total Requests:                               3                             
Successful Requests:                          3                             
Failed Requests:                              0                             
Total Duration (seconds):                     39.99                         
---------------------------------------------------------------------------
Frames Generated:                             3                             
Megapixels Generated:                         0.79                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.08                          
MP Throughput (MP/sec):                       0.02                          
Requests Per Second:                          0.08                          
Latency Per Request (sec):                    13.33                         
Peak Memory (MB):                             35610.00                      
==============================================================================================================

@BBuf
Copy link
Collaborator

BBuf commented Mar 3, 2026

Motivation

Address part of step 1 for #18077

Modifications

* Added bench_offline_throughput.py under multimodal_gen similar to the counterpart for LLM

Accuracy Tests

N/A

Benchmarking and Profiling

* Need all diffusion dependencies:

* `pip install imageio cache_dit remote-pdb accelerate addict`

* Need to install source version of `transformers` and `diffusers`

* `pip install git+https://github.com/huggingface/transformers`

* `pip install git+https://github.com/huggingface/diffusers`

* Sample single-GPU (RTX 6000 pro) run with `GLM-Image` + `sglang` backend + `torch.compile`: `python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend sglang --enable-torch-compile --num-prompts 20 --batch-size 1` with resulting report:
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          20                            
---------------------------------------------------------------------------
Total Requests:                               20                            
Successful Requests:                          20                            
Failed Requests:                              0                             
Total Duration (seconds):                     233.38                        
---------------------------------------------------------------------------
Frames Generated:                             20                            
Megapixels Generated:                         5.24                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.0857                        
MP Throughput (MP/sec):                       0.0225                        
Requests Per Second:                          0.0857                        
Latency Per Request (sec):                    11.6688                       
Peak Memory (MB):                             0                             
==============================================================================================================
* Sample single-GPU (RTX 6000 pro) run with `GLM-Image` + `diffusers` backend: `python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend diffusers --num-prompts 20 --batch-size 1` with resulting report:
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          20                            
---------------------------------------------------------------------------
Total Requests:                               20                            
Successful Requests:                          20                            
Failed Requests:                              0                             
Total Duration (seconds):                     246.26                        
---------------------------------------------------------------------------
Frames Generated:                             20                            
Megapixels Generated:                         5.24                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.0812                        
MP Throughput (MP/sec):                       0.0213                        
Requests Per Second:                          0.0812                        
Latency Per Request (sec):                    12.3132                       
Peak Memory (MB):                             0                             
==============================================================================================================
* Verification of refactored bench_serving.py (on RTX 6000 pro) with GLM-Image

\server: sglang serve --model-path zai-org/GLM-Image --backend sglang bench_serving: python3 -m sglang.multimodal_gen.benchmarks.bench_serving --dataset random --num-prompts 10 --width 512 --height 512 --model zai-org/GLM-Image

================= Serving Benchmark Result =================
Task:                                    text-to-image  
Model:                                   zai-org/GLM-Image
Dataset:                                 random         
--------------------------------------------------
Benchmark duration (s):                  131.30         
Request rate:                            inf            
Max request concurrency:                 1              
Successful requests:                     10/10             
--------------------------------------------------
Request throughput (req/s):              0.08           
Latency Mean (s):                        13.1293        
Latency Median (s):                      12.9035        
Latency P99 (s):                         14.9457        
--------------------------------------------------
Peak Memory Max (MB):                    35387.64       
Peak Memory Mean (MB):                   35387.45       
Peak Memory Median (MB):                 35387.64       
============================================================
* TODO: verify on all currently-supported models under multimodal_gen for runnability

Checklist

* [x]  Format your code according to the [Format code with pre-commit](https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit).

* [x]  Add unit tests according to the [Run and add unit tests](https://docs.sglang.io/developer_guide/contribution_guide.html#run-and-add-unit-tests).

* [ ]  Update documentation according to [Write documentations](https://docs.sglang.io/developer_guide/contribution_guide.html#write-documentations).

* [x]  Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.io/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.io/developer_guide/contribution_guide.html#benchmark-the-speed).

* [x]  Follow the SGLang code style [guidance](https://docs.sglang.io/developer_guide/contribution_guide.html#code-style-guidance).

Review Process

1. Ping Merge Oncalls to start the PR flow. See the [PR Merge Process](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process).

2. Get approvals from [CODEOWNERS](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and other reviewers.

3. Trigger CI tests with [comments](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests) or contact authorized users to do so.
   
   * `/tag-run-ci-label`, `/rerun-failed-ci`, `/tag-and-rerun-ci`

4. After green CI and required approvals, ask Merge Oncalls to merge.

5. 5. 5. 5. 5. 5. 5. 5. 5. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.

Why peak memory is 0MIB?

图片

@BBuf
Copy link
Collaborator

BBuf commented Mar 3, 2026

Motivation

Address part of step 1 for #18077

Modifications

* Added bench_offline_throughput.py under multimodal_gen similar to the counterpart for LLM

Accuracy Tests

N/A

Benchmarking and Profiling

* Need all diffusion dependencies:

* `pip install imageio cache_dit remote-pdb accelerate addict`

* Need to install source version of `transformers` and `diffusers`

* `pip install git+https://github.com/huggingface/transformers`

* `pip install git+https://github.com/huggingface/diffusers`

* Sample single-GPU (RTX 6000 pro) run with `GLM-Image` + `sglang` backend + `torch.compile`: `python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend sglang --enable-torch-compile --num-prompts 20 --batch-size 1` with resulting report:
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          20                            
---------------------------------------------------------------------------
Total Requests:                               20                            
Successful Requests:                          20                            
Failed Requests:                              0                             
Total Duration (seconds):                     233.38                        
---------------------------------------------------------------------------
Frames Generated:                             20                            
Megapixels Generated:                         5.24                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.0857                        
MP Throughput (MP/sec):                       0.0225                        
Requests Per Second:                          0.0857                        
Latency Per Request (sec):                    11.6688                       
Peak Memory (MB):                             0                             
==============================================================================================================
* Sample single-GPU (RTX 6000 pro) run with `GLM-Image` + `diffusers` backend: `python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend diffusers --num-prompts 20 --batch-size 1` with resulting report:
==================================== Offline Throughput Benchmark Result =====================================
Model:                                        zai-org/GLM-Image             
Dataset:                                      random                        
Resolution:                                   512x512x1                     
Num Inference Steps:                          20                            
---------------------------------------------------------------------------
Total Requests:                               20                            
Successful Requests:                          20                            
Failed Requests:                              0                             
Total Duration (seconds):                     246.26                        
---------------------------------------------------------------------------
Frames Generated:                             20                            
Megapixels Generated:                         5.24                          
---------------------------------------------------------------------------
Frame Throughput (frames/sec):                0.0812                        
MP Throughput (MP/sec):                       0.0213                        
Requests Per Second:                          0.0812                        
Latency Per Request (sec):                    12.3132                       
Peak Memory (MB):                             0                             
==============================================================================================================
* Verification of refactored bench_serving.py (on RTX 6000 pro) with GLM-Image

\server: sglang serve --model-path zai-org/GLM-Image --backend sglang bench_serving: python3 -m sglang.multimodal_gen.benchmarks.bench_serving --dataset random --num-prompts 10 --width 512 --height 512 --model zai-org/GLM-Image

================= Serving Benchmark Result =================
Task:                                    text-to-image  
Model:                                   zai-org/GLM-Image
Dataset:                                 random         
--------------------------------------------------
Benchmark duration (s):                  131.30         
Request rate:                            inf            
Max request concurrency:                 1              
Successful requests:                     10/10             
--------------------------------------------------
Request throughput (req/s):              0.08           
Latency Mean (s):                        13.1293        
Latency Median (s):                      12.9035        
Latency P99 (s):                         14.9457        
--------------------------------------------------
Peak Memory Max (MB):                    35387.64       
Peak Memory Mean (MB):                   35387.45       
Peak Memory Median (MB):                 35387.64       
============================================================
* TODO: verify on all currently-supported models under multimodal_gen for runnability

Checklist

* [x]  Format your code according to the [Format code with pre-commit](https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit).

* [x]  Add unit tests according to the [Run and add unit tests](https://docs.sglang.io/developer_guide/contribution_guide.html#run-and-add-unit-tests).

* [ ]  Update documentation according to [Write documentations](https://docs.sglang.io/developer_guide/contribution_guide.html#write-documentations).

* [x]  Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.io/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.io/developer_guide/contribution_guide.html#benchmark-the-speed).

* [x]  Follow the SGLang code style [guidance](https://docs.sglang.io/developer_guide/contribution_guide.html#code-style-guidance).

Review Process

1. Ping Merge Oncalls to start the PR flow. See the [PR Merge Process](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process).

2. Get approvals from [CODEOWNERS](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and other reviewers.

3. Trigger CI tests with [comments](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests) or contact authorized users to do so.
   
   * `/tag-run-ci-label`, `/rerun-failed-ci`, `/tag-and-rerun-ci`

4. After green CI and required approvals, ask Merge Oncalls to merge.

5. 5. 5. 5. 5. 5. 5. 5. 5. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.

Why peak memory is 0MIB?
图片

updated in #18154 (comment)

Copy link
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@BBuf
Copy link
Collaborator

BBuf commented Mar 3, 2026

/tag-and-rerun-ci

@BBuf BBuf merged commit a69b943 into sgl-project:main Mar 4, 2026
48 checks passed
Kangyan-Zhou pushed a commit to Kangyan-Zhou/sglang that referenced this pull request Mar 4, 2026
…modal models (sgl-project#18154)

Co-authored-by: Hao Jin <Hao Jin>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
qeternity pushed a commit to qeternity/sglang that referenced this pull request Mar 6, 2026
…modal models (sgl-project#18154)

Co-authored-by: Hao Jin <Hao Jin>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants