[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models#18154
[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models#18154BBuf merged 6 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
466c89f to
b81f932
Compare
|
also, could you clean the code a bit?x |
8021c9e to
86ee88e
Compare
|
cc @zhaochenyang20 Refactored as requested |
|
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
|
We can first have this PR merged. And I think the profiling of diffusion router could be interesting: |
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
8ffe7c1 to
17e64df
Compare
zhaochenyang20
left a comment
There was a problem hiding this comment.
-
Refactor the print lines in LLM and Diffusion. I think you can put a helper function in https://github.com/sgl-project/sglang/blob/main/python/sglang/test/test_utils.py
-
debugging with the bench_offline launching commands for the engine over multi GPUs.
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
|
Also, could you also modify this document: https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/profiling.md |
zhaochenyang20
left a comment
There was a problem hiding this comment.
I have a strong suggestion regarding the architecture of our benchmark tools. Instead of maintaining two separate scripts (bench_offline_throughput.py and bench_serving.py), we should merge them into a single, unified entry point (e.g., bench_throughput.py).
Both scenarios share identical logic for Argument Parsing, Dataset Loading, and Result Reporting/Metrics Calculation. The only distinct logic is the inference backend execution.
-
Unified Argument Parsing: Add a --backend argument (e.g., choices=["engine", "server"]) to switch modes.
-
Shared Data Loading: Reuse the datasets.py logic for both modes.
-
Backend Abstraction:
If backend == "engine": Initialize and launch the GPUWorker.
If backend == "server": Check the health of the endpoint.
-
Execution Loop: Send requests via the selected backend interface.
-
Unified Reporting: Calculate and print metrics using a shared logic to ensure fair comparison between offline and online performance.
This refactoring would significantly maximize code reuse and improve maintainability. What do you think?
42d2cfd to
862bad0
Compare
a797cb0 to
e752743
Compare
|
He has already switched to DiffGenerator. Could you please take a look? Thanks. @mickqian |
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
e752743 to
8e87e67
Compare
8e87e67 to
50c4d0f
Compare
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
Add default value `eps=1e-5` to `register_fake` implementations of `fused_norm_scale_shift` and `fused_scale_residual_norm_scale_shift` custom ops, matching the default in the actual custom_op signatures. Made-with: Cursor
|
My testing for this is: uv pip install -e ".[diffusion]"
# This is for GLM image
pip install --upgrade transformerspython3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
--model-path zai-org/GLM-Image \
--height 512 --width 512 \
--num-inference-steps 3 \
--backend sglang \
--num-prompts 3==================================== Offline Throughput Benchmark Result =====================================
Model: zai-org/GLM-Image
Dataset: random
Resolution: 512x512x1
Num Inference Steps: 3
---------------------------------------------------------------------------
Total Requests: 3
Successful Requests: 3
Failed Requests: 0
Total Duration (seconds): 31.16
---------------------------------------------------------------------------
Frames Generated: 3
Megapixels Generated: 0.79
---------------------------------------------------------------------------
Frame Throughput (frames/sec): 0.10
MP Throughput (MP/sec): 0.03
Requests Per Second: 0.10
Latency Per Request (sec): 10.39
Peak Memory (MB): 35610.00
==============================================================================================================python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
--model-path zai-org/GLM-Image \
--height 512 --width 512 \
--num-inference-steps 3 \
--backend sglang \
--enable-torch-compile \
--num-prompts 3==================================== Offline Throughput Benchmark Result =====================================
Model: zai-org/GLM-Image
Dataset: random
Resolution: 512x512x1
Num Inference Steps: 3
---------------------------------------------------------------------------
Total Requests: 3
Successful Requests: 3
Failed Requests: 0
Total Duration (seconds): 31.47
---------------------------------------------------------------------------
Frames Generated: 3
Megapixels Generated: 0.79
---------------------------------------------------------------------------
Frame Throughput (frames/sec): 0.10
MP Throughput (MP/sec): 0.02
Requests Per Second: 0.10
Latency Per Request (sec): 10.49
Peak Memory (MB): 35634.00
==============================================================================================================python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
--model-path zai-org/GLM-Image \
--height 512 --width 512 \
--num-inference-steps 3 \
--backend sglang \
--num-prompts 3 \
--skip-warmup \
--output-file /tmp/bench_result.json
cat /tmp/bench_result.json==================================== Offline Throughput Benchmark Result =====================================
Model: zai-org/GLM-Image
Dataset: random
Resolution: 512x512x1
Num Inference Steps: 3
---------------------------------------------------------------------------
Total Requests: 3
Successful Requests: 3
Failed Requests: 0
Total Duration (seconds): 39.99
---------------------------------------------------------------------------
Frames Generated: 3
Megapixels Generated: 0.79
---------------------------------------------------------------------------
Frame Throughput (frames/sec): 0.08
MP Throughput (MP/sec): 0.02
Requests Per Second: 0.08
Latency Per Request (sec): 13.33
Peak Memory (MB): 35610.00
============================================================================================================== |
Why peak memory is 0MIB?
|
updated in #18154 (comment) |
|
/tag-and-rerun-ci |
…modal models (sgl-project#18154) Co-authored-by: Hao Jin <Hao Jin> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…modal models (sgl-project#18154) Co-authored-by: Hao Jin <Hao Jin> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>


Motivation
Address part of step 1 for #18077
Modifications
Accuracy Tests
N/A
Benchmarking and Profiling
Need all diffusion dependencies:
pip install imageio cache_dit remote-pdb accelerate addictNeed to install source version of
transformersanddiffuserspip install git+https://github.com/huggingface/transformerspip install git+https://github.com/huggingface/diffusersSample single-GPU (RTX 6000 pro) run with
GLM-Image+sglangbackend +torch.compile:python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend sglang --enable-torch-compile --num-prompts 20 --batch-size 1with resulting report:GLM-Image+diffusersbackend:python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend diffusers --num-prompts 20 --batch-size 1with resulting report:\server:
sglang serve --model-path zai-org/GLM-Image --backend sglangbench_serving:
python3 -m sglang.multimodal_gen.benchmarks.bench_serving --dataset random --num-prompts 10 --width 512 --height 512 --model zai-org/GLM-ImageChecklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci