Register tensors with symmetric memory for qwen by nvcastet · Pull Request #18643 · sgl-project/sglang

nvcastet · 2026-02-11T23:36:37Z

Motivation

Perform add op in-place so that the MoE output still belongs to the symmetric memory pool (when symm mem is enabled).

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-11T23:36:40Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/models/qwen2_moe.py

hlu1 · 2026-02-17T21:46:06Z

Please add perf benchmark numbers

nvcastet · 2026-02-18T21:17:17Z

Symmetric Memory Benchmark: Qwen3-Next-80B-A3B-Thinking-FP8 on 4x GB200

TP4 Configuration

Server (symm-mem enabled):

python -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --tp 4 --max-running-requests 1024 --chunked-prefill-size 8192 --mem-fraction-static 0.8 --disable-radix-cache --mamba-ssm-dtype float32 --fp8-gemm-backend flashinfer_trtllm --attention-backend trtllm_mha --cuda-graph-max-bs=1024 --enable-symm-mem

Server (symm-mem disabled): same as above without --enable-symm-mem

Client:

python3 -m sglang.bench_one_batch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --dataset-name random --input-len 1024 --output-len 1024 --batch-size 1024 512 128 --base-url http://127.0.0.1:30000

Batch Size	Metric	Symm-Mem Disabled	Symm-Mem Enabled	Gain
1024	Prefill (tok/s)	58,472.54	63,989.35	+9.4%
1024	Decode (tok/s)	31,369.96	32,934.25	+5.0%
512	Prefill (tok/s)	62,012.44	63,895.43	+3.0%
512	Decode (tok/s)	24,305.91	24,859.08	+2.3%
128	Prefill (tok/s)	61,414.28	62,912.60	+2.4%
128	Decode (tok/s)	11,316.08	11,375.96	+0.5%

DEP4 Configuration

Server (symm-mem enabled):

python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --disable-radix-cache --attention-backend trtllm_mha --tp-size 4 --ep 4 --cuda-graph-max-bs 512 --enable-dp-attention --dp 4 --stream-interval 10 --mem-fraction-static 0.9 --max-running-requests 2048 --enable-dp-lm-head --mamba-ssm-dtype float32 --fp8-gemm-backend flashinfer_trtllm --chunked-prefill-size 8192 --enable-symm-mem

Server (symm-mem disabled): same as above without --enable-symm-mem

Client:

python3 -m sglang.bench_one_batch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --dataset-name random --input-len 1024 --output-len 1024 --batch-size 2048 --base-url http://127.0.0.1:30000

Batch Size	Metric	Symm-Mem Disabled	Symm-Mem Enabled	Gain
2048	Prefill (tok/s)	43,284.91	60,495.37	+39.8%
2048	Decode (tok/s)	11,903.58	12,384.14	+4.0%

hlu1 · 2026-02-18T22:37:42Z

/tag-and-rerun-ci

hlu1 reviewed Feb 11, 2026

View reviewed changes

python/sglang/srt/models/qwen2_moe.py Show resolved Hide resolved

hlu1 requested a review from Qiaolin-Yu February 17, 2026 21:45

hlu1 mentioned this pull request Feb 17, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

33 tasks

hlu1 approved these changes Feb 18, 2026

View reviewed changes

nvcastet added 2 commits February 18, 2026 15:41

Register tensors with symmetric memory for qwen

58dd5f3

Add comment explaining in-place add requirement for symmetric memory

5352aa1

nvcastet force-pushed the register_symm_mem_qwen branch from d55edfb to 5352aa1 Compare February 18, 2026 21:47

github-actions bot added the run-ci label Feb 18, 2026

hlu1 requested a review from ispobock February 18, 2026 22:52

ispobock approved these changes Feb 20, 2026

View reviewed changes

ispobock merged commit 99df920 into sgl-project:main Feb 20, 2026
143 of 161 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register tensors with symmetric memory for qwen#18643

Register tensors with symmetric memory for qwen#18643
ispobock merged 2 commits intosgl-project:mainfrom
nvcastet:register_symm_mem_qwen

nvcastet commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

Uh oh!

hlu1 commented Feb 17, 2026

Uh oh!

nvcastet commented Feb 18, 2026

Uh oh!

hlu1 commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nvcastet commented Feb 11, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

Uh oh!

hlu1 commented Feb 17, 2026

Uh oh!

nvcastet commented Feb 18, 2026

Symmetric Memory Benchmark: Qwen3-Next-80B-A3B-Thinking-FP8 on 4x GB200

TP4 Configuration

DEP4 Configuration

Uh oh!

hlu1 commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants