Skip to content

Register tensors with symmetric memory for qwen#18643

Merged
ispobock merged 2 commits intosgl-project:mainfrom
nvcastet:register_symm_mem_qwen
Feb 20, 2026
Merged

Register tensors with symmetric memory for qwen#18643
ispobock merged 2 commits intosgl-project:mainfrom
nvcastet:register_symm_mem_qwen

Conversation

@nvcastet
Copy link
Collaborator

Motivation

Perform add op in-place so that the MoE output still belongs to the symmetric memory pool (when symm mem is enabled).

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hlu1 hlu1 requested a review from Qiaolin-Yu February 17, 2026 21:45
@hlu1
Copy link
Collaborator

hlu1 commented Feb 17, 2026

Please add perf benchmark numbers

@nvcastet
Copy link
Collaborator Author

Symmetric Memory Benchmark: Qwen3-Next-80B-A3B-Thinking-FP8 on 4x GB200

TP4 Configuration

Server (symm-mem enabled):

python -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --tp 4 --max-running-requests 1024 --chunked-prefill-size 8192 --mem-fraction-static 0.8 --disable-radix-cache --mamba-ssm-dtype float32 --fp8-gemm-backend flashinfer_trtllm --attention-backend trtllm_mha --cuda-graph-max-bs=1024 --enable-symm-mem

Server (symm-mem disabled): same as above without --enable-symm-mem

Client:

python3 -m sglang.bench_one_batch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --dataset-name random --input-len 1024 --output-len 1024 --batch-size 1024 512 128 --base-url http://127.0.0.1:30000
Batch Size Metric Symm-Mem Disabled Symm-Mem Enabled Gain
1024 Prefill (tok/s) 58,472.54 63,989.35 +9.4%
1024 Decode (tok/s) 31,369.96 32,934.25 +5.0%
512 Prefill (tok/s) 62,012.44 63,895.43 +3.0%
512 Decode (tok/s) 24,305.91 24,859.08 +2.3%
128 Prefill (tok/s) 61,414.28 62,912.60 +2.4%
128 Decode (tok/s) 11,316.08 11,375.96 +0.5%

DEP4 Configuration

Server (symm-mem enabled):

python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --disable-radix-cache --attention-backend trtllm_mha --tp-size 4 --ep 4 --cuda-graph-max-bs 512 --enable-dp-attention --dp 4 --stream-interval 10 --mem-fraction-static 0.9 --max-running-requests 2048 --enable-dp-lm-head --mamba-ssm-dtype float32 --fp8-gemm-backend flashinfer_trtllm --chunked-prefill-size 8192 --enable-symm-mem

Server (symm-mem disabled): same as above without --enable-symm-mem

Client:

python3 -m sglang.bench_one_batch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 --dataset-name random --input-len 1024 --output-len 1024 --batch-size 2048 --base-url http://127.0.0.1:30000
Batch Size Metric Symm-Mem Disabled Symm-Mem Enabled Gain
2048 Prefill (tok/s) 43,284.91 60,495.37 +39.8%
2048 Decode (tok/s) 11,903.58 12,384.14 +4.0%

@nvcastet nvcastet force-pushed the register_symm_mem_qwen branch from d55edfb to 5352aa1 Compare February 18, 2026 21:47
@hlu1
Copy link
Collaborator

hlu1 commented Feb 18, 2026

/tag-and-rerun-ci

@hlu1 hlu1 requested a review from ispobock February 18, 2026 22:52
@ispobock ispobock merged commit 99df920 into sgl-project:main Feb 20, 2026
143 of 161 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants