[Refactor] Split out deepseek v2 weight loader function into mixin by xyjixyjixyji · Pull Request #16649 · sgl-project/sglang

xyjixyjixyji · 2026-01-07T10:45:05Z

Motivation

DeepseekV2 code has been developed fast and a lot of historical code and be more orgranized, including the weight loading part.

Issue related: #16291

Modifications

This PR just moves the weight loader function into a mixin with some documentations.

The further refactors of splitting the weight loading internal will come after this PR get merged.

Accuracy Tests

Server start with CI model: python3 -m sglang.launch_server --model lmzheng/sglang-ci-dsv3-test --tp 1 --trust-remote-code --base-gpu-id 5 --port 12345

Client accuracy test:

root@innomatrix-us-adc-ndh100-0004:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --data-path /tmp/test.jsonl --port 12345
100%|███████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:19<00:00, 69.38it/s]
Accuracy: 0.665
Invalid: 0.001
Latency: 19.946 s
Output throughput: 6475.981 token/s

Benchmarking and Profiling

N/A

On H100:

root@innomatrix-us-adc-ndh100-0004:/sgl-workspace/sglang# python3 -m sglang.test.send_one --port 12345
......
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    1.572    |  252   |   1.000    |     160.35      |
+-------------+--------+------------+-----------------+

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-07T10:45:24Z

Summary of Changes

Hello @xyjixyjixyji, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the modularity and organization of the DeepseekV2 model's weight loading mechanism. By moving the intricate weight loading and post-processing logic into a reusable mixin, the main model class becomes cleaner and easier to manage. This change sets the stage for future refactors and ensures a more structured approach to handling model weights, especially concerning various quantization and parallelism configurations.

Highlights

Code Refactoring: The core weight loading logic for Deepseek V2/V3 models has been extracted into a dedicated mixin class, DeepseekV2WeightLoaderMixin, to improve code organization and maintainability.
Modularization: The DeepseekV2WeightLoaderMixin now encapsulates complex weight loading functionalities, including handling tensor/pipeline parallelism, various quantization formats (FP8, INT8, AWQ), MoE expert weights, NextN speculative decoding weights, and shared expert fusion optimizations.
Dependency Management: A new utility file, deepseek_common/utils.py, has been introduced to house common functions like awq_dequantize_func and enable_nextn_moe_bf16_cast_to_fp8, centralizing Deepseek-specific utilities.
Simplified Model Class: The DeepseekV2ForCausalLM class has been significantly simplified by inheriting from the new mixin, reducing its internal complexity and making its load_weights method a direct call to the mixin's implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a good refactoring that moves the Deepseek V2 weight loading logic into a dedicated mixin, DeepseekV2WeightLoaderMixin. This significantly cleans up the DeepseekV2ForCausalLM class and improves code organization by centralizing weight loading logic. The introduction of awq_dequantize_func in a new utility file is also a nice improvement for handling device-specific implementations. However, I've identified a critical bug in the newly added mixin that would cause a runtime error during weight loading due to incorrect argument passing.

python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py

Qiaolin-Yu · 2026-01-07T23:17:44Z

/tag-and-rerun-ci

Fridge003

We can do low-level refactor in following PRs

This reverts commit c71e04e.

Qiaolin-Yu · 2026-01-17T09:41:26Z

/tag-and-rerun-ci

Fridge003 · 2026-01-18T16:41:13Z

Failed tests unrelated
https://github.com/sgl-project/sglang/actions/runs/21101509440/job/60716014440?pr=16649#logs

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

xyjixyjixyji requested review from Fridge003, ch-wan, fzyzcjy, ispobock, merrymercy and zhyncs as code owners January 7, 2026 10:45

github-actions bot added the deepseek label Jan 7, 2026

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py Show resolved Hide resolved

github-actions bot added the run-ci label Jan 7, 2026

xyjixyjixyji force-pushed the refactor_weight_loader branch from 5821631 to 763e129 Compare January 8, 2026 06:32

Fridge003 approved these changes Jan 8, 2026

View reviewed changes

xyjixyjixyji added 8 commits January 17, 2026 07:37

move model loader away

1269487

update

16a2b6f

make it a mixin

7293e86

add doc

58d46f2

format

8669d55

update docstring

3b0a078

format

71bd451

fix ci

aa428c1

xyjixyjixyji force-pushed the refactor_weight_loader branch from 763e129 to aa428c1 Compare January 17, 2026 07:42

xyjixyjixyji added 2 commits January 17, 2026 09:20

/

c71e04e

Revert "/"

f10a594

This reverts commit c71e04e.

Qiaolin-Yu and others added 2 commits January 17, 2026 01:50

Merge branch 'main' into refactor_weight_loader

3c57b6e

Merge branch 'main' into refactor_weight_loader

01cf49c

Fridge003 approved these changes Jan 18, 2026

View reviewed changes

Fridge003 merged commit 9343372 into sgl-project:main Jan 18, 2026
360 of 387 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Split out deepseek v2 weight loader function into mixin#16649

[Refactor] Split out deepseek v2 weight loader function into mixin#16649
Fridge003 merged 12 commits intosgl-project:mainfrom
xyjixyjixyji:refactor_weight_loader

xyjixyjixyji commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Qiaolin-Yu commented Jan 7, 2026

Uh oh!

Fridge003 left a comment

Uh oh!

Qiaolin-Yu commented Jan 17, 2026

Uh oh!

Fridge003 commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xyjixyjixyji commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Qiaolin-Yu commented Jan 7, 2026

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu commented Jan 17, 2026

Uh oh!

Fridge003 commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xyjixyjixyji commented Jan 7, 2026 •

edited

Loading