[ModelOPT] Support Qwen 3 Next Coder NVFP4#18224
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| "qkv_proj": ["q_proj", "k_proj", "v_proj"], | ||
| "gate_up_proj": ["gate_proj", "up_proj"], | ||
| } | ||
|
|
There was a problem hiding this comment.
Is it possible to do it not inside of model?
You do it in Qwen3-Next and Qwen3 #18189. What about other models? is it framework specific or model specific?
There was a problem hiding this comment.
It's because qkv and o proj are NVFP4 in these two recipes
There was a problem hiding this comment.
Sorry, I don't get your point. Can we do it not in the model? I believe it should in the quantization part python/sglang/srt/layers/quantization/
|
/tag-and-rerun-ci |
ssshinigami
left a comment
There was a problem hiding this comment.
Looks not correct to change models for this fix. It is quantization specific things, and should be in quantization part.
|
/rerun-failed-ci |
|
Hi @ssshinigami , we'll try to clean up the code soon, as there is a lot of other related quantization codes in weight loading still unfortunately.. |
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
* www/pr/ks: (265 commits) [BugFix][PD]Fix metadata_buffer_index leak when aborted in PD (sgl-project#17483) Refactoring Mooncake TE as a shared distributed component (sgl-project#17810) [ModelOPT] Support Qwen 3 Next Coder NVFP4 (sgl-project#18224) Update author information in pyproject.toml (sgl-project#18453) [Kimi-K2.5] Fix missing `quant_config` in `KimiK25` (sgl-project#18440) Add tensor parallelism support to LFM2 ShortConv layers (sgl-project#17777) [diffusion] chore: revise process title (sgl-project#18446) Fix TRT-LLM MLA backend applying k_scale to BF16 KV cache in BMM1 (sgl-project#18396) [diffusion] refactor: group component loaders under the component_loaders/ directory (sgl-project#18438) [ModelOpt] Fix broken Qwen3-235B-A22B-Instruct-2507-NVFP4 launch (sgl-project#18189) [diffusion] feat: support efficient sequence shard (sgl-project#18161) [CI] fix: notebook ci may not working (sgl-project#18417) fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache (sgl-project#18394) [Fix] Fix backend selection after flashinfer version update (sgl-project#18364) [diffusion] platform: support WAN/FLUX/Qwen-Image/Qwen-Image-edit on Ascend (sgl-project#13662) fix: fix NVFP4 Kimi-K2.5 weight mapping and exclude list (sgl-project#18370) [diffusion] feat: support saving videos directly on the server to avoid the overhead of tensor transfer (sgl-project#18253) [diffusion] fix: respect dist_timeout option (sgl-project#18386) [Doc] Fix outdated `--fp4-gemm-backend` documentation (sgl-project#18350) [diffusion] fix: remove unnecessary norm_type argument from GLM-Image dits (sgl-project#18382) ...
Motivation
This branch include important bugfix for qwen 3 coder next nvfp4
B300
sglang serve --model vincentzed-hf/Qwen3-Coder-Next-NVFP4 --quantization modelopt_fp4We provide cmd to reproduce the same checkpoint in ModelOpt in the model card above.
https://huggingface.co/vincentzed-hf/Qwen3-Coder-Next-NVFP4
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci