fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache#18394
fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache#18394Fridge003 merged 1 commit intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
b8zhong
left a comment
There was a problem hiding this comment.
Thanks! I think this is the right solution, compared to the other ones. Could you maybe try to use a util (e.g mapping torch dtype to str for KV cache here)
|
/tag-and-rerun-ci |
aba3ec7 to
a47f0ee
Compare
Added a mapping dict! |
|
Hi if you could add a test for this Modelopt model on SM90, that would be really great. @zack041 Thanks a lot! |
Sure! I'll add the test in a follow-up PR. |
* www/pr/ks: (265 commits) [BugFix][PD]Fix metadata_buffer_index leak when aborted in PD (sgl-project#17483) Refactoring Mooncake TE as a shared distributed component (sgl-project#17810) [ModelOPT] Support Qwen 3 Next Coder NVFP4 (sgl-project#18224) Update author information in pyproject.toml (sgl-project#18453) [Kimi-K2.5] Fix missing `quant_config` in `KimiK25` (sgl-project#18440) Add tensor parallelism support to LFM2 ShortConv layers (sgl-project#17777) [diffusion] chore: revise process title (sgl-project#18446) Fix TRT-LLM MLA backend applying k_scale to BF16 KV cache in BMM1 (sgl-project#18396) [diffusion] refactor: group component loaders under the component_loaders/ directory (sgl-project#18438) [ModelOpt] Fix broken Qwen3-235B-A22B-Instruct-2507-NVFP4 launch (sgl-project#18189) [diffusion] feat: support efficient sequence shard (sgl-project#18161) [CI] fix: notebook ci may not working (sgl-project#18417) fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache (sgl-project#18394) [Fix] Fix backend selection after flashinfer version update (sgl-project#18364) [diffusion] platform: support WAN/FLUX/Qwen-Image/Qwen-Image-edit on Ascend (sgl-project#13662) fix: fix NVFP4 Kimi-K2.5 weight mapping and exclude list (sgl-project#18370) [diffusion] feat: support saving videos directly on the server to avoid the overhead of tensor transfer (sgl-project#18253) [diffusion] fix: respect dist_timeout option (sgl-project#18386) [Doc] Fix outdated `--fp4-gemm-backend` documentation (sgl-project#18350) [diffusion] fix: remove unnecessary norm_type argument from GLM-Image dits (sgl-project#18382) ...
Motivation
fixes #18290 #12298
Source of error:
In sglang/srt/model_executor/model_runner.py
While
configure_kv_cache_dtypecorrectly handles the update ofself.kv_cache_dtypefor modelopt case whenserver_args.kv_cache_dtypeis"auto", it did not update itself.server_args.kv_cache_dtypeis used inflashattention_backend.pyto determine the data type for kv cache:self.kv_cache_dtype_stris assigned withserver_args.kv_cache_dtype.Modifications
To solve, the
server_args.kv_cache_dtypeis also updated toFP8along withkv_cache_dtype.Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci