Make compressed-tensors MoEs support ignored layers#17828
Make compressed-tensors MoEs support ignored layers#17828AniZpZ merged 5 commits intosgl-project:mainfrom
Conversation
Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>
Summary of ChangesHello @LHXuuu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the handling of quantization for Mixture-of-Experts (MoE) layers, particularly when only a subset of layers are intended for quantization. It refactors the core logic for retrieving quantization schemes, making it more robust and unified across different layer types. The changes ensure that Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for ignoring layers during quantization for Mixture-of-Experts (MoE) layers when using compressed-tensors. The changes are well-structured, involving a refactoring to centralize the logic for retrieving quantization schemes into a new get_scheme_dict method. This new method correctly handles ignored layers. The MoE quantization logic is updated to use this new method, ensuring that projections within an MoE layer have consistent quantization schemes and correctly fall back to an unquantized method when ignored. Overall, this is a solid improvement that enhances the flexibility of the quantization framework for MoE models. I have one minor suggestion to improve code clarity.
python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>
|
Great! |
|
/rerun-failed-ci |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Motivation
When the MoE layer is not fully quantized, certain layers must be ignored during weight loading.
Modifications
get_scheme_dictto provide a unified interface for bothLinearandFusedMoElayers, determining whether a rollback is needed and matching the appropriate target to return the correspondingscheme_dict.FusedMoEto thetarget_scheme_map, then match normally by either layer or module type.Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci