Add tensor parallelism support to LFM2 ShortConv layers#17777
Add tensor parallelism support to LFM2 ShortConv layers#17777ispobock merged 1 commit intosgl-project:mainfrom
Conversation
- Use MergedColumnParallelLinear for in_proj to shard B, C, x separately - Use RowParallelLinear with input_is_parallel for out_proj - Shard conv_weight/conv_bias along hidden dimension - Fix cache shape calculation by passing num_heads=tp_size (temporal state is empty)
Summary of ChangesHello @tugot17, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces tensor parallelism support to the LFM2 ShortConv layers, addressing a critical limitation where these models previously failed to run with tensor parallelism due to the use of non-parallelized linear layers. By integrating Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully integrates tensor parallelism into the LFM2 ShortConv layers, addressing the AssertionError related to divisibility checks when tp_size > 1. The changes correctly replace nn.Linear layers with MergedColumnParallelLinear and RowParallelLinear for input and output projections, respectively. Additionally, convolutional weights and biases are now sharded along the hidden dimension, and the Mamba2StateShape creation is adjusted to ensure compatibility with tensor parallelism. The removal of unused variables and simplified logic for attention layer IDs also contributes to code cleanliness.
| if ".conv.conv.weight" in name: | ||
| name = name.replace(".conv.conv.weight", ".conv.conv_weight") | ||
| loaded_weight = loaded_weight.squeeze(1) # (D, 1, K) -> (D, K) | ||
| if ".conv.conv.bias" in name: | ||
| name = name.replace(".conv.conv.bias", ".conv.conv_bias") |
There was a problem hiding this comment.
The change in weight naming from .conv.weight to .conv.conv.weight and .conv.bias to .conv.conv.bias suggests a discrepancy between the internal naming convention of the SGLang model and the HuggingFace checkpoint. While this fix addresses the loading issue, it would be beneficial to add a comment explaining this specific naming adaptation, especially if it's a common pattern for LFM2 models or a known quirk of the upstream checkpoint. This improves clarity for future maintainers.
|
/tag-and-rerun-ci |
|
@tugot17 Thanks for fixing! |
* www/pr/ks: (265 commits) [BugFix][PD]Fix metadata_buffer_index leak when aborted in PD (sgl-project#17483) Refactoring Mooncake TE as a shared distributed component (sgl-project#17810) [ModelOPT] Support Qwen 3 Next Coder NVFP4 (sgl-project#18224) Update author information in pyproject.toml (sgl-project#18453) [Kimi-K2.5] Fix missing `quant_config` in `KimiK25` (sgl-project#18440) Add tensor parallelism support to LFM2 ShortConv layers (sgl-project#17777) [diffusion] chore: revise process title (sgl-project#18446) Fix TRT-LLM MLA backend applying k_scale to BF16 KV cache in BMM1 (sgl-project#18396) [diffusion] refactor: group component loaders under the component_loaders/ directory (sgl-project#18438) [ModelOpt] Fix broken Qwen3-235B-A22B-Instruct-2507-NVFP4 launch (sgl-project#18189) [diffusion] feat: support efficient sequence shard (sgl-project#18161) [CI] fix: notebook ci may not working (sgl-project#18417) fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache (sgl-project#18394) [Fix] Fix backend selection after flashinfer version update (sgl-project#18364) [diffusion] platform: support WAN/FLUX/Qwen-Image/Qwen-Image-edit on Ascend (sgl-project#13662) fix: fix NVFP4 Kimi-K2.5 weight mapping and exclude list (sgl-project#18370) [diffusion] feat: support saving videos directly on the server to avoid the overhead of tensor transfer (sgl-project#18253) [diffusion] fix: respect dist_timeout option (sgl-project#18386) [Doc] Fix outdated `--fp4-gemm-backend` documentation (sgl-project#18350) [diffusion] fix: remove unnecessary norm_type argument from GLM-Image dits (sgl-project#18382) ...
I just realised that the LFM models do not support the the tensor parallelsim. This is because we used the
nn.Linearfor in/out projections in ShortConvThis PR fixes this.
Changes
Note on num_heads=tp_size:
Mamba2StateShape.create()dividesnum_headsbytp_world_sizefor the temporal state shape. LFM2 ShortConv doesn't use temporal state (state_size=0), so the result is always empty regardless of the first dimension. We passnum_heads=tp_sizeto satisfy the divisibility check while keeping the fix local to LFM2.Test:
Run server
Error on main:
Runs successfully on the branch:
Tests:
ONLY_RUN=LiquidAI/LFM2.5-1.2B-Instruct pytest test/registered/models/test_generation_models.py::TestGenerationModels::test_all_models -v -s 1 passed, 5 warnings in 47.65sPasses, also passes if manually change the test to run:
Benchmarks
Benchmarks TP1:
Benchmarks TP2:
There is some variance in the among the runs, but it seems to work overall as expected (within the confidence interval). Run on 2xB200.