[Fix] Add lora tied lm head support (for Qwen2.5, Gemma, etc model need)#18634
Conversation
…t PEFT shorthand target_modules
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Pull request overview
This PR fixes LoRA adapter loading/application for models where tie_word_embeddings=True (so lm_head is the same module object as embed_tokens), and improves compatibility with PEFT configs that use shorthand target_modules values and/or rename lm_head to unembed_tokens.
Changes:
- Add logic to make tied
lm_headappear as an independent module so LoRA can wrap it. - Improve PEFT config compatibility: handle string shorthands for
target_modulesand remapunembed_tokens→lm_headduring weight loading. - Add a CUDA nightly regression test covering tied
lm_headLoRA behavior and HF-vs-SGLang logprob consistency.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
python/sglang/srt/lora/lora_manager.py |
Handles PEFT shorthand target_modules and creates an untied lm_head module for LoRA wrapping when embeddings are tied. |
python/sglang/srt/lora/lora.py |
Remaps PEFT unembed_tokens weights to lm_head and loosens embedding weight filtering when normalized targets are empty. |
python/sglang/srt/lora/utils.py |
Accepts string target_modules inputs (PEFT shorthands) and adds unembed_tokens → lm_head normalization mapping. |
test/registered/lora/test_lora_tied_lm_head.py |
New regression test for tied lm_head LoRA, including NaN checks, base-vs-LoRA difference, and HF parity. |
Comments suppressed due to low confidence (1)
test/registered/lora/test_lora_tied_lm_head.py:325
- 'except' clause does nothing but pass and there is no explanatory comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| model = AutoModelForCausalLM.from_pretrained( | ||
| base_model_name, | ||
| torch_dtype=torch.float16, | ||
| device_map="cpu", |
There was a problem hiding this comment.
Using device_map="cpu" in Transformers typically pulls in the Accelerate dependency; if it’s missing, this will error even though the test only needs a CPU load. Consider removing device_map (CPU is the default) or explicitly moving the model to CPU after load to avoid an unnecessary dependency in CI.
| device_map="cpu", |
|
/tag-and-rerun-ci |
|
@yushengsu-thu Can you please post the local result of running the newly added test? |
Motivation
Added test ci:
324linesModified code in sgl:
87linesWhen
tie_word_embeddings=True(e.g., Qwen2.5, Gemma), the model'slm_headis the same Python object asembed_tokens. PyTorch'snamed_modules()deduplicates by object identity, solm_headnever appears as a separate module — LoRA cannot wrap it. This causes LoRA adapters that targetlm_headto silently fail or produce incorrect results.Additionally, PEFT adapters may use shorthand strings like
"all-linear"or"all"fortarget_modules, which SGLang previously did not handle, leading to crashes during adapter loading. PEFT also renameslm_headtounembed_tokensinternally in some configurations, which was not recognized by SGLang's weight loader.Modifications
python/sglang/srt/lora/lora_manager.pylm_headfor LoRA wrapping: Whenlm_headis the same object asembed_tokens, create a newParallelLMHeadthat shares the base weight tensor (no extra GPU memory) so thatnamed_modules()yields it as an independent module.target_modules: Gracefully handle"all-linear"and"all"strings by requiring the user to specify--lora-target-modulesat server startup. Raise clear error messages for unrecognized string values.python/sglang/srt/lora/lora.pyunembed_tokenstolm_head: PEFT internally renameslm_headtounembed_tokensin some adapter configs. The weight loader now remaps this key so the weight is loaded into the correct buffer.normalized_target_modulesis empty: Whentarget_modulesis a shorthand like"all-linear", the normalized set is empty. Allowembed_tokens/lm_headweights to be loaded in this case, deferring to--lora-target-modulesfor module selection.python/sglang/srt/lora/utils.pytarget_modulesinget_normalized_target_modules(): Return an empty set for PEFT shorthands so callers can fall back to CLI config.unembed_tokens→lm_headmapping toparams_mapping.test/registered/lora/test_lora_tied_lm_head.py(new)lm_headintarget_moduleson a model withtie_word_embeddings=True(Qwen/Qwen2.5-0.5B).test_tied_lm_head_lora_no_nan: Verifies SGLang does not produce NaN values.test_tied_lm_head_lora_differs_from_base: Confirms LoRA output differs from the base model (i.e.,lm_headLoRA is actually applied).test_tied_lm_head_lora_hf_sgl_logprob_match: Compares prefill and decode logprobs between HuggingFace+PEFT and SGLang, ensuring numerical consistency within threshold.Accuracy Tests
The new test
test_lora_tied_lm_head.pyvalidates:Tested with
Qwen/Qwen2.5-0.5B(tie_word_embeddings=True) + triton LoRA backend.Benchmarking and Profiling
No impact on inference speed — the untied
lm_headshares the same weight tensor asembed_tokens, adding zero GPU memory overhead. The change only affects the module graph structure during initialization.Checklist