Skip to content

[Fix] Add lora tied lm head support (for Qwen2.5, Gemma, etc model need)#18634

Merged
Fridge003 merged 8 commits intosgl-project:mainfrom
yushengsu-thu:fix/lora-tied-lm-head-support
Feb 18, 2026
Merged

[Fix] Add lora tied lm head support (for Qwen2.5, Gemma, etc model need)#18634
Fridge003 merged 8 commits intosgl-project:mainfrom
yushengsu-thu:fix/lora-tied-lm-head-support

Conversation

@yushengsu-thu
Copy link
Collaborator

@yushengsu-thu yushengsu-thu commented Feb 11, 2026

Motivation

Added test ci: 324 lines
Modified code in sgl: 87 lines

When tie_word_embeddings=True (e.g., Qwen2.5, Gemma), the model's lm_head is the same Python object as embed_tokens. PyTorch's named_modules() deduplicates by object identity, so lm_head never appears as a separate module — LoRA cannot wrap it. This causes LoRA adapters that target lm_head to silently fail or produce incorrect results.

Additionally, PEFT adapters may use shorthand strings like "all-linear" or "all" for target_modules, which SGLang previously did not handle, leading to crashes during adapter loading. PEFT also renames lm_head to unembed_tokens internally in some configurations, which was not recognized by SGLang's weight loader.

Modifications

python/sglang/srt/lora/lora_manager.py

  • Untie lm_head for LoRA wrapping: When lm_head is the same object as embed_tokens, create a new ParallelLMHead that shares the base weight tensor (no extra GPU memory) so that named_modules() yields it as an independent module.
  • Handle PEFT shorthand target_modules: Gracefully handle "all-linear" and "all" strings by requiring the user to specify --lora-target-modules at server startup. Raise clear error messages for unrecognized string values.

python/sglang/srt/lora/lora.py

  • Remap unembed_tokens to lm_head: PEFT internally renames lm_head to unembed_tokens in some adapter configs. The weight loader now remaps this key so the weight is loaded into the correct buffer.
  • Allow loading when normalized_target_modules is empty: When target_modules is a shorthand like "all-linear", the normalized set is empty. Allow embed_tokens/lm_head weights to be loaded in this case, deferring to --lora-target-modules for module selection.

python/sglang/srt/lora/utils.py

  • Handle string target_modules in get_normalized_target_modules(): Return an empty set for PEFT shorthands so callers can fall back to CLI config.
  • Add unembed_tokenslm_head mapping to params_mapping.

test/registered/lora/test_lora_tied_lm_head.py (new)

  • Programmatically creates a LoRA adapter with lm_head in target_modules on a model with tie_word_embeddings=True (Qwen/Qwen2.5-0.5B).
  • test_tied_lm_head_lora_no_nan: Verifies SGLang does not produce NaN values.
  • test_tied_lm_head_lora_differs_from_base: Confirms LoRA output differs from the base model (i.e., lm_head LoRA is actually applied).
  • test_tied_lm_head_lora_hf_sgl_logprob_match: Compares prefill and decode logprobs between HuggingFace+PEFT and SGLang, ensuring numerical consistency within threshold.

Accuracy Tests

The new test test_lora_tied_lm_head.py validates:

  1. No NaN values in output logprobs
  2. LoRA adapter produces different output from the base model
  3. SGLang logprobs match HuggingFace+PEFT logprobs (max diff < 2e-1)

Tested with Qwen/Qwen2.5-0.5B (tie_word_embeddings=True) + triton LoRA backend.

Benchmarking and Profiling

No impact on inference speed — the untied lm_head shares the same weight tensor as embed_tokens, adding zero GPU memory overhead. The change only affects the module graph structure during initialization.

Checklist

Copilot AI review requested due to automatic review settings February 11, 2026 19:11
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the lora label Feb 11, 2026
@yushengsu-thu yushengsu-thu changed the title [Fix] Add lora tied lm head support [Fix] Add lora tied lm head support (for Qwen2.5, Gemma, etc model need) Feb 11, 2026
@yushengsu-thu
Copy link
Collaborator Author

/tag-and-rerun-ci

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes LoRA adapter loading/application for models where tie_word_embeddings=True (so lm_head is the same module object as embed_tokens), and improves compatibility with PEFT configs that use shorthand target_modules values and/or rename lm_head to unembed_tokens.

Changes:

  • Add logic to make tied lm_head appear as an independent module so LoRA can wrap it.
  • Improve PEFT config compatibility: handle string shorthands for target_modules and remap unembed_tokenslm_head during weight loading.
  • Add a CUDA nightly regression test covering tied lm_head LoRA behavior and HF-vs-SGLang logprob consistency.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
python/sglang/srt/lora/lora_manager.py Handles PEFT shorthand target_modules and creates an untied lm_head module for LoRA wrapping when embeddings are tied.
python/sglang/srt/lora/lora.py Remaps PEFT unembed_tokens weights to lm_head and loosens embedding weight filtering when normalized targets are empty.
python/sglang/srt/lora/utils.py Accepts string target_modules inputs (PEFT shorthands) and adds unembed_tokenslm_head normalization mapping.
test/registered/lora/test_lora_tied_lm_head.py New regression test for tied lm_head LoRA, including NaN checks, base-vs-LoRA difference, and HF parity.
Comments suppressed due to low confidence (1)

test/registered/lora/test_lora_tied_lm_head.py:325

  • 'except' clause does nothing but pass and there is no explanatory comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="cpu",
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using device_map="cpu" in Transformers typically pulls in the Accelerate dependency; if it’s missing, this will error even though the test only needs a CPU load. Consider removing device_map (CPU is the default) or explicitly moving the model to CPU after load to avoid an unnecessary dependency in CI.

Suggested change
device_map="cpu",

Copilot uses AI. Check for mistakes.
@yushengsu-thu
Copy link
Collaborator Author

/tag-and-rerun-ci

@Fridge003
Copy link
Collaborator

@yushengsu-thu Can you please post the local result of running the newly added test?

@Fridge003 Fridge003 merged commit 9c5aae4 into sgl-project:main Feb 18, 2026
85 of 91 checks passed
yushengsu-thu added a commit to yushengsu-thu/sglang that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants