Skip to content

[megatron] fix: Qwen3.5 LoRA & MTP support (with Megatron-Bridge)#5599

Open
HollowMan6 wants to merge 3 commits intoverl-project:mainfrom
HollowMan6:qwen3.5_lora
Open

[megatron] fix: Qwen3.5 LoRA & MTP support (with Megatron-Bridge)#5599
HollowMan6 wants to merge 3 commits intoverl-project:mainfrom
HollowMan6:qwen3.5_lora

Conversation

@HollowMan6
Copy link
Collaborator

@HollowMan6 HollowMan6 commented Mar 15, 2026

What does this PR do?

Need the following PRs:

The previous weight sync path relied on sender-side name rewriting, including hard-coded .base_layer handling. That made the Megatron/vLLM boundary brittle for newer models, especially packed projections and fused MoE modules.

In addition, the async LoRA update path assumed each adapter arrived in a single IPC bucket, which is not guaranteed by the bucketed transport. That could produce incomplete add_lora requests when LoRA tensors were split across multiple buckets.

There were also a few compatibility issues around:

  • Qwen-style nested text_config MTP fields
  • newer Megatron-LM process_mtp_loss API
  • actor engine offload behavior when automatic reload is not enabled

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

1. Receiver-side base-layer normalization for vLLM weight sync

Instead of mutating exported Megatron parameter names on the sender side, the vLLM colocate worker now resolves incoming names against the live vLLM parameter namespace.

This includes:

  • generic add/remove resolution for .base_layer leaf params (weight / bias)
  • stripping Bridge-inserted .base_layer from non-leaf fused-MoE logical aliases
  • packed-owner lookup for aliases such as q_proj -> qkv_proj

2. Multi-bucket LoRA update support

This keeps the sender simpler and lets the receiver normalize names based on the actual loaded vLLM model structure.

BucketedWeightReceiver now forwards is_last to the callback.

The async rollout weight update path uses that signal to:

  • keep standard base-weight loading per bucket
  • accumulate LoRA tensors across buckets
  • call add_lora only once, after the final bucket is received

This aligns VERL's bucketed transport semantics with vLLM's expectation that one add_lora request contains a complete adapter tensor dict.

3. Megatron PEFT utility cleanup

The old hard-coded stacked-parameter suffix list was removed.

megatron_peft_utils now exposes generic helpers to:

  • add .base_layer
  • remove .base_layer
  • resolve the correct name by probing the target namespace

The Megatron-to-HF module mapping was also expanded for GDN modules, including:

  • in_proj -> [in_proj_qkv, in_proj_z, in_proj_b, in_proj_a]
  • out_proj -> [out_proj]

4. MTP compatibility updates

MTP checks and disabling logic now work with nested HF text configs as well as configs that use mtp_num_hidden_layers instead of only num_nextn_predict_layers.

The Megatron MTP patch also prefers the newer upstream process_mtp_loss helper when available, while preserving a fallback path for older Megatron-LM versions.

5. Worker/runtime fixes

  • ActorRolloutRefWorker.update_weights now only offloads the actor engine back to CPU when param offload is actually enabled.
  • vision_config is now preserved in Megatron checkpoint manager backup state.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements and fixes for Qwen3.5 LoRA support within the Megatron framework. The core change involves refactoring the weight synchronization logic by moving the .base_layer normalization from the sender (Megatron) to the receiver (vLLM). This makes the system more robust and less brittle, especially for newer models with packed projections and fused MoE modules. The PR also addresses a critical bug in asynchronous LoRA updates, ensuring that adapters split across multiple IPC buckets are applied correctly only after the final bucket is received. Additionally, it includes several compatibility fixes for MTP with nested Hugging Face text configurations and resolves an incorrect engine offload behavior. The changes are well-supported by new regression tests, enhancing the overall stability and maintainability of the codebase.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Megatron→vLLM weight synchronization (notably for Qwen3.5 + LoRA), fixes multi-bucket LoRA IPC updates, improves MTP config compatibility, and corrects actor engine offload behavior when param offload is disabled.

Changes:

  • Move weight-name normalization (including .base_layer and packed-module alias handling) into the vLLM receiver side and make it robust to fused MoE/packed projections.
  • Update bucketed IPC receiving to surface is_last and accumulate LoRA tensors across buckets before issuing a single add_lora.
  • Improve MTP compatibility for nested HF text_config and mtp_num_hidden_layers, avoid incorrect CPU offload when param offload is disabled, and add regression tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
verl/workers/rollout/vllm_rollout/utils.py Adds receiver-side weight-name normalization logic and multi-bucket LoRA accumulation before add_lora.
verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py Extends per-bucket callback contract to include is_last.
verl/workers/megatron_workers.py Updates MTP enable/disable logic to support nested text_config and newer MTP fields.
verl/workers/fsdp_workers.py Mirrors MTP-disable behavior for nested text_config and mtp_num_hidden_layers.
verl/workers/engine_workers.py Offloads actor engine to CPU only when param offload is enabled.
verl/workers/engine/megatron/transformer_impl.py Removes sender-side .base_layer rewriting for non-merged LoRA sync.
verl/utils/megatron_utils.py Updates MTP detection/disable logic for nested text_config and alternate MTP fields.
verl/utils/megatron_peft_utils.py Replaces hard-coded stacked-param suffix logic with generic .base_layer name helpers.
verl/utils/checkpoint/megatron_checkpoint_manager.py Preserves vision_config in transformer config backup state.
tests/workers/test_engine_workers_update_weights.py Adds regression tests for param-offload-gated CPU offload behavior.
tests/utils/test_vllm_weight_name_normalization_on_cpu.py Adds targeted tests for vLLM receiver normalization and multi-bucket LoRA accumulation.
tests/utils/test_megatron_peft_utils.py Tests new .base_layer resolver utilities and GDN module mapping expansion.
tests/utils/test_bucketed_weight_transfer.py Updates test callback signature to match new is_last contract.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@HollowMan6 HollowMan6 changed the title [megatron] fix: Qwen3.5 LoRA support [megatron] fix: Qwen3.5 LoRA & MTP support (with Megatron-Bridge) Mar 15, 2026
- move `.base_layer` normalization out of the Megatron sender and into the vLLM receiver
- make vLLM weight-name normalization robust to packed modules and fused MoE logical aliases
- fix bucketed LoRA IPC updates so multi-bucket adapters are applied only after the final bucket arrives
- avoid incorrect engine offload behavior when param offload is disabled
- improve MTP compatibility for nested HF text configs and newer Megatron-LM APIs
- add targeted regression coverage for the new weight-sync and worker behaviors

The previous weight sync path relied on sender-side name rewriting, including hard-coded `.base_layer` handling. That made the Megatron/vLLM boundary brittle for newer models, especially packed
projections and fused MoE modules.

In addition, the async LoRA update path assumed each adapter arrived in a single IPC bucket, which is not guaranteed by the bucketed transport. That could produce incomplete `add_lora` requests when
LoRA tensors were split across multiple buckets.

There were also a few compatibility issues around:
- Qwen-style nested `text_config` MTP fields
- newer Megatron-LM `process_mtp_loss` API
- actor engine offload behavior when automatic reload is not enabled

Instead of mutating exported Megatron parameter names on the sender side, the vLLM colocate worker now resolves incoming names against the live vLLM parameter namespace.

This includes:
- generic add/remove resolution for `.base_layer` leaf params (`weight` / `bias`)
- stripping Bridge-inserted `.base_layer` from non-leaf fused-MoE logical aliases
- packed-owner lookup for aliases such as `q_proj -> qkv_proj`

This keeps the sender simpler and lets the receiver normalize names based on the actual loaded vLLM model structure.

`BucketedWeightReceiver` now forwards `is_last` to the callback.

The async rollout weight update path uses that signal to:
- keep standard base-weight loading per bucket
- accumulate LoRA tensors across buckets
- call `add_lora` only once, after the final bucket is received

This aligns VERL's bucketed transport semantics with vLLM's expectation that one `add_lora` request contains a complete adapter tensor dict.

The old hard-coded stacked-parameter suffix list was removed.

`megatron_peft_utils` now exposes generic helpers to:
- add `.base_layer`
- remove `.base_layer`
- resolve the correct name by probing the target namespace

The Megatron-to-HF module mapping was also expanded for GDN modules, including:
- `in_proj -> [in_proj_qkv, in_proj_z, in_proj_b, in_proj_a]`
- `out_proj -> [out_proj]`

MTP checks and disabling logic now work with nested HF text configs as well as configs that use `mtp_num_hidden_layers` instead of only `num_nextn_predict_layers`.

The Megatron MTP patch also prefers the newer upstream `process_mtp_loss` helper when available, while preserving a fallback path for older Megatron-LM versions.

- `ActorRolloutRefWorker.update_weights` now only offloads the actor engine back to CPU when param offload is actually enabled.
- `vision_config` is now preserved in Megatron checkpoint manager backup state.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants