fix(trainer): add torch.cuda.empty_cache() after FSDP update_actor#541
Merged
pan-x-c merged 1 commit intoMay 19, 2026
Merged
Conversation
In colocate mode (vLLM + FSDP on the same GPU), PyTorch's caching allocator holds onto reserved GPU memory after backward passes without releasing it back to CUDA. This causes memory_reserved to grow monotonically across training steps, eventually starving vLLM during weight synchronization. Observed on L20 (46 GiB) with Qwen3.5-0.8B: - Without fix: memory_reserved grows from ~30 GiB to 38.2 GiB over 9 steps, causing OOM during vLLM weight sync - With fix: memory_reserved stabilizes at 32.9 GiB from step 5 onward This matches the existing pattern in megatron_workers.py, which calls aggressive_empty_cache(force_sync=True) at the end of update_actor. The FSDP path had no equivalent cache release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
/unittest-pattern-TestTrainerGSM8K_1 |
Summary
Tests
Github Test Reporter by CTRF 💚 |
pan-x-c
approved these changes
May 19, 2026
Collaborator
pan-x-c
left a comment
There was a problem hiding this comment.
Thanks for pointing it out. LGTM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
torch.cuda.empty_cache()at the end ofupdate_actor()infsdp_workers.pyto release GPU memory held by PyTorch's caching allocator after backward passesmemory_reservedfrom growing monotonically across training steps, which eventually starves vLLM during weight synchronization in colocate modeProblem
In colocate mode (vLLM + FSDP on the same GPU), PyTorch's caching allocator reserves GPU memory during backward passes and never voluntarily releases it back to CUDA. Over multiple training steps,
memory_reservedgrows monotonically:This happens because
fsdp_workers.py's training path had zeroempty_cache()calls, while:megatron_workers.pyalready callsaggressive_empty_cache(force_sync=True)at the end ofupdate_actor(line 887) and in several other locationsupdate_weight()already hasempty_cache()for the inference sideThe FSDP training path was the only backend missing this cleanup.
Fix
Add
torch.cuda.empty_cache()at the end ofupdate_actor(), after the offload section and beforereturn output. This matches the existing pattern inmegatron_workers.py.Result with fix (Qwen3.5-0.8B, L20 46 GiB, colocate mode):
Testing
Validated on L20 (46 GiB) with
Qwen3.5-0.8Bin colocate mode:max_memory_reservedstabilized at 32.9 GiB (vs. 38.2 GiB without fix)max_memory_allocatedstable at 28.1-28.4 GiBFiles Changed
trinity/trainer/verl/fsdp_workers.py: Addtorch.cuda.empty_cache()at end ofupdate_actor()(+6 lines)🤖 Generated with Claude Code