Skip to content

fix(trainer): add torch.cuda.empty_cache() after FSDP update_actor#541

Merged
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
Leon-Algo:fix/fsdp-empty-cache-after-update-actor
May 19, 2026
Merged

fix(trainer): add torch.cuda.empty_cache() after FSDP update_actor#541
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
Leon-Algo:fix/fsdp-empty-cache-after-update-actor

Conversation

@Leon-Algo
Copy link
Copy Markdown
Contributor

Summary

  • Add torch.cuda.empty_cache() at the end of update_actor() in fsdp_workers.py to release GPU memory held by PyTorch's caching allocator after backward passes
  • This prevents memory_reserved from growing monotonically across training steps, which eventually starves vLLM during weight synchronization in colocate mode

Problem

In colocate mode (vLLM + FSDP on the same GPU), PyTorch's caching allocator reserves GPU memory during backward passes and never voluntarily releases it back to CUDA. Over multiple training steps, memory_reserved grows monotonically:

Step 1: ~30.1 GiB
Step 5: ~34.0 GiB
Step 9:  38.2 GiB → OOM during vLLM weight sync (attempted 970 MiB, only 758 MiB available)

This happens because fsdp_workers.py's training path had zero empty_cache() calls, while:

  • megatron_workers.py already calls aggressive_empty_cache(force_sync=True) at the end of update_actor (line 887) and in several other locations
  • The vLLM worker's update_weight() already has empty_cache() for the inference side

The FSDP training path was the only backend missing this cleanup.

Fix

Add torch.cuda.empty_cache() at the end of update_actor(), after the offload section and before return output. This matches the existing pattern in megatron_workers.py.

Result with fix (Qwen3.5-0.8B, L20 46 GiB, colocate mode):

Step 1:  31.4 GiB
Step 2:  32.5 GiB
Step 5+: 32.9 GiB (stable, no further growth)
Step 10: 32.9 GiB ✓

Testing

Validated on L20 (46 GiB) with Qwen3.5-0.8B in colocate mode:

  • 6 explore steps + 10 train steps completed with zero OOM
  • max_memory_reserved stabilized at 32.9 GiB (vs. 38.2 GiB without fix)
  • max_memory_allocated stable at 28.1-28.4 GiB
  • GRPO advantages non-zero (max 0.92-1.50) throughout training

Files Changed

  • trinity/trainer/verl/fsdp_workers.py: Add torch.cuda.empty_cache() at end of update_actor() (+6 lines)

🤖 Generated with Claude Code

In colocate mode (vLLM + FSDP on the same GPU), PyTorch's caching
allocator holds onto reserved GPU memory after backward passes without
releasing it back to CUDA. This causes memory_reserved to grow
monotonically across training steps, eventually starving vLLM during
weight synchronization.

Observed on L20 (46 GiB) with Qwen3.5-0.8B:
- Without fix: memory_reserved grows from ~30 GiB to 38.2 GiB over
  9 steps, causing OOM during vLLM weight sync
- With fix: memory_reserved stabilizes at 32.9 GiB from step 5 onward

This matches the existing pattern in megatron_workers.py, which calls
aggressive_empty_cache(force_sync=True) at the end of update_actor.
The FSDP path had no equivalent cache release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pan-x-c
Copy link
Copy Markdown
Collaborator

pan-x-c commented May 19, 2026

/unittest-pattern-TestTrainerGSM8K_1

@github-actions
Copy link
Copy Markdown

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
1 1 0 0 0 0 5m 41s

Tests

Test Name Status Flaky Duration
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer 5m 29s

Github Test Reporter by CTRF 💚

Copy link
Copy Markdown
Collaborator

@pan-x-c pan-x-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. LGTM

@pan-x-c pan-x-c merged commit 904435a into agentscope-ai:main May 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants