fix(trainer): add torch.cuda.empty_cache() after FSDP update_actor by Leon-Algo · Pull Request #541 · agentscope-ai/Trinity-RFT

Leon-Algo · 2026-05-19T09:13:52Z

Summary

Add torch.cuda.empty_cache() at the end of update_actor() in fsdp_workers.py to release GPU memory held by PyTorch's caching allocator after backward passes
This prevents memory_reserved from growing monotonically across training steps, which eventually starves vLLM during weight synchronization in colocate mode

Problem

In colocate mode (vLLM + FSDP on the same GPU), PyTorch's caching allocator reserves GPU memory during backward passes and never voluntarily releases it back to CUDA. Over multiple training steps, memory_reserved grows monotonically:

Step 1: ~30.1 GiB
Step 5: ~34.0 GiB
Step 9:  38.2 GiB → OOM during vLLM weight sync (attempted 970 MiB, only 758 MiB available)

This happens because fsdp_workers.py's training path had zero empty_cache() calls, while:

megatron_workers.py already calls aggressive_empty_cache(force_sync=True) at the end of update_actor (line 887) and in several other locations
The vLLM worker's update_weight() already has empty_cache() for the inference side

The FSDP training path was the only backend missing this cleanup.

Fix

Add torch.cuda.empty_cache() at the end of update_actor(), after the offload section and before return output. This matches the existing pattern in megatron_workers.py.

Result with fix (Qwen3.5-0.8B, L20 46 GiB, colocate mode):

Step 1:  31.4 GiB
Step 2:  32.5 GiB
Step 5+: 32.9 GiB (stable, no further growth)
Step 10: 32.9 GiB ✓

Testing

Validated on L20 (46 GiB) with Qwen3.5-0.8B in colocate mode:

6 explore steps + 10 train steps completed with zero OOM
max_memory_reserved stabilized at 32.9 GiB (vs. 38.2 GiB without fix)
max_memory_allocated stable at 28.1-28.4 GiB
GRPO advantages non-zero (max 0.92-1.50) throughout training

Files Changed

trinity/trainer/verl/fsdp_workers.py: Add torch.cuda.empty_cache() at end of update_actor() (+6 lines)

🤖 Generated with Claude Code

In colocate mode (vLLM + FSDP on the same GPU), PyTorch's caching allocator holds onto reserved GPU memory after backward passes without releasing it back to CUDA. This causes memory_reserved to grow monotonically across training steps, eventually starving vLLM during weight synchronization. Observed on L20 (46 GiB) with Qwen3.5-0.8B: - Without fix: memory_reserved grows from ~30 GiB to 38.2 GiB over 9 steps, causing OOM during vLLM weight sync - With fix: memory_reserved stabilizes at 32.9 GiB from step 5 onward This matches the existing pattern in megatron_workers.py, which calls aggressive_empty_cache(force_sync=True) at the end of update_actor. The FSDP path had no equivalent cache release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pan-x-c · 2026-05-19T09:22:01Z

/unittest-pattern-TestTrainerGSM8K_1

github-actions · 2026-05-19T09:30:15Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
1	1	0	0	0	0	5m 41s

Tests

Test Name	Status	Flaky	Duration
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer	✅		5m 29s

Github Test Reporter by CTRF 💚

pan-x-c

Thanks for pointing it out. LGTM

pan-x-c approved these changes May 19, 2026

View reviewed changes

pan-x-c merged commit 904435a into agentscope-ai:main May 19, 2026
1 check passed

Leon-Algo mentioned this pull request May 25, 2026

fix(trainer): add empty_cache() after compute_ref_log_prob to prevent OOM #548

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trainer): add torch.cuda.empty_cache() after FSDP update_actor#541

fix(trainer): add torch.cuda.empty_cache() after FSDP update_actor#541
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
Leon-Algo:fix/fsdp-empty-cache-after-update-actor

Leon-Algo commented May 19, 2026

Uh oh!

pan-x-c commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

pan-x-c left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Leon-Algo commented May 19, 2026

Summary

Problem

Fix

Testing

Files Changed

Uh oh!

pan-x-c commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Summary

Tests

Uh oh!

pan-x-c left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants