Skip to content

fix(trainer): add empty_cache() after compute_ref_log_prob to prevent OOM#548

Merged
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
Leon-Algo:fix/fsdp-empty-cache-after-ref-log-prob
May 25, 2026
Merged

fix(trainer): add empty_cache() after compute_ref_log_prob to prevent OOM#548
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
Leon-Algo:fix/fsdp-empty-cache-after-ref-log-prob

Conversation

@Leon-Algo
Copy link
Copy Markdown
Contributor

Summary

Add torch.cuda.empty_cache() after the reference model forward pass in compute_ref_log_prob().

This complements #541 which added empty_cache() after update_policy() for the actor path. The ref model path has the same memory leak issue.

Root Cause

In colocate mode, vLLM and FSDP trainer share the same GPU. During compute_ref_log_prob(), the ref model creates large intermediate tensors (logits with vocab_size up to 248K). After output.to("cpu") moves the result to CPU, PyTorch's caching allocator still reserves the GPU memory used by these intermediates. This reserved memory grows monotonically across training steps and is never released back to CUDA, eventually causing OOM.

What This PR Does

Adds torch.cuda.empty_cache() right after output = output.to("cpu") in compute_ref_log_prob(), releasing the caching allocator's reserved memory back to CUDA so it can be reused by vLLM and subsequent training steps.

Verification

Tested on A100 80GB with Qwen3.5-0.8B in colocate mode:

Metric Without patch With patch
OOM at step 2 Yes (77 GiB reserved) No
Stable memory N/A ~40 GiB reserved
Training steps completed 1 5+

The fix is a single 3-line addition (comment + empty_cache() call) with no behavioral changes to training logic.

Related

… OOM

Add torch.cuda.empty_cache() after the reference model forward pass
in compute_ref_log_prob(). Without this, PyTorch's caching allocator
retains GPU memory reserved during the ref-log-prob computation, and
memory_reserved grows monotonically across training steps, eventually
causing OOM in colocate mode where vLLM and FSDP trainer share the
same GPU.

This complements PR agentscope-ai#541 which added empty_cache() after update_policy()
for the actor path. The ref model path has the same issue: it creates
large intermediate tensors (logits with vocab_size up to 248K) that
remain reserved even after being moved to CPU.

Verified on A100 80GB with Qwen3.5-0.8B colocate training:
- Without patch: OOM at step 2 (77 GiB reserved)
- With patch: stable at ~40 GiB reserved across 5+ steps
Copy link
Copy Markdown
Collaborator

@pan-x-c pan-x-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pan-x-c pan-x-c merged commit 991eda5 into agentscope-ai:main May 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants