fix(trainer): add empty_cache() after compute_ref_log_prob to prevent OOM#548
Merged
pan-x-c merged 1 commit intoMay 25, 2026
Conversation
… OOM Add torch.cuda.empty_cache() after the reference model forward pass in compute_ref_log_prob(). Without this, PyTorch's caching allocator retains GPU memory reserved during the ref-log-prob computation, and memory_reserved grows monotonically across training steps, eventually causing OOM in colocate mode where vLLM and FSDP trainer share the same GPU. This complements PR agentscope-ai#541 which added empty_cache() after update_policy() for the actor path. The ref model path has the same issue: it creates large intermediate tensors (logits with vocab_size up to 248K) that remain reserved even after being moved to CPU. Verified on A100 80GB with Qwen3.5-0.8B colocate training: - Without patch: OOM at step 2 (77 GiB reserved) - With patch: stable at ~40 GiB reserved across 5+ steps
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
torch.cuda.empty_cache()after the reference model forward pass incompute_ref_log_prob().This complements #541 which added
empty_cache()afterupdate_policy()for the actor path. The ref model path has the same memory leak issue.Root Cause
In colocate mode, vLLM and FSDP trainer share the same GPU. During
compute_ref_log_prob(), the ref model creates large intermediate tensors (logits withvocab_sizeup to 248K). Afteroutput.to("cpu")moves the result to CPU, PyTorch's caching allocator still reserves the GPU memory used by these intermediates. This reserved memory grows monotonically across training steps and is never released back to CUDA, eventually causing OOM.What This PR Does
Adds
torch.cuda.empty_cache()right afteroutput = output.to("cpu")incompute_ref_log_prob(), releasing the caching allocator's reserved memory back to CUDA so it can be reused by vLLM and subsequent training steps.Verification
Tested on A100 80GB with Qwen3.5-0.8B in colocate mode:
The fix is a single 3-line addition (comment +
empty_cache()call) with no behavioral changes to training logic.Related
empty_cache()afterupdate_policy()(already merged)