Skip to content

[core] reuse unused reserved cuda memory when loading models#37920

Merged
gante merged 4 commits into
huggingface:mainfrom
gante:cuda_reuse_reserved_memory
May 5, 2025
Merged

[core] reuse unused reserved cuda memory when loading models#37920
gante merged 4 commits into
huggingface:mainfrom
gante:cuda_reuse_reserved_memory

Conversation

@gante
Copy link
Copy Markdown
Contributor

@gante gante commented May 1, 2025

What does this PR do?

TL;DR checks for unused reserved CUDA memory before preallocating more memory or deciding to do CPU offload.
Missing: benchmark whether this has a speed impact in from_pretrained on e.g. TP

Context

(first commit containing the issue: #36335)

There has been an issue with flaky model tests that is difficult to reproduce, and where resetting cuda memory was helping. E.g. if we remove the tearDown function with torch.cuda.empty_cache in CacheHardIntegrationTest, we might start getting failures (depending on the device).

Tracing down the issue, we can see that repeated from_pretrained calls may start offloading the model. More specifically, we can see that

  1. the reserved memory grows when we instantiate a second model, even when the first model is no longer actively allocating cuda memory
  2. we're triggering CPU offload when there is plenty of memory for the model

On main + RTX 4090 (24GB), if we pick a 4B model in BF16 (~33% of device memory), we observe the following (see output below -- notice the CPU offload after the 3rd call):

# How to reproduce the issue: pick a model/GPU/dtype combination such that the model takes >33% memory of the GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen3-4B"

def generate():
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
    inputs = tokenizer(["Here's everything I know about cats. Cats"], return_tensors="pt").to(model.device)
    _ = model.generate(**inputs, do_sample=True, max_new_tokens=1, return_dict_in_generate=True, output_scores=True)

print("generate 1")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)

print("generate 2")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)

print("generate 3")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)

print("generate 4")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)
generate 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.94it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.224609375
generate 2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.89it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 16.443359375
generate 3
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.00it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
memory allocated (GB) 5.62136697769165
memory reserved (GB) 8.224609375
generate 4
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.17it/s]
memory allocated (GB) 5.62136697769165
memory reserved (GB) 16.443359375

Solution

The solution is quite simple: when warming up memory or deciding whether to do CPU offload, let's check the memory available in the GPU including unused reserved memory.

After the fix in this PR, rerunning the script above we get

generate 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.77it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625
generate 2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.78it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625
generate 3
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.92it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625
generate 4
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.98it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625

@github-actions github-actions Bot marked this pull request as draft May 1, 2025 18:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@gante gante marked this pull request as ready for review May 1, 2025 18:50
@gante gante requested a review from Cyrilvallez May 1, 2025 18:51
@gante
Copy link
Copy Markdown
Contributor Author

gante commented May 1, 2025

@Cyrilvallez do you have a benchmark script at hand for from_pretrained? 🤗 I've seen you sharing benchmarks on other threads, e.g. here

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very very nice catch! This should help the CIs a lot! Indeed torch does not re-include the reserved memory if we try to allocate something bigger, as I had noticed here

As for a quick benchmark, you can use something like

import time
import torch
from transformers import AutoModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# model_id = "codellama/CodeLlama-34b-Instruct-hf"
device = torch.device(f"cuda:2")

# synchronize to make sure torch warmup is done
torch.cuda.synchronize(device)
t0 = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device)
torch.cuda.synchronize(device)
dt = time.time() - t0
print(f"time to load the model: {dt:.2f}")

max_mem = torch.cuda.max_memory_allocated(device) / 1024**3
current_mem = torch.cuda.memory_allocated(device) / 1024**3
print(f"Max: {max_mem:.2f} GiB")
print(f"Current: {current_mem:.2f} GiB")

but here I don't think it will have any impact on performances (though it's always good to double check indeed)

Comment on lines +1287 to +1293

# CUDA: `max_memory` contains non-reserved memory. There may be *unused* reserved memory in the GPU, which we
# can use to allocate parameters.
for device_name in max_memory.keys():
if isinstance(device_name, int): # it's a GPU device
unused_memory = torch.cuda.memory_reserved(device_name) - torch.cuda.memory_allocated(device_name)
max_memory[device_name] += unused_memory
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change still needed with the other one? Was this a bug from sooo many years ago? From https://pytorch.org/docs/stable/generated/torch.cuda.mem_get_info.html it is not crystal clear if get_max_memory will include reserved memory or not, but looks like you're right 👀

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be something to upstream directly to accelerate as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the root issue is indeed old! The recent changes regarding memory warmup have exposed them :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to upstream this into accelerate if you prefer to have it there !

Comment thread src/transformers/modeling_utils.py Outdated
@gante
Copy link
Copy Markdown
Contributor Author

gante commented May 5, 2025

Double-checking performance with the script from this comment, on my machine:

Before (best of 5 runs)

time to load the model: 4.45 s
Max: 15.83 GiB
Current: 14.96 GiB

After (best of 5 runs)

time to load the model: 4.44 s
Max: 15.83 GiB
Current: 14.96 GiB

✅ there seems to be no regression

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! THanks for diving!

Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to say it's 💯

@gante gante merged commit 3b067a1 into huggingface:main May 5, 2025
20 checks passed
@gante gante deleted the cuda_reuse_reserved_memory branch May 5, 2025 14:14
@faaany faaany mentioned this pull request May 6, 2025
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants