[core] reuse unused reserved cuda memory when loading models by gante · Pull Request #37920 · huggingface/transformers

gante · 2025-05-01T18:48:50Z

What does this PR do?

TL;DR checks for unused reserved CUDA memory before preallocating more memory or deciding to do CPU offload.
Missing: benchmark whether this has a speed impact in from_pretrained on e.g. TP

Context

(first commit containing the issue: #36335)

There has been an issue with flaky model tests that is difficult to reproduce, and where resetting cuda memory was helping. E.g. if we remove the tearDown function with torch.cuda.empty_cache in CacheHardIntegrationTest, we might start getting failures (depending on the device).

Tracing down the issue, we can see that repeated from_pretrained calls may start offloading the model. More specifically, we can see that

the reserved memory grows when we instantiate a second model, even when the first model is no longer actively allocating cuda memory
we're triggering CPU offload when there is plenty of memory for the model

On main + RTX 4090 (24GB), if we pick a 4B model in BF16 (~33% of device memory), we observe the following (see output below -- notice the CPU offload after the 3rd call):

# How to reproduce the issue: pick a model/GPU/dtype combination such that the model takes >33% memory of the GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen3-4B"

def generate():
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
    inputs = tokenizer(["Here's everything I know about cats. Cats"], return_tensors="pt").to(model.device)
    _ = model.generate(**inputs, do_sample=True, max_new_tokens=1, return_dict_in_generate=True, output_scores=True)

print("generate 1")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)

print("generate 2")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)

print("generate 3")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)

print("generate 4")
generate()
print("memory allocated (GB)", torch.cuda.memory_allocated(0) / 1024 ** 3)
print("memory reserved (GB)", torch.cuda.memory_reserved(0) / 1024 ** 3)

generate 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.94it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.224609375
generate 2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.89it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 16.443359375
generate 3
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.00it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
memory allocated (GB) 5.62136697769165
memory reserved (GB) 8.224609375
generate 4
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.17it/s]
memory allocated (GB) 5.62136697769165
memory reserved (GB) 16.443359375

Solution

The solution is quite simple: when warming up memory or deciding whether to do CPU offload, let's check the memory available in the GPU including unused reserved memory.

After the fix in this PR, rerunning the script above we get

generate 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.77it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625
generate 2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.78it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625
generate 3
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.92it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625
generate 4
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.98it/s]
memory allocated (GB) 0.0079345703125
memory reserved (GB) 8.22265625

github-actions · 2025-05-01T18:49:03Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

gante · 2025-05-01T19:06:32Z

@Cyrilvallez do you have a benchmark script at hand for from_pretrained? 🤗 I've seen you sharing benchmarks on other threads, e.g. here

HuggingFaceDocBuilderDev · 2025-05-01T19:08:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez

Very very nice catch! This should help the CIs a lot! Indeed torch does not re-include the reserved memory if we try to allocate something bigger, as I had noticed here

As for a quick benchmark, you can use something like

import time
import torch
from transformers import AutoModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# model_id = "codellama/CodeLlama-34b-Instruct-hf"
device = torch.device(f"cuda:2")

# synchronize to make sure torch warmup is done
torch.cuda.synchronize(device)
t0 = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device)
torch.cuda.synchronize(device)
dt = time.time() - t0
print(f"time to load the model: {dt:.2f}")

max_mem = torch.cuda.max_memory_allocated(device) / 1024**3
current_mem = torch.cuda.memory_allocated(device) / 1024**3
print(f"Max: {max_mem:.2f} GiB")
print(f"Current: {current_mem:.2f} GiB")

but here I don't think it will have any impact on performances (though it's always good to double check indeed)

Cyrilvallez · 2025-05-05T10:22:25Z

+
+        # CUDA: `max_memory` contains non-reserved memory. There may be *unused* reserved memory in the GPU, which we
+        # can use to allocate parameters.
+        for device_name in max_memory.keys():
+            if isinstance(device_name, int):  # it's a GPU device
+                unused_memory = torch.cuda.memory_reserved(device_name) - torch.cuda.memory_allocated(device_name)
+                max_memory[device_name] += unused_memory


Is this change still needed with the other one? Was this a bug from sooo many years ago? From https://pytorch.org/docs/stable/generated/torch.cuda.mem_get_info.html it is not crystal clear if get_max_memory will include reserved memory or not, but looks like you're right 👀

This could be something to upstream directly to accelerate as well

Yes, the root issue is indeed old! The recent changes regarding memory warmup have exposed them :D

cc @SunMarc

Happy to upstream this into accelerate if you prefer to have it there !

gante · 2025-05-05T13:23:46Z

Double-checking performance with the script from this comment, on my machine:

Before (best of 5 runs)

time to load the model: 4.45 s
Max: 15.83 GiB
Current: 14.96 GiB

After (best of 5 runs)

time to load the model: 4.44 s
Max: 15.83 GiB
Current: 14.96 GiB

✅ there seems to be no regression

ArthurZucker

LGTM! THanks for diving!

ydshieh

Just want to say it's 💯

…face#37920)

reuse cuda memory

0cf1c54

github-actions Bot marked this pull request as draft May 1, 2025 18:49

gante marked this pull request as ready for review May 1, 2025 18:50

gante requested a review from Cyrilvallez May 1, 2025 18:51

gante added 2 commits May 1, 2025 18:53

better comment

16210d3

missing index

171d094

gante mentioned this pull request May 2, 2025

[tests] Smaller model in slow cache tests #37922

Merged

Cyrilvallez reviewed May 5, 2025

View reviewed changes

Cyrilvallez approved these changes May 5, 2025

View reviewed changes

ArthurZucker approved these changes May 5, 2025

View reviewed changes

PR review suggestion

0d471c3

ydshieh approved these changes May 5, 2025

View reviewed changes

gante merged commit 3b067a1 into huggingface:main May 5, 2025
20 checks passed

gante deleted the cuda_reuse_reserved_memory branch May 5, 2025 14:14

faaany mentioned this pull request May 6, 2025

add xpu memory check #37969

Merged

gante mentioned this pull request May 6, 2025

[offload] respect max_memory argument when factoring in unused reserved memory #37982

Merged

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025

[core] reuse unused reserved cuda memory when loading models (hugging…

6ebe973

…face#37920)

Conversation

gante commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Context

Solution

Uh oh!

github-actions Bot commented May 1, 2025

Uh oh!

gante commented May 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

Cyrilvallez left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez May 5, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 5, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 5, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gante commented May 5, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gante commented May 1, 2025 •

edited

Loading

Cyrilvallez left a comment •

edited

Loading