[core] reuse unused reserved cuda memory when loading models#37920
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
|
@Cyrilvallez do you have a benchmark script at hand for |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Very very nice catch! This should help the CIs a lot! Indeed torch does not re-include the reserved memory if we try to allocate something bigger, as I had noticed here
As for a quick benchmark, you can use something like
import time
import torch
from transformers import AutoModelForCausalLM
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# model_id = "codellama/CodeLlama-34b-Instruct-hf"
device = torch.device(f"cuda:2")
# synchronize to make sure torch warmup is done
torch.cuda.synchronize(device)
t0 = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device)
torch.cuda.synchronize(device)
dt = time.time() - t0
print(f"time to load the model: {dt:.2f}")
max_mem = torch.cuda.max_memory_allocated(device) / 1024**3
current_mem = torch.cuda.memory_allocated(device) / 1024**3
print(f"Max: {max_mem:.2f} GiB")
print(f"Current: {current_mem:.2f} GiB")but here I don't think it will have any impact on performances (though it's always good to double check indeed)
|
|
||
| # CUDA: `max_memory` contains non-reserved memory. There may be *unused* reserved memory in the GPU, which we | ||
| # can use to allocate parameters. | ||
| for device_name in max_memory.keys(): | ||
| if isinstance(device_name, int): # it's a GPU device | ||
| unused_memory = torch.cuda.memory_reserved(device_name) - torch.cuda.memory_allocated(device_name) | ||
| max_memory[device_name] += unused_memory |
There was a problem hiding this comment.
Is this change still needed with the other one? Was this a bug from sooo many years ago? From https://pytorch.org/docs/stable/generated/torch.cuda.mem_get_info.html it is not crystal clear if get_max_memory will include reserved memory or not, but looks like you're right 👀
There was a problem hiding this comment.
This could be something to upstream directly to accelerate as well
There was a problem hiding this comment.
Yes, the root issue is indeed old! The recent changes regarding memory warmup have exposed them :D
There was a problem hiding this comment.
Happy to upstream this into accelerate if you prefer to have it there !
|
Double-checking performance with the script from this comment, on my machine: Before (best of 5 runs) After (best of 5 runs) ✅ there seems to be no regression |
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM! THanks for diving!
ydshieh
left a comment
There was a problem hiding this comment.
Just want to say it's 💯
What does this PR do?
TL;DR checks for unused reserved CUDA memory before preallocating more memory or deciding to do CPU offload.
Missing: benchmark whether this has a speed impact in
from_pretrainedon e.g. TPContext
(first commit containing the issue: #36335)
There has been an issue with flaky model tests that is difficult to reproduce, and where resetting cuda memory was helping. E.g. if we remove the
tearDownfunction withtorch.cuda.empty_cacheinCacheHardIntegrationTest, we might start getting failures (depending on the device).Tracing down the issue, we can see that repeated
from_pretrainedcalls may start offloading the model. More specifically, we can see thatOn
main+ RTX 4090 (24GB), if we pick a 4B model in BF16 (~33% of device memory), we observe the following (see output below -- notice the CPU offload after the 3rd call):Solution
The solution is quite simple: when warming up memory or deciding whether to do CPU offload, let's check the memory available in the GPU including unused reserved memory.
After the fix in this PR, rerunning the script above we get