Conversation
Replace the old update_constants_from_blob + cross-method sharing with a unified per-weight caching approach. The first method to initialize loads its constants from the blob and caches them by FQN. Subsequent methods with matching FQNs reuse cached GPU tensors via update_user_managed_constant_buffer_pairs, skipping blob loading entirely. This eliminates duplicate GPU weight allocations for multi-method models (e.g., prefill/decode), reducing peak GPU memory from ~35 GB to ~17.6 GB for Qwen 3.5 MoE. Also adds GPU peak memory reporting to the Qwen3.5 MoE runner and a CI check (< 20 GB) in test_model_e2e.sh.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18901
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New Failures, 1 Cancelled Job, 4 Unrelated FailuresAs of commit 2f1a453 with merge base eef7921 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@claude review this PR. |
|
Claude finished @digantdesai's task in 1m 47s —— View job Review of PR #18901 — Per-weight constant cache
Overall: Good improvement. Replacing the Issues1. Dangling pointers in The Actually, on closer inspection this is fine since 2. The 3. First method still loads the full blob even when some constants are already cached — When This is likely acceptable for now since the primary use case is full overlap, but worth a comment noting this limitation. 4. GPU memory measurement is not peak memory —
5. Missing blank line between function and member variable — return Error::Ok;
}
mutable std::mutex shared_constants_mutex_;There should be a blank line between 6. Thread safety: The calls to 7. CI test uses PEAK_MEM=$(echo "$OUTPUT" | grep -oP 'GPU peak memory usage: \K[0-9.]+' || true)
Minor / Nits
Looks Good
|
Replace the old update_constants_from_blob + cross-method sharing with a unified per-weight caching approach. The first method to initialize loads its constants from the blob and caches them by FQN. Subsequent methods with matching FQNs reuse cached GPU tensors via update_user_managed_constant_buffer_pairs, skipping blob loading entirely. This eliminates duplicate GPU weight allocations for multi-method models (e.g., prefill/decode), reducing peak GPU memory from ~35 GB to ~17.6 GB for Qwen 3.5 MoE. Also adds GPU peak memory reporting to the Qwen3.5 MoE runner and a CI check (< 20 GB) in test_model_e2e.sh.
9aeea5f to
e249c23
Compare
| // extract and cache the new constants. | ||
| // 4. For cached constants, call update_user_managed_constant_buffer_pairs | ||
| // to point the container to the shared GPU tensors. | ||
| Error load_constants_with_cache( |
There was a problem hiding this comment.
yeah caching is a better design.
There was a problem hiding this comment.
I would imagine no one would run decode and prefill in parallel, because unlike a single method they can do it. Also why do we need mutex in this method?
There was a problem hiding this comment.
The mutex here is essential cuz CudaBackend class is a singleton across different backend handles, even if they are in different Module.
One day if we actually deployed out model onto user's computer, user parallelly running differnt chat across different windows is expected, and that time race will be happen,
| for (const auto& [fqn, internal_name] : fqn_to_name) { | ||
| auto it = shared_constant_tensors_.find(fqn); | ||
| if (it != shared_constant_tensors_.end()) { | ||
| pairs.push_back({internal_name.c_str(), it->second}); |
There was a problem hiding this comment.
I think I mentioned it earlier, since we are tracing two methods with different shapes and dynamic shapes etc. is it possible that something with same FQN can have a different shape and we silently do buggy things? Should we check for shapes here as well?
There was a problem hiding this comment.
Similarly for mutability.
|
Thanks @digantdesai for review! Will address your concern tmr. For the claude review we've solved most of the issues:
In the issue it said
i have updated now we no longer have allocate_constant_on_gpu and other useless code;
I don't think clause point makes sense -- when we load the first method their shouldn't be any data cached;
That's a known issue. When @larryliu0820 created the memory tracker we've aware the issue and have some discussion: current tracker is the best thing we can provide without touching aoti codebase (if follow the claude instruction we should insert the cudaMemPoolAttrUsedMemHighand other command into aoti codebase)
It is a format issue i think we should follow lintrunner not claude. Now lintrunner is all green.
I will make an update tmr.
It mentioned the error will occur on MacOS; but MacOS doesn't have cuda GPU so the script will never run on it -> the issue is false alarm. |
1d5e628 to
04c911a
Compare
Problem: Multi-method AOTI models (e.g., Qwen3.5 MoE with separate prefill/decode methods) load the full weight blob independently for each method, even when they share identical weights. This causes duplicate GPU allocations -- Qwen3.5 MoE peaked at ~35 GB, making it impossible to run on a single 24 GB GPU (e.g., 4090).
Solution: Introduce a per-weight FQN-keyed constant cache in
CudaBackend. The first method loads its constants from the blob and caches them. Subsequent methods with matching FQNs skip blob loading entirely and reuse cached GPU tensors viaupdate_user_managed_constant_buffer_pairs. A legacy fallback path is preserved for older AOTI models without constant management APIs.Results
Peak GPU memory: 35.4 GB → 17.6 GB (-50%)