Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs by Gasoonjia · Pull Request #19205 · pytorch/executorch

Gasoonjia · 2026-04-29T08:04:55Z

Goal

Make the Qwen3.5-35B-A3B HQQ-INT4 CUDA AOTI export viable on consumer-class
24 GB GPUs (RTX 4090 / 3090), so users without datacenter hardware
can run the model end-to-end.

Why it's needed

Out of the box the export OOMs on anything smaller than ~40 GB because the
AOTI compile pipeline (a) clones mutated buffers onto the target device on
top of the live model, and (b) leaks multi-GB of CUDA tensors between
method compiles via Inductor / Triton internal caches.

What this PR changes

All fixes are scoped to executorch/backends/cuda/cuda_backend.py —
no PyTorch core patches, no impact on Metal or other AOTI backends:

Monkey-patch Inductor's buffer cloning so the AOTI compile-time clones
land on CPU instead of the target GPU, and patch the C++ wrapper's
device codegen so constants_info_ still points at the real model
device for the runtime.
Wire those patches into the existing
get_extra_aoti_compile_context_manager() so they only apply during
aot_compile and revert cleanly afterwards.
Override preprocess_multimethod for CUDA only: release cuda memory between method
compiles.
Add a regression guard: the Qwen3.5 MoE export script prints the peak
GPU memory used, and CI fails if it exceeds 20 GB.

Verified

Export succeeds on an A100 capped to 20 GB; peak ~18 GB.
Export succeeds on real 4090 model, export
Inference perf matches an unconstrained export within noise.
Without the fixes, peak shoots to ~37 GB and the new CI gate fails.

Future upstream work

The fixes here are workarounds for three underlying Inductor / PyTorch
issues that deserve proper upstream solutions:

A first-class option for aot_compile to clone lifted buffers on CPU
while still recording the model's target device for the runtime.
Inductor / Triton should drain their own caches at the end of each
aot_compile call (or expose a public reset API).

Once those land upstream, the entire workaround block in
cuda_backend.py can be removed.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

pytorch-bot · 2026-04-29T08:04:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19205

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Ubuntu services are down

❌ 13 New Failures, 1 Cancelled Job, 5 Unrelated Failures

As of commit 1d71ba3 with merge base e84a418 ():

NEW FAILURES - The following jobs have failed:

pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t 5cd47e381d28826191b4184739788c61ce41e6b7ff7dd59968607952928902ad /exec failed with exit code 1
Test CUDA Builds / unittest-cuda / linux-job (gh)
backends/cuda/tests/test_int4_matmul.py::TestDequantThenMatmul::test_prefill_short
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-metal) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-metal) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (nvidia, parakeet-tdt, non-quantized) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (nvidia, parakeet-tdt, quantized-int4-metal) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, non-quantized) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, quantized-int4-metal) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (openai, whisper-small, non-quantized) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / export-model-metal-artifact (openai, whisper-small, quantized-int4-metal) / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'
Test Metal Backend / test-metal-backend-modules / macos-job (gh)
test_topk_float32_output_consistency
Test Metal Backend / test-metal-qwen35-moe-tiny / macos-job (gh)
AttributeError: type object 'MetalBackend' has no attribute 'release_moved_tensors'

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-models-macos-cpu (resnet50, xnnpack-quantization-delegation) / macos-job (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / test-models-windows (mv3, xnnpack-q8) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-models-windows (resnet50, xnnpack-q8) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-29T08:06:02Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Gasoonjia · 2026-05-01T18:13:40Z

-    triton.Config({"BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 16}, num_warps=4, num_stages=3),
    triton.Config(
-        {"BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 128, "GROUP_SIZE_M": 8}, num_warps=4, num_stages=2
+        {"BLOCK_SIZE_N": 64, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8},


these are all lintrunner changes

## Summary Enable end-to-end ExecuTorch export of large models (e.g. Qwen3.5-35B-A3B HQQ-INT4, ~18 GB of weights) under tight GPU memory budgets such as the 24 GB cap of consumer cards (RTX 4090 / 3090 / etc.) using the CUDA AOTI backend. Out of the box, calling `torch._inductor.aot_compile` on this model on a 24 GB-capped GPU OOMs in two distinct places: 1. **`_unlift_graph` clones every mutated buffer onto the model's target device.** After `move_to_device_pass(...)` that target is CUDA, so we end up with a transient ~18 GB GPU clone of the model weights on top of the live model — instant OOM. 2. **Inductor / Triton internal caches keep multi-GB worth of CUDA tensors alive between method compilations.** When ExecuTorch lowers a multi-method export (e.g. decode + prefill) those leftovers stack up, so the second method's compile starts from a half-full GPU and OOMs again under the 24 GB cap. This diff workarounds both issues in `CudaBackend` only — no changes to PyTorch core, no impact on Metal / other AOTI backends. ## What this diff does `backends/cuda/cuda_backend.py`: 1. **`_compile_time_cpu_clones(target_device)`** — context manager that wraps `torch._inductor.compile_fx.clone_preserve_strides` so the buffer clones produced by `_unlift_graph` land on CPU instead of the target device. The wrap is **frame-discriminated** (`sys._getframe(1).f_code.co_name == "_unlift_graph"`) so it does *not* affect `triton_heuristics.py:1101`, which re-imports the same symbol for autotune benchmark inputs that legitimately must stay on GPU. It also wraps `CppWrapperCpu.codegen_device` so the generated `constants_info_[i].device_type` still points at the real model target device (e.g. cuda), preventing a mixed-device runtime error when the constants are loaded back at inference time. 2. **`get_extra_aoti_compile_context_manager()`** — chains the existing SDPA-MATH manager with `_compile_time_cpu_clones` via an `ExitStack`, so both fire around the `torch._inductor.aot_compile` call in `AotiBackend.preprocess`. 3. **`preprocess_multimethod()`** — overrides the base implementation with a CUDA-specific cleanup loop that runs after every method compile. It walks `gc.get_objects()`, finds every live CUDA tensor, and calls `untyped_storage().resize_(0)` on it. This is how we release ~18 GB of stale Inductor / Triton cache leftovers between `decode` and `prefill` compiles. We need the in-place `resize_(0)` (rather than `del + gc.collect()`) because the cache still holds Python references — only forcibly emptying the storage reclaims the GPU memory. Other AOTI backends (Metal/MPS) inherit the default no-cleanup base implementation and are unaffected. ## Verified behavior - `python -m executorch.examples.models.qwen3_5_moe.export --prequantized <hqq-int4-bundle> --backend cuda` succeeds end-to-end with `torch.cuda.set_per_process_memory_fraction(0.3, 0)` on an 80 GB A100 (= 24 GB visible) — peak GPU usage during compile stays at ~19 GB. - Both `[CLEANUP]` lines fire and report ~18.29 GB freed per method. - `qwen3_5_moe_runner` inference produces coherent text and matches the perf of an unconstrained-VRAM export within measurement noise (1903 tok/s prefill, 160 tok/s decode on A100 with `--cuda_graph=true`, 571-token prompt + 128 generated, GPU peak 18 GB). ## What should eventually move upstream The three workarounds here all paper over real PyTorch issues that deserve a proper fix in core: 1. **`_unlift_graph` cloning on the target device.** Cloning lifted buffers onto whatever device they happen to live on is not free — for large models the clone alone OOMs. Inductor should either stage the clone on CPU explicitly or expose an option to do so. Today we have to monkey-patch `clone_preserve_strides` *and* the wrapper's device codegen to compensate; both could be replaced by a first-class API such as `aot_compile(..., clone_buffers_on="cpu")` plus an internal "original device of constant" record so the C++ wrapper writes the right `constants_info_`. 2. **Inductor / Triton caches leak compile-time CUDA tensors.** After `aot_compile` returns the `.so` and `.ptd` are written, the result bytes are in our hands, and every CUDA tensor still alive is by definition stale. Today we have to walk `gc.get_objects()` and manually release each storage. Inductor should drain its own caches (`PyCodeCache`, `CompiledFxGraph`, `CachingAutotuner`, …) at the end of an `aot_compile` call, or at least expose a `torch._inductor.reset_compile_caches()` helper. 3. **`wrap_triton` has no `mutates_args` parameter.** This is the underlying reason `identify_mutated_tensors` exists at all: the inner HOP for a Triton kernel call has to re-derive mutations from TTIR because the user can't declare them. A future `wrap_triton(kernel, mutates_args={"C"})` would let kernel authors short-circuit the TTIR analysis (and avoid the historical fallback that marks every input as mutated, which still shows up in older PyTorch / Triton combinations). Once those land, the entire monkey-patch block in this file can be deleted. ## Test Plan - Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export --prequantized ... --backend cuda` with `torch.cuda.set_per_process_memory_fraction(0.3, 0)` (24 GB cap on an 80 GB A100). Export succeeded; both `[CLEANUP]` lines fired; peak GPU usage stayed under 24 GB. - Manual: ran `qwen3_5_moe_runner` against the exported `.pte` / `.ptd`. Inference produced coherent output, prefill 1903 tok/s, decode 160 tok/s with `--cuda_graph=true`, GPU peak 18 GB. - Unaffected backends: Metal / other AOTI backends inherit the default `BackendDetails.preprocess_multimethod` (no cleanup) and are not touched by this diff.

## Summary Add a GPU memory regression guard so that the Qwen3.5 MoE export keeps fitting on consumer-grade 24 GB GPUs (RTX 4090 / 3090 / A5000 …). ## What this diff does 1. `examples/models/qwen3_5_moe/export.py` - Reset CUDA peak memory stats at the start of the CUDA backend setup. - At the end of `main()`, when running with `--backend cuda`, print a stable, machine-parseable marker line: `EXPORT_GPU_PEAK_MEMORY_MB: <peak_in_MB>` This makes the actual peak GPU memory consumed by the entire load + quantize + lower pipeline visible to both humans and CI. 2. `.ci/scripts/export_model_artifact.sh` (qwen3_5_moe path) - Tee the export output to a temp log. - Grep the `EXPORT_GPU_PEAK_MEMORY_MB` marker and compare against `EXPORT_GPU_PEAK_MB_LIMIT` (default 20480 MB = 20 GB; overridable via env var). - Fail the job with an explanatory error if the budget is exceeded, so any future regression that reintroduces the ~18 GB unnecessary GPU clone (or comparable leak) is caught at PR time rather than silently breaking 24 GB-class GPUs. ## Notes - Current measured peak with the CUDA backend memory fixes (see prior commit on this branch) is ~18 GB, leaving ~2 GB headroom under the 20 GB limit. Without those fixes the peak shoots to ~37 GB and CI will fail loudly. - The threshold is intentionally tighter than the 24 GB physical cap to leave room for measurement noise and small allocator overhead. ## Test Plan - Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export --prequantized <hqq-int4-bundle> --backend cuda` and confirmed the marker line is printed at the end with a sensible value (~18 GB). - Manual: simulated CI gate logic locally with the marker line and confirmed both the success path and the failure path (forced threshold below the actual peak) behave as expected.

Gasoonjia requested a review from lucylq as a code owner April 29, 2026 08:04

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2026

Gasoonjia added ciflow/metal ciflow/cuda ciflow/mlx labels Apr 29, 2026

Gasoonjia requested review from digantdesai and mergennachin and removed request for lucylq April 29, 2026 20:32

Gasoonjia force-pushed the cuda-aoti-low-mem-export branch from ef8aeb2 to 2163d1c Compare April 30, 2026 18:35

Gasoonjia changed the base branch from main to block-m-64-on-persistent April 30, 2026 18:35

Gasoonjia commented May 1, 2026

View reviewed changes

mergennachin approved these changes May 1, 2026

View reviewed changes

Gasoonjia force-pushed the cuda-aoti-low-mem-export branch from 29527f3 to 43860a5 Compare May 4, 2026 07:40

Gasoonjia requested review from GregoryComer, JacobSzwejbka, SS-JIA, abhinaykukkadapu, kirklandsign, larryliu0820, manuelcandales, rascani, robert-kalmar and shoumikhin as code owners May 4, 2026 07:40

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 4, 2026

Gasoonjia changed the base branch from block-m-64-on-persistent to main May 4, 2026 07:41

Gasoonjia added 2 commits May 4, 2026 00:46

Gasoonjia added 4 commits May 4, 2026 00:46

lint

678bc9f

low mem export controled by compile spec

dd14cdd

lint

a15733b

release moved tensors

1d71ba3

Gasoonjia force-pushed the cuda-aoti-low-mem-export branch from 43860a5 to 1d71ba3 Compare May 4, 2026 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205

Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205
Gasoonjia wants to merge 6 commits intomainfrom
cuda-aoti-low-mem-export

Gasoonjia commented Apr 29, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Gasoonjia May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gasoonjia commented Apr 29, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Why it's needed

What this PR changes

Verified

Future upstream work

Uh oh!

pytorch-bot Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19205

❗ 1 Active SEVs

❌ 13 New Failures, 1 Cancelled Job, 5 Unrelated Failures

Uh oh!

github-actions Bot commented Apr 29, 2026

This PR needs a release notes: label

Uh oh!

Gasoonjia May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gasoonjia commented Apr 29, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Apr 29, 2026 •

edited

Loading

This PR needs a `release notes:` label