Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205
Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19205
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 13 New Failures, 1 Cancelled Job, 5 Unrelated FailuresAs of commit 1d71ba3 with merge base e84a418 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
ef8aeb2 to
2163d1c
Compare
| triton.Config({"BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 16}, num_warps=4, num_stages=3), | ||
| triton.Config( | ||
| {"BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 128, "GROUP_SIZE_M": 8}, num_warps=4, num_stages=2 | ||
| {"BLOCK_SIZE_N": 64, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8}, |
There was a problem hiding this comment.
these are all lintrunner changes
29527f3 to
43860a5
Compare
## Summary
Enable end-to-end ExecuTorch export of large models (e.g. Qwen3.5-35B-A3B
HQQ-INT4, ~18 GB of weights) under tight GPU memory budgets such as the
24 GB cap of consumer cards (RTX 4090 / 3090 / etc.) using the CUDA AOTI
backend.
Out of the box, calling `torch._inductor.aot_compile` on this model on a
24 GB-capped GPU OOMs in two distinct places:
1. **`_unlift_graph` clones every mutated buffer onto the model's target
device.** After `move_to_device_pass(...)` that target is CUDA, so we
end up with a transient ~18 GB GPU clone of the model weights on top
of the live model — instant OOM.
2. **Inductor / Triton internal caches keep multi-GB worth of CUDA
tensors alive between method compilations.** When ExecuTorch lowers a
multi-method export (e.g. decode + prefill) those leftovers stack up,
so the second method's compile starts from a half-full GPU and OOMs
again under the 24 GB cap.
This diff workarounds both issues in `CudaBackend` only — no changes to
PyTorch core, no impact on Metal / other AOTI backends.
## What this diff does
`backends/cuda/cuda_backend.py`:
1. **`_compile_time_cpu_clones(target_device)`** — context manager that
wraps `torch._inductor.compile_fx.clone_preserve_strides` so the
buffer clones produced by `_unlift_graph` land on CPU instead of the
target device. The wrap is **frame-discriminated**
(`sys._getframe(1).f_code.co_name == "_unlift_graph"`) so it does
*not* affect `triton_heuristics.py:1101`, which re-imports the same
symbol for autotune benchmark inputs that legitimately must stay on
GPU. It also wraps `CppWrapperCpu.codegen_device` so the generated
`constants_info_[i].device_type` still points at the real model
target device (e.g. cuda), preventing a mixed-device runtime error
when the constants are loaded back at inference time.
2. **`get_extra_aoti_compile_context_manager()`** — chains the existing
SDPA-MATH manager with `_compile_time_cpu_clones` via an
`ExitStack`, so both fire around the `torch._inductor.aot_compile`
call in `AotiBackend.preprocess`.
3. **`preprocess_multimethod()`** — overrides the base implementation
with a CUDA-specific cleanup loop that runs after every method
compile. It walks `gc.get_objects()`, finds every live CUDA tensor,
and calls `untyped_storage().resize_(0)` on it. This is how we
release ~18 GB of stale Inductor / Triton cache leftovers between
`decode` and `prefill` compiles. We need the in-place
`resize_(0)` (rather than `del + gc.collect()`) because the cache
still holds Python references — only forcibly emptying the storage
reclaims the GPU memory. Other AOTI backends (Metal/MPS) inherit the
default no-cleanup base implementation and are unaffected.
## Verified behavior
- `python -m executorch.examples.models.qwen3_5_moe.export
--prequantized <hqq-int4-bundle> --backend cuda` succeeds end-to-end
with `torch.cuda.set_per_process_memory_fraction(0.3, 0)` on an
80 GB A100 (= 24 GB visible) — peak GPU usage during compile stays at
~19 GB.
- Both `[CLEANUP]` lines fire and report ~18.29 GB freed per method.
- `qwen3_5_moe_runner` inference produces coherent text and matches
the perf of an unconstrained-VRAM export within measurement noise
(1903 tok/s prefill, 160 tok/s decode on A100 with `--cuda_graph=true`,
571-token prompt + 128 generated, GPU peak 18 GB).
## What should eventually move upstream
The three workarounds here all paper over real PyTorch issues that
deserve a proper fix in core:
1. **`_unlift_graph` cloning on the target device.** Cloning lifted
buffers onto whatever device they happen to live on is not free —
for large models the clone alone OOMs. Inductor should either
stage the clone on CPU explicitly or expose an option to do so.
Today we have to monkey-patch `clone_preserve_strides` *and* the
wrapper's device codegen to compensate; both could be replaced by a
first-class API such as `aot_compile(..., clone_buffers_on="cpu")`
plus an internal "original device of constant" record so the C++
wrapper writes the right `constants_info_`.
2. **Inductor / Triton caches leak compile-time CUDA tensors.** After
`aot_compile` returns the `.so` and `.ptd` are written, the result
bytes are in our hands, and every CUDA tensor still alive is by
definition stale. Today we have to walk `gc.get_objects()` and
manually release each storage. Inductor should drain its own caches
(`PyCodeCache`, `CompiledFxGraph`, `CachingAutotuner`, …) at the end
of an `aot_compile` call, or at least expose a
`torch._inductor.reset_compile_caches()` helper.
3. **`wrap_triton` has no `mutates_args` parameter.** This is the
underlying reason `identify_mutated_tensors` exists at all: the
inner HOP for a Triton kernel call has to re-derive mutations from
TTIR because the user can't declare them. A future
`wrap_triton(kernel, mutates_args={"C"})` would let kernel authors
short-circuit the TTIR analysis (and avoid the historical fallback
that marks every input as mutated, which still shows up in older
PyTorch / Triton combinations).
Once those land, the entire monkey-patch block in this file can be
deleted.
## Test Plan
- Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export
--prequantized ... --backend cuda` with
`torch.cuda.set_per_process_memory_fraction(0.3, 0)` (24 GB cap on an
80 GB A100). Export succeeded; both `[CLEANUP]` lines fired; peak GPU
usage stayed under 24 GB.
- Manual: ran `qwen3_5_moe_runner` against the exported `.pte` /
`.ptd`. Inference produced coherent output, prefill 1903 tok/s,
decode 160 tok/s with `--cuda_graph=true`, GPU peak 18 GB.
- Unaffected backends: Metal / other AOTI backends inherit the default
`BackendDetails.preprocess_multimethod` (no cleanup) and are not
touched by this diff.
## Summary
Add a GPU memory regression guard so that the Qwen3.5 MoE export keeps
fitting on consumer-grade 24 GB GPUs (RTX 4090 / 3090 / A5000 …).
## What this diff does
1. `examples/models/qwen3_5_moe/export.py`
- Reset CUDA peak memory stats at the start of the CUDA backend setup.
- At the end of `main()`, when running with `--backend cuda`, print a
stable, machine-parseable marker line:
`EXPORT_GPU_PEAK_MEMORY_MB: <peak_in_MB>`
This makes the actual peak GPU memory consumed by the entire
load + quantize + lower pipeline visible to both humans and CI.
2. `.ci/scripts/export_model_artifact.sh` (qwen3_5_moe path)
- Tee the export output to a temp log.
- Grep the `EXPORT_GPU_PEAK_MEMORY_MB` marker and compare against
`EXPORT_GPU_PEAK_MB_LIMIT` (default 20480 MB = 20 GB; overridable
via env var).
- Fail the job with an explanatory error if the budget is exceeded,
so any future regression that reintroduces the ~18 GB unnecessary
GPU clone (or comparable leak) is caught at PR time rather than
silently breaking 24 GB-class GPUs.
## Notes
- Current measured peak with the CUDA backend memory fixes (see prior
commit on this branch) is ~18 GB, leaving ~2 GB headroom under the
20 GB limit. Without those fixes the peak shoots to ~37 GB and CI
will fail loudly.
- The threshold is intentionally tighter than the 24 GB physical cap
to leave room for measurement noise and small allocator overhead.
## Test Plan
- Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export
--prequantized <hqq-int4-bundle> --backend cuda` and confirmed the
marker line is printed at the end with a sensible value (~18 GB).
- Manual: simulated CI gate logic locally with the marker line and
confirmed both the success path and the failure path (forced
threshold below the actual peak) behave as expected.
43860a5 to
1d71ba3
Compare
Goal
Make the Qwen3.5-35B-A3B HQQ-INT4 CUDA AOTI export viable on consumer-class
24 GB GPUs (RTX 4090 / 3090), so users without datacenter hardware
can run the model end-to-end.
Why it's needed
Out of the box the export OOMs on anything smaller than ~40 GB because the
AOTI compile pipeline (a) clones mutated buffers onto the target device on
top of the live model, and (b) leaks multi-GB of CUDA tensors between
method compiles via Inductor / Triton internal caches.
What this PR changes
All fixes are scoped to
executorch/backends/cuda/cuda_backend.py—no PyTorch core patches, no impact on Metal or other AOTI backends:
land on CPU instead of the target GPU, and patch the C++ wrapper's
device codegen so
constants_info_still points at the real modeldevice for the runtime.
get_extra_aoti_compile_context_manager()so they only apply duringaot_compileand revert cleanly afterwards.preprocess_multimethodfor CUDA only: release cuda memory between methodcompiles.
GPU memory used, and CI fails if it exceeds 20 GB.
Verified
Future upstream work
The fixes here are workarounds for three underlying Inductor / PyTorch
issues that deserve proper upstream solutions:
aot_compileto clone lifted buffers on CPUwhile still recording the model's target device for the runtime.
aot_compilecall (or expose a public reset API).Once those land upstream, the entire workaround block in
cuda_backend.pycan be removed.cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell