Add Qwen 3.6 MoE model and switch CI to Qwen3.6-35B-A3B-HQQ-INT4#18978
Add Qwen 3.6 MoE model and switch CI to Qwen3.6-35B-A3B-HQQ-INT4#18978mergennachin wants to merge 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18978
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Cancelled Jobs, 3 Unrelated FailuresAs of commit 0cd505c with merge base 99d1756 ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds initial Qwen 3.6 MoE example documentation and updates the existing Qwen 3.5 MoE tooling/CI to support and use a Qwen 3.6 prequantized checkpoint.
Changes:
- Add a stub
qwen3_6_moeREADME pointing to theqwen3_5_moerunner/instructions. - Introduce
--sensitive(mixed-precision) quantization mode for Qwen3.5 MoE export + prequantization scripts. - Switch CUDA CI and model e2e/export scripts from the Qwen 3.5 prequantized checkpoint to Qwen 3.6, and update Qwen3.5 MoE README with prequantized quick-start info.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/qwen3_6_moe/README.md | New stub README pointing to Qwen3.5 MoE instructions + Qwen3.6 checkpoint link. |
| examples/models/qwen3_5_moe/quantize_and_save.py | Adds --sensitive path and copies LICENSE into prequantized output. |
| examples/models/qwen3_5_moe/export.py | Adds --sensitive quantization mode and routes CUDA quantization to it. |
| examples/models/qwen3_5_moe/README.md | Adds prequantized quick-start and updates terminology/options list. |
| .github/workflows/cuda.yml | Switch CI matrix/model selection to Qwen3.6-35B-A3B-HQQ-INT4. |
| .ci/scripts/test_model_e2e.sh | Update supported HF model ID to Qwen3.6 prequantized checkpoint. |
| .ci/scripts/export_model_artifact.sh | Update supported HF model ID to Qwen3.6 prequantized checkpoint. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 1m 42s —— View job Review of PR #18978Overall this is a well-structured PR. The Issues1. The removal of 2. Import restructuring in The 3. Qwen3-0.6B removed from The help text for supported models at the top of Suggestions (non-blocking)4. Duplicate matrix exclusion blocks in The 5. The function always uses If you later want per-component overrides (e.g. INT4 shared experts), this would need refactoring, but for now the all-or-nothing approach is appropriate. 6. Test helper
7. Consider adding The README's "Generating Prequantized Weights" section (lines 99-113) only shows What looks good
|
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 1m 43s —— View job Review of PR #18978
This PR is well-structured. The A few items to consider: Issues1. The removal of 2. The 3. Comment in The exclude comments still read Suggestions (non-blocking)4.
5. The two test helper methods share nearly identical logic (materialize, move to CUDA, forward, save, load, compare). The only difference is 6. README quick-start uses The earlier Copilot review flagged inconsistency with "bundle" terminology. The current text says 7. The wording "requires What looks good
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
examples/models/qwen3_5_moe/quantize_and_save.py:5
- The docstring/CLI description still read as Qwen 3.5-only, but this script now supports
--sensitivefor models without QAT (called out as Qwen 3.6 in the README). Consider updating the wording to reflect that it can be used for both Qwen 3.5 and Qwen 3.6 (or more generally “Qwen MoE”) to avoid confusion.
"""Quantize Qwen 3.5 MoE and save as a self-contained safetensors checkpoint.
Runs quantization once and saves the result so export.py can skip
re-quantizing via --prequantized. The output directory contains everything
needed to load the model — no reference to the original HF checkpoint required.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Validate CLI argument combinations.""" | ||
| if args.model_id: | ||
| if args.model_dir is not None: | ||
| raise ValueError("Cannot specify model_dir when model_id is provided.") |
There was a problem hiding this comment.
In _validate_args, using raise ValueError(...) will surface as a stack trace rather than a consistent CLI validation error with usage text. Consider using parser.error(...) here (matching the other validation branches) so invalid combinations exit cleanly with a helpful message.
| raise ValueError("Cannot specify model_dir when model_id is provided.") | |
| parser.error("Cannot specify --model-dir when --model-id is provided.") |
| if args.model_id: | ||
| if args.model_dir is not None: | ||
| raise ValueError("Cannot specify model_dir when model_id is provided.") | ||
| from huggingface_hub import snapshot_download | ||
|
|
||
| args.model_dir = snapshot_download(repo_id=args.model_id) | ||
|
|
||
| if not args.prequantized and not args.model_dir and not args.tiny_test: |
There was a problem hiding this comment.
_validate_args will download a full HF snapshot when --model-id is set even if --prequantized is also provided (since the prequantized path never uses model_dir). Consider erroring on --prequantized + --model-id (or skipping the download when args.prequantized is set) to avoid an unnecessary multi-GB download in accidental/CI usage.
Qwen 3.6 MoE shares architecture and runner with Qwen 3.5 MoE. Add a stub README pointing to the existing qwen3_5_moe example. Update CI scripts and cuda.yml to use the Qwen 3.6 prequantized checkpoint. Improve qwen3_5_moe README: add quick-start section for prequantized weights, list available prequantized checkpoints, and clean up terminology.
Try #2. It is using a new quantization scheme where it is not quantizing all layers.
Qwen 3.6 MoE shares architecture and runner with Qwen 3.5 MoE.
Add a stub README pointing to the existing qwen3_5_moe example.
Update CI scripts and cuda.yml to use the Qwen 3.6 prequantized
checkpoint. Improve qwen3_5_moe README: add quick-start section
for prequantized weights, list available prequantized checkpoints,
and clean up terminology.
Using new checkpoint where it is not uniform INT4, but uses INT8 on certain layers
https://huggingface.co/SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4