Skip to content

Add Qwen 3.6 MoE model and switch CI to Qwen3.6-35B-A3B-HQQ-INT4#18978

Open
mergennachin wants to merge 1 commit intomainfrom
qwen3_6
Open

Add Qwen 3.6 MoE model and switch CI to Qwen3.6-35B-A3B-HQQ-INT4#18978
mergennachin wants to merge 1 commit intomainfrom
qwen3_6

Conversation

@mergennachin
Copy link
Copy Markdown
Contributor

@mergennachin mergennachin commented Apr 17, 2026

Try #2. It is using a new quantization scheme where it is not quantizing all layers.

Qwen 3.6 MoE shares architecture and runner with Qwen 3.5 MoE.
Add a stub README pointing to the existing qwen3_5_moe example.
Update CI scripts and cuda.yml to use the Qwen 3.6 prequantized
checkpoint. Improve qwen3_5_moe README: add quick-start section
for prequantized weights, list available prequantized checkpoints,
and clean up terminology.

Using new checkpoint where it is not uniform INT4, but uses INT8 on certain layers

https://huggingface.co/SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4

@mergennachin mergennachin requested a review from lucylq as a code owner April 17, 2026 22:40
Copilot AI review requested due to automatic review settings April 17, 2026 22:40
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 17, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18978

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs, 3 Unrelated Failures

As of commit 0cd505c with merge base 99d1756 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 17, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mergennachin mergennachin requested a review from Gasoonjia April 17, 2026 22:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial Qwen 3.6 MoE example documentation and updates the existing Qwen 3.5 MoE tooling/CI to support and use a Qwen 3.6 prequantized checkpoint.

Changes:

  • Add a stub qwen3_6_moe README pointing to the qwen3_5_moe runner/instructions.
  • Introduce --sensitive (mixed-precision) quantization mode for Qwen3.5 MoE export + prequantization scripts.
  • Switch CUDA CI and model e2e/export scripts from the Qwen 3.5 prequantized checkpoint to Qwen 3.6, and update Qwen3.5 MoE README with prequantized quick-start info.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
examples/models/qwen3_6_moe/README.md New stub README pointing to Qwen3.5 MoE instructions + Qwen3.6 checkpoint link.
examples/models/qwen3_5_moe/quantize_and_save.py Adds --sensitive path and copies LICENSE into prequantized output.
examples/models/qwen3_5_moe/export.py Adds --sensitive quantization mode and routes CUDA quantization to it.
examples/models/qwen3_5_moe/README.md Adds prequantized quick-start and updates terminology/options list.
.github/workflows/cuda.yml Switch CI matrix/model selection to Qwen3.6-35B-A3B-HQQ-INT4.
.ci/scripts/test_model_e2e.sh Update supported HF model ID to Qwen3.6 prequantized checkpoint.
.ci/scripts/export_model_artifact.sh Update supported HF model ID to Qwen3.6 prequantized checkpoint.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/models/qwen3_5_moe/README.md Outdated
Comment thread .github/workflows/cuda.yml Outdated
Comment thread examples/models/qwen3_6_moe/README.md Outdated
Comment thread examples/models/qwen3_5_moe/quantize_and_save.py
Comment thread examples/models/qwen3_5_moe/export.py Outdated
Comment thread examples/models/qwen3_5_moe/export.py
Comment thread examples/models/qwen3_5_moe/export.py
@mergennachin
Copy link
Copy Markdown
Contributor Author

@claude Review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Claude finished @mergennachin's task in 1m 42s —— View job


Review of PR #18978

Overall this is a well-structured PR. The _quantize_sensitive implementation is clean, the argument validation is thorough, the test coverage for the new path is good, and the CI changes are a straightforward model ID swap. A few items worth addressing:


Issues

1. inv_freq removed from expected missing keys in load_prequantized_model (export.py:238)

The removal of ".inv_freq" from runtime_prefixes means that if a checkpoint is missing inv_freq keys, they'll now be flagged as unexpected weight mismatches instead of being silently accepted. This is correct if the new Qwen3.6 prequantized checkpoint includes inv_freq in its safetensors file, but it will break loading of older Qwen3.5 prequantized checkpoints that omit inv_freq. Worth confirming this is intentional.

Fix this →

2. Import restructuring in _quantize_experts_int4 changes behavior (export.py:282-290)

The choose_qparams_affine, MappingType, and quantize_affine imports were previously conditional (only imported when use_hqq=False). They're now imported unconditionally. This is fine functionally, but note that these imports now run even in the HQQ path. Minor, not blocking.

3. Qwen3-0.6B removed from test_model_e2e.sh help text (test_model_e2e.sh:25)

The help text for supported models at the top of test_model_e2e.sh still lists Qwen/Qwen3-0.6B but doesn't list the new SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4. This is pre-existing (also missing Qwen3.5), so not something this PR introduced, but the error message on line 233 was updated correctly.


Suggestions (non-blocking)

4. Duplicate matrix exclusion blocks in cuda.yml

The export-model-cuda-artifact and test-model-cuda-e2e jobs have identical matrix + exclude blocks (lines 163-253 and 296-378). If these ever diverge it would be a subtle CI bug. Not actionable in this PR, but worth noting for future refactoring.

5. _quantize_sensitive hardcodes INT8 for all non-expert layers

The function always uses "8w" for attention projections, shared experts, lm_head, and embeddings. The args.qlinear / args.qembedding args are explicitly rejected when --sensitive is set. This is a deliberate design choice (the function owns the precision map), and the validation in _validate_args correctly enforces it. The docstring explains the rationale well.

If you later want per-component overrides (e.g. INT4 shared experts), this would need refactoring, but for now the all-or-nothing approach is appropriate.

6. Test helper _make_sensitive_model duplicates model setup (test_quantize_roundtrip.py:88-106)

_make_sensitive_model and _make_quantized_model share the same model creation pattern (seed, construct, cast, init weights, eval). A small shared helper could reduce duplication, but it's minor and the current code is readable.

7. Consider adding --sensitive to the quantize_and_save.py usage examples in the Qwen3.5 README

The README's "Generating Prequantized Weights" section (lines 99-113) only shows --qlinear 4w --qembedding 8w --hqq. Since Qwen3.6 needs --sensitive --hqq, a second example there would help users who land on the qwen3_5_moe README from the qwen3_6_moe stub.


What looks good

  • Argument validation is extracted into _validate_args and properly rejects incompatible flag combinations (--sensitive + --qlinear, --hqq without --qlinear/--sensitive, --sensitive + --backend mlx).
  • The _quantize_sensitive function selectively quantizes submodules using wrapper ModuleDicts to scope quantize_model_ calls, avoiding accidental quantization of GDN internals or routing gates.
  • Test coverage: test_sensitive_gs32 and test_sensitive_gs32_hqq exercise the full roundtrip (quantize -> save -> load -> forward == quantize -> forward), and the existing corrupted checkpoint test still applies.
  • CI matrix changes are mechanical and consistent across both export-model-cuda-artifact and test-model-cuda-e2e jobs.
  • The LICENSE file is now copied into prequantized output directories.

Copilot AI review requested due to automatic review settings April 17, 2026 23:19
@mergennachin
Copy link
Copy Markdown
Contributor Author

@claude Review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Claude finished @mergennachin's task in 1m 43s —— View job


Review of PR #18978

  • Gather context and read changed files
  • Review export.py changes (core logic)
  • Review quantize_and_save.py changes
  • Review test changes
  • Review CI/script changes
  • Review documentation changes
  • Post review

This PR is well-structured. The _quantize_sensitive implementation is clean, the ModuleDict wrapper pattern for scoping quantize_model_ calls is the right approach, argument validation is thorough, and the test coverage for the new path is solid. The CI changes are a straightforward model ID swap with consistent matrix exclusions.

A few items to consider:


Issues

1. .inv_freq removed from expected missing keys (export.py:234-240)

The removal of ".inv_freq" from runtime_prefixes in load_prequantized_model means that if a prequantized checkpoint omits inv_freq keys, loading will now fail with a "missing weight keys" error. This is correct if the Qwen3.6 prequantized checkpoint includes inv_freq in its safetensors file, but it will break loading of any older Qwen3.5 prequantized checkpoints that omit inv_freq. Worth confirming this is intentional (and that the Qwen3.5 HuggingFace checkpoint at SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4 was regenerated or also includes inv_freq).

2. Qwen3-0.6B removed from test-model-cuda-e2e matrix but Qwen/Qwen3-0.6B still in export-model-cuda-artifact (cuda.yml:163-253, 294-378)

The test-model-cuda-e2e matrix no longer includes Qwen/Qwen3-0.6B, but export-model-cuda-artifact still has it (line 177). This means the Qwen3-0.6B model is exported but never tested. This appears intentional (the exported artifact exists for other consumers like test-cuda-pybind), but the asymmetry is worth a comment in the YAML.

3. Comment in cuda.yml says "Qwen3.5 MoE" but now refers to Qwen3.6 (cuda.yml:194, 324)

The exclude comments still read # Qwen MoE uses a prequantized checkpoint, only tile-packed. The "Qwen MoE" wording is fine, but the original version said "Qwen3.5 MoE". After this change the model is Qwen3.6-35B-A3B-HQQ-INT4. This is minor but could cause confusion when debugging CI.


Suggestions (non-blocking)

4. _quantize_sensitive shared expert quantizes the gate linear too

shared_wrapper = nn.ModuleDict({"shared": layer.mlp.shared_expert}) wraps the entire shared_expert module and calls quantize_model_ with qlinear_config="8w" on it. If shared_expert has submodules beyond gate_up_proj and down_proj (e.g. the gate linear mentioned in the docstring as staying bf16), those would also get INT8-quantized. Looking at the architecture diagram, the shared_expert_gate is on layer.mlp (not inside shared_expert), so this is fine. But if the SwiGLU module itself has a gate linear internally, it would be caught. Worth double-checking the module structure.

5. _test_sensitive_roundtrip duplicates the body of _test_roundtrip (test_quantize_roundtrip.py:203-237)

The two test helper methods share nearly identical logic (materialize, move to CUDA, forward, save, load, compare). The only difference is _make_quantized_model vs _make_sensitive_model. This is readable and fine for now, but if more quantization modes are added, extracting a shared _roundtrip_core(model) would reduce duplication.

6. README quick-start uses <path-to-checkpoint-dir> which is clear (README.md:43)

The earlier Copilot review flagged inconsistency with "bundle" terminology. The current text says <path-to-checkpoint-dir> which is consistent with the rest of the README. This looks resolved.

7. qwen3_6_moe/README.md says "requires --sensitive" and "--hqq is recommended" (README.md:10-12)

The wording "requires --sensitive for quantization" could be read as "you can't export without --sensitive", when really you just need --sensitive if you want the mixed-precision quantization path. Since exporting an unquantized model or using uniform --qlinear 4w is technically possible (just suboptimal), consider clarifying: "uses --sensitive for quantization" or "should use --sensitive".


What looks good

  • _validate_args extraction centralizes validation that was previously scattered in main(), and correctly rejects --sensitive + --qlinear, --hqq without --qlinear/--sensitive, and --sensitive + --backend mlx.
  • The ModuleDict wrapper pattern in _quantize_sensitive (lines 480-528) is a clean way to scope quantize_model_ calls to specific submodules without accidentally quantizing GDN internals or routing gates.
  • Test coverage: test_sensitive_gs32 and test_sensitive_gs32_hqq exercise the full roundtrip (quantize -> save -> load -> forward == quantize -> forward), and the corrupted checkpoint rejection test still applies.
  • CI matrix changes are consistent between export-model-cuda-artifact and test-model-cuda-e2e.
  • The quantize_and_save.py validation (lines 209-221) mirrors export.py's _validate_args correctly.
  • model.md rewrite is substantially more detailed and now includes concrete shapes and module names, which is genuinely useful as a developer reference.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

examples/models/qwen3_5_moe/quantize_and_save.py:5

  • The docstring/CLI description still read as Qwen 3.5-only, but this script now supports --sensitive for models without QAT (called out as Qwen 3.6 in the README). Consider updating the wording to reflect that it can be used for both Qwen 3.5 and Qwen 3.6 (or more generally “Qwen MoE”) to avoid confusion.
"""Quantize Qwen 3.5 MoE and save as a self-contained safetensors checkpoint.

Runs quantization once and saves the result so export.py can skip
re-quantizing via --prequantized. The output directory contains everything
needed to load the model — no reference to the original HF checkpoint required.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""Validate CLI argument combinations."""
if args.model_id:
if args.model_dir is not None:
raise ValueError("Cannot specify model_dir when model_id is provided.")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _validate_args, using raise ValueError(...) will surface as a stack trace rather than a consistent CLI validation error with usage text. Consider using parser.error(...) here (matching the other validation branches) so invalid combinations exit cleanly with a helpful message.

Suggested change
raise ValueError("Cannot specify model_dir when model_id is provided.")
parser.error("Cannot specify --model-dir when --model-id is provided.")

Copilot uses AI. Check for mistakes.
Comment on lines +866 to +873
if args.model_id:
if args.model_dir is not None:
raise ValueError("Cannot specify model_dir when model_id is provided.")
from huggingface_hub import snapshot_download

args.model_dir = snapshot_download(repo_id=args.model_id)

if not args.prequantized and not args.model_dir and not args.tiny_test:
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_args will download a full HF snapshot when --model-id is set even if --prequantized is also provided (since the prequantized path never uses model_dir). Consider erroring on --prequantized + --model-id (or skipping the download when args.prequantized is set) to avoid an unnecessary multi-GB download in accidental/CI usage.

Copilot uses AI. Check for mistakes.
@mergennachin mergennachin temporarily deployed to upload-benchmark-results April 18, 2026 00:02 — with GitHub Actions Inactive
Qwen 3.6 MoE shares architecture and runner with Qwen 3.5 MoE.
Add a stub README pointing to the existing qwen3_5_moe example.
Update CI scripts and cuda.yml to use the Qwen 3.6 prequantized
checkpoint. Improve qwen3_5_moe README: add quick-start section
for prequantized weights, list available prequantized checkpoints,
and clean up terminology.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants