Add Quantization Aware Distillation (QAD) to Megatron-Bridge example by kevalmorabia97 · Pull Request #1600 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-06-02T12:46:25Z

What does this PR do?

Type of change: new example

Note: This is part 2 of 4 (builds on #1589):

Part 1 (Add Megatron-Bridge PTQ quantize + export example scripts #1589): Megatron-Bridge quantize.py + export.py support and tests.

Part 2 (this PR): extend distill.py for quantization-aware distillation (QAD) — load a quantized Megatron checkpoint as the student.

Part 3: add NVFP4 + QAD-on-pruned-checkpoint experiments to the Nemotron-3-Nano-30B-A3B tutorial.

Part 4: repeat the NVFP4 + QAD experiments on a non-Nemotron model.

This PR is stacked on #1589 (base branch kmorabia/megatron-bridge-quantize-export) — review/merge that first.

Extends examples/megatron_bridge/distill.py to initialize the student from a Megatron checkpoint (a quantized checkpoint from quantize.py, or a pruned one) via --student_megatron_path, enabling Quantization Aware Distillation (QAD):

--student_hf_path still builds the student architecture; --student_megatron_path supplies the (optionally quantized) weights.
For a quantized checkpoint, the ModelOpt quantize mode + base weights are restored onto the plain student before the knowledge-distillation conversion (restore_sharded_modelopt_state is a no-op once a model is already converted), so the distilled checkpoint stays exportable as a quantized model with export.py.

Upstream dependency / workaround: DistillationProvider.provide() has no seam to transform the student before the KD conversion, so this patches provide() at the class level (via an id()-keyed registry, because the provider proxies instance-attribute assignment to its teacher once the teacher is set). A companion Megatron-Bridge PR adds a first-class DistillationProvider.student_pre_conversion_hook; from nemo:26.06 onwards the workaround should be removed and replaced with that hook (a removal note in distill.py documents exactly how).

Usage

# 1) PTQ -> quantized Megatron checkpoint (part 1)
torchrun --nproc_per_node 2 quantize.py \
    --hf_model_name_or_path Qwen/Qwen3-8B --quant_cfg fp8 --tp_size 2 \
    --export_megatron_path /tmp/Qwen3-8B-FP8-megatron

# 2) QAD: distill the quantized student from the unquantized teacher
torchrun --nproc_per_node 8 distill.py \
    --teacher_hf_path Qwen/Qwen3-8B \
    --student_hf_path Qwen/Qwen3-8B \
    --student_megatron_path /tmp/Qwen3-8B-FP8-megatron \
    --data_paths 1.0 tokenized/data_text_document \
    --train_iters 1000 --output_dir /output/qwen3_8b_qad

# 3) export the distilled quantized checkpoint (part 1)
torchrun --nproc_per_node 1 export.py \
    --hf_model_name_or_path Qwen/Qwen3-8B \
    --megatron_path /output/qwen3_8b_qad/checkpoints \
    --export_unified_hf_path /tmp/qwen3_8b_qad_fp8_hf

Testing

tests/examples/megatron_bridge/test_qad.py (validated on a 2-GPU NeMo 26.04 container): quantize a tiny Qwen3 at TP=2 → QAD distill from the quantized student → export.py to a unified HF checkpoint, asserting hf_quant_config.json is written (proves the quantize mode survived QAD). Includes a commented-out vLLM deployment check, validated locally (full flow passes; vLLM loads the export as quantization=modelopt). Existing normal/Puzzletron distillation tests still pass.

Before your PR is "Ready for review"

Is this change backward compatible?: N/A (new example feature; default behavior unchanged when --student_megatron_path is not set)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A (no new dependencies)
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅
Did you get Claude approval on this PR?: ✅

Additional Information

Depends on a companion Megatron-Bridge PR adding DistillationProvider.student_pre_conversion_hook (the upstream replacement for the class-level provide() workaround). The Nemotron-3 tutorial NVFP4 + QAD experiments ship in part 3.

Summary by CodeRabbit

New Features
- Added Quantization Aware Distillation (QAD) workflow for Megatron-Bridge to improve quantized model accuracy and enable distillation from quantized checkpoints
- Added example scripts demonstrating quantize → QAD distill → export flow
Documentation
- Expanded QAD guidance, best practices for post-quantization accuracy, and runnable example commands
Tests
- Added end-to-end test validating the full quantize→QAD→export workflow and artifacts

coderabbitai · 2026-06-02T12:46:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f03b4cdb-fead-4055-8f53-c2c4322ad8e4

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed6c0c and f0d1988.

📒 Files selected for processing (4)

CHANGELOG.rst
examples/megatron_bridge/README.md
examples/megatron_bridge/distill.py
tests/examples/megatron_bridge/test_qad.py

✅ Files skipped from review due to trivial changes (1)

examples/megatron_bridge/README.md

🚧 Files skipped from review as they are similar to previous changes (3)

CHANGELOG.rst
tests/examples/megatron_bridge/test_qad.py
examples/megatron_bridge/distill.py

📝 Walkthrough

Walkthrough

Extends Megatron-Bridge distillation to support Quantization Aware Distillation (QAD) by allowing distill.py to initialize the student model from a quantized Megatron checkpoint. Adds CLI argument, monkeypatch-based restoration workflow, documentation, and comprehensive end-to-end test.

Changes

QAD Feature Implementation

Layer / File(s)	Summary
QAD Documentation `CHANGELOG.rst`, `examples/megatron_bridge/README.md`	CHANGELOG updated to reference QAD capability via distill.py extension. README adds TIP recommending QAD for post-quantization accuracy recovery, extends distillation instructions with `--student_megatron_path` parameter, and introduces "Quantization Aware Distillation (QAD)" subsection with teacher/student roles and example command.
Megatron Student Loading Mechanism `examples/megatron_bridge/distill.py`	Implements _restore_megatron_student() to select latest checkpoint, restore ModelOpt state when present, and load weights. Creates _MEGATRON_STUDENT_CONFIG registry keyed by distill_provider id. Monkeypatches DistillationProvider.provide to restore student weights and state before KD conversion, avoiding teacher-proxy attribute leakage.
Distill Script CLI and Main Wiring `examples/megatron_bridge/distill.py`	Adds `--student_megatron_path` CLI argument. Modifies `_build_model_provider()` to accept `load_weights` flag. Main() conditionally skips HuggingFace weight loading when Megatron checkpoint provided, detects ModelOpt state presence, disables gradient_accumulation_fusion when quantization present, and registers distill_provider with checkpoint metadata for the monkeypatch to use.
End-to-End QAD Test `tests/examples/megatron_bridge/test_qad.py`	Orchestrates full QAD flow: generates tiny Qwen3 HF model, runs quantize.py (FP8 PTQ) producing Megatron checkpoint with modelopt_state, runs distill.py using quantized student and unquantized teacher, validates distilled checkpoint iteration marker and state preservation, then runs export.py and verifies exported HF checkpoint files (config.json, hf_quant_config.json, *.safetensors).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

ChenhanYu
jenchen13
yueshen2016

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding Quantization Aware Distillation (QAD) support to the Megatron-Bridge example scripts.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Code follows all SECURITY.md practices: no unsafe torch.load, numpy.load, trust_remote_code defaults to False, no eval/exec/nosec, no new dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kmorabia/megatron-bridge-qad

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-02T13:12:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.86%. Comparing base (f21977a) to head (f0d1988).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1600      +/-   ##
==========================================
+ Coverage   77.41%   77.86%   +0.45%     
==========================================
  Files         480      480              
  Lines       52506    52506              
==========================================
+ Hits        40645    40883     +238     
+ Misses      11861    11623     -238

Flag	Coverage Δ
examples	`42.95% <ø> (+2.14%)`	⬆️
unit	`53.75% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 · 2026-06-02T17:37:20Z

/claude review

claude

Claude review passed — no blocking issues found. LGTM

Reviewed Part 2 of the QAD-on-Megatron-Bridge series (extends distill.py with --student_megatron_path to load a quantized Megatron checkpoint as the student before the KD conversion, plus an end-to-end QAD test).

Findings: 0 CRITICAL · 0 IMPORTANT · 3 SUGGESTION (all inline)

The class-level provide() monkey-patch is well-documented as a 26.04 workaround with a clear removal path once student_pre_conversion_hook lands in 26.06; the test exercises the full quantize → QAD-distill → unified-HF-export flow and confirms modelopt_state survives. Suggestions are about local clarity, not behavior:

_restore_megatron_student: the strict=False rationale referencing "in-memory teacher weights" doesn't match the only call site (teacher isn't built yet at that point).
student_is_quantized / quantized: bool actually mean "checkpoint has any ModelOpt mode state"; safe today (prune_minitron strips its state, only quantize.py emits any), but the QAD-specific log message and gradient_accumulation_fusion = False would fire incorrectly if any other mode starts shipping state.
id(self)-keyed registry silently falls back to vanilla distillation if the framework ever wraps/copies the provider before provide() is called — consider asserting the lookup hit when --student_megatron_path was set so the failure is loud rather than producing an uninitialized-student run.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/megatron_bridge/distill.py`:
- Line 104: The parameter name has_modelopt_state in the function
_restore_megatron_student shadows the imported function has_modelopt_state;
rename the parameter (e.g., to modelopt_present or has_modelopt_flag) in
_restore_megatron_student, update all references inside that function to the new
parameter name, and update all call sites of _restore_megatron_student to pass
the renamed parameter variable so the imported has_modelopt_state function
remains callable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7cf45685-2015-4140-bfc0-7c93ab8c17e4

📥 Commits

Reviewing files that changed from the base of the PR and between f21977a and 2ed6c0c.

📒 Files selected for processing (4)

CHANGELOG.rst
examples/megatron_bridge/README.md
examples/megatron_bridge/distill.py
tests/examples/megatron_bridge/test_qad.py

Extend examples/megatron_bridge/distill.py with --student_megatron_path to initialize the student from a Megatron checkpoint (a quantized checkpoint from quantize.py, or a pruned one) instead of HuggingFace weights; --student_hf_path still builds the architecture. For a quantized checkpoint, the ModelOpt quantize mode + base weights are restored onto the plain student before the knowledge-distillation conversion (restore_sharded_modelopt_state is a no-op once a model is already converted), so the distilled checkpoint stays exportable as a quantized model with export.py. Until nemo:26.06 (which adds DistillationProvider.student_pre_conversion_hook upstream), this is done by patching DistillationProvider.provide at the class level via an id()-keyed registry, since the provider proxies instance attribute assignment to its teacher once the teacher is set. A removal note documents the upstream-hook replacement. Add tests/examples/megatron_bridge/test_qad.py covering quantize -> QAD -> export, asserting hf_quant_config.json is written so the distilled checkpoint stays exportable as a quantized model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested a review from a team as a code owner June 2, 2026 12:46

kevalmorabia97 requested review from yueshen2016 and removed request for a team June 2, 2026 12:46

kevalmorabia97 requested review from AAnoosheh, ChenhanYu and jenchen13 June 2, 2026 12:47

kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch 2 times, most recently from 9ba6385 to 7787f98 Compare June 2, 2026 13:00

kevalmorabia97 mentioned this pull request Jun 2, 2026

Add NVFP4 + QAD to the Nemotron-3-Nano-30B-A3B tutorial #1601

Draft

AAnoosheh approved these changes Jun 2, 2026

View reviewed changes

kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch from 7787f98 to 4e22a59 Compare June 2, 2026 17:28

claude Bot approved these changes Jun 2, 2026

View reviewed changes

claude Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread examples/megatron_bridge/distill.py Outdated

claude Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread examples/megatron_bridge/distill.py Outdated

claude Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread examples/megatron_bridge/distill.py

kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch 2 times, most recently from 70b1610 to 46056fa Compare June 2, 2026 18:35

coderabbitai Bot approved these changes Jun 2, 2026

View reviewed changes

Base automatically changed from kmorabia/megatron-bridge-quantize-export to main June 2, 2026 19:36

kevalmorabia97 requested a review from a team as a code owner June 2, 2026 19:36

kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch from 46056fa to 2ed6c0c Compare June 2, 2026 19:49

kevalmorabia97 removed the request for review from a team June 2, 2026 19:50

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread examples/megatron_bridge/distill.py Outdated

kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch from 2ed6c0c to f0d1988 Compare June 2, 2026 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Quantization Aware Distillation (QAD) to Megatron-Bridge example#1600

Add Quantization Aware Distillation (QAD) to Megatron-Bridge example#1600
kevalmorabia97 wants to merge 1 commit into
mainfrom
kmorabia/megatron-bridge-qad

kevalmorabia97 commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

kevalmorabia97 commented Jun 2, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevalmorabia97 commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 commented Jun 2, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevalmorabia97 commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading