Skip to content

Add Quantization Aware Distillation (QAD) to Megatron-Bridge example#1600

Open
kevalmorabia97 wants to merge 1 commit into
mainfrom
kmorabia/megatron-bridge-qad
Open

Add Quantization Aware Distillation (QAD) to Megatron-Bridge example#1600
kevalmorabia97 wants to merge 1 commit into
mainfrom
kmorabia/megatron-bridge-qad

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented Jun 2, 2026

What does this PR do?

Type of change: new example

Note: This is part 2 of 4 (builds on #1589):

  • Part 1 (Add Megatron-Bridge PTQ quantize + export example scripts #1589): Megatron-Bridge quantize.py + export.py support and tests.
  • Part 2 (this PR): extend distill.py for quantization-aware distillation (QAD) — load a quantized Megatron checkpoint as the student.
  • Part 3: add NVFP4 + QAD-on-pruned-checkpoint experiments to the Nemotron-3-Nano-30B-A3B tutorial.
  • Part 4: repeat the NVFP4 + QAD experiments on a non-Nemotron model.

This PR is stacked on #1589 (base branch kmorabia/megatron-bridge-quantize-export) — review/merge that first.

Extends examples/megatron_bridge/distill.py to initialize the student from a Megatron checkpoint (a quantized checkpoint from quantize.py, or a pruned one) via --student_megatron_path, enabling Quantization Aware Distillation (QAD):

  • --student_hf_path still builds the student architecture; --student_megatron_path supplies the (optionally quantized) weights.
  • For a quantized checkpoint, the ModelOpt quantize mode + base weights are restored onto the plain student before the knowledge-distillation conversion (restore_sharded_modelopt_state is a no-op once a model is already converted), so the distilled checkpoint stays exportable as a quantized model with export.py.

Upstream dependency / workaround: DistillationProvider.provide() has no seam to transform the student before the KD conversion, so this patches provide() at the class level (via an id()-keyed registry, because the provider proxies instance-attribute assignment to its teacher once the teacher is set). A companion Megatron-Bridge PR adds a first-class DistillationProvider.student_pre_conversion_hook; from nemo:26.06 onwards the workaround should be removed and replaced with that hook (a removal note in distill.py documents exactly how).

Usage

# 1) PTQ -> quantized Megatron checkpoint (part 1)
torchrun --nproc_per_node 2 quantize.py \
    --hf_model_name_or_path Qwen/Qwen3-8B --quant_cfg fp8 --tp_size 2 \
    --export_megatron_path /tmp/Qwen3-8B-FP8-megatron

# 2) QAD: distill the quantized student from the unquantized teacher
torchrun --nproc_per_node 8 distill.py \
    --teacher_hf_path Qwen/Qwen3-8B \
    --student_hf_path Qwen/Qwen3-8B \
    --student_megatron_path /tmp/Qwen3-8B-FP8-megatron \
    --data_paths 1.0 tokenized/data_text_document \
    --train_iters 1000 --output_dir /output/qwen3_8b_qad

# 3) export the distilled quantized checkpoint (part 1)
torchrun --nproc_per_node 1 export.py \
    --hf_model_name_or_path Qwen/Qwen3-8B \
    --megatron_path /output/qwen3_8b_qad/checkpoints \
    --export_unified_hf_path /tmp/qwen3_8b_qad_fp8_hf

Testing

tests/examples/megatron_bridge/test_qad.py (validated on a 2-GPU NeMo 26.04 container): quantize a tiny Qwen3 at TP=2 → QAD distill from the quantized student → export.py to a unified HF checkpoint, asserting hf_quant_config.json is written (proves the quantize mode survived QAD). Includes a commented-out vLLM deployment check, validated locally (full flow passes; vLLM loads the export as quantization=modelopt). Existing normal/Puzzletron distillation tests still pass.

Before your PR is "Ready for review"

  • Is this change backward compatible?: N/A (new example feature; default behavior unchanged when --student_megatron_path is not set)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A (no new dependencies)
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅
  • Did you get Claude approval on this PR?: ✅

Additional Information

Depends on a companion Megatron-Bridge PR adding DistillationProvider.student_pre_conversion_hook (the upstream replacement for the class-level provide() workaround). The Nemotron-3 tutorial NVFP4 + QAD experiments ship in part 3.

Summary by CodeRabbit

  • New Features

    • Added Quantization Aware Distillation (QAD) workflow for Megatron-Bridge to improve quantized model accuracy and enable distillation from quantized checkpoints
    • Added example scripts demonstrating quantize → QAD distill → export flow
  • Documentation

    • Expanded QAD guidance, best practices for post-quantization accuracy, and runnable example commands
  • Tests

    • Added end-to-end test validating the full quantize→QAD→export workflow and artifacts

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 2, 2026 12:46
@kevalmorabia97 kevalmorabia97 requested review from yueshen2016 and removed request for a team June 2, 2026 12:46
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f03b4cdb-fead-4055-8f53-c2c4322ad8e4

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed6c0c and f0d1988.

📒 Files selected for processing (4)
  • CHANGELOG.rst
  • examples/megatron_bridge/README.md
  • examples/megatron_bridge/distill.py
  • tests/examples/megatron_bridge/test_qad.py
✅ Files skipped from review due to trivial changes (1)
  • examples/megatron_bridge/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • CHANGELOG.rst
  • tests/examples/megatron_bridge/test_qad.py
  • examples/megatron_bridge/distill.py

📝 Walkthrough

Walkthrough

Extends Megatron-Bridge distillation to support Quantization Aware Distillation (QAD) by allowing distill.py to initialize the student model from a quantized Megatron checkpoint. Adds CLI argument, monkeypatch-based restoration workflow, documentation, and comprehensive end-to-end test.

Changes

QAD Feature Implementation

Layer / File(s) Summary
QAD Documentation
CHANGELOG.rst, examples/megatron_bridge/README.md
CHANGELOG updated to reference QAD capability via distill.py extension. README adds TIP recommending QAD for post-quantization accuracy recovery, extends distillation instructions with --student_megatron_path parameter, and introduces "Quantization Aware Distillation (QAD)" subsection with teacher/student roles and example command.
Megatron Student Loading Mechanism
examples/megatron_bridge/distill.py
Implements _restore_megatron_student() to select latest checkpoint, restore ModelOpt state when present, and load weights. Creates _MEGATRON_STUDENT_CONFIG registry keyed by distill_provider id. Monkeypatches DistillationProvider.provide to restore student weights and state before KD conversion, avoiding teacher-proxy attribute leakage.
Distill Script CLI and Main Wiring
examples/megatron_bridge/distill.py
Adds --student_megatron_path CLI argument. Modifies _build_model_provider() to accept load_weights flag. Main() conditionally skips HuggingFace weight loading when Megatron checkpoint provided, detects ModelOpt state presence, disables gradient_accumulation_fusion when quantization present, and registers distill_provider with checkpoint metadata for the monkeypatch to use.
End-to-End QAD Test
tests/examples/megatron_bridge/test_qad.py
Orchestrates full QAD flow: generates tiny Qwen3 HF model, runs quantize.py (FP8 PTQ) producing Megatron checkpoint with modelopt_state, runs distill.py using quantized student and unquantized teacher, validates distilled checkpoint iteration marker and state preservation, then runs export.py and verifies exported HF checkpoint files (config.json, hf_quant_config.json, *.safetensors).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • ChenhanYu
  • jenchen13
  • yueshen2016
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding Quantization Aware Distillation (QAD) support to the Megatron-Bridge example scripts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Code follows all SECURITY.md practices: no unsafe torch.load, numpy.load, trust_remote_code defaults to False, no eval/exec/nosec, no new dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/megatron-bridge-qad

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch 2 times, most recently from 9ba6385 to 7787f98 Compare June 2, 2026 13:00
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.86%. Comparing base (f21977a) to head (f0d1988).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1600      +/-   ##
==========================================
+ Coverage   77.41%   77.86%   +0.45%     
==========================================
  Files         480      480              
  Lines       52506    52506              
==========================================
+ Hits        40645    40883     +238     
+ Misses      11861    11623     -238     
Flag Coverage Δ
examples 42.95% <ø> (+2.14%) ⬆️
unit 53.75% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch from 7787f98 to 4e22a59 Compare June 2, 2026 17:28
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Reviewed Part 2 of the QAD-on-Megatron-Bridge series (extends distill.py with --student_megatron_path to load a quantized Megatron checkpoint as the student before the KD conversion, plus an end-to-end QAD test).

Findings: 0 CRITICAL · 0 IMPORTANT · 3 SUGGESTION (all inline)

The class-level provide() monkey-patch is well-documented as a 26.04 workaround with a clear removal path once student_pre_conversion_hook lands in 26.06; the test exercises the full quantize → QAD-distill → unified-HF-export flow and confirms modelopt_state survives. Suggestions are about local clarity, not behavior:

  • _restore_megatron_student: the strict=False rationale referencing "in-memory teacher weights" doesn't match the only call site (teacher isn't built yet at that point).
  • student_is_quantized / quantized: bool actually mean "checkpoint has any ModelOpt mode state"; safe today (prune_minitron strips its state, only quantize.py emits any), but the QAD-specific log message and gradient_accumulation_fusion = False would fire incorrectly if any other mode starts shipping state.
  • id(self)-keyed registry silently falls back to vanilla distillation if the framework ever wraps/copies the provider before provide() is called — consider asserting the lookup hit when --student_megatron_path was set so the failure is loud rather than producing an uninitialized-student run.

Comment thread examples/megatron_bridge/distill.py Outdated
Comment thread examples/megatron_bridge/distill.py Outdated
Comment thread examples/megatron_bridge/distill.py
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch 2 times, most recently from 70b1610 to 46056fa Compare June 2, 2026 18:35
Base automatically changed from kmorabia/megatron-bridge-quantize-export to main June 2, 2026 19:36
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 2, 2026 19:36
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch from 46056fa to 2ed6c0c Compare June 2, 2026 19:49
@kevalmorabia97 kevalmorabia97 removed the request for review from a team June 2, 2026 19:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/megatron_bridge/distill.py`:
- Line 104: The parameter name has_modelopt_state in the function
_restore_megatron_student shadows the imported function has_modelopt_state;
rename the parameter (e.g., to modelopt_present or has_modelopt_flag) in
_restore_megatron_student, update all references inside that function to the new
parameter name, and update all call sites of _restore_megatron_student to pass
the renamed parameter variable so the imported has_modelopt_state function
remains callable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7cf45685-2015-4140-bfc0-7c93ab8c17e4

📥 Commits

Reviewing files that changed from the base of the PR and between f21977a and 2ed6c0c.

📒 Files selected for processing (4)
  • CHANGELOG.rst
  • examples/megatron_bridge/README.md
  • examples/megatron_bridge/distill.py
  • tests/examples/megatron_bridge/test_qad.py

Comment thread examples/megatron_bridge/distill.py Outdated
Extend examples/megatron_bridge/distill.py with --student_megatron_path to
initialize the student from a Megatron checkpoint (a quantized checkpoint from
quantize.py, or a pruned one) instead of HuggingFace weights; --student_hf_path
still builds the architecture.

For a quantized checkpoint, the ModelOpt quantize mode + base weights are
restored onto the plain student before the knowledge-distillation conversion
(restore_sharded_modelopt_state is a no-op once a model is already converted),
so the distilled checkpoint stays exportable as a quantized model with export.py.

Until nemo:26.06 (which adds DistillationProvider.student_pre_conversion_hook
upstream), this is done by patching DistillationProvider.provide at the class
level via an id()-keyed registry, since the provider proxies instance attribute
assignment to its teacher once the teacher is set. A removal note documents the
upstream-hook replacement.

Add tests/examples/megatron_bridge/test_qad.py covering quantize -> QAD ->
export, asserting hf_quant_config.json is written so the distilled checkpoint
stays exportable as a quantized model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-bridge-qad branch from 2ed6c0c to f0d1988 Compare June 2, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants