Skip to content

Add Megatron-Bridge PTQ quantize + export example scripts#1589

Merged
kevalmorabia97 merged 11 commits into
mainfrom
kmorabia/megatron-bridge-quantize-export
Jun 2, 2026
Merged

Add Megatron-Bridge PTQ quantize + export example scripts#1589
kevalmorabia97 merged 11 commits into
mainfrom
kmorabia/megatron-bridge-quantize-export

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented Jun 1, 2026

What does this PR do?

Type of change: new example

Adds a two-step post-training quantization (PTQ) flow for Megatron-Bridge models under examples/megatron_bridge/, mirroring the Megatron-LM quantize.sh / export.sh split:

  • quantize.py — loads an HF model via Megatron-Bridge, applies ModelOpt PTQ (via a --quant_cfg alias / full config name, or a --recipe YAML), with optional KV-cache quant, weight-only, compression, and MoE expert-ratio calibration, then saves a Megatron checkpoint (with ModelOpt state). Tensor / pipeline / expert parallelism are all supported, and the checkpoint can later be reloaded for further training (QAT / distillation).
  • export.py — loads the quantized Megatron checkpoint, re-shards to TP=1, and exports a HuggingFace (unified) checkpoint deployable with TensorRT-LLM / vLLM / SGLang.

Why the split? The unified HF exporter (export_mcore_gpt_to_hf) does not gather tensor-parallel-sharded weights — Megatron-LM likewise forces TP=1 during its export step. Saving a TP-sharded Megatron checkpoint first lets us calibrate at TP>1 (to fit large models) and then reload re-sharded to TP=1 for the HF export. A combined single-script flow silently produced corrupt HF checkpoints under TP>1 (collided per-rank shards), which this split avoids.

Note: This is part 1 of 4:

  • Part 1 (this PR): Megatron-Bridge quantize.py + export.py support and tests.
  • Part 2: extend distill.py for quantization-aware distillation (QAD) — load a quantized Megatron checkpoint as the student.
  • Part 3: add NVFP4 + QAD-on-pruned-checkpoint experiments to the Nemotron-3-Nano-30B-A3B tutorial.
  • Part 4: repeat the NVFP4 + QAD experiments on a non-Nemotron model.

Usage

# Step 1: quantize (TP/PP/EP supported) -> Megatron checkpoint
torchrun --nproc_per_node 2 quantize.py \
    --hf_model_name_or_path Qwen/Qwen3-8B \
    --quant_cfg fp8 \
    --tp_size 2 \
    --export_megatron_path /tmp/Qwen3-8B-FP8-megatron

# Step 2: export -> deployable HuggingFace (unified) checkpoint (re-shards to TP=1)
torchrun --nproc_per_node 1 export.py \
    --hf_model_name_or_path Qwen/Qwen3-8B \
    --megatron_path /tmp/Qwen3-8B-FP8-megatron \
    --export_unified_hf_path /tmp/Qwen3-8B-FP8-hf

Testing

tests/examples/megatron_bridge/test_quantize.py (validated on a 2-GPU NeMo 26.04 container):

  • test_quantize_export_and_vllm_deployment — quantize a tiny Qwen3 via a recipe at TP=2 → export.py re-shards to TP=1 → load + generate with vLLM (skipped if vLLM absent).
  • test_quantize_megatron_checkpoint_reload — quantize at TP=2 → reload the Megatron checkpoint via the bridge and assert ModelOpt quantizers were restored.

Before your PR is "Ready for review"

  • Is this change backward compatible?: N/A (new example)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A (no new dependencies)
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅
  • Did you get Claude approval on this PR?: ✅

Additional Information

The Nemotron-3 tutorial update to use these scripts is intentionally not included here — it ships with the part 3 PR alongside the NVFP4 + QAD experiments.

Summary by CodeRabbit

  • Documentation

    • Expanded post-training quantization (PTQ) workflow documentation with detailed step-by-step examples and configuration guidance for the Megatron-Bridge framework.
  • New Features

    • Added quantization tool for applying PTQ to Megatron models with calibration support.
    • Added export tool for converting quantized models to a deployable format.
  • Tests

    • Added integration tests validating the complete quantization-export-deployment workflow, including inference validation.

kevalmorabia97 and others added 2 commits June 1, 2026 12:19
Add a two-step post-training quantization flow for Megatron-Bridge models
under examples/megatron_bridge/:

- quantize.py: load an HF model via Megatron-Bridge, apply ModelOpt PTQ
  (via a --quant_cfg alias / full config name, or a --recipe YAML), with
  optional KV-cache quant, weight-only, compression, and MoE expert-ratio
  calibration; save a Megatron checkpoint (with ModelOpt state). Tensor /
  pipeline / expert parallelism are all supported.
- export.py: load the quantized Megatron checkpoint, re-shard to TP=1, and
  export a HuggingFace (unified) checkpoint deployable with TensorRT-LLM /
  vLLM / SGLang. The unified exporter does not gather tensor-parallel shards
  (matching Megatron-LM, which forces TP=1 during export), hence the split.

Add tests covering quantize -> export -> vLLM deployment and quantize ->
Megatron reload with ModelOpt quantizers restored.

Follow-ups: part 2 extends distill.py for quantization-aware distillation
(QAD); part 3 adds NVFP4 + QAD-on-pruned-checkpoint experiments to the
Nemotron-3 tutorial.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 1, 2026 19:20
@kevalmorabia97 kevalmorabia97 requested a review from AAnoosheh June 1, 2026 19:20
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements an end-to-end post-training quantization (PTQ) and export workflow for Megatron-Bridge. Two new CLI scripts handle quantizing HuggingFace models to Megatron checkpoints and exporting them to unified HuggingFace checkpoints, with comprehensive documentation and integration tests validating the full pipeline.

Changes

Megatron-Bridge PTQ Workflow

Layer / File(s) Summary
Quantization script (quantize.py)
examples/megatron_bridge/quantize.py
Loads HuggingFace models into Megatron-Bridge, resolves quantization configs from YAML recipes or CLI options (--quant_cfg, --kv_cache_quant, --weight_only), performs calibration when needed, applies quantization with optional weight compression, saves Megatron checkpoint including ModelOpt state, and optionally validates via generation sanity checks. Distributes work and cleans up with dist.setup()/dist.cleanup().
Export script (export.py)
examples/megatron_bridge/export.py
Loads quantized Megatron checkpoints, reshards from original MP to tensor_model_parallel_size=1 for export, constructs models via AutoBridge from HuggingFace source, synchronizes extra-module detection across distributed ranks, and calls modelopt.torch.export.export_mcore_gpt_to_hf to write unified HuggingFace checkpoints with configurable dtype and PP/EP re-sharding. Manages distributed state.
Integration test and vLLM helper
tests/examples/megatron_bridge/test_quantize_export.py, tests/examples/megatron_bridge/_vllm_generate.py
Creates tiny Qwen3 test model, runs quantize.py via torchrun to produce FP8 Megatron checkpoint with modelopt_state, runs export.py to generate unified HuggingFace export, verifies expected artifacts exist, and conditionally validates generation with vLLM via subprocess helper script using pytest.importorskip.
Documentation updates
CHANGELOG.rst, examples/megatron_bridge/README.md
CHANGELOG entry documenting the PTQ workflow; expanded README section describing the two-step quantize→export flow, supported configuration options, concrete torchrun examples for Qwen3-8B FP8 quantization, and export sharding guidance noting TP=1 constraint and PP/EP re-sharding.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main addition to the PR: two example scripts (quantize.py and export.py) implementing a post-training quantization workflow for Megatron-Bridge.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR contains no security anti-patterns: no unsafe torch.load/numpy.load, no hardcoded trust_remote_code=True, no eval/exec on external input, no nosec comments, and no new dependencies.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/megatron-bridge-quantize-export

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 requested review from ChenhanYu, jenchen13 and yueshen2016 and removed request for AAnoosheh June 1, 2026 19:21
Make the output-path flags explicit about the checkpoint format, with a
shared 'export_' prefix matching prune_minitron.py's output_* convention:
- quantize.py: --export_megatron_path     (Megatron checkpoint output)
- export.py:   --export_unified_hf_path    (unified HuggingFace output)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-bridge-quantize-export branch from 3ebeedc to 1361339 Compare June 1, 2026 19:35
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Drop the standalone test_quantize_megatron_checkpoint_reload test and its
_load_megatron_quantized.py helper: the export+vLLM test already reloads the
quantized Megatron checkpoint (via export.py's load_megatron_model, at the
harder TP>1 -> TP=1 reshard) and asserts hf_quant_config.json, which is only
written when the reloaded model is quantized. Fold the modelopt_state-present
assertion into that test so quantize.py's resumable-checkpoint contract stays
covered without a second GPU job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.64%. Comparing base (54fb87e) to head (339dd06).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1589      +/-   ##
==========================================
+ Coverage   73.22%   76.64%   +3.41%     
==========================================
  Files         478      480       +2     
  Lines       52421    53999    +1578     
==========================================
+ Hits        38387    41389    +3002     
+ Misses      14034    12610    -1424     
Flag Coverage Δ
examples 42.89% <100.00%> (+2.08%) ⬆️
gpu 59.79% <28.57%> (+7.90%) ⬆️
regression 15.18% <28.57%> (+0.07%) ⬆️
unit 53.75% <28.57%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Comment thread examples/megatron_bridge/quantize.py
Comment thread examples/megatron_bridge/quantize.py Outdated
Comment thread examples/megatron_bridge/quantize.py Outdated
kevalmorabia97 and others added 2 commits June 1, 2026 13:51
- Reword the pack=False comment: it described 'backward compatibility',
  but this is a new file. Explain that calibration uses unpacked sequences
  (pack=True changes per-sample calibration statistics).
- Guard the calibration_mode toggle with try/finally so a failure mid-
  calibration doesn't leave the flag set for the later compress / save calls.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
The export+vLLM test was OOM-killed on CI (exit 137). CI has more GPU memory
than local (where it passes), so this is a host-RAM OOM: loading vLLM in-process
stacked its large footprint on top of the test process's torch / Megatron /
transformers imports. Move the vLLM load+generate into a standalone
_vllm_generate.py run as a subprocess (footprint reclaimed on exit) with a
minimal config (swap_space=0, max_num_seqs=1, enforce_eager, small KV cache).
Also rename test_quantize.py -> test_quantize_export.py since it now covers
export.py too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review changes in examples/megatron_bridge directory

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/examples/megatron_bridge/_vllm_generate.py`:
- Around line 43-44: The current smoke-test uses "text is not None" which can't
detect an empty string; update the assertion after llm.generate(...) to require
a non-empty truthy string from the vLLM completion: check outputs and then
assert that outputs[0].outputs[0].text is non-empty (e.g., truthy or .strip()
length > 0) so CompletionOutput.text from the llm.generate call and
SamplingParams invocation must be validated as a non-empty string rather than
merely not None.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c55dd76c-7b35-4b30-bfb6-97fb7137f39b

📥 Commits

Reviewing files that changed from the base of the PR and between a9a386b and 9d1f851.

📒 Files selected for processing (3)
  • examples/megatron_bridge/quantize.py
  • tests/examples/megatron_bridge/_vllm_generate.py
  • tests/examples/megatron_bridge/test_quantize_export.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/megatron_bridge/quantize.py

Comment thread tests/examples/megatron_bridge/_vllm_generate.py Outdated
The exported-checkpoint vLLM load+generate hangs on CI runners (the
quantize + export steps are already validated). Drop the out-of-process
_vllm_generate.py helper and keep the vLLM deployment check inline but
commented out, with a note on re-enabling it locally in the nemo
container. Verified the full test (incl. vLLM) passes locally (~93s).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Reviewed all 5 changed files (CHANGELOG.rst, README.md, export.py, quantize.py, test_quantize_export.py). The two new scripts cleanly extend the Megatron-Bridge example pattern (mirrors prune_minitron.py + load_mbridge_model_from_hf), the recipe / quant_cfg / KV-cache / weight-only / MoE-ratio plumbing matches the established hf_ptq.py logic, and ModelOpt-state-carrying Megatron checkpoints round-trip through save_megatron_model → load_megatron_model → export_mcore_gpt_to_hf as expected.

Findings: 0 CRITICAL, 0 IMPORTANT, 3 SUGGESTION (all inline, none blocking):

  1. export.py:134-141 — broadcast source rank hardcodes world_size-1 for detecting the last PP stage. Works under Megatron-Core's default rank ordering with TP=1, but is fragile vs. the codebase's existing mpu.is_pipeline_last_stage() pattern.
  2. quantize.py:241-244--kv_cache_quant + --compress incompatibility check is bypassed on the recipe path. A recipe with KV-cache (e.g. the one used by the test) plus --compress silently proceeds.
  3. quantize.py:346-360 — sanity-check generation runs unconditionally; the "fake-quantized model" comment no longer matches when --compress is set.

Risk: low. New example surface only, no public-API or modelopt-state schema changes; CHANGELOG entry is correct.

Comment thread examples/megatron_bridge/export.py Outdated
Comment thread examples/megatron_bridge/quantize.py Outdated
Comment thread examples/megatron_bridge/quantize.py Outdated
- export.py: detect extra modules (EAGLE/Medusa/MTP) via an all-reduce MAX over
  all ranks instead of broadcasting from a hard-coded source rank, so
  --export_extra_modules is correct under any pipeline placement / rank ordering.
- quantize.py: validate --compress vs KV-cache quantization on the resolved
  quant_cfg (detecting *[kv]_bmm_quantizer entries) so the incompatibility is
  caught for recipe-driven KV-cache configs too, not just --kv_cache_quant.
- quantize.py: skip the post-quantization generation sanity check when --compress
  is set, since the weights are then real low-bit and megatron_generate may not
  support compressed forward for every format.
- quantize.py: add a TODO for AutoQuantize (mtq.auto_quantize) support.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…dge examples

Expert tensor parallelism is not supported for these flows (ModelOpt quantized
MoE experts don't support it and the HF exporter forces it to 1). Megatron
defaults expert_tensor_parallel_size to tensor_model_parallel_size when unset,
so it must be pinned explicitly. Remove the --etp_size CLI option from
quantize.py and distill.py and explicitly set expert_tensor_parallel_size=1 in
the model initialization of quantize.py, distill.py, and prune_minitron.py
(export.py already did).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Comment thread examples/megatron_bridge/export.py Outdated
Comment thread examples/megatron_bridge/export.py Outdated
Comment thread examples/megatron_bridge/quantize.py
Comment thread tests/examples/megatron_bridge/_vllm_generate.py Outdated
Comment thread examples/megatron_bridge/README.md
…rint_args

- README: reorder sections to Post-Training Quantization -> Distillation ->
  Pruning (TOC + intro updated to match).
- export.py: expert parallelism is unsupported for HF export (the unified
  exporter doesn't gather expert-sharded weights), so remove --ep_size and pin
  expert_model_parallel_size=1; shard large models via pipeline parallelism.
- Add modelopt.torch.utils.print_args() to pretty-print an argparse.Namespace
  on the master process, and reuse it in quantize.py / distill.py / export.py /
  prune_minitron.py instead of duplicating the args-banner loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 2, 2026 18:26
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-02 19:36 UTC

Copy link
Copy Markdown
Contributor

@jenchen13 jenchen13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread examples/megatron_bridge/README.md Outdated
Comment thread examples/megatron_bridge/README.md Outdated
Comment thread examples/megatron_bridge/export.py Outdated
- README: drop redundant "Examples of" prefix from the section table and fix
  "distillation a" -> "distilling a" grammar (jenchen13).
- export.py: add a comment explaining expert parallelism is not supported for
  HF export (jenchen13).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) June 2, 2026 18:40
@kevalmorabia97 kevalmorabia97 merged commit f21977a into main Jun 2, 2026
66 of 68 checks passed
@kevalmorabia97 kevalmorabia97 deleted the kmorabia/megatron-bridge-quantize-export branch June 2, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants