feat(lora): save/restore LoRA config in checkpoint metadata by RexBearIU · Pull Request #4269 · AI-Hypercomputer/maxtext

RexBearIU · 2026-06-25T10:35:32Z

Description

This PR implements native serialization of LoRA configuration parameters (lora_rank, lora_alpha) in standard Orbax _CHECKPOINT_METADATA files, and automatically restores them during checkpoint-to-Hugging Face conversion.

Why is this change being made?

Previously, users had to manually supply matching lora.lora_rank and lora.lora_alpha parameters when converting MaxText checkpoints to Hugging Face format. Storing them in Orbax metadata makes the conversion seamless and error-free (resolves @igorts-git's request in #3970).

Key Implementation Details

Serialization: In save_checkpoint (checkpointing.py), we save the active config.lora block under the "lora" key in Orbax's custom_metadata when a LoRA rank is specified.
Restoration: In main (to_huggingface.py), sync_lora_metadata reads the custom metadata from lora_restore_path via ocp.StandardCheckpointer and overrides active config parameters during conversion.
Fail-Fast Safety: Scoped strictly to the conversion path to ensure SFT training paths remain strict and fail fast on any configuration mismatches.
Test Import Refactoring: Refactored hf_checkpoint_conversion_test.py to move dynamically loaded inline imports to global top-level imports and completely removed json import since JSON string is written directly.

BUGS: #3970

Tests

We have verified the implementation with complete suite-level and individual unit-tests:

Added/Updated Unit Tests:
- SyncLoRAMetadataTest in tests/unit/hf_checkpoint_conversion_test.py to verify the auto-resolving mechanism during Hugging Face conversion.
Command to run:
python tests/unit/hf_checkpoint_conversion_test.py
All tests pass successfully.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-25T10:39:20Z

Codecov Report

❌ Patch coverage is 75.00000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/utils/lora_utils.py	77.77%	4 Missing and 2 partials ⚠️
...rc/maxtext/checkpoint_conversion/to_huggingface.py	0.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

shralex

Thanks Jackie! A significant thing missing in this PR is using the metadata file on checkpoint restore path.

RexBearIU · 2026-06-25T15:14:08Z

Hi @shralex, thank you for the feedback!

I have fully addressed your comments with the following changes:

Checkpoint Restore Auto-Sync: Implemented automatic LoRA rank and alpha syncing from the Orbax native _CHECKPOINT_METADATA file's custom_metadata on the training/SFT restore path (restore_lora_from_path in lora_utils.py). Now, training/SFT runs resuming or restoring from a LoRA checkpoint will automatically detect, sync, and apply the correct LoRA rank and alpha parameters from the saved checkpoint metadata.
Unified Native Orbax Metadata: Switched from creating and loading a custom lora_config.json to using Orbax's native custom_metadata dictionary inside _CHECKPOINT_METADATA. This conforms perfectly to standard checkpointing conventions without introducing any custom, out-of-band config files.
Path Resilience: Enhanced metadata resolution to support paths pointing to either the step directory directly (e.g., .../checkpoints/1000/) or to any nested parameter subfolders (e.g., .../checkpoints/1000/items/), resolving parent paths gracefully.
Expanded Unit Tests & Linting: Added and modified tests (SyncLoRAMetadataTest and SyncLoRAMetadataTrainingTest in both test suites) covering both conversion and training/SFT-side auto-restore flows. Verified everything compiles, passes all pre-commit formatting/styling, and is 100% green!

Please let me know if you would like any other enhancements!

xibinliu · 2026-06-26T16:52:04Z

Thanks Jackie! A significant thing missing in this PR is using the metadata file on checkpoint restore path.

added the logic to re-use the metadata for checkpoint restore.

shralex · 2026-06-27T15:30:07Z

    )
    return trainer

+  sync_lora_metadata(mt_config)


lets move this down to after we verified that lora is enabled

Done! I have moved sync_lora_metadata(config) down in to_huggingface.py so that it is called after we verify that LoRA is indeed enabled in the model configuration.

shralex · 2026-06-27T15:31:30Z


 def restore_lora_from_path(trainer: Any, mt_config: pyconfig.HyperParameters) -> Any:
  """Restores LoRA parameter weights from an external Orbax checkpoint for a fresh run."""
  lora_restore_path = mt_config.lora.lora_restore_path


can we add a check here:

if not lora_restore_path:
return trainer # No restore requested; exit cleanly without error

(otherwise we're relying on the callers to always call this function when this path is set)

Done! Added the guard check at the beginning of restore_lora_from_path so it returns the trainer early and exits cleanly if lora_restore_path is not set.

shralex

This version reverts Xibin's previous version where sync_lora_metadata was in lora_utils. We should move it back there and use it not just on checkpoint conversion but also before model creation.

shralex · 2026-06-29T15:29:25Z

      save_args_composite["iter"] = GrainCheckpointSave(item=grain_iters_to_save)

+  custom_metadata = None
+  if config and config.lora.lora_rank > 0:


Lets check that config contains "lora" before accessing config.lora.lora_rank:

if config and hasattr(config, "lora") and config.lora:
lora_rank = getattr(config.lora, "lora_rank", 0)
if lora_rank > 0 and hasattr(config.lora, "model_dump"):
custom_metadata = {"lora": config.lora.model_dump()}

Done! Added checks to ensure config has the lora attribute and is not None before attempting to access lora_rank or model_dump.

shralex · 2026-06-29T15:33:09Z

      replicator_error_handler(config)
-      return checkpoint_manager.save(step, args=Composite(state=checkpoint_args), force=force)
+      return checkpoint_manager.save(
+          step, args=Composite(state=checkpoint_args), force=force, custom_metadata=custom_metadata


EmergencyCheckpointManager and EmergencyReplicatorCheckpointManager do not accept a custom metadata argument. Lets leave this argument out here, and open a bug to add this support

Done! Omitted passing the custom_metadata argument when calling .save() on EmergencyCheckpointManager or EmergencyReplicatorCheckpointManager.

I've created a bug b/529671188 for Orbax team to add support on EmergencyCheckpointManager or EmergencyReplicatorCheckpointManager

RexBearIU · 2026-06-30T10:05:08Z

This version reverts Xibin's previous version where sync_lora_metadata was in lora_utils. We should move it back there and use it not just on checkpoint conversion but also before model creation.

Done. Moved sync_lora_metadata back to lora_utils.py (running during checkpoint restore) with a clean, formatting-free diff.

Co-authored-by: Xibin Liu <xibin@google.com>

RexBearIU mentioned this pull request Jun 25, 2026

docs: QLoRA Documentation and Notebooks #3970

Merged

4 tasks

shralex requested changes Jun 25, 2026

View reviewed changes

RexBearIU changed the title ~~feat(lora): serialize and load lora_config.json sidecar metadata~~ feat(lora): save and auto-restore LoRA rank/alpha using native Orbax custom_metadata Jun 25, 2026

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 187905b to cd17578 Compare June 25, 2026 15:13

shralex reviewed Jun 25, 2026

View reviewed changes

Comment thread src/maxtext/checkpoint_conversion/to_huggingface.py Outdated

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from cd17578 to 1b15640 Compare June 25, 2026 16:02

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 1b15640 to ae44adc Compare June 25, 2026 16:11

igorts-git reviewed Jun 25, 2026

View reviewed changes

Comment thread tests/unit/hf_checkpoint_conversion_test.py Outdated

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch 3 times, most recently from 69c78a7 to a701719 Compare June 26, 2026 02:50

RexBearIU changed the title ~~feat(lora): save and auto-restore LoRA rank/alpha using native Orbax custom_metadata~~ feat(lora): save/restore LoRA config in checkpoint metadata Jun 26, 2026

igorts-git approved these changes Jun 26, 2026

View reviewed changes

xibinliu force-pushed the jackyf/lora-ckpt-metadata branch from a701719 to 07c5e19 Compare June 26, 2026 16:42

xibinliu force-pushed the jackyf/lora-ckpt-metadata branch 3 times, most recently from 5940e65 to 9bc253e Compare June 26, 2026 23:29

shralex reviewed Jun 27, 2026

View reviewed changes

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 9bc253e to 0f6248b Compare June 29, 2026 09:46

shralex requested changes Jun 29, 2026

View reviewed changes

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch 3 times, most recently from d55b90d to ffe10de Compare June 30, 2026 08:33

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from ffe10de to 2649217 Compare June 30, 2026 10:07

feat(lora): save/restore LoRA config in checkpoint metadata

e58e177

Co-authored-by: Xibin Liu <xibin@google.com>

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 2649217 to e58e177 Compare June 30, 2026 10:12

RexBearIU mentioned this pull request Jun 30, 2026

feat(scan_layers): verify scan_layers compatibility from checkpoint metadata #4304

Draft

4 tasks

Uh oh!

Conversation

RexBearIU commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why is this change being made?

Key Implementation Details

Tests

Checklist

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

RexBearIU commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xibinliu commented Jun 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RexBearIU Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RexBearIU commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RexBearIU commented Jun 25, 2026 •

edited

Loading

codecov Bot commented Jun 25, 2026 •

edited

Loading

RexBearIU commented Jun 25, 2026 •

edited

Loading

RexBearIU Jun 30, 2026 •

edited

Loading

RexBearIU commented Jun 30, 2026 •

edited

Loading