[V1]support resume training from checkpoint by frozenleaves · Pull Request #10280 · hiyouga/LlamaFactory

frozenleaves · 2026-03-13T04:06:09Z

What does this PR do?

support resume training from checkpoint

examples:

model: Qwen/Qwen3-0.6B
model_class: llm
template: qwen3_nothink
dist_config:
  name: fsdp2
train_dataset: data/v1_sft_demo.yaml
output_dir: outputs/test3_fsdp2_full_resume
micro_batch_size: 1
cutoff_len: 512
learning_rate: 1.0e-4
max_steps: 30
save_steps: 5
save_total_limit: 2  # just save the last 2 checkpoint
resume_from_checkpoint: auto
# resume_from_checkpoint: outputs/test3_fsdp2_full_resume/checkpoint-10

If resume_from_checkpoint is set to auto, it will automatically search for the most recently saved complete checkpoint, or you can also specify it manually.

Currently verified cases:

Save checkpoint and resume training from checkpoint under single card/ddp conditions
Doka fsdp2/deepspeed saves checkpoint and resumes training from checkpoint under full training.
Save checkpoint and resume training from checkpoint under multi-card lora training
After the training was interrupted, Doka Lora resumed training under the condition that the last checkpoint was incompletely saved.

gemini-code-assist · 2026-03-13T04:06:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training stability and user experience by introducing robust checkpointing and training resumption capabilities. It allows users to seamlessly continue training from a previous state, preventing data loss from interruptions and enabling more flexible training workflows. The changes integrate comprehensive state saving and loading mechanisms across different distributed training paradigms, ensuring that all critical components of the training process are preserved and restored accurately.

Highlights

Training Resumption: Introduced the ability to resume training from a saved checkpoint, either by specifying a path or automatically detecting the latest valid checkpoint.
Checkpoint Management: Added new training arguments (save_steps, save_on_epoch_end, save_total_limit) to control when and how many checkpoints are saved, including automatic rotation to keep only the most recent ones.
Distributed Training Support: Implemented checkpointing and resumption logic compatible with various distributed training setups, including FSDP2, DeepSpeed, DDP, and single-GPU training, covering both full and LoRA model states.
State Preservation: Ensured that not only model and optimizer states but also learning rate scheduler, dataloader, and random number generator (RNG) states are correctly saved and restored for full reproducibility.

Changelog

src/llamafactory/v1/config/training_args.py
- Added resume_from_checkpoint argument to specify a checkpoint path or 'auto'.
- Added save_steps argument to control checkpoint saving frequency by steps.
- Added save_on_epoch_end argument to enable checkpoint saving at epoch boundaries.
- Added save_total_limit argument to manage the maximum number of kept checkpoints.
src/llamafactory/v1/core/base_trainer.py
- Imported os module and new checkpoint utility functions.
- Initialized _resume_epoch and conditionally called _resume_from_checkpoint in the constructor.
- Added _dist_name property to retrieve the distributed configuration name.
- Implemented _resolve_checkpoint_path to find the latest valid checkpoint when 'auto' is specified.
- Implemented _save_checkpoint to save model, optimizer, scheduler, dataloader, and RNG states, supporting FSDP2, DeepSpeed, and standard configurations, and handling checkpoint rotation.
- Implemented _save_fsdp2_states for FSDP2-specific state saving using Distributed Checkpoint.
- Implemented _save_standard_states for DDP/single-GPU state saving via save_pretrained.
- Implemented _resume_from_checkpoint to load training states from a specified checkpoint directory.
- Implemented _load_fsdp2_states for FSDP2-specific state loading.
- Implemented _load_standard_states for DDP/single-GPU state loading, supporting both full model and adapter (LoRA) checkpoints.
- Modified the fit method to start training from the _resume_epoch and to trigger checkpoint saves based on save_steps or save_on_epoch_end.
src/llamafactory/v1/core/utils/batching.py
- Added _buffer_tokens attribute to BatchGenerator for state tracking.
- Modified state_dict and load_state_dict methods to correctly serialize and deserialize the _buffer's state.
src/llamafactory/v1/core/utils/checkpoint.py
- Added a new file containing stateless checkpoint utility functions.
- Defined CHECKPOINT_COMPLETE_MARKER to indicate a fully saved checkpoint.
- Implemented _parse_checkpoint_step to extract step numbers from checkpoint directory names.
- Implemented find_latest_checkpoint to locate the most recent complete checkpoint in an output directory.
- Implemented rotate_checkpoints to enforce a limit on the number of saved checkpoints and clean up incomplete or older ones.
- Implemented save_metadata and load_metadata for storing and retrieving training progress metadata.
- Implemented _get_accelerator_rng_state and _set_accelerator_rng_state to manage accelerator-specific random number generator states.
- Implemented save_rng_state and load_rng_state to preserve and restore Python, NumPy, PyTorch, and accelerator RNG states for reproducibility.
- Implemented mark_checkpoint_complete to create a marker file signifying a successful checkpoint save.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive feature to support resuming training from checkpoints. The implementation is well-structured, covering various distributed training strategies like DDP, FSDP2, and DeepSpeed, and correctly handles the state for the model, optimizer, scheduler, dataloader, and RNG. The new checkpoint utility functions for finding, rotating, and managing checkpoints are robust. I have identified a couple of areas for improvement: a potential issue that could cause a crash when resuming on different hardware, and some redundant code in the batching logic. Overall, this is a solid contribution.

src/llamafactory/v1/core/utils/checkpoint.py

src/llamafactory/v1/core/utils/batching.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

src/llamafactory/v1/config/training_args.py

src/llamafactory/v1/core/base_trainer.py

frozenleaves added 4 commits March 13, 2026 09:44

support resume training from breakpoint

21de3e9

fix: StatefulBuffer was not an allowed global by default

9f62d3e

fix

7a7ac42

fix

ddd4890

frozenleaves had a problem deploying to docker March 13, 2026 04:06 — with GitHub Actions Error

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

src/llamafactory/v1/core/utils/checkpoint.py Outdated Show resolved Hide resolved

src/llamafactory/v1/core/utils/batching.py Outdated Show resolved Hide resolved

Update src/llamafactory/v1/core/utils/checkpoint.py

5c35da6

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

frozenleaves temporarily deployed to docker March 13, 2026 04:10 — with GitHub Actions Inactive

frozenleaves changed the title ~~support resume training from checkpoint~~ [V1]support resume training from checkpoint Mar 13, 2026

frozenleaves added 3 commits March 13, 2026 14:10

fix

a933a04

Merge remote-tracking branch 'origin/main-0312' into main-0312

b8de0ef

fix

52f8b4f

frozenleaves temporarily deployed to docker March 13, 2026 06:12 — with GitHub Actions Inactive

jiaqiw09 reviewed Mar 13, 2026

View reviewed changes

frozenleaves commented Mar 13, 2026

View reviewed changes

src/llamafactory/v1/core/base_trainer.py Outdated Show resolved Hide resolved

src/llamafactory/v1/core/base_trainer.py Outdated Show resolved Hide resolved

refactor resume training

a20cee7

frozenleaves temporarily deployed to docker March 16, 2026 02:37 — with GitHub Actions Inactive

add dcp2hf script

8485935

frozenleaves temporarily deployed to docker March 16, 2026 07:03 — with GitHub Actions Inactive

frozenleaves marked this pull request as ready for review March 16, 2026 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1]support resume training from checkpoint#10280

[V1]support resume training from checkpoint#10280
frozenleaves wants to merge 10 commits intohiyouga:mainfrom
frozenleaves:main-0312

frozenleaves commented Mar 13, 2026

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frozenleaves commented Mar 13, 2026

What does this PR do?

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants