Skip to content

[V1]support resume training from checkpoint#10280

Open
frozenleaves wants to merge 10 commits intohiyouga:mainfrom
frozenleaves:main-0312
Open

[V1]support resume training from checkpoint#10280
frozenleaves wants to merge 10 commits intohiyouga:mainfrom
frozenleaves:main-0312

Conversation

@frozenleaves
Copy link
Copy Markdown
Collaborator

What does this PR do?

support resume training from checkpoint

examples:

model: Qwen/Qwen3-0.6B
model_class: llm
template: qwen3_nothink
dist_config:
  name: fsdp2
train_dataset: data/v1_sft_demo.yaml
output_dir: outputs/test3_fsdp2_full_resume
micro_batch_size: 1
cutoff_len: 512
learning_rate: 1.0e-4
max_steps: 30
save_steps: 5
save_total_limit: 2  # just save the last 2 checkpoint
resume_from_checkpoint: auto
# resume_from_checkpoint: outputs/test3_fsdp2_full_resume/checkpoint-10

If resume_from_checkpoint is set to auto, it will automatically search for the most recently saved complete checkpoint, or you can also specify it manually.

Currently verified cases:

  1. Save checkpoint and resume training from checkpoint under single card/ddp conditions
  2. Doka fsdp2/deepspeed saves checkpoint and resumes training from checkpoint under full training.
  3. Save checkpoint and resume training from checkpoint under multi-card lora training
  4. After the training was interrupted, Doka Lora resumed training under the condition that the last checkpoint was incompletely saved.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training stability and user experience by introducing robust checkpointing and training resumption capabilities. It allows users to seamlessly continue training from a previous state, preventing data loss from interruptions and enabling more flexible training workflows. The changes integrate comprehensive state saving and loading mechanisms across different distributed training paradigms, ensuring that all critical components of the training process are preserved and restored accurately.

Highlights

  • Training Resumption: Introduced the ability to resume training from a saved checkpoint, either by specifying a path or automatically detecting the latest valid checkpoint.
  • Checkpoint Management: Added new training arguments (save_steps, save_on_epoch_end, save_total_limit) to control when and how many checkpoints are saved, including automatic rotation to keep only the most recent ones.
  • Distributed Training Support: Implemented checkpointing and resumption logic compatible with various distributed training setups, including FSDP2, DeepSpeed, DDP, and single-GPU training, covering both full and LoRA model states.
  • State Preservation: Ensured that not only model and optimizer states but also learning rate scheduler, dataloader, and random number generator (RNG) states are correctly saved and restored for full reproducibility.
Changelog
  • src/llamafactory/v1/config/training_args.py
    • Added resume_from_checkpoint argument to specify a checkpoint path or 'auto'.
    • Added save_steps argument to control checkpoint saving frequency by steps.
    • Added save_on_epoch_end argument to enable checkpoint saving at epoch boundaries.
    • Added save_total_limit argument to manage the maximum number of kept checkpoints.
  • src/llamafactory/v1/core/base_trainer.py
    • Imported os module and new checkpoint utility functions.
    • Initialized _resume_epoch and conditionally called _resume_from_checkpoint in the constructor.
    • Added _dist_name property to retrieve the distributed configuration name.
    • Implemented _resolve_checkpoint_path to find the latest valid checkpoint when 'auto' is specified.
    • Implemented _save_checkpoint to save model, optimizer, scheduler, dataloader, and RNG states, supporting FSDP2, DeepSpeed, and standard configurations, and handling checkpoint rotation.
    • Implemented _save_fsdp2_states for FSDP2-specific state saving using Distributed Checkpoint.
    • Implemented _save_standard_states for DDP/single-GPU state saving via save_pretrained.
    • Implemented _resume_from_checkpoint to load training states from a specified checkpoint directory.
    • Implemented _load_fsdp2_states for FSDP2-specific state loading.
    • Implemented _load_standard_states for DDP/single-GPU state loading, supporting both full model and adapter (LoRA) checkpoints.
    • Modified the fit method to start training from the _resume_epoch and to trigger checkpoint saves based on save_steps or save_on_epoch_end.
  • src/llamafactory/v1/core/utils/batching.py
    • Added _buffer_tokens attribute to BatchGenerator for state tracking.
    • Modified state_dict and load_state_dict methods to correctly serialize and deserialize the _buffer's state.
  • src/llamafactory/v1/core/utils/checkpoint.py
    • Added a new file containing stateless checkpoint utility functions.
    • Defined CHECKPOINT_COMPLETE_MARKER to indicate a fully saved checkpoint.
    • Implemented _parse_checkpoint_step to extract step numbers from checkpoint directory names.
    • Implemented find_latest_checkpoint to locate the most recent complete checkpoint in an output directory.
    • Implemented rotate_checkpoints to enforce a limit on the number of saved checkpoints and clean up incomplete or older ones.
    • Implemented save_metadata and load_metadata for storing and retrieving training progress metadata.
    • Implemented _get_accelerator_rng_state and _set_accelerator_rng_state to manage accelerator-specific random number generator states.
    • Implemented save_rng_state and load_rng_state to preserve and restore Python, NumPy, PyTorch, and accelerator RNG states for reproducibility.
    • Implemented mark_checkpoint_complete to create a marker file signifying a successful checkpoint save.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive feature to support resuming training from checkpoints. The implementation is well-structured, covering various distributed training strategies like DDP, FSDP2, and DeepSpeed, and correctly handles the state for the model, optimizer, scheduler, dataloader, and RNG. The new checkpoint utility functions for finding, rotating, and managing checkpoints are robust. I have identified a couple of areas for improvement: a potential issue that could cause a crash when resuming on different hardware, and some redundant code in the batching logic. Overall, this is a solid contribution.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@frozenleaves frozenleaves changed the title support resume training from checkpoint [V1]support resume training from checkpoint Mar 13, 2026
@frozenleaves frozenleaves marked this pull request as ready for review March 16, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants