[V1]support resume training from checkpoint#10280
[V1]support resume training from checkpoint#10280frozenleaves wants to merge 10 commits intohiyouga:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the training stability and user experience by introducing robust checkpointing and training resumption capabilities. It allows users to seamlessly continue training from a previous state, preventing data loss from interruptions and enabling more flexible training workflows. The changes integrate comprehensive state saving and loading mechanisms across different distributed training paradigms, ensuring that all critical components of the training process are preserved and restored accurately. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive feature to support resuming training from checkpoints. The implementation is well-structured, covering various distributed training strategies like DDP, FSDP2, and DeepSpeed, and correctly handles the state for the model, optimizer, scheduler, dataloader, and RNG. The new checkpoint utility functions for finding, rotating, and managing checkpoints are robust. I have identified a couple of areas for improvement: a potential issue that could cause a crash when resuming on different hardware, and some redundant code in the batching logic. Overall, this is a solid contribution.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
What does this PR do?
support resume training from checkpoint
examples:
If
resume_from_checkpointis set to auto, it will automatically search for the most recently saved complete checkpoint, or you can also specify it manually.Currently verified cases: