Skip to content

How to resume training from checkpoint? #14

@Davidwhw

Description

@Davidwhw

I set the resume_from_checkpoint parameter in TrainingArguments.
And in the startup script, the checkpoint path is specified for resume_from_checkpoint.

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default='adamw_torch')
    resume_from_checkpoint: Optional[str] = field(
        default=None, 
        metadata={
            'help': 'Path to a checkpoint directory to resume training from (e.g., `output/checkpoint-1000/`)'
        }
    )
    max_length: int = field(
        default=4096,
        metadata={
            'help':
            'Maximum sequence length. Sequences will be right padded (and possibly truncated).'
        },
    )
    use_lora: bool = False
    fix_vit: bool = True
    fix_sampler: bool = False
    fix_llm: bool = True
    label_names: List[str] = field(default_factory=lambda: ['samples'])

However, ChartMoETrainer will still start training from scratch.
What Settings should I make to resume training from a breakpoint?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions