Skip to content

[bug] Resuming experiment in distributed format with frozen weights #256

@RaymondLi0

Description

@RaymondLi0

🐞 Describe the Bug

Error when resuming an experiment from distributed format, with a different set of frozen weights.

2025-05-07 19:22:37,146 [Rank 05] Traceback (most recent call last):
  File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
    Runnable.parse_and_run(unparsed)
  File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
    runnable()
  File "/app/fast_llm/engine/training/config.py", line 423, in runnable
    trainer.run()
  File "/app/fast_llm/engine/training/trainer.py", line 172, in run
    self._run_training()
  File "/app/fast_llm/engine/training/trainer.py", line 175, in _run_training
    self._prepare_training_state()
  File "/app/fast_llm/engine/training/trainer.py", line 456, in _prepare_training_state
    self._load_checkpoint(self._config.training.checkpoint, last_iteration)
  File "/app/fast_llm/engine/training/trainer.py", line 525, in _load_checkpoint
    metadata = self._multi_stage.load_checkpoint(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 40, in load_checkpoint
    metadata = converter.load(config)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/fast_llm/engine/checkpoint/distributed.py", line 122, in load
    self_fsdp.copy_shard_overlaps(
  File "/app/fast_llm/engine/multi_stage/fsdp.py", line 455, in copy_shard_overlaps
    shard[begin:end][overlap_mask] = loaded_shards[shard_name][overlap_index_map_masked]
                                     ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index is out of bounds for dimension with size 0

🔄 Steps to Reproduce

Steps to reproduce the behavior:

fast-llm version: 286f9d3a2d5daec175a20bea11613405a5c53b71 (main)

1 - Pretraining run with mlp_lr_scale=0.0
2 - Load that pretrained model in distributed format, with mlp_lr_scale=1.0

🎯 Expected Behavior

No crash

📝 Additional Context

Originally, the bug I observed was on this branch: #243
In a first run, I set the lr-scale of the embedding/output weights to zero. Then un-freeze the output weights in a subsequent run. There was no crash but a very high loss at the beginning of training.
Resuming from the hugging-face format instead of distributed worked fine.
In an attempt to reproduce this issue on main, I got the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions