Skip to content

Some fixes for frozen weights#270

Merged
RaymondLi0 merged 4 commits into
mainfrom
raymond/fix-frozen-weight
May 20, 2025
Merged

Some fixes for frozen weights#270
RaymondLi0 merged 4 commits into
mainfrom
raymond/fix-frozen-weight

Conversation

@RaymondLi0
Copy link
Copy Markdown
Contributor

@RaymondLi0 RaymondLi0 commented May 15, 2025

✨ Description

Closes #265
Closes #256

1 - log gradients only for parameters that require_grad
2 - Fix shard validation when loading from distributed format
3 - In copy_shard_overlaps, skip when current shard is empty (exp_avgs, exp_avgs_sq for frozen weights), and fill with zero when loaded shard is empty.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

jlamypoirier

This comment was marked as duplicate.

Copy link
Copy Markdown
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good, thanks for the fixes!

@RaymondLi0
Copy link
Copy Markdown
Contributor Author

RaymondLi0 commented May 16, 2025

Added another fix for cases where loaded and current config don't have the same set of frozen weights @jlamypoirier .
The test I added unfortunately still fails with large diffs on tensors ...

Comment thread tests/test_checkpoint.py
@RaymondLi0
Copy link
Copy Markdown
Contributor Author

RaymondLi0 commented May 16, 2025

Thanks for the review @jlamypoirier ! Let me know if the new changes are OK

Comment thread tests/test_checkpoint.py
],
compare=f"test_{TEST_MODEL}_checkpoint_and_eval",
prepare_fn=_prepare_resume_fn,
do_compare=False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just removing compare?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the prepare_resume_fn uses compare to copy the checkpoints

@RaymondLi0 RaymondLi0 marked this pull request as ready for review May 20, 2025 18:54
@RaymondLi0 RaymondLi0 merged commit e22eda9 into main May 20, 2025
4 checks passed
@RaymondLi0 RaymondLi0 deleted the raymond/fix-frozen-weight branch May 20, 2025 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] test_checkpoint test not passing when any lr scale is set to 0 [bug] Resuming experiment in distributed format with frozen weights

2 participants