Better checkpoints by jlamypoirier · Pull Request #6 · ServiceNow/Fast-LLM

jlamypoirier · 2024-10-17T12:52:34Z

Add a FieldUpdate tool to simplify field overrides.
Standardize checkpoint saving and loading configs, using mixins so training checkpoint/export configs can use just the right fields (ex. no path since it's provided by the trainer)
Adjust some fields so they make sense for both saving and loading, and merge the ones for config selection into an enum (see below).
Move checkpoint save/load from Run to Trainer and simplify.
Make IntervalConfig into a proper class (use FieldUpdate foe doc instead)
Make export checkpoint fully configurable, save separately instead of using a symlink.
Add keep_every for checkpoints to get the old behaviour, and optional callback script.

RENAME[BW COMPATIBLE] pretrained.imported_type ->pretrained.model_type
RENAME[BW COMPATIBLE] pretrained.load_weights -> pretrained.model_weights
RENAME[BW COMPATIBLE] pretrained.load_optimizer -> pretrained.optimizer_state
MERGE (pretrained.override_architecture, pretrained.load_full_base_model_config, pretrained.load_full_fast_llm_config) ->pretrained.load_config:enum

tscholak

Great changes, thanks @jlamypoirier.
I have a few suggestions and questions for clarification. Can you please answer them?

tscholak · 2024-10-21T17:18:02Z

+    # Distributed checkpoint for fast checkpointing and resuming.
+    distributed = "distributed"
+    # Model state dict, for safe long-term storage in Fast-LLM format.
+    state_dict = "state_dict"


how about we call this long_term instead?

I'm ok with renaming (in a future PR), but I would prefer a name more related to Fast-LLM, since it's meant to be the "standard" Fast-LLM checkpoint format. (Maybe just "fast_llm"?)

tscholak · 2024-10-21T17:19:33Z

+        if default.get("format", None) == "huggingface":
+            warnings.warn(f"`huggingface` checkpoint format has been renamed to `external`.")
+            default["format"] = CheckpointFormat.external.value


will there be other external checkpoint formats or why this change?

Maybe, in theory the conversion mechanism can be used for any kind of checkpoint format, I had renamed anything else already but not this one because of backward compatibility.

I'm thinking of getting rid of it altogether in the next PR though, and just use the "model_type" as the format for external formats.

tscholak · 2024-10-21T17:23:21Z

@@ -0,0 +1,147 @@
+# TODO: Use packaging.version? (Safer but extra requirement)


not sure where to ask this question:
where will we be configuring the layers-per-step option?

That's a hack for the single process converter, it doesn't apply here because we already have the whole model loaded. I'm not entirely sure about memory usage but it should be OK because we reconstruct one layer at the time.

tscholak · 2024-10-21T17:23:53Z

+
+class CheckpointFormat(str, enum.Enum):
+    # Distributed checkpoint for fast checkpointing and resuming.
+    distributed = "distributed"


how about we call this fast or short_term?

Also an option, let's discuss it later.

tscholak · 2024-10-21T17:30:16Z

+
+        # TODO: Simplify branching.
+        if checkpoint_config.format == CheckpointFormat.external:
+            # TODO: Support optimizer?


can you confirm that Fast-LLM will set the checkpoint_config.optimizer_state to False somewhere (or reject the config at launch) before we end up here? Would be frustrating to crash thousands of steps into training because of this.

Right now there is no safety check, but that would be easy to add. Looking at it, right now the bigger problem is that the optimizer isn't saved by default so the bigger risk is not saving it when we want to...

tscholak · 2024-10-21T17:37:22Z

+    # Intervals are a common pattern, so we standardize them with this base class.
+    interval: int | None = Field(
+        default=None,
+        desc="The number of training iterations between each interval. Setting to None will disable.",


maybe the wrong place to do this, but can we make it so that there is at least a warning that checkpoint saving is disabled? I don't think people will appreciate training for hours (or days) only to find out that they forgot to set a saving interval ;)

That's an option, but would the warning actually be useful? A runtime warning is unlikely to be seen, and a validation one is only useful if the config is validated before launch.

Hm, I see. Do you have a better idea?

Not really....

tscholak · 2024-10-21T17:41:08Z

+    interval = FieldUpdate(
+        desc="The number of training iterations between each Wandb status post (alert)."
+        " Setting to None will disable iteration-based wandb alerts."
+        " Must be a sub-interval of the logging interval."


you mean that wandb posting can only happen at logging times? That would be super-interval, no? can you clarify what you mean here? I'm confused.

Yes. By sub-interval I meant posting iterations are a subset of logging ones, but that's probably not the right term.

tscholak · 2024-10-21T17:42:34Z

+class CheckpointBaseConfig(IntervalConfig):
+    _abstract = True
+    save_name: typing.ClassVar[str] = "save"
+    directory_name: typing.ClassVar[str] = "save"


for what kind of checkpoint will we have save as the name of the output directory?

None, that's a placeholder (I could just remove?)

tscholak · 2024-10-21T17:44:44Z

+    save_name: typing.ClassVar[str] = "export"
+    directory_name = "export"
+    interval = FieldUpdate(
+        desc="The number of training iterations between each export." " Setting to None will disable exports."


Can this interval be incompatible with the checkpointing interval (i.e., not an exact multiple or divisor of the checkpointing interval)?

Yes it doesn't matter anymore because checkpoints and exports are completely independent.

tscholak · 2024-10-21T17:45:31Z

+    interval = FieldUpdate(
+        desc="The number of training iterations between each automated shutdown."
+        " Setting to None will disable automated shutdowns."
+        " Must be a sub-interval of the checkpoint interval."


what's an automated shutdown? I am not familiar with this feature? is this new?

It's always been there but never used.

tscholak

Thanks @jlamypoirier for answering my comments. Since most changes I suggested will be future work, this can be merged as is.

jlamypoirier added 5 commits October 17, 2024 08:52

Better checkpoints

cf63b8d

Simplify save checkpoint

ed45212

Simplify checkpoint loading

a6fe0e5

Fixed, misc, backward compatible

958889c

Merge remote-tracking branch 'origin/main' into better_checkpoints

ffa5630

jlamypoirier marked this pull request as ready for review October 18, 2024 14:11

jlamypoirier marked this pull request as draft October 18, 2024 15:42

jlamypoirier added 4 commits October 18, 2024 13:37

Simpler saving

5b508ee

cleanup

b3a97c0

fix

48cfd9b

Merge remote-tracking branch 'origin/main' into better_checkpoints

e1e3fde

jlamypoirier marked this pull request as ready for review October 21, 2024 16:58

jlamypoirier requested a review from tscholak October 21, 2024 16:59

tweak

3e5b33a

tscholak reviewed Oct 21, 2024

View reviewed changes

tscholak approved these changes Oct 21, 2024

View reviewed changes

jlamypoirier merged commit 6350dec into main Oct 21, 2024

jlamypoirier deleted the better_checkpoints branch October 21, 2024 19:34

jlamypoirier mentioned this pull request Oct 25, 2024

Roadmap #27

Closed

tscholak added this to the 0.2.0 milestone Oct 25, 2024

		@@ -0,0 +1,147 @@
		# TODO: Use packaging.version? (Safer but extra requirement)

Conversation

jlamypoirier commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlamypoirier commented Oct 17, 2024 •

edited

Loading

jlamypoirier Oct 21, 2024 •

edited

Loading