Support updating grain data mixture during training by aireenmei · Pull Request #2697 · AI-Hypercomputer/maxtext

aireenmei · 2025-11-15T07:58:41Z

Description

See the added "9." in data_input_grain.md about this new feature.

FIXES: b/454051801

Tests

config the new mixture in grain_mixture.json
test script test_grain_mix.sh, training log
Inspect the checkpoints under gs://aireenmei-multipod/test/grain_ckpt/gemini-test/grain-mixture-test-2025-11-15-06-44-13/checkpoints

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2025-11-19T22:04:48Z

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This Pull Request introduces a new feature allowing dynamic updates to Grain data mixtures during training, which is well-documented and implemented. The associated changes involve refactoring data input validation and modifying checkpointing logic to support this new functionality.

🔍 General Feedback

The refactoring of data input validation into types.py is a good improvement for code organization and maintainability.
The new functions for merging iterator states in checkpointing.py correctly handle the complexity of updating data sources while resuming training.
The documentation for the new feature in data_input_grain.md is clear and provides good examples.
Consider refactoring duplicated code for applying dataset transformations in _grain_data_processing.py into a helper function to improve maintainability (as noted in an inline comment).

github-actions · 2025-11-19T22:05:55Z

@@ -64,15 +65,21 @@ def get_datasets(
      dataset_list = [


🟡 The logic for applying shuffle, repeat, and sharding is duplicated for both mixed and single datasets. Consider extracting this into a helper function to improve maintainability and reduce redundancy.

Suggested change

dataset_list = [

def _apply_dataset_transforms(dataset, shuffle, shuffle_seed, num_epoch, dataloading_host_index, dataloading_host_count):

if shuffle:

dataset = dataset.shuffle(seed=shuffle_seed)

dataset = dataset.repeat(num_epoch)

dataset = dataset[dataloading_host_index::dataloading_host_count]

return dataset.to_iter_dataset()

# In get_datasets, for mixed datasets:

for d, _ in enumerate(dataset_list):

dataset_list[d] = _apply_dataset_transforms(

dataset_list[d],

shuffle,

shuffle_seed,

num_epoch,

dataloading_host_index,

dataloading_host_count,

)

dataset = grain.IterDataset.mix(dataset_list, weights)

else:

data_files = find_data_files(data_file_pattern)

dataset = grain.MapDataset.source(grain.ArrayRecordDataSource(data_files))

dataset = _apply_dataset_transforms(

dataset,

shuffle,

shuffle_seed,

num_epoch,

dataloading_host_index,

dataloading_host_count,

)

iindyk · 2025-11-21T00:19:44Z

+
+When you are ready to introduce the new dataset, create a JSON configuration file (e.g., `grain_mixture.json`). This file defines both the original mixture and the new one, along with the desired weights for blending them.
+
+**Important:** The `old_dataset` section in this file must exactly match the sources and weights you used in `grain_train_files` for the initial run.


this will not allow removing datasets, right? is that ok?

iindyk · 2025-11-21T00:26:10Z

+      )
+      old_weight = mixture_config["old_dataset_weight"]
+      new_weight = mixture_config["new_dataset_weight"]
+      train_ds = grain.IterDataset.mix([old_dataset, new_dataset], weights=[old_weight, new_weight])


this will work too, but now that you're using IterDataset.mix, checkpoint will have per-component checkpoint, so I think we could do some surgery there and recover each component separately. That would allow completely changing weights (e.g. changing and removing old weights as well). Is that something you'd be interested in? if so we can discuss further

aireenmei force-pushed the aireen/grain_mix branch from bc4ec1d to 7bd865d Compare November 15, 2025 08:14

aireenmei added 5 commits November 19, 2025 19:40

migrate from MapDataset.mix to IterDataset.mix

6a04af0

Support updating grain data mixture during training

a2e6446

update pyconfig_deprecated

889fab9

fix pre-commit

9405b51

fix types and pyconfig

f01855a

aireenmei force-pushed the aireen/grain_mix branch from e9c8d43 to 8cb5c7d Compare November 19, 2025 19:41

aireenmei marked this pull request as ready for review November 19, 2025 19:54

aireenmei requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, bvandermoon, gagika, gobbleturk, hengtaoguo, jacoguzo, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners November 19, 2025 19:54

fix pylint

f0835a3

aireenmei force-pushed the aireen/grain_mix branch from 8cb5c7d to f0835a3 Compare November 19, 2025 20:18

aireenmei added the gemini-review label Nov 19, 2025

github-actions Bot reviewed Nov 19, 2025

View reviewed changes

iindyk reviewed Nov 21, 2025

View reviewed changes

aireenmei closed this Nov 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support updating grain data mixture during training#2697

Support updating grain data mixture during training#2697
aireenmei wants to merge 6 commits into
mainfrom
aireen/grain_mix

aireenmei commented Nov 15, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Nov 19, 2025

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Nov 19, 2025

Uh oh!

iindyk Nov 21, 2025

Uh oh!

iindyk Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-      dataset_list = [
+def _apply_dataset_transforms(dataset, shuffle, shuffle_seed, num_epoch, dataloading_host_index, dataloading_host_count):
+  if shuffle:
+    dataset = dataset.shuffle(seed=shuffle_seed)
+  dataset = dataset.repeat(num_epoch)
+  dataset = dataset[dataloading_host_index::dataloading_host_count]
+  return dataset.to_iter_dataset()
+# In get_datasets, for mixed datasets:
+      for d, _ in enumerate(dataset_list):
+        dataset_list[d] = _apply_dataset_transforms(
+            dataset_list[d],
+            shuffle,
+            shuffle_seed,
+            num_epoch,
+            dataloading_host_index,
+            dataloading_host_count,
+        )
+      dataset = grain.IterDataset.mix(dataset_list, weights)
+    else:
+      data_files = find_data_files(data_file_pattern)
+      dataset = grain.MapDataset.source(grain.ArrayRecordDataSource(data_files))
+      dataset = _apply_dataset_transforms(
+          dataset,
+          shuffle,
+          shuffle_seed,
+          num_epoch,
+          dataloading_host_index,
+          dataloading_host_count,
+      )


		When you are ready to introduce the new dataset, create a JSON configuration file (e.g., `grain_mixture.json`). This file defines both the original mixture and the new one, along with the desired weights for blending them.

		Important: The `old_dataset` section in this file must exactly match the sources and weights you used in `grain_train_files` for the initial run.

Uh oh!

Conversation

aireenmei commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

github-actions Bot commented Nov 19, 2025

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

github-actions Bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

iindyk Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

iindyk Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aireenmei commented Nov 15, 2025 •

edited

Loading