Add retry to dataset loading by gobbleturk · Pull Request #10 · AI-Hypercomputer/maxtext

gobbleturk · 2023-04-21T23:11:09Z

We have found rare flaky behavior with the data loading, a simple retry should solve the issue.

* Add GitHub Action to run all DAG scripts locally Change-Id: I7625c3ed953ce0da2e3ccfb5d4614eba7625b739 * fix requirements.txt path Change-Id: I81785543e9b2a77efe369bbd0396e7bef0e4c8e4 * Add BQ dep Change-Id: I2b50735c7d72c627e1fd38083b6c3c5b1c9feec3 * fix GHA name Change-Id: I347cc18fc0d39ac87fe81c467993e8353e94c5ad * comment out packages Change-Id: Ic14969ffb3350492797bfe7e2b67dde641ee5465

Verified both scan modes on commit 055a4c2 after full env restore: scan_layers=false: 55.4 ms decode, 123.6 ms prefill (577.5 tok/s) scan_layers=true: 68.4 ms decode, 121.9 ms prefill (468 tok/s) Updated: - env_restore.md: add 2026-04-20 noscan results + summary table - opt4 plan: add noscan row to benchmark table - perf optimization: add opt AI-Hypercomputer#9 (reverted) and AI-Hypercomputer#10 rows, update both benchmark sections with 2026-04-20 results

…ffload (#10) Both paths need Pathways/TPU-memory infra at runtime, so the external pieces (reshard_pytree via pathwaysutils; move_memory_to_device) are mocked and the test pins our changes: - #9: scan_layers=False no longer raises and the unscanned policy params are pushed to the inference engine (guard removal). - #10: optimizer_memory_host_offload runs the device_put/update plumbing and yields the same params as the no-offload step (memory placement, not math).

gobbleturk added 3 commits April 21, 2023 21:57

Add retry to data loading

e35ee7e

Add retry to data loading

04019fa

Add retry to data loading

78a0e70

gobbleturk requested a review from rwitten April 21, 2023 23:11

gobbleturk added 4 commits April 21, 2023 23:12

Add retry to data loading

fab0e62

Add retry to data loading

fd44426

Add retry to data loading

7208aff

Add retry to data loading

7eb69fe

rwitten approved these changes Apr 21, 2023

View reviewed changes

gobbleturk merged commit 3031610 into main Apr 22, 2023

gobbleturk deleted the tfd-retry branch April 22, 2023 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retry to dataset loading#10

Add retry to dataset loading#10
gobbleturk merged 7 commits into
mainfrom
tfd-retry

gobbleturk commented Apr 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gobbleturk commented Apr 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants