Skip to content

Add retry to dataset loading#10

Merged
gobbleturk merged 7 commits into
mainfrom
tfd-retry
Apr 22, 2023
Merged

Add retry to dataset loading#10
gobbleturk merged 7 commits into
mainfrom
tfd-retry

Conversation

@gobbleturk

Copy link
Copy Markdown
Collaborator

We have found rare flaky behavior with the data loading, a simple retry should solve the issue.

@gobbleturk gobbleturk requested a review from rwitten April 21, 2023 23:11
@gobbleturk gobbleturk merged commit 3031610 into main Apr 22, 2023
@gobbleturk gobbleturk deleted the tfd-retry branch April 22, 2023 00:45
A9isha pushed a commit that referenced this pull request Apr 11, 2024
* Add GitHub Action to run all DAG scripts locally

Change-Id: I7625c3ed953ce0da2e3ccfb5d4614eba7625b739

* fix requirements.txt path

Change-Id: I81785543e9b2a77efe369bbd0396e7bef0e4c8e4

* Add BQ dep

Change-Id: I2b50735c7d72c627e1fd38083b6c3c5b1c9feec3

* fix GHA name

Change-Id: I347cc18fc0d39ac87fe81c467993e8353e94c5ad

* comment out packages

Change-Id: Ic14969ffb3350492797bfe7e2b67dde641ee5465
geeningwang pushed a commit to geeningwang/maxtext that referenced this pull request Apr 20, 2026
Verified both scan modes on commit 055a4c2 after full env restore:
  scan_layers=false: 55.4 ms decode, 123.6 ms prefill (577.5 tok/s)
  scan_layers=true:  68.4 ms decode, 121.9 ms prefill (468 tok/s)

Updated:
- env_restore.md: add 2026-04-20 noscan results + summary table
- opt4 plan: add noscan row to benchmark table
- perf optimization: add opt AI-Hypercomputer#9 (reverted) and AI-Hypercomputer#10 rows, update
  both benchmark sections with 2026-04-20 results
ecnal-cienet added a commit that referenced this pull request Jun 24, 2026
…ffload (#10)

Both paths need Pathways/TPU-memory infra at runtime, so the external pieces
(reshard_pytree via pathwaysutils; move_memory_to_device) are mocked and the test
pins our changes:
- #9: scan_layers=False no longer raises and the unscanned policy params are pushed
  to the inference engine (guard removal).
- #10: optimizer_memory_host_offload runs the device_put/update plumbing and yields
  the same params as the no-offload step (memory placement, not math).
hsuan-lun-chiang pushed a commit that referenced this pull request Jun 25, 2026
…ffload (#10)

Both paths need Pathways/TPU-memory infra at runtime, so the external pieces
(reshard_pytree via pathwaysutils; move_memory_to_device) are mocked and the test
pins our changes:
- #9: scan_layers=False no longer raises and the unscanned policy params are pushed
  to the inference engine (guard removal).
- #10: optimizer_memory_host_offload runs the device_put/update plumbing and yields
  the same params as the no-offload step (memory placement, not math).
hsuan-lun-chiang pushed a commit that referenced this pull request Jun 25, 2026
…ffload (#10)

Both paths need Pathways/TPU-memory infra at runtime, so the external pieces
(reshard_pytree via pathwaysutils; move_memory_to_device) are mocked and the test
pins our changes:
- #9: scan_layers=False no longer raises and the unscanned policy params are pushed
  to the inference engine (guard removal).
- #10: optimizer_memory_host_offload runs the device_put/update plumbing and yields
  the same params as the no-offload step (memory placement, not math).
hsuan-lun-chiang pushed a commit that referenced this pull request Jun 25, 2026
…ffload (#10)

Both paths need Pathways/TPU-memory infra at runtime, so the external pieces
(reshard_pytree via pathwaysutils; move_memory_to_device) are mocked and the test
pins our changes:
- #9: scan_layers=False no longer raises and the unscanned policy params are pushed
  to the inference engine (guard removal).
- #10: optimizer_memory_host_offload runs the device_put/update plumbing and yields
  the same params as the no-offload step (memory placement, not math).
ecnal-cienet added a commit that referenced this pull request Jun 25, 2026
…ffload (#10)

Both paths need Pathways/TPU-memory infra at runtime, so the external pieces
(reshard_pytree via pathwaysutils; move_memory_to_device) are mocked and the test
pins our changes:
- #9: scan_layers=False no longer raises and the unscanned policy params are pushed
  to the inference engine (guard removal).
- #10: optimizer_memory_host_offload runs the device_put/update plumbing and yields
  the same params as the no-offload step (memory placement, not math).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants