Skip to content

move gsutil copy to condtional to avoid breakages#909

Merged
copybara-service[bot] merged 1 commit into
AI-Hypercomputer:mainfrom
kocchop:faysal/move-gsutil-cp-2-conditional
Sep 24, 2024
Merged

move gsutil copy to condtional to avoid breakages#909
copybara-service[bot] merged 1 commit into
AI-Hypercomputer:mainfrom
kocchop:faysal/move-gsutil-cp-2-conditional

Conversation

@kocchop

@kocchop kocchop commented Sep 23, 2024

Copy link
Copy Markdown
Collaborator

gsutil is not in the requirements list, hence not installed by default. The copy command is breaking for the containers which don't have it by default. Moved it under a conditonal

@jonb377 jonb377 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kocchop!

@gobbleturk

Copy link
Copy Markdown
Collaborator

Isn't it better to error out when a user requests profiling but the profiling fails? User can set profiler='' (default, no profiling) to avoid profiling instead of this check?

This error points to a real bug in maxtext - we should either ensure gsutil is installed or perhaps can use a different gcs API (upload_blob)

@kocchop

kocchop commented Sep 23, 2024

Copy link
Copy Markdown
Collaborator Author

Hi @gobbleturk the profiling works fine as is. This line just copies the generated .nsys-rep file to some directory. Without gsutil this copy fails and the run breaks. If for some reason the profiling breaks, it should break even with this PR. In sum, this PR does not ignore profiling failures, rather it avoids the breakages caused by gsutil file copy.

@gobbleturk

Copy link
Copy Markdown
Collaborator

Hi @gobbleturk the profiling works fine as is. This line just copies the generated .nsys-rep file to some directory. Without gsutil this copy fails and the run breaks. If for some reason the profiling breaks, it should break even with this PR. In sum, this PR does not ignore profiling failures, rather it avoids the breakages caused by gsutil file copy.

Don't we need to move the profile to GCS? How are we able to use the profile otherwise?

@kocchop

kocchop commented Sep 23, 2024

Copy link
Copy Markdown
Collaborator Author

@gobbleturk yes, on clusters using gcs, it is needed indeed. Hence, the nightly maxtext container made through this script installs gsutil etc. However, for other clusters it may not be necessary as there could be different file system mounts. Without any checks, it leads to breakages.

Another way could be to enforce the gsutil to be on the requirements list. Even with that, we might need to change the behavior for non-gcs scenarios.

@gobbleturk

Copy link
Copy Markdown
Collaborator

script

Ya this makes sense. For TPUs the profile will be sent to base_output_directory which is configurable (can be GCS or local), I think we should change the GPU path to also be configurable, as it won't always be GCS.

@gobbleturk gobbleturk left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will approve for now to unblock nsys profiling use cases, as we work on longer term fix

@gobbleturk

Copy link
Copy Markdown
Collaborator

Will approve for now to unblock nsys profiling use cases, as we work on longer term fix

Created #911 to track longer term fix

@copybara-service copybara-service Bot merged commit 9a0cfd4 into AI-Hypercomputer:main Sep 24, 2024
@kocchop kocchop deleted the faysal/move-gsutil-cp-2-conditional branch September 25, 2024 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants