Skip to content

Update Gemma 7b tests to use optionally internal GCS buckets for testing#583

Merged
copybara-service[bot] merged 1 commit into
mainfrom
anisha-run-multicluster-test
Apr 12, 2024
Merged

Update Gemma 7b tests to use optionally internal GCS buckets for testing#583
copybara-service[bot] merged 1 commit into
mainfrom
anisha-run-multicluster-test

Conversation

@A9isha

@A9isha A9isha commented Apr 10, 2024

Copy link
Copy Markdown
Collaborator

For models such as Gemma-7B which needs multihost TPUs for running train or decode, we can run the checkpoint conversion step of creating the MaxText compatible Orbax checkpoint in a CPU machine or TPU v4-8 (basically any single host machine with enough RAM) in end_to_end/gemma/7b/1_test_gemma.sh, and run the decode and train steps on a multi-host TPU in end_to_end/gemma/7b/2_test_gemma.sh.

When connecting the runs of these two separate scripts:
Run the two scripts as follows:

export BASE_OUTPUT_PATH=/path/to/GCS/bucket; bash end_to_end/tpu/gemma/7b/1_test_gemma.sh
export BASE_OUTPUT_PATH=/path/to/same/GCS/bucket; bash end_to_end/tpu/gemma/7b/2_test_gemma.sh

@A9isha A9isha force-pushed the anisha-run-multicluster-test branch from 15f9fa8 to ab74710 Compare April 11, 2024 00:36
@A9isha A9isha force-pushed the anisha-run-multicluster-test branch 3 times, most recently from 3cf81f0 to c147b1e Compare April 11, 2024 01:18
@A9isha A9isha requested review from jonb377 and khatwanimohit April 11, 2024 01:18
@A9isha A9isha marked this pull request as ready for review April 11, 2024 01:18
Comment thread end_to_end/gemma/7b/1_test_gemma.sh Outdated

@jonb377 jonb377 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Anisha!

Comment thread end_to_end/gemma/7b/2_test_gemma.sh Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming, idx has been absorbed into MODEL_BUCKET, so this will still be distinct for each run?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, since MODEL_BUCKET has the idx absorbed into it, we are getting CONVERTED_CHECKPOINT distinct in each run. We are also pulling out idx in line 21 from MODEL_BUCKET, and used in the different RUN_NAME vars in line 24, 47, 56, these RUN_NAME vars would make the outputs of end_to_end/gemma/7b/2_test_gemma.sh distinct for each run.

Is this what you meant?

@A9isha A9isha force-pushed the anisha-run-multicluster-test branch from c147b1e to 3fa6510 Compare April 11, 2024 17:00
Comment thread end_to_end/gemma/7b/1_test_gemma.sh Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this AWK doing? Seems unwieldy, maybe you can explain the problem (ideally in some message about the commit?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rafi - I added the explanation in the description of this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that is super helpful.

Is there a way we could not have thing branching? Like do we really to preserve multiple options here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean we remove the two options of connecting the two testing scripts, and only have the option of this?
(we need this option of connecting the two scripts for our internal testing though )

export MODEL_BUCKET=/path/to/GCS/bucket/and/an/index; bash end_to_end/gemma/7b/1_test_gemma.sh
export MODEL_BUCKET=/path/to/GCS/bucket/and/an/index; bash end_to_end/gemma/7b/2_test_gemma.sh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or maybe focus on a version that doesn't require parsing so you apss two args.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely make passing MODEL_BUCKET as the only option

Could we pass both a GCS path as well as an additional index from airflow? @jonb377 and @RissyRan

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up with @RissyRan - it is possible, it would need some modifications to our airflow+XLML tests code.

@A9isha A9isha Apr 12, 2024

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwitten I think the much simpler and cleaner solution will be to have one GCS path shared between the two scripts and it would be the BASE_OUTPUT_DIRECTORY, and write everything there: for e.g., scanned ckpt, unscanned ckpt, and all outputs of train and decode (in separate subfolders). This shared GCS path will be unique since it will have the datetime inside it and will be generated from airflow.

I updated the code, please take a look.

cc: @RissyRan @jonb377 @khatwanimohit

@rwitten rwitten removed their assignment Apr 11, 2024
@A9isha A9isha force-pushed the anisha-run-multicluster-test branch from 3fa6510 to f031d76 Compare April 11, 2024 18:51
@rwitten rwitten removed their assignment Apr 11, 2024

@khatwanimohit khatwanimohit left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit

Comment thread end_to_end/gemma/7b/1_test_gemma.sh Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
In MaxText land, BASE_OUTPUT_DIRECTORY is usually just a gcs bucket, we attach the run_name to it and create a unique path for the run.
I think we should be more clear in the example usage here, saying that BASE_OUTPUT_DIRECTORY is the full path of user's unique run.

I think we should also consider renaming that variable to BASE_OUTPUT_PATH to avoid confusion.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mohit!

I added a comment in the scripts to clarify and also renamed to BASE_OUTPUT_PATH.

@A9isha A9isha force-pushed the anisha-run-multicluster-test branch 5 times, most recently from 4e29578 to 310b98f Compare April 12, 2024 20:06
@A9isha A9isha force-pushed the anisha-run-multicluster-test branch from 310b98f to 5b8a3c3 Compare April 12, 2024 20:26
@copybara-service copybara-service Bot merged commit 3bba7a6 into main Apr 12, 2024
@copybara-service copybara-service Bot deleted the anisha-run-multicluster-test branch April 12, 2024 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants