Update Gemma 7b tests to use optionally internal GCS buckets for testing#583
Conversation
15f9fa8 to
ab74710
Compare
3cf81f0 to
c147b1e
Compare
There was a problem hiding this comment.
Just confirming, idx has been absorbed into MODEL_BUCKET, so this will still be distinct for each run?
There was a problem hiding this comment.
Right, since MODEL_BUCKET has the idx absorbed into it, we are getting CONVERTED_CHECKPOINT distinct in each run. We are also pulling out idx in line 21 from MODEL_BUCKET, and used in the different RUN_NAME vars in line 24, 47, 56, these RUN_NAME vars would make the outputs of end_to_end/gemma/7b/2_test_gemma.sh distinct for each run.
Is this what you meant?
c147b1e to
3fa6510
Compare
There was a problem hiding this comment.
what is this AWK doing? Seems unwieldy, maybe you can explain the problem (ideally in some message about the commit?)
There was a problem hiding this comment.
Thanks Rafi - I added the explanation in the description of this PR.
There was a problem hiding this comment.
Ok that is super helpful.
Is there a way we could not have thing branching? Like do we really to preserve multiple options here?
There was a problem hiding this comment.
Did you mean we remove the two options of connecting the two testing scripts, and only have the option of this?
(we need this option of connecting the two scripts for our internal testing though )
export MODEL_BUCKET=/path/to/GCS/bucket/and/an/index; bash end_to_end/gemma/7b/1_test_gemma.sh
export MODEL_BUCKET=/path/to/GCS/bucket/and/an/index; bash end_to_end/gemma/7b/2_test_gemma.sh
There was a problem hiding this comment.
(Or maybe focus on a version that doesn't require parsing so you apss two args.
There was a problem hiding this comment.
Followed up with @RissyRan - it is possible, it would need some modifications to our airflow+XLML tests code.
There was a problem hiding this comment.
@rwitten I think the much simpler and cleaner solution will be to have one GCS path shared between the two scripts and it would be the BASE_OUTPUT_DIRECTORY, and write everything there: for e.g., scanned ckpt, unscanned ckpt, and all outputs of train and decode (in separate subfolders). This shared GCS path will be unique since it will have the datetime inside it and will be generated from airflow.
I updated the code, please take a look.
3fa6510 to
f031d76
Compare
There was a problem hiding this comment.
Nit:
In MaxText land, BASE_OUTPUT_DIRECTORY is usually just a gcs bucket, we attach the run_name to it and create a unique path for the run.
I think we should be more clear in the example usage here, saying that BASE_OUTPUT_DIRECTORY is the full path of user's unique run.
I think we should also consider renaming that variable to BASE_OUTPUT_PATH to avoid confusion.
There was a problem hiding this comment.
Thanks Mohit!
I added a comment in the scripts to clarify and also renamed to BASE_OUTPUT_PATH.
4e29578 to
310b98f
Compare
310b98f to
5b8a3c3
Compare
For models such as
Gemma-7Bwhich needs multihost TPUs for runningtrainordecode, we can run the checkpoint conversion step of creating the MaxText compatible Orbax checkpoint in a CPU machine or TPUv4-8(basically any single host machine with enough RAM) inend_to_end/gemma/7b/1_test_gemma.sh, and run thedecodeandtrainsteps on a multi-host TPU inend_to_end/gemma/7b/2_test_gemma.sh.When connecting the runs of these two separate scripts:
Run the two scripts as follows: