-
Notifications
You must be signed in to change notification settings - Fork 370
Description
As mentioned by other issues:
there seems to be some compatibility issues using the generate_text.sh to run the pretrained model checkpoint generated by the examples. I have trained GPT2-125M-MoE64 with the ds_pretrain_gpt_125M_MoE64.sh, and get the checkpoint files as below:
checkpoint
└── gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true
├── global_step10000
│ ├── expp_rank_0_mp_rank_00_optim_states.pt
│ ├── expp_rank_1_mp_rank_00_optim_states.pt
│ ├── expp_rank_2_mp_rank_00_optim_states.pt
│ ├── expp_rank_3_mp_rank_00_optim_states.pt
│ ├── layer_0_expert_0_mp_rank_00_model_states.pt
│ ├── layer_0_expert_1_mp_rank_00_model_states.pt
│ ├── ...
│ ├── layer_5_expert_63_mp_rank_00_model_states.pt
│ └── mp_rank_00_model_states.pt
├── latest
└── latest_checkpointed_iteration.txt
However, when loading checkpoint, the predefined checkpoint path does not match the checkpoint folder layout. In the get_checkpoint_name function in megatron/checkpointing.py line 98, the defined checkpoint path is:
def get_checkpoint_name(checkpoints_path, iteration, release=False,
pipeline_parallel=None,
tensor_rank=None, pipeline_rank=None):
"""Determine the directory name for this rank's checkpoint."""
if release:
directory = 'release'
else:
directory = 'iter_{:07d}'.format(iteration)
...
return os.path.join(common_path, "model_optim_rng.pt")
Bypassing the naming issues by changing the refered checkpoint names (i.e., set directory = 'global_step{:05d}'.format(iteration) and return f"{common_path}_model_states.pt", generate_text.sh will take me to the above mentioned issues.
Is further model conversion needed to run the GPT-MoE models? As the tutorials are not quite clear, may I know how to run the GPT-MoE models with deepspeed expert parallelism?