Skip to content

Model Inference issues with GPT-MoE models #458

@1155157110

Description

@1155157110

As mentioned by other issues:

there seems to be some compatibility issues using the generate_text.sh to run the pretrained model checkpoint generated by the examples. I have trained GPT2-125M-MoE64 with the ds_pretrain_gpt_125M_MoE64.sh, and get the checkpoint files as below:

checkpoint
└── gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true
    ├── global_step10000
    │   ├── expp_rank_0_mp_rank_00_optim_states.pt
    │   ├── expp_rank_1_mp_rank_00_optim_states.pt
    │   ├── expp_rank_2_mp_rank_00_optim_states.pt
    │   ├── expp_rank_3_mp_rank_00_optim_states.pt
    │   ├── layer_0_expert_0_mp_rank_00_model_states.pt
    │   ├── layer_0_expert_1_mp_rank_00_model_states.pt
    │   ├── ...
    │   ├── layer_5_expert_63_mp_rank_00_model_states.pt
    │   └── mp_rank_00_model_states.pt
    ├── latest
    └── latest_checkpointed_iteration.txt

However, when loading checkpoint, the predefined checkpoint path does not match the checkpoint folder layout. In the get_checkpoint_name function in megatron/checkpointing.py line 98, the defined checkpoint path is:

def get_checkpoint_name(checkpoints_path, iteration, release=False,
                        pipeline_parallel=None,
                        tensor_rank=None, pipeline_rank=None):
    """Determine the directory name for this rank's checkpoint."""
    if release:
        directory = 'release'
    else:
        directory = 'iter_{:07d}'.format(iteration)
    ...
    return os.path.join(common_path, "model_optim_rng.pt")

Bypassing the naming issues by changing the refered checkpoint names (i.e., set directory = 'global_step{:05d}'.format(iteration) and return f"{common_path}_model_states.pt", generate_text.sh will take me to the above mentioned issues.

Is further model conversion needed to run the GPT-MoE models? As the tutorials are not quite clear, may I know how to run the GPT-MoE models with deepspeed expert parallelism?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions