Model Inference issues with GPT-MoE models

As mentioned by other issues:
- #367 
- #324

there seems to be some compatibility issues using the [generate_text.sh](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/generate_text.sh) to run the pretrained model checkpoint generated by the [examples](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/MoE). I have trained GPT2-125M-MoE64 with the [ds_pretrain_gpt_125M_MoE64.sh](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/MoE/ds_pretrain_gpt_125M_MoE64.sh), and get the checkpoint files as below:
```
checkpoint
└── gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true
    ├── global_step10000
    │   ├── expp_rank_0_mp_rank_00_optim_states.pt
    │   ├── expp_rank_1_mp_rank_00_optim_states.pt
    │   ├── expp_rank_2_mp_rank_00_optim_states.pt
    │   ├── expp_rank_3_mp_rank_00_optim_states.pt
    │   ├── layer_0_expert_0_mp_rank_00_model_states.pt
    │   ├── layer_0_expert_1_mp_rank_00_model_states.pt
    │   ├── ...
    │   ├── layer_5_expert_63_mp_rank_00_model_states.pt
    │   └── mp_rank_00_model_states.pt
    ├── latest
    └── latest_checkpointed_iteration.txt
```

However, when loading checkpoint, the predefined checkpoint path does not match the checkpoint folder layout. In the `get_checkpoint_name` function in megatron/checkpointing.py line 98, the defined checkpoint path is:
```
def get_checkpoint_name(checkpoints_path, iteration, release=False,
                        pipeline_parallel=None,
                        tensor_rank=None, pipeline_rank=None):
    """Determine the directory name for this rank's checkpoint."""
    if release:
        directory = 'release'
    else:
        directory = 'iter_{:07d}'.format(iteration)
    ...
    return os.path.join(common_path, "model_optim_rng.pt")
```

Bypassing the naming issues by changing the refered checkpoint names (i.e., set `directory = 'global_step{:05d}'.format(iteration)` and `return f"{common_path}_model_states.pt"`, generate_text.sh will take me to the above mentioned issues.

Is further model conversion needed to run the GPT-MoE models? As the tutorials are not quite clear, may I know how to run the GPT-MoE models with deepspeed expert parallelism?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Inference issues with GPT-MoE models #458

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model Inference issues with GPT-MoE models #458

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions