Skip to content

Checkpoint loading not working correctly #982

@sdtblck

Description

@sdtblck

We are having a critical issue in gpt-neox where the checkpoint loading from deepspeed isn't working correctly. When we reload the model from a checkpoint, the loss suggests it hasn't loaded anything at all, and is just starting again from initialization... (jumps up from ~4 at checkpoint save to 8 at checkpoint load)

The model is based on the megatron example in DeepSpeedExamples, and the checkpoint saving / loading as far as I'm aware, hasn't changed at all from there (and should be handled almost entirely by model_engine.save_checkpoint / load_checkpoint.)

It happens both with the pipeline parallel module, and without.

You can see here an example of training loss before saving a checkpoint

Screenshot from 2021-04-20 13-28-27

and then the loss directly afterward:

Screenshot from 2021-04-20 13-30-55

Would appreciate any help to decipher what's going on here.

We're technically using our deepspeed fork but there are no modification to the checkpoint saving / loading logic there either.

This is about as minimal a reproduction as i can give right now:

first set up the environment in bash:
(this is assuming you're running on an aws machine, but should work on any base image with torch + cuda 11.1, just comment out that first line)

# first activate the correct AWS env with torch 1.8.0 + cuda 11.1
source activate pytorch_latest_p37

# clone repo and cd into it
git clone https://github.com/EleutherAI/gpt-neox
cd gpt-neox

# install other requirements
pip install pybind11==2.6.2 six regex numpy==1.20.1 nltk==3.5 \
        -e git+git://github.com/EleutherAI/DeeperSpeed.git@0e9573776caf7227ecafdc8a3d57c7955165d8e2#egg=deepspeed \
        zstandard==0.15.1 cupy-cuda111 mpi4py==3.0.3 wandb==0.10.21 einops==0.3.0 transformers tokenizers lm_dataformat triton==1.0.0.dev20210329

# install apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690

# prepare dummy gpt2 data
python prepare_data.py

then in a python script, train a small model and try to continue training:

from yaml import load, dump
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper
import os

# modify train iters in small.yml to 500
with open('configs/small.yml', 'r') as f:
        data = load(f, Loader=Loader)
data.update({'train-iters': 500})
with open('configs/tmp.yml', 'w') as f:
        f.write(dump(data))

# start run
os.system('./deepy.py pretrain_gpt2.py -d configs tmp.yml local_setup.yml')

# kill any hanging processes
os.system("pkill -f deepy.py")

# update train iters again
with open('configs/small.yml', 'r') as f:
        data = load(f, Loader=Loader)
data.update({'train-iters': 550})
with open('configs/tmp.yml', 'w') as f:
        f.write(dump(data))

# start run again
os.system('./deepy.py pretrain_gpt2.py -d configs tmp.yml local_setup.yml')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions