Description

This is a repository setup for evaluating training performance on various hardware.

Structure

trainer.py: The main training script.
model.py: The model definition.
data.py: The data loading and preprocessing.
utils.py: Utility functions.
Dockerfile: The Dockerfile for the training container.
pyproject.toml: The poetry configuration.

Local Development

Install poetry:

curl -sSL https://install.python-poetry.org | python3 -
# Add poetry to your profile
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

If you encounter issues with SSL certificates, try to follow the steps described here.

Install initial dependencies:

poetry install

To run the training script, use the following command:

poetry run accelerate launch source/trainer.py --config configs/tiny_model_local.yaml

Running in Slurm

Build the Docker image:

docker build -t climax-training:latest -f Dockerfile .

Test training with the docker image:

docker run \
  --init \
  --rm \
  -m 940g \
  --shm-size=500g \
  --gpus all \
  --network host \
  --privileged \
  --ulimit nofile=1000000 \
  -w /app \
  -e PYTHONPATH=/app \
  -v "/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu" \
  -v "/opt/hpcx:/opt/hpcx" \
  -v "/dev/infiniband:/dev/infiniband" \
  -v "/mnt/jua-shared-1:/mnt/data" \
  -v "$(pwd)/configs/tiny_model_local.yaml:/app/config.yaml:ro" \
  -v "/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml:/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml:ro" \
  -e LD_LIBRARY_PATH=/opt/hpcx/nccl_rdma_sharp_plugin/lib:/opt/hpcx/ucc/lib/ucc:/opt/hpcx/ucc/lib:/opt/hpcx/ucx/lib/ucx:/opt/hpcx/ucx/lib:/opt/hpcx/sharp/lib:/opt/hpcx/hcoll/lib:/opt/hpcx/ompi/lib \
  -e NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml \
  -e NCCL_SOCKET_NTHREADS=4 \
  -e NCCL_NSOCKS_PERTHREAD=8 \
  -e NCCL_IB_MERGE_VFS=0 \
  -e NCCL_IB_HCA=^mlx5_0:1 \
  climax-training:latest \
  poetry run accelerate launch \
    --multi_gpu \
    --mixed_precision fp16 \
    --num_processes 8 \
    source/trainer.py \
    --run_id test-run-1 \
    --config /app/configs/tiny_model.yaml

To run the training script in Slurm, use the following command:

poetry run python launch_sbatch.py --nodes 2 --gpus_per_node 8 --config configs/medium_model.yaml --run_id test-run

Notes

You will need to update the data mount path in launch_sbatch.py.
You will need to add docker as a sudoer to each compute node. For example:

# Open sudoers file safely with visudo
sudo visudo

# Add the following line
%docker ALL=(ALL) NOPASSWD: /usr/bin/docker

Expected results

Tiny model - Total parameters: 0.009 B

1 GPU - 3650/140227 [02:20<1:24:26, 26.96timestep/s] 8 GPU's - 4200/17529 [00:38<01:57, 113.13timestep/s]

Small model - Total parameters: 0.093 B

1 GPU - 1748/140227 [01:24<1:46:44, 21.62timestep/s] 8 GPU's - 4824/17529 [00:52<02:10, 97.09timestep/s

Medium model - Total parameters: 1.067 B

1 GPU - 1776/140227 [02:00<5:56:49, 6.51timestep/s] 8 GPU's - 2240/17529 [00:54<05:51, 43.51timestep/s] 16 GPU's - 3408/8765 [00:41<01:02, 85.68timestep/s]

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
source		source
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
docker_runner.py		docker_runner.py
fsdp_config.yaml		fsdp_config.yaml
launch_sbatch.py		launch_sbatch.py
launch_srun.py		launch_srun.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Structure

Local Development

Running in Slurm

Notes

Expected results

Tiny model - Total parameters: 0.009 B

Small model - Total parameters: 0.093 B

Medium model - Total parameters: 1.067 B

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Description

Structure

Local Development

Running in Slurm

Notes

Expected results

Tiny model - Total parameters: 0.009 B

Small model - Total parameters: 0.093 B

Medium model - Total parameters: 1.067 B

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages