Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,4 +160,9 @@ cython_debug/
#.idea/
bittensor-subnet-template/
wandb/
.vscode/
.vscode/

data
wallets
lightning_logs
.scale_batch_size*
75 changes: 75 additions & 0 deletions DOCKER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# how to use hivetrain for docker

## install dependencies

1. [Docker](https://docs.docker.com/engine/install/)
2. [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

## clone the repo
```
git clone https://github.com/LuciferianInk/DistributedTraining.git
```

## move into the repo
```
cd DistributedTraining
```

## checkout the dev branch
```
git checkout docker-setup
```

## build the docker image
```
docker compose build
```

## make a .env file
Make a file called `.env`, and place it in the root of this project.

## make a choice
At this point, you must make one of two choices:

### 1. bootstrap
If you intend to bootstrap a new training run.
```
docker compose up
```

### 2. join
If you intend to join an existing training run, then add this environment variable to your `.env` file:
```
INITIAL_PEERS="/p2p/12D3KooWE1fyvZHhuc2UQqAN35oXgexHKRpVqgXKo9EUQ4hguny9"
```
After that, you may join the training run with:
```
docker compose up
```

## final notes

Your machine will print your own peer ID to the console at startup. It should look like this:
```
PEER-ID: /p2p/12D3KooWF9KB7PVUdbct4ryCMzDjbNT1q2w5XMw9iVG6tisY4ThB
```
If Hivemind is under-utilizing your GPU (i.e. it's not using all of your available VRAM), you may try to increase the batch size being used. To do this, add this environment variable to your `.env` file:
```
BATCH_SIZE=2 (or 3, or whatever)
```
You will know that training is progressing when you see output like this:
```
hivetrain-1 | LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
hivetrain-1 |
hivetrain-1 | | Name | Type | Params
hivetrain-1 | ------------------------------------------
hivetrain-1 | 0 | model | GPT2LMHeadModel | 186 M
hivetrain-1 | ------------------------------------------
hivetrain-1 | 186 M Trainable params
hivetrain-1 | 0 Non-trainable params
hivetrain-1 | 186 M Total params
hivetrain-1 | 747.418 Total estimated model params size (MB)
hivetrain-1 | Global Step: 0, Local Loss: 12.069, Peers: 0
hivetrain-1 | Global Step: 0, Local Loss: 12.063, Peers: 1
hivetrain-1 | Global Step: 0, Local Loss: 11.852, Peers: 2
```
30 changes: 30 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
FROM nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu22.04

LABEL sponsor="Hivetrain"

ENV DEBIAN_FRONTEND="noninteractive"

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
git \
python3-dev \
python3-pip \
python3-packaging \
python3-venv \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt requirements.txt

RUN pip install -r requirements.txt && \
pip cache purge

COPY requirements.docker.txt requirements.docker.txt

RUN pip install -r requirements.docker.txt && \
pip cache purge

COPY ./ /app

ENTRYPOINT "bash ./entrypoint.sh"
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Done : Train TINYGPT

## How Miners are Rewarded

Hivetrain uses a simmple score assignment system designed to reward users for their participation and adherence to network guidelines. The system evaluates two critical aspects of user behavior: responsiveness and loss values. By applying a set of predefined rules, we aim to foster a healthy and productive network environment where all participants are incentivized to contribute positively. Whilst maintaining network integrity with few gameable variables.
Hivetrain uses a simple score assignment system designed to reward users for their participation and adherence to network guidelines. The system evaluates two critical aspects of user behavior: responsiveness and loss values. By applying a set of predefined rules, we aim to foster a healthy and productive network environment where all participants are incentivized to contribute positively. Whilst maintaining network integrity with few gameable variables.

### 1.0
Users who actively respond to network activities and maintain their losses within an acceptable threshold are awarded a score of 1.0. This top score reflects exemplary user behavior and strict adherence to network standards, highlighting the user as a model participant.
Expand Down
32 changes: 32 additions & 0 deletions compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
version: '3.9'

services:
hivetrain:
image: ghcr.io/bit-current/distributedtraining:latest
entrypoint: bash ./entrypoint.sh
restart: 'always'
ipc: host
network_mode: host
tty: true
stdin_open: true
build:
shm_size: '4gb'
dockerfile: Dockerfile
volumes:
- ./neurons:/app/neurons
- ./data:/data
- ./wallets:/root/.bittensor/wallets
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
count: all
environment:
NETUID: ${NETUID:-25}
WALLETNAME: ${WALLETNAME:-default}
WALLETHOTKEY: ${WALLETHOTKEY:-defaulthotkey}
DHTPORT: ${DHTPORT:-42316}
AXONPORT: ${AXONPORT:-42310}
env_file:
- .env
8 changes: 8 additions & 0 deletions entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash

cd /app/neurons

python3 hiveminer.py \
--initial_peers ${INITIAL_PEERS} \
--batch_size ${BATCH_SIZE} \
--save_every ${SAVE_EVERY}
10 changes: 10 additions & 0 deletions example.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
NETUID=25
WALLETNAME='test'
WALLETHOTKEY='test'
WANDB_API_KEY=''
DHTPORT=42316
EXTERNALIP=104.202.156.242

CUDA_VISIBLE_DEVICES=0
INITIAL_PEERS="/p2p/12D3KooWCvMCCJDHQ7d9pfqqkxAPD6AZdAbcXPd1d9pWvQWDpqBi"
SAVE_EVERY=0
Loading