bit-current · Vectorrent · Mar 8, 2024 · Mar 8, 2024 · Mar 8, 2024 · Mar 9, 2024
diff --git a/.gitignore b/.gitignore
@@ -160,4 +160,9 @@ cython_debug/
 #.idea/
 bittensor-subnet-template/
 wandb/
-.vscode/
+.vscode/
+
+data
+wallets
+lightning_logs
+.scale_batch_size*
diff --git a/DOCKER.md b/DOCKER.md
@@ -0,0 +1,75 @@
+# how to use hivetrain for docker
+
+## install dependencies
+
+1. [Docker](https://docs.docker.com/engine/install/)
+2. [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
+
+## clone the repo
+```
+git clone https://github.com/LuciferianInk/DistributedTraining.git
+```
+
+## move into the repo
+```
+cd DistributedTraining
+```
+
+## checkout the dev branch
+```
+git checkout docker-setup
+```
+
+## build the docker image
+```
+docker compose build
+```
+
+## make a .env file
+Make a file called `.env`, and place it in the root of this project.
+
+## make a choice
+At this point, you must make one of two choices:
+
+### 1. bootstrap
+If you intend to bootstrap a new training run.
+```
+docker compose up
+```
+
+### 2. join
+If you intend to join an existing training run, then add this environment variable to your `.env` file:
+```
+INITIAL_PEERS="/p2p/12D3KooWE1fyvZHhuc2UQqAN35oXgexHKRpVqgXKo9EUQ4hguny9"
+```
+After that, you may join the training run with:
+```
+docker compose up
+```
+
+## final notes
+
+Your machine will print your own peer ID to the console at startup. It should look like this:
+```
+PEER-ID: /p2p/12D3KooWF9KB7PVUdbct4ryCMzDjbNT1q2w5XMw9iVG6tisY4ThB
+```
+If Hivemind is under-utilizing your GPU (i.e. it's not using all of your available VRAM), you may try to increase the batch size being used. To do this, add this environment variable to your `.env` file:
+```
+BATCH_SIZE=2 (or 3, or whatever)
+```
+You will know that training is progressing when you see output like this:
+```
+hivetrain-1  | LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
+hivetrain-1  | 
+hivetrain-1  |   | Name  | Type            | Params
+hivetrain-1  | ------------------------------------------
+hivetrain-1  | 0 | model | GPT2LMHeadModel | 186 M 
+hivetrain-1  | ------------------------------------------
+hivetrain-1  | 186 M     Trainable params
+hivetrain-1  | 0         Non-trainable params
+hivetrain-1  | 186 M     Total params
+hivetrain-1  | 747.418   Total estimated model params size (MB)
+hivetrain-1  | Global Step: 0, Local Loss: 12.069, Peers: 0
+hivetrain-1  | Global Step: 0, Local Loss: 12.063, Peers: 1
+hivetrain-1  | Global Step: 0, Local Loss: 11.852, Peers: 2
+```
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,30 @@
+FROM nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu22.04
+
+LABEL sponsor="Hivetrain"
+
+ENV DEBIAN_FRONTEND="noninteractive"
+
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+    git \
+    python3-dev \
+    python3-pip \
+    python3-packaging \
+    python3-venv \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+
+COPY requirements.txt requirements.txt
+
+RUN pip install -r requirements.txt && \
+    pip cache purge
+
+COPY requirements.docker.txt requirements.docker.txt
+
+RUN pip install -r requirements.docker.txt && \
+    pip cache purge
+
+COPY ./ /app
+
+ENTRYPOINT "bash ./entrypoint.sh"
diff --git a/README.md b/README.md
@@ -61,7 +61,7 @@ Done : Train TINYGPT
 
 ## How Miners are Rewarded
 
-Hivetrain uses a simmple score assignment system designed to reward users for their participation and adherence to network guidelines. The system evaluates two critical aspects of user behavior: responsiveness and loss values. By applying a set of predefined rules, we aim to foster a healthy and productive network environment where all participants are incentivized to contribute positively. Whilst maintaining network integrity with few gameable variables.
+Hivetrain uses a simple score assignment system designed to reward users for their participation and adherence to network guidelines. The system evaluates two critical aspects of user behavior: responsiveness and loss values. By applying a set of predefined rules, we aim to foster a healthy and productive network environment where all participants are incentivized to contribute positively. Whilst maintaining network integrity with few gameable variables.
 
 ### 1.0 
 Users who actively respond to network activities and maintain their losses within an acceptable threshold are awarded a score of 1.0. This top score reflects exemplary user behavior and strict adherence to network standards, highlighting the user as a model participant.

diff --git a/compose.yml b/compose.yml
@@ -0,0 +1,32 @@
+version: '3.9'
+
+services:
+  hivetrain:
+    image: ghcr.io/bit-current/distributedtraining:latest
+    entrypoint: bash ./entrypoint.sh
+    restart: 'always'
+    ipc: host
+    network_mode: host
+    tty: true
+    stdin_open: true
+    build:
+      shm_size: '4gb'
+      dockerfile: Dockerfile
+    volumes:
+      - ./neurons:/app/neurons
+      - ./data:/data
+      - ./wallets:/root/.bittensor/wallets
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - capabilities: ["gpu"]
+              count: all
+    environment:
+      NETUID: ${NETUID:-25}
+      WALLETNAME: ${WALLETNAME:-default}
+      WALLETHOTKEY: ${WALLETHOTKEY:-defaulthotkey}
+      DHTPORT: ${DHTPORT:-42316}
+      AXONPORT: ${AXONPORT:-42310}
+    env_file:
+      - .env
diff --git a/entrypoint.sh b/entrypoint.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+cd /app/neurons
+
+python3 hiveminer.py \
+    --initial_peers ${INITIAL_PEERS} \
+    --batch_size ${BATCH_SIZE} \
+    --save_every ${SAVE_EVERY}
diff --git a/example.env b/example.env
@@ -0,0 +1,10 @@
+NETUID=25
+WALLETNAME='test'
+WALLETHOTKEY='test'
+WANDB_API_KEY=''
+DHTPORT=42316
+EXTERNALIP=104.202.156.242
+
+CUDA_VISIBLE_DEVICES=0
+INITIAL_PEERS="/p2p/12D3KooWCvMCCJDHQ7d9pfqqkxAPD6AZdAbcXPd1d9pWvQWDpqBi"
+SAVE_EVERY=0