|
| 1 | +# autoresearch |
| 2 | + |
| 3 | +Autonomous LLM pretraining research, driven by AI agents. |
| 4 | + |
| 5 | +The idea: give an AI agent a small but real LLM training setup and let it run experiments overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. |
| 6 | + |
| 7 | +This particular implementation is trying to be the least fancy baseline, but it's clear how one would adjust the `program.md` file to run more sophisticated research programs with more elaborate instructions. For example, the agent can actively do little experiments on research while the job is running. |
| 8 | + |
| 9 | +## How it works |
| 10 | + |
| 11 | +The repo is deliberately small and only has a few files: |
| 12 | + |
| 13 | +- **`constants.py`** — fixed rules: sequence length, time budget, eval tokens. Not modified. |
| 14 | +- **`prepare.py`** — one-time data prep (downloads training data, trains a BPE tokenizer) and runtime utilities (dataloader, evaluation). Not modified. |
| 15 | +- **`train.py`** — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. |
| 16 | +- **`program.md`** — instructions for the agent. Point your agent here and let it go. |
| 17 | + |
| 18 | +Training runs for a **fixed 5-minute time budget** (wall clock, excluding startup/compilation). The metric is **val_bpb** (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared. |
| 19 | + |
| 20 | +## Quick start |
| 21 | + |
| 22 | +**Requirements:** A single NVIDIA GPU (tested on H100), Python 3.10+, [uv](https://docs.astral.sh/uv/). |
| 23 | + |
| 24 | +```bash |
| 25 | +# 1. Install dependencies |
| 26 | +uv sync |
| 27 | + |
| 28 | +# 2. Download data and train tokenizer (one-time, ~5 min) |
| 29 | +uv run prepare.py |
| 30 | + |
| 31 | +# 3. Run a single training experiment (5 min + startup) |
| 32 | +uv run train.py |
| 33 | +``` |
| 34 | + |
| 35 | +## Running the agent |
| 36 | + |
| 37 | +Simply spin up your Claude/Codex or whatever you want in this repo, then you can something like: |
| 38 | + |
| 39 | +``` |
| 40 | +Hi have a look at program.md and let's kick off a new experiment! let's do the setup first. |
| 41 | +``` |
| 42 | + |
| 43 | +The `program.md` file is essentially a super lightweight "skill". |
| 44 | + |
| 45 | +## Project structure |
| 46 | + |
| 47 | +``` |
| 48 | +constants.py — fixed constants (do not modify) |
| 49 | +prepare.py — data prep + runtime utilities (do not modify) |
| 50 | +train.py — model, optimizer, training loop (agent modifies this) |
| 51 | +program.md — agent instructions |
| 52 | +spawn.sh — multi-agent launcher |
| 53 | +pyproject.toml — dependencies |
| 54 | +``` |
| 55 | + |
| 56 | +## Design choices |
| 57 | + |
| 58 | +- **Single file to modify.** The agent only touches `train.py`. This keeps the scope manageable and diffs reviewable. |
| 59 | +- **Fixed time budget.** Training always runs for exactly 5 minutes. This makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). |
| 60 | +- **BPB metric.** Bits per byte is independent of tokenizer vocabulary size, so the agent could in principle change the vocab size and still get a fair comparison. |
| 61 | +- **Self-contained.** No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric. |
0 commit comments