A 5-part self-study workshop covering reinforcement learning concepts as applied to LLM alignment, from foundational RL through modern methods like DPO and GRPO.
| Part | Notebook | Topic | Key Concepts |
|---|---|---|---|
| 1 | 01_rl_foundations.ipynb |
RL Foundations | Bandits, policies, REINFORCE, baselines, variance reduction |
| 2 | 02_policy_gradients_text.ipynb |
Policy Gradients for Text | LLMs as policies, sentiment steering, KL divergence, mode collapse |
| 3 | 03_rlhf_pipeline.ipynb |
RLHF Pipeline | Reward models, Bradley-Terry, PPO with trl |
| 4 | 04_dpo.ipynb |
Direct Preference Optimization | DPO derivation, from-scratch implementation, DPO vs PPO |
| 5 | 05_grpo_frontier.ipynb |
GRPO & the Frontier | GRPO, verifiable rewards, RLAIF, online DPO, DeepSeek-R1 |
- Solid ML/deep learning background
- Familiarity with PyTorch and HuggingFace transformers
- No prior RL knowledge required
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebook notebooks/Designed to run locally on macOS with Apple Silicon (MPS). All notebooks use GPT-2 124M, which fits comfortably in memory. Code auto-detects MPS/CUDA/CPU.
Each notebook builds on the previous. The arc is:
Bandits → Policy Gradients → Text as RL → RLHF/PPO → DPO → GRPO
(toy) (theory) (bridge) (classic) (modern) (frontier)
- Parts 3-4: Anthropic HH-RLHF (human preference data, loaded from HuggingFace)
- Part 5: GSM8K (grade school math, loaded from HuggingFace)
torch— all implementationstransformers— GPT-2 and sentiment classifiertrl— PPOTrainer, DPOTrainerdatasets— HH-RLHF, GSM8K