Reinforcement Learning for LLM Alignment Workshop

A 5-part self-study workshop covering reinforcement learning concepts as applied to LLM alignment, from foundational RL through modern methods like DPO and GRPO.

Workshop Structure

Part	Notebook	Topic	Key Concepts
1	`01_rl_foundations.ipynb`	RL Foundations	Bandits, policies, REINFORCE, baselines, variance reduction
2	`02_policy_gradients_text.ipynb`	Policy Gradients for Text	LLMs as policies, sentiment steering, KL divergence, mode collapse
3	`03_rlhf_pipeline.ipynb`	RLHF Pipeline	Reward models, Bradley-Terry, PPO with trl
4	`04_dpo.ipynb`	Direct Preference Optimization	DPO derivation, from-scratch implementation, DPO vs PPO
5	`05_grpo_frontier.ipynb`	GRPO & the Frontier	GRPO, verifiable rewards, RLAIF, online DPO, DeepSeek-R1

Prerequisites

Solid ML/deep learning background
Familiarity with PyTorch and HuggingFace transformers
No prior RL knowledge required

Setup

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Launch Jupyter
jupyter notebook notebooks/

Hardware

Designed to run locally on macOS with Apple Silicon (MPS). All notebooks use GPT-2 124M, which fits comfortably in memory. Code auto-detects MPS/CUDA/CPU.

Progression

Each notebook builds on the previous. The arc is:

Bandits → Policy Gradients → Text as RL → RLHF/PPO → DPO → GRPO
  (toy)      (theory)         (bridge)    (classic)   (modern) (frontier)

Datasets Used

Parts 3-4: Anthropic HH-RLHF (human preference data, loaded from HuggingFace)
Part 5: GSM8K (grade school math, loaded from HuggingFace)

Key Libraries

torch — all implementations
transformers — GPT-2 and sentiment classifier
trl — PPOTrainer, DPOTrainer
datasets — HH-RLHF, GSM8K

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning for LLM Alignment Workshop

Workshop Structure

Prerequisites

Setup

Hardware

Progression

Datasets Used

Key Libraries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning for LLM Alignment Workshop

Workshop Structure

Prerequisites

Setup

Hardware

Progression

Datasets Used

Key Libraries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages