Decode AI from first principles. No black boxes. No hand-waving.
This repository is built on a simple belief: you cannot truly master AI by calling model.fit(). To understand how modern AI systems actually work, you need to build them from scratch — derive the math, implement the algorithms, and watch the gradients flow.
Every notebook in this repository dissects a core AI concept by implementing it from the ground up using raw PyTorch and NumPy. We go from the bias-variance tradeoff all the way to building GPT, LLaMA, DeepSeek, and GRPO — the same algorithm behind DeepSeek-R1. If a concept matters, we don't just explain it. We build it, break it, and rebuild it until the intuition is earned.
"What I cannot create, I do not understand." — Richard Feynman
#
Notebook
Description
01
Data Processing
Bias-variance tradeoff, feature scaling, data splitting, and preprocessing pipelines
02
Regression
Linear and polynomial regression — cost functions, gradient descent, and regularization
03
Classification
Logistic regression, SVMs, decision trees, random forests, and ensemble methods
04
Clustering
K-Means, DBSCAN, hierarchical clustering — algorithms, objective functions, and evaluation
05
Dimension Reduction
PCA derivation, eigenvalue decomposition, and t-SNE for high-dimensional data
02 - Deep Learning Foundation
#
Notebook
Description
01
Neural Network Foundations
NumPy vectorization, broadcasting, forward/backward pass from scratch
02
Activation Functions
Sigmoid, tanh, ReLU, GELU — why activations matter, saturation, and dying neurons
03
Weight Initialization
Why zero init fails, variance explosion/vanishing, Xavier and He initialization proofs
04
Normalization
Batch norm, layer norm, group norm — internal covariate shift and loss landscape smoothing
05
Regularization
L2 weight decay, dropout, early stopping — fighting overfitting with math
06
Residual Connection
The degradation problem, skip connections, and why deeper networks can fail without them
07
Loss Function
BCE, cross-entropy derivations — why sigmoid+BCE and softmax+CE produce clean gradients
08
Optimizer
SGD, momentum, RMSProp, Adam — from vanilla gradient descent to adaptive learning rates
09
Model Classification
End-to-end image classification on CIFAR-10 applying all the foundations above
03 - Large Language Model
#
Notebook
Description
01
Vanilla RNN
RNN cell from scratch — hidden states, BPTT, vanishing/exploding gradients
02
Recurrent Classifier
Sentiment classification on IMDb using RNN/LSTM with padding and packing
03
RNN with Attention
Seq2seq bottleneck problem, Bahdanau attention for date format translation
#
Notebook
Description
A01
Pretrained Model - HuggingFace
Using HuggingFace pipelines and pretrained models for text classification
A02
Attention Mechanism
Bahdanau vs Luong attention — the information bottleneck and its solution
A03
Transformer
Full transformer architecture from scratch — multi-head attention, positional encoding, encoder-decoder
B01
BERT
Bidirectional encoder — WordPiece tokenization, MLM, NSP, and the fine-tuning paradigm
B02
ColBERT
Late interaction retrieval — MaxSim scoring, query augmentation, token-level matching
C01
nanoGPT
GPT-2 from scratch — byte-level BPE tokenization, causal self-attention, autoregressive decoding
C02
LLaMA
LLaMA architecture deep dive — RMSNorm, RoPE, SwiGLU, grouped-query attention
C03
Mistral MoE
Mixture of Experts — sparse routing, expert parallelism, sliding window attention
C04
DeepSeek
Multi-head Latent Attention (MLA) — 24x KV-cache reduction via low-rank compression
C05
Qwen
Advanced RoPE scaling — Position Interpolation, NTK-Aware, Dynamic NTK, YaRN
#
Notebook
Description
01
Text Embedding
Cosine similarity, dot product, L2 distance — similarity metrics and embedding spaces
02
HNSW
Approximate nearest neighbors — HNSW, Product Quantization, IVF for vector search
03
Topic Modeling
Discovering latent topics from text corpora
04
NER
Named Entity Recognition — BIO tagging, CoNLL-2003, token classification
05
RAG
Retrieval-Augmented Generation — chunking strategies, vector stores, retrieval pipeline
06
Advanced RAG
Advanced retrieval techniques — re-ranking, hybrid search, query transformation
#
Notebook
Description
01
Instruction Tuning
Fine-tuning Pythia-2.8B on Dolly 15k — prompt formatting, loss masking on response tokens
02
SFT
Supervised Fine-Tuning — the first step after pretraining in the LLM pipeline
03
Reward Model
ORM vs PRM — Bradley-Terry loss, step-level credit assignment for reasoning
04
DPO vs ORPO and SimPO
Direct Preference Optimization — aligning LLMs with human preferences without RL
05
GRPO with RLVR
Group Relative Policy Optimization — the algorithm behind DeepSeek-R1, with verifiable rewards
06
PEFT (LoRA / QLoRA)
LoRA and QLoRA from scratch — low-rank adaptation, 4-bit quantization, 99.6% fewer parameters
07
Abliteration
Mechanistic interpretability — finding and removing the refusal direction in activation space
#
Notebook
Description
01
Distillation
Knowledge distillation — response-level SFT, logit-level KD, rejection sampling
02
Model Pruning
Unstructured and structured pruning — magnitude-based, layer-wise, and global strategies
03
Quantization
FP32 to INT4 — numeric formats, quantization schemes, memory-accuracy tradeoffs
#
Notebook
Description
A01
LLM Prompting
System prompt design patterns — persona, task-specific, guard-rails, few-shot, chain-of-thought
A02
LangChain
LangChain fundamentals — data loaders, splitters, vectorstores, embeddings, retrieval chains
A03
Agent Harness
The agent loop primitive — tool calling, finish reasons, state management from scratch
A04
Agent Gateway
Intelligence layers — BASE, IDENTITY, SOUL, MEMORY, SKILLS, TOOLS, CONTEXT, HEARTBEAT
A05
Agent Operation
Production observability — logs, metrics, traces, cost attribution, latency profiling
A06
Self Learning Loop
Reflexion and verbal gradients — self-critique, reflection injection, iterative improvement
B01
LangGraph Agent
Cyclic state graphs with LangGraph — conditional edges, tool routing, RAG agents, LangSmith tracing
B02
Claude Code
Reverse-engineering Claude Code's agent loop — stop reasons, tool execution, harness internals
#
Notebook
Description
01
Generation
Text generation from scratch — KV-cache, sampling strategies, batched/continuous batching, speculative decoding
#
Notebook
Description
01
LLM Evaluation
Perplexity, intrinsic evaluation metrics — measuring how well a model predicts text
#
Notebook
Description
01
CNN Foundations
Convolutions from scratch in NumPy — zero-padding, forward/backward pass, pooling, and gradient derivations
02
CNN Architecture
Baseline CNN to ResNet on FashionMNIST — vanishing gradients and why skip connections work
03
Transfer Learning
Feature extraction vs fine-tuning on CIFAR-10 — ImageNet normalization, layer freezing strategies
04
Object Detection
IoU, NMS, anchor box assignment, and YOLO output decoding — built from scratch
05
Image Segmentation
U-Net encoder/decoder from scratch — skip connections, pixel-wise loss, SegFormer inference
06
Metric Learning
Siamese networks, contrastive loss, triplet loss, and FaceNet — embedding spaces for unseen classes
07
Vision Transformers
ViT, DeiT, and Swin from scratch — patch embedding, multi-head self-attention, hierarchical windows
08
Contrastive Learning
SimCLR, CLIP, and DINOv2 — self-supervised pretraining with NT-Xent loss
09
Diffusion Model
DDPM, DDIM, Stable Diffusion — forward/reverse process, noise schedules, ContextUNet
10
Model Explainability
Saliency maps, GradCAM, Integrated Gradients, and SHAP on ResNet-50
#
Notebook
Description
01
Bridge Architecture
Connecting frozen ViT to frozen LLM — LLaVA projectors, Flamingo Perceiver, BLIP-2 Q-Former, MoE bridges
02
Vision Language Model
Qwen-VL style VLM from scratch — TinyViT, MLP projector, mRoPE, visual token insertion, Stage 1 training
03
Instruction Tuning
Stage 2–3 VLM training — visual instructions, multi-turn dialog, RLHF-V for hallucination reduction
04
Reasoning & Inference
VLM inference pipeline — decoding strategies, streaming, chain-of-thought, visual grounding, evaluation
05
Audio & Speech
Waveforms to Mel spectrograms from scratch — audio encoders, Whisper, Phi-4 multimodal speech
06
Video
Video understanding — spatial-temporal attention, ViViT, dynamic FPS sampling, text-timestamp alignment
07
Visual Agent & Computer Use
VLMs that act on GUIs — perception, planning, action loops, computer-use agents
08
Native Multimodal
Any-to-any unified token spaces — Chameleon, Transfusion, Emu3, Janus Pro with VQ-VAE tokenizers
Topic
Description
Training Strategy
Training data curation, loss functions, distributed training, and GPU programming
Model Serving
vLLM, PagedAttention, autoscaling, and production deployment
MLOps
Experiment tracking, model versioning, CI/CD for ML, monitoring, and drift detection
LLM Benchmarks
MMLU, HumanEval, GSM8K — standardized evaluation and leaderboard methodology
AI Governance
Red teaming, toxicity benchmarks, bias evaluation, hallucination detection
More work is coming. This repository is actively maintained and expanding as the field evolves.