A complete, end-to-end text-to-image diffusion pipeline trained from scratch on a single T4 GPU. ~32% the parameter count of Stable Diffusion v1.5, implementing modern Flow Matching instead of DDPM.
This model generates mountains, sunsets, and landscape scenery reasonably well. It fails on essentially every other concept β people, animals, objects, abstract ideas, text, you name it.
This is an unexpected outcome of our dataset choices during training. The LAION subset we used (~168K images) skewed heavily toward natural landscape photography, and our caption quality was insufficient to teach the model broader semantic understanding.
We are starting a new project that trains a fresh VAE, CLIP, and UNet (same architecture from scratch) on a curated, high-quality dataset of 2 million image-caption pairs with significantly better caption diversity and quality. This repo documents v1 β the learning exercise.
Mini SD is a three-stage pipeline, conceptually identical to Stable Diffusion but built entirely from scratch:
Text Prompt
β
βΌ
βββββββββββ
β CLIP β Text Encoder (ViT-B/32 equivalent)
β Encoder β Produces a 512-dim embedding
ββββββ¬βββββ
β conditioning (cross-attention)
βΌ
βββββββββββ βββββββββββ
β UNet ββββββββ Noise β Pure Gaussian noise in latent space
β β βββββββββββ
β (Flow β Iteratively denoises over N steps
βMatching)β using Euler integration
ββββββ¬βββββ
β
βΌ
βββββββββββ
β VAE β Decodes 64Γ64 latent β 512Γ512 RGB image
β Decoder β
βββββββββββ
β
βΌ
512Γ512 PNG
| Component | Parameters | Architecture | Role |
|---|---|---|---|
| VAE | 3.13 M | ResNet-based encoder/decoder, 512px β 64px (8Γ compression), 4-channel latent | Compresses pixel space into a compact latent space for efficient diffusion |
| CLIP | 101.5 M | ViT-B/32 equivalent β 12-layer transformer, 512-dim embeddings, 77-token max sequence | Encodes text prompts into conditioning vectors via contrastive pretraining |
| UNet | 230.85 M | 192 base channels, channel multipliers (1,2,2,4), 2 ResBlocks per level, cross-attention at 16Γ16 and 8Γ8 resolutions | The diffusion "brain" β learns the velocity field to denoise latents |
| Total | ~335.5 M | End-to-end pipeline | 32% the size of Stable Diffusion v1.5 (~1B total) |
Unlike most tutorials that implement DDPM, this project uses Flow Matching (Lipman et al., 2023):
- Forward process: Linear interpolation β
x_t = (1-t) * noise + t * data - Target: The model learns velocity
v = data - noise, not noiseΞ΅ - Sampling: Simple Euler integration β
x_{t+dt} = x_t + v * dt - Benefit: Straighter trajectories β high-quality samples in 20 steps vs. 50β1000 for DDPM
Classifier-Free Guidance (CFG) is implemented by concatenating unconditional and conditional embeddings in a single forward pass, then interpolating the predicted velocities.
mini-sd/
βββ README.md
βββ .gitignore
β
βββ backend/
β βββ architecture.py # All model definitions: VAE, CLIP, UNet, Tokenizer, Scheduler
β βββ config.py # Unified config for model hyperparameters & server settings
β βββ main.py # FastAPI server β /generate, /health, /config endpoints
β βββ model_manager.py # Checkpoint loading, inference pipeline, image encoding
β βββ logger.py # Structured logging utility
β βββ standalone_inference.py # Self-contained script for inference without the API server
β βββ utils.py # Image β base64 helpers, GPU memory info
β βββ requirements.txt # Python dependencies
β
βββ frontend/ # Web UI (connects to FastAPI backend)
β
βββ metadata/
βββ training_script.ipynb # Full Google Colab training notebook (VAE β CLIP β UNet)
βββ dataset_samples.png # Sample images from the LAION training subset
βββ clip/ # CLIP training metrics & sample outputs
βββ unet/ # UNet training metrics & generated samples
βββ vae/ # VAE training metrics & reconstruction comparisons
- Python 3.12
- A mid to high tier CPU (CUDA capable GPU with 4GB VRAM or more recommended for inference)
- Model checkpoints (VAE, CLIP, UNet, Tokenizer β not included in repo due to size)
git clone https://github.com/dheeren-tejani/mini-sd
cd mini-sd/backend
pip install -r requirements.txtrequirements.txt uses --extra-index-url https://download.pytorch.org/whl/cu121 for CUDA 12.1 builds of PyTorch.
Edit backend/config.py or set environment variables:
# In config.py β ServerConfig defaults:
vae_path: "./models/vae/vae_final.pt"
clip_path: "./models/clip/clip_final.pt"
unet_path: "./models/unet/unet_step_017000.pt"
tokenizer_path: "./models/tokenizer/tokenizer.pt"Or via environment variables:
export VAE_PATH="/path/to/vae_final.pt"
export CLIP_PATH="/path/to/clip_final.pt"
export UNET_PATH="/path/to/unet_step_017000.pt"
export TOKENIZER_PATH="/path/to/tokenizer.pt"cd backend
uvicorn main:app --reload
# Server starts at http://127.0.0.1:8000
cd..
cd frontend
npm run devcd backend
python standalone_inference.py
# Interactive prompt loop β type a prompt, get a saved PNG| Method | Endpoint | Description |
|---|---|---|
POST |
/generate |
Generate an image from a text prompt |
GET |
/health |
Check if models are loaded and ready |
GET |
/config |
Retrieve inference parameter ranges for the frontend |
Example /generate request:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "a serene mountain lake at golden hour",
"steps": 20,
"cfg_scale": 7.5,
"seed": 42
}'Response: A JSON object containing a base64-encoded PNG data-URI in the image field, ready to drop into an <img src="..."> tag.
Training was done entirely on Google Colab (T4 GPU, 16GB VRAM) using the notebook at metadata/training_script.ipynb. Training is sequential β VAE first, then CLIP, then UNet.
- Source: LAION Aesthetic subset (~168K images with captions)
- Resolution: 512Γ512 (pre-processed using preprocess.py)
| Stage | Component | Batch Size | Effective Batch | Steps | Key Detail |
|---|---|---|---|---|---|
| 1 | VAE | 32 | 128 | Until convergence | Cyclical KL annealing, free bits = 0.5 |
| 2 | CLIP | 32 | 128 | Until convergence | Contrastive loss, temperature learned |
| 3 | UNet | 32 | 128 | ~17,000+ | Flow matching, CFG dropout = 10% |
All stages use:
- Optimizer: 8-bit AdamW (
bitsandbytes) for VRAM efficiency - Scheduler: Cosine LR with warmup
- Mixed precision: AMP (FP16 forward, FP32 loss)
- Gradient clipping: 0.5
- EMA: decay = 0.9999
- GPU augmentations: Kornia pipeline (normalization, flips, color jitter offloaded to GPU)
A sliding window keeps only the latest max_checkpoints=2 step checkpoints to save Drive storage, while _best.pt and _final.pt are always preserved.
You can see the loss curves for VAE, CLIP, UNet here, VAE Curves, CLIP Curves, UNet Curves.
All hyperparameters live in backend/config.py as two frozen dataclasses:
ModelConfigβ architecture dimensions (channels, layers, latent sizes, attention resolutions)ServerConfigβ checkpoint paths, CORS origins, inference limits (max steps, CFG scale)
Key values at a glance:
# VAE
vae_image_size = 512 # Input resolution
vae_latent_dim = 4 # Latent channels (matches UNet input)
vae_hidden_dims = [128, 256, 512]
# CLIP
clip_embed_dim = 512 # Also the UNet context dim
clip_num_layers = 12
clip_max_seq_length = 77
# UNet
unet_model_channels = 192
unet_channel_mult = (1, 2, 2, 4)
unet_attention_resolutions = (16, 8) # Apply cross-attn at 16Γ16 and 8Γ8
# Inference
flow_inference_steps = 20 # Euler steps (20 is usually enough)
num_diffusion_steps = 1000 # Training timestep rangeWhy does it only work for landscapes? The LAION subset used for training had a strong bias toward natural scenery, and the captions were often too generic ("a beautiful image", "high quality photo") to teach meaningful text-image alignment. The model overfit to this distribution.
Why is the UNet checkpoint at step 70,000? Training was done on free Colab T4 sessions with time limits. Step 70,000 represents the best checkpoint before the session expired. More training would likely improve results, but the dataset bias is the fundamental bottleneck.
Why Flow Matching and not DDPM? Flow Matching converges faster, generates comparable quality in fewer inference steps, and the loss formulation is simpler. It was a deliberate design choice to learn modern techniques over the older DDPM formulation.
EMA weights are intentionally NOT used at inference.
The checkpoint saves both model_state_dict and ema_model_state_dict. The inference code uses model_state_dict exclusively, due to EMA weights being corrupted by a bug in the training script
Samples being corrupted to pure noise. If you see the VAE and UNet samples, you will see pure noise in them, this is due to some bugs during their sampling phase, and they do not reflect their training behavior, it happenend due to some bug in VAE Scaling and UNet flow sampling, but we are unsure of the exact reason why it happenend, if you find the exact reason of it by analyzing the Training Script, feel free to mail me at dheerennntejani@gmail.com
We are building a new version from scratch with the same architecture but addressing the root causes of v1's limitations:
- 2 million high-quality image-caption pairs with diverse subjects (people, animals, objects, scenes, abstract concepts)
- Rich, descriptive captions generated or curated to carry real semantic signal
- Longer training runs with proper LR scheduling and evaluation
- Better VAE with higher latent quality (LPIPS loss component)
The goal is a model that actually generalizes beyond mountain photography.
torch (cu121)
diffusers
transformers
bitsandbytes
einops
pillow
tqdm
kornia
fastapi
uvicorn
numpy
matplotlib
This project is released for educational purposes. Model weights (when available) are for non-commercial research use only, consistent with the LAION dataset license terms.
- Flow Matching for Generative Modeling β Lipman et al., 2023
- Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow β Liu et al., 2023
- LAION-5B β Open dataset used for training
- Stable Diffusion β Architecture inspiration
- bitsandbytes β 8-bit optimizers enabling T4 training
- Kornia β GPU-accelerated augmentation pipeline
The whole project, from start to end was made by Made by Dheeren Tejani, it was a great learning experience and paved a great path for phase 2