A personal journey to explore, implement, and deeply understand GPU programming through 100 small, focused CUDA projects.
The aim is to learn GPU programming by building, testing, profiling and documenting small kernels that grow in complexity over time.
What you'll find here
- Daily challenge folders (
challanges/1_vectorAdd,challanges/2_matrixMult, ...) each containingnotes.mdand the kernel implementation (*.cu).
Every day, I build one CUDA kernel β from the basics (vector addition) all the way to advanced patterns (shared memory tiling, warp-level primitives, cooperative groups, streams, graph execution, etc.).
This repository documents my progress with:
- π Daily Notes β
notes.mdinside each folder - π§ Explanations β kernels and CUDA concepts
- π§ͺ Code Implementations β clean, runnable examples
- π A Progress Table β tracking each challenge
The goal is not just to write kernels β it's to understand how they interact with the architecture and how to write correct, fast, and maintainable GPU code.
- Understand CUDA execution model (threads, warps, blocks, grids).
- Learn memory hierarchy and optimization: shared memory, registers, caches, and HBM/GDDR characteristics.
- Explore advanced features: cooperative groups, streams, graphs, CUDA Graphs, Tensor Cores, WMMA, and memory-bound optimizations.
- Improve profiling & benchmarking skills (nvprof / Nsight / nvtx markers).
- Produce short, self-contained notes for each day.
CUDA_in_100_days/
βββ challanges/
β βββ ...
β βββ N_<kernel_name>/
β βββ notes.md
β βββ <kernel_name>.cu
βββ scripts/
β βββ update_readme.py
βββ notes_template.md
βββ badge.svg
βββ README.md
βββ .gitignore| Day | Folder | Topic | Short description |
|---|---|---|---|
| 1 | 1_vectorAdd |
Vector Addition | Basic CUDA kernel computing element-wise addition of two float vectors. |
| 2 | 2_matrixMult |
Matrix Multiplication | Naive dense matrix multiplication kernel, revisiting thread indexing in 2D, memory coalescing. |
| 3 | 3_sharedMem_MatrixMult |
Shared Memory Matrix Multiplication | Use of shared memory to reduce number of global accesses between threads on the same block |
| 4 | 4_sharedMem_blockTiling_MatrixMult |
Shared Memory 1-D Block Tiling Matrix Multiplication | Use of 1-D tiling to increase ratio of loads per FLOP |
| 5 | [5_sharedMem_2DblockTiling_MatrixMult copy](challanges/5_sharedMem_2DblockTiling_MatrixMult copy/) |
Shared Memory 2-D Block Tiling Matrix Multiplication | Use of 2-D tiling to increase ratio of loads per FLOP |
| ... | ... | ... | ... |
Progress: Day 5 / 100 (5%)