100 Days of CUDA Kernels 🚀

A personal journey to explore, implement, and deeply understand GPU programming through 100 small, focused CUDA projects.

The aim is to learn GPU programming by building, testing, profiling and documenting small kernels that grow in complexity over time.

What you'll find here

Daily challenge folders (challanges/1_vectorAdd, challanges/2_matrixMult, ...) each containing notes.md and the kernel implementation (*.cu).

💡 The Idea

Every day, I build one CUDA kernel — from the basics (vector addition) all the way to advanced patterns (shared memory tiling, warp-level primitives, cooperative groups, streams, graph execution, etc.).

This repository documents my progress with:

📘 Daily Notes — notes.md inside each folder
🧠 Explanations — kernels and CUDA concepts
🧪 Code Implementations — clean, runnable examples
📊 A Progress Table — tracking each challenge

The goal is not just to write kernels — it's to understand how they interact with the architecture and how to write correct, fast, and maintainable GPU code.

Project goals

Understand CUDA execution model (threads, warps, blocks, grids).
Learn memory hierarchy and optimization: shared memory, registers, caches, and HBM/GDDR characteristics.
Explore advanced features: cooperative groups, streams, graphs, CUDA Graphs, Tensor Cores, WMMA, and memory-bound optimizations.
Improve profiling & benchmarking skills (nvprof / Nsight / nvtx markers).
Produce short, self-contained notes for each day.

🧭 Repository Structure

CUDA_in_100_days/
├── challanges/
│   ├── ...
│   └── N_<kernel_name>/
│       ├── notes.md
│       └── <kernel_name>.cu
├── scripts/
│   └── update_readme.py
├── notes_template.md
├── badge.svg
├── README.md
└── .gitignore

📅 Progress Table

Day	Folder	Topic	Short description
1	`1_vectorAdd`	Vector Addition	Basic CUDA kernel computing element-wise addition of two float vectors.
2	`2_matrixMult`	Matrix Multiplication	Naive dense matrix multiplication kernel, revisiting thread indexing in 2D, memory coalescing.
3	`3_sharedMem_MatrixMult`	Shared Memory Matrix Multiplication	Use of shared memory to reduce number of global accesses between threads on the same block
4	`4_sharedMem_blockTiling_MatrixMult`	Shared Memory 1-D Block Tiling Matrix Multiplication	Use of 1-D tiling to increase ratio of loads per FLOP
5	[`5_sharedMem_2DblockTiling_MatrixMult copy`](challanges/5_sharedMem_2DblockTiling_MatrixMult copy/)	Shared Memory 2-D Block Tiling Matrix Multiplication	Use of 2-D tiling to increase ratio of loads per FLOP
...	...	...	...

Progress: Day 5 / 100 (5%)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

100 Days of CUDA Kernels 🚀

💡 The Idea

Project goals

🧭 Repository Structure

📅 Progress Table

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
challanges		challanges
scripts		scripts
.gitignore		.gitignore
README.md		README.md
badge.svg		badge.svg
notes_template.md		notes_template.md

theRTLmaker/CUDA_in_100_days

Folders and files

Latest commit

History

Repository files navigation

100 Days of CUDA Kernels 🚀

💡 The Idea

Project goals

🧭 Repository Structure

📅 Progress Table

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages