CacheSim

A multi-core CPU cache hierarchy simulator with MESI coherency protocol

Built in C++20. Models the full memory hierarchy (L1 → L2 → L3 → DRAM) across two simulated cores with a complete MESI bus protocol, configurable replacement policies, and trace-driven simulation mode. Every cache line state transition is unit-tested.

Why this exists

Modern CPUs spend more time waiting for memory than executing instructions. Understanding why — and being able to simulate, measure, and prove it — is what separates systems engineers from application developers. This simulator makes the invisible visible: you can watch false sharing destroy cache performance in real time, see LRU vs FIFO behave differently under thrash, and measure exactly how many DRAM fetches a workload costs.

Features

3-level hierarchy — L1 (32KB, 8-way) per core, L2 (256KB, 4-way) per core, L3 (4MB, 16-way) shared
MESI coherency protocol — BusRd, BusRdX, BusUpgr between two cores; M→S downgrade on foreign read, write-invalidate on write
Replacement policies — LRU, FIFO, Random (selectable at runtime via --policy)
Write policies — Write-back with allocate (default), Write-through (--write-through)
Trace-driven mode — replay any memory access trace file (--trace)
5 built-in workloads — sequential scan, false sharing, ping-pong, hot loop, cache thrash
Per-level statistics — accesses, hits, misses, writebacks, MESI invalidations, hit rate, DRAM fetches
27 unit tests — every MESI state transition, LRU eviction order, writeback-on-eviction, write-through propagation

Build

# Requirements: GCC 13+ or Clang 14+, CMake >= 3.16, C++20

git clone https://github.com/YOUR_USERNAME/CacheSim.git
cd CacheSim
mkdir build && cd build
cmake ..
make -j$(nproc)

Run

Run all 27 unit tests

./test_cache

Run all built-in workloads

./cache_sim --all-workloads

Run false sharing workload

./cache_sim --workload false_sharing --iters 64

Run sequential workload

./cache_sim --workload sequential --iters 128

Run hot loop workload

./cache_sim --workload hot_loop --iters 128

Run ping-pong workload

./cache_sim --workload ping_pong --iters 64

Run cache thrash workload

./cache_sim --workload thrash --iters 128

Verbose MESI output

./cache_sim --workload false_sharing --iters 10 --verbose

Replay a trace file

./cache_sim --trace ../traces/false_sharing.trace --verbose

FIFO replacement policy

./cache_sim --workload sequential --policy fifo

Random replacement policy

./cache_sim --workload thrash --policy random

Write-through mode

./cache_sim --workload false_sharing --write-through --iters 64

Help

./cache_sim --help

Trace file format

# one access per line: <core_id> <R|W> <hex_address>
0 R 0x1000
1 W 0x1004
0 R 0x1000

Workload results

Sequential scan — spatial locality working correctly

L1-Core0   accesses=128   hits=112   misses=16   hit%=87.5%
DRAM fetches: 16

128 elements × 8 bytes = 1024 bytes. Block size 64 bytes → 16 cache line fetches. Each fetch brings 8 elements → 7 free hits per line. Exactly 87.5%.

Hot loop — temporal locality, zero DRAM traffic

L1-Core0   accesses=1280   hits=1280   misses=0   hit%=100.0%
L2/L3:     accesses=0
DRAM fetches: 0

4-element working set fits in L1. After warmup, 1280 accesses with zero memory traffic.

False sharing — the most important result

L1-Core0   accesses=128   hits=0   misses=128   WBs=128   INVs=128   hit%=0.0%
L1-Core1   accesses=128   hits=0   misses=128   WBs=127   INVs=127   hit%=0.0%
L3         accesses=256   hits=255  misses=1
DRAM fetches: 1

Only 1 DRAM fetch. The data fits comfortably in L3. Yet L1 hit rate is 0%. Core 0 writes → BusUpgr invalidates Core 1's copy. Core 1 writes → BusUpgr invalidates Core 0's copy. 128 times each. Two threads writing separate variables that share a cache line destroy each other's performance.

Verbose output shows the bus traffic:

C0 W 0x1000 | MESI BusUpgr/BusRdX: Core 1 S→I   L1 write miss  L2 write miss  L3 write hit
C1 W 0x1004 | MESI BusUpgr/BusRdX: Core 0 M→I (writeback) [WB]  L1 write miss  L3 write hit

Fix in real code: alignas(64) int a; alignas(64) int b; — one variable per cache line.

Ping-pong — alternating writes to same address

L1-Core0   hits=0   INVs=32   hit%=0.0%
L1-Core1   hits=0   INVs=31   hit%=0.0%
DRAM fetches: 1

Same address, alternating writes. Every write invalidates the other core. Zero L1 utility.

Cache thrash — conflict misses from bad stride

L1-Core0   accesses=128   hits=0   misses=128   hit%=0.0%
L3         hits=0   misses=128
DRAM fetches: 128

Stride = L1 size. All accesses map to set 0. 8-way L1 fills after 8 accesses, then evicts on every subsequent access. 128 DRAM fetches for 128 accesses — the cache is completely bypassed despite being empty.

Architecture

Core 0 ──── L1C0 (32KB, 8-way, 64B) ──── L2C0 (256KB, 4-way, 64B) ──┐
                                                                        ├── L3 (4MB, 16-way, 64B) ── DRAM
Core 1 ──── L1C1 (32KB, 8-way, 64B) ──── L2C1 (256KB, 4-way, 64B) ──┘

Key design: probe/fill separation

CacheLevel::read() and write() only probe and count stats — they never auto-allocate on miss. The CacheHierarchy controller runs the snoop bus, determines the correct MESI state (E vs S depends on whether any other core holds the line), then calls fill() with the correct state. This is how real cache controllers work. Auto-allocating on miss with a hardcoded state is incorrect for MESI.

MESI state machine

State	Meaning	On read by other core	On write by other core
Modified	Dirty, exclusive	Flush + M→S, other gets S	Flush + M→I
Exclusive	Clean, exclusive	E→S, other gets S	E→I
Shared	Clean, shared	Stay S	S→I
Invalid	Not present	—	—

File structure

CacheSim/
├── include/
│   ├── cache_line.h        # CacheLine struct, MESIState enum
│   ├── cache_config.h      # CacheConfig: sets/ways/block_size, address decode
│   ├── cache_level.h       # CacheLevel: probe/read/write/fill/invalidate/downgrade
│   ├── cache_hierarchy.h   # CacheHierarchy: L1/L2/L3 wiring + MESI bus protocol
│   └── trace_player.h      # TracePlayer: trace file parser and replay engine
├── src/
│   └── main.cpp            # CLI driver + 5 built-in workloads
├── tests/
│   └── test_cache.cpp      # 27 unit tests (self-registering, no framework needed)
├── traces/
│   └── false_sharing.trace # Example trace demonstrating MESI invalidation storm
└── CMakeLists.txt

Unit tests (27 passing)

test_cold_miss_then_hit                    — first access misses, second hits
test_mesi_exclusive_on_first_read          — sole reader gets Exclusive state
test_mesi_shared_on_second_read            — second reader downgrades both to Shared
test_mesi_modified_on_write                — write transitions line to Modified
test_mesi_write_invalidates_other_core     — BusUpgr invalidates other core's copy
test_mesi_modified_downgrade_on_foreign_read — M→S flush on read from other core
test_lru_eviction_order                    — LRU evicts least recently used way
test_false_sharing_generates_invalidations — adjacent vars in same line → INVs
test_writeback_on_eviction                 — dirty eviction increments WB counter
test_write_through_propagates_to_l2        — WT write updates L2 state to Modified
test_spatial_locality_block_reuse          — adjacent address in same block hits

Results: 27 passed, 0 failed

Extensions (planned / good interview talking points)

L3 back-invalidation on eviction — when L3 evicts a line, invalidate all L1/L2 copies
PLRU (Pseudo-LRU) — tree-based O(1) update; what real Intel/AMD CPUs use
perf mem trace import — convert Linux perf binary traces for real workload replay
MOESI / MESIF — add Owned or Forward states for directory-based protocols
Multi-level inclusion violation tracking — detect when L2 evicts a line still in L1
Valgrind cachegrind comparison — run same program through both, compare miss rates

Resume bullet

Developed a trace-driven CPU cache hierarchy simulator in C++20 — models L1/L2/L3 with MESI coherency protocol (BusRd/BusRdX/BusUpgr), LRU/FIFO/Random eviction, and write-back/write-through policies; demonstrated false sharing reducing L1 hit rate from 87% to 0% via BusUpgr invalidation storms across 2 simulated cores; 27 unit tests covering all MESI state transitions and eviction edge cases.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
include		include
screenshots		screenshots
src		src
tests		tests
traces		traces
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CacheSim

Why this exists

Features

Build

Run

Run all 27 unit tests

Run all built-in workloads

Run false sharing workload

Run sequential workload

Run hot loop workload

Run ping-pong workload

Run cache thrash workload

Verbose MESI output

Replay a trace file

FIFO replacement policy

Random replacement policy

Write-through mode

Help

Trace file format

Workload results

Sequential scan — spatial locality working correctly

Hot loop — temporal locality, zero DRAM traffic

False sharing — the most important result

Ping-pong — alternating writes to same address

Cache thrash — conflict misses from bad stride

Architecture

Key design: probe/fill separation

MESI state machine

File structure

Unit tests (27 passing)

Extensions (planned / good interview talking points)

Resume bullet

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages