Skip to content

Amit-netizen/CacheSim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CacheSim

A multi-core CPU cache hierarchy simulator with MESI coherency protocol

Built in C++20. Models the full memory hierarchy (L1 → L2 → L3 → DRAM) across two simulated cores with a complete MESI bus protocol, configurable replacement policies, and trace-driven simulation mode. Every cache line state transition is unit-tested.


Why this exists

Modern CPUs spend more time waiting for memory than executing instructions. Understanding why — and being able to simulate, measure, and prove it — is what separates systems engineers from application developers. This simulator makes the invisible visible: you can watch false sharing destroy cache performance in real time, see LRU vs FIFO behave differently under thrash, and measure exactly how many DRAM fetches a workload costs.


Features

  • 3-level hierarchy — L1 (32KB, 8-way) per core, L2 (256KB, 4-way) per core, L3 (4MB, 16-way) shared
  • MESI coherency protocol — BusRd, BusRdX, BusUpgr between two cores; M→S downgrade on foreign read, write-invalidate on write
  • Replacement policies — LRU, FIFO, Random (selectable at runtime via --policy)
  • Write policies — Write-back with allocate (default), Write-through (--write-through)
  • Trace-driven mode — replay any memory access trace file (--trace)
  • 5 built-in workloads — sequential scan, false sharing, ping-pong, hot loop, cache thrash
  • Per-level statistics — accesses, hits, misses, writebacks, MESI invalidations, hit rate, DRAM fetches
  • 27 unit tests — every MESI state transition, LRU eviction order, writeback-on-eviction, write-through propagation

Build

# Requirements: GCC 13+ or Clang 14+, CMake >= 3.16, C++20

git clone https://github.com/YOUR_USERNAME/CacheSim.git
cd CacheSim
mkdir build && cd build
cmake ..
make -j$(nproc)

Run

Run all 27 unit tests

./test_cache

Test Cache


Run all built-in workloads

./cache_sim --all-workloads

Workloads 1

Workloads 2


Run false sharing workload

./cache_sim --workload false_sharing --iters 64

False Sharing


Run sequential workload

./cache_sim --workload sequential --iters 128

Sequential


Run hot loop workload

./cache_sim --workload hot_loop --iters 128

Hot Loop


Run ping-pong workload

./cache_sim --workload ping_pong --iters 64

Ping Pong


Run cache thrash workload

./cache_sim --workload thrash --iters 128

Thrash


Verbose MESI output

./cache_sim --workload false_sharing --iters 10 --verbose

False Sharing 10 Iterations


Replay a trace file

./cache_sim --trace ../traces/false_sharing.trace --verbose

Trace


FIFO replacement policy

./cache_sim --workload sequential --policy fifo

FIFO


Random replacement policy

./cache_sim --workload thrash --policy random

Write-through mode

./cache_sim --workload false_sharing --write-through --iters 64

Write Through


Help

./cache_sim --help

Help

Trace file format

# one access per line: <core_id> <R|W> <hex_address>
0 R 0x1000
1 W 0x1004
0 R 0x1000

Workload results

Sequential scan — spatial locality working correctly

L1-Core0   accesses=128   hits=112   misses=16   hit%=87.5%
DRAM fetches: 16

128 elements × 8 bytes = 1024 bytes. Block size 64 bytes → 16 cache line fetches. Each fetch brings 8 elements → 7 free hits per line. Exactly 87.5%.


Hot loop — temporal locality, zero DRAM traffic

L1-Core0   accesses=1280   hits=1280   misses=0   hit%=100.0%
L2/L3:     accesses=0
DRAM fetches: 0

4-element working set fits in L1. After warmup, 1280 accesses with zero memory traffic.


False sharing — the most important result

L1-Core0   accesses=128   hits=0   misses=128   WBs=128   INVs=128   hit%=0.0%
L1-Core1   accesses=128   hits=0   misses=128   WBs=127   INVs=127   hit%=0.0%
L3         accesses=256   hits=255  misses=1
DRAM fetches: 1

Only 1 DRAM fetch. The data fits comfortably in L3. Yet L1 hit rate is 0%. Core 0 writes → BusUpgr invalidates Core 1's copy. Core 1 writes → BusUpgr invalidates Core 0's copy. 128 times each. Two threads writing separate variables that share a cache line destroy each other's performance.

Verbose output shows the bus traffic:

C0 W 0x1000 | MESI BusUpgr/BusRdX: Core 1 S→I   L1 write miss  L2 write miss  L3 write hit
C1 W 0x1004 | MESI BusUpgr/BusRdX: Core 0 M→I (writeback) [WB]  L1 write miss  L3 write hit

Fix in real code: alignas(64) int a; alignas(64) int b; — one variable per cache line.


Ping-pong — alternating writes to same address

L1-Core0   hits=0   INVs=32   hit%=0.0%
L1-Core1   hits=0   INVs=31   hit%=0.0%
DRAM fetches: 1

Same address, alternating writes. Every write invalidates the other core. Zero L1 utility.


Cache thrash — conflict misses from bad stride

L1-Core0   accesses=128   hits=0   misses=128   hit%=0.0%
L3         hits=0   misses=128
DRAM fetches: 128

Stride = L1 size. All accesses map to set 0. 8-way L1 fills after 8 accesses, then evicts on every subsequent access. 128 DRAM fetches for 128 accesses — the cache is completely bypassed despite being empty.


Architecture

Core 0 ──── L1C0 (32KB, 8-way, 64B) ──── L2C0 (256KB, 4-way, 64B) ──┐
                                                                        ├── L3 (4MB, 16-way, 64B) ── DRAM
Core 1 ──── L1C1 (32KB, 8-way, 64B) ──── L2C1 (256KB, 4-way, 64B) ──┘

Key design: probe/fill separation

CacheLevel::read() and write() only probe and count stats — they never auto-allocate on miss. The CacheHierarchy controller runs the snoop bus, determines the correct MESI state (E vs S depends on whether any other core holds the line), then calls fill() with the correct state. This is how real cache controllers work. Auto-allocating on miss with a hardcoded state is incorrect for MESI.

MESI state machine

State Meaning On read by other core On write by other core
Modified Dirty, exclusive Flush + M→S, other gets S Flush + M→I
Exclusive Clean, exclusive E→S, other gets S E→I
Shared Clean, shared Stay S S→I
Invalid Not present

File structure

CacheSim/
├── include/
│   ├── cache_line.h        # CacheLine struct, MESIState enum
│   ├── cache_config.h      # CacheConfig: sets/ways/block_size, address decode
│   ├── cache_level.h       # CacheLevel: probe/read/write/fill/invalidate/downgrade
│   ├── cache_hierarchy.h   # CacheHierarchy: L1/L2/L3 wiring + MESI bus protocol
│   └── trace_player.h      # TracePlayer: trace file parser and replay engine
├── src/
│   └── main.cpp            # CLI driver + 5 built-in workloads
├── tests/
│   └── test_cache.cpp      # 27 unit tests (self-registering, no framework needed)
├── traces/
│   └── false_sharing.trace # Example trace demonstrating MESI invalidation storm
└── CMakeLists.txt

Unit tests (27 passing)

test_cold_miss_then_hit                    — first access misses, second hits
test_mesi_exclusive_on_first_read          — sole reader gets Exclusive state
test_mesi_shared_on_second_read            — second reader downgrades both to Shared
test_mesi_modified_on_write                — write transitions line to Modified
test_mesi_write_invalidates_other_core     — BusUpgr invalidates other core's copy
test_mesi_modified_downgrade_on_foreign_read — M→S flush on read from other core
test_lru_eviction_order                    — LRU evicts least recently used way
test_false_sharing_generates_invalidations — adjacent vars in same line → INVs
test_writeback_on_eviction                 — dirty eviction increments WB counter
test_write_through_propagates_to_l2        — WT write updates L2 state to Modified
test_spatial_locality_block_reuse          — adjacent address in same block hits

Results: 27 passed, 0 failed

Extensions (planned / good interview talking points)

  • L3 back-invalidation on eviction — when L3 evicts a line, invalidate all L1/L2 copies
  • PLRU (Pseudo-LRU) — tree-based O(1) update; what real Intel/AMD CPUs use
  • perf mem trace import — convert Linux perf binary traces for real workload replay
  • MOESI / MESIF — add Owned or Forward states for directory-based protocols
  • Multi-level inclusion violation tracking — detect when L2 evicts a line still in L1
  • Valgrind cachegrind comparison — run same program through both, compare miss rates

Resume bullet

Developed a trace-driven CPU cache hierarchy simulator in C++20 — models L1/L2/L3 with MESI coherency protocol (BusRd/BusRdX/BusUpgr), LRU/FIFO/Random eviction, and write-back/write-through policies; demonstrated false sharing reducing L1 hit rate from 87% to 0% via BusUpgr invalidation storms across 2 simulated cores; 27 unit tests covering all MESI state transitions and eviction edge cases.


License

MIT

About

Multi-core CPU cache hierarchy simulator in C++20 — L1/L2/L3 with MESI coherency, LRU/FIFO/Random eviction, false sharing demo, trace-driven mode. 27 tests.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors