Skip to content

Commit b8fc7c9

Browse files
pdrobnjakclaude
andcommitted
docs(benchmark): add profiling analysis for executeEVMTxWithGigaExecutor
Baseline: 8600 TPS median with GIGA_EXECUTOR=true GIGA_OCC=true. Key findings: - CacheMultiStore snapshot allocation is #1 target (15 GB, 27% of function) - cachekv.NewStore creates 9 GB of sync.Map objects per 30s sample - GC overhead (~24% CPU) is driven by allocation pressure - Lock contention (30.7% CPU) partially secondary to GC/alloc Candidate optimizations: sync.Pool for cachekv.Store, lazy per-store creation, replace sync.Map with regular map in giga path, cache block-level constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d22930c commit b8fc7c9

File tree

1 file changed

+157
-0
lines changed

1 file changed

+157
-0
lines changed
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Profiling Analysis: executeEVMTxWithGigaExecutor
2+
3+
**Date:** 2026-02-13
4+
**Branch:** pd/benchmark-profiling-improvements (commit dcbfd5a02)
5+
**Scenario:** benchmark/scenarios/evm.json
6+
**Config:** GIGA_EXECUTOR=true GIGA_OCC=true DURATION=120
7+
8+
## Baseline
9+
10+
| Metric | Value |
11+
|--------|-------|
12+
| Median TPS | 8600 |
13+
| Avg TPS | 8495 |
14+
| Min TPS | 7800 |
15+
| Max TPS | 9000 |
16+
| Readings | 21 |
17+
| Block Height | 995 |
18+
19+
## Function Hot Path
20+
21+
```
22+
executeEVMTxWithGigaExecutor (app/app.go:1714)
23+
├─ msg.AsTransaction()
24+
├─ RecoverSenderFromEthTx() — ECDSA recovery
25+
├─ GetEVMAddress() / AssociateAddresses()
26+
├─ NewDBImpl() — allocates state DB + initial Snapshot
27+
│ └─ Snapshot() — clones entire CacheMultiStore
28+
├─ GetVMBlockContext()
29+
├─ GetParams() + ChainID() (x2) — redundant: ChainID called at lines 1721 and 1764
30+
├─ NewGethExecutor() → vm.NewEVM() — new EVM per tx
31+
├─ ExecuteTransaction()
32+
│ └─ ApplyMessage() → StateTransition.Execute()
33+
│ └─ EVM.Call() — may trigger additional Snapshot() calls
34+
├─ stateDB.Finalize() — flushCtxs → cachekv.Store.Write
35+
├─ WriteReceipt()
36+
└─ Marshal response (protobuf)
37+
```
38+
39+
## CPU Profile (30s sample, 120.21s total across 4 cores)
40+
41+
`executeEVMTxWithGigaExecutor`: **25.95s cumulative (21.6%)**
42+
43+
| Component | Cumulative | % of total | Notes |
44+
|-----------|-----------|------------|-------|
45+
| runtime.lock2 (spinlock) | 36.87s | 30.7% | OCC goroutines fighting for locks |
46+
| runtime.usleep (lock backoff) | 30.00s | 25.0% | Spinning on contended locks |
47+
| ExecuteTransaction → ApplyMessage | 15.31s | 12.7% | Actual EVM execution |
48+
| GC (gcDrain + scanobject) | 17.15s + 11.49s | 23.8% | Driven by allocation pressure |
49+
| mallocgc | 11.42s | 9.5% | Object allocation |
50+
| runtime.kevent | 10.38s | 8.6% | I/O polling |
51+
52+
### CPU Focused on executeEVMTxWithGigaExecutor
53+
54+
Top flat contributors within the function's call tree:
55+
56+
| Function | Flat | Notes |
57+
|----------|------|-------|
58+
| runtime.usleep | 4.95s | Lock spinning inside store operations |
59+
| runtime.cgocall | 3.09s | CGo boundary (likely crypto) |
60+
| runtime.mallocgc | 6.31s | Allocation pressure from stores |
61+
| runtime.newobject | 3.72s | Heap allocations |
62+
| memiavl.MemNode.Get | 1.00s | IAVL tree reads |
63+
| cachekv.Store.Write | 1.03s | Flushing cache stores |
64+
| cachemulti.newCacheMultiStoreFromCMS | 1.22s | CMS creation during Snapshot |
65+
66+
## Heap Profile (alloc_space)
67+
68+
`executeEVMTxWithGigaExecutor`: **56.2 GB cumulative (54% of 104 GB total)**
69+
70+
| Hotspot | Allocated | % of function | What |
71+
|---------|----------|--------------|------|
72+
| DBImpl.Snapshot → CacheMultiStore | 15.1 GB | 27% | Full cache store clone per snapshot |
73+
| cachekv.NewStore | 9.0 GB | 16% | Individual KV store objects (sync.Map x2-3) |
74+
| sync.newIndirectNode (sync.Map internals) | 7.8 GB | 14% | sync.Map internal trie nodes |
75+
| cachekv.Store.Write | 8.4 GB | 15% | Flushing stores (map iteration + write) |
76+
| DBImpl.Finalize → flushCtxs | 9.7 GB | 17% | Writing state changes back through layers |
77+
| GetBalance → LockedCoins | 9.0 GB | 16% | Balance lookups triggering deep store reads |
78+
| AccountKeeper.GetAccount | 7.2 GB | 13% | Account deserialization (protobuf UnpackAny) |
79+
| scheduler.prepareTask | 8.8 GB | 16% | OCC task preparation (VersionIndexedStore) |
80+
81+
### Allocation Objects
82+
83+
| Hotspot | Objects | % of total | Notes |
84+
|---------|---------|------------|-------|
85+
| cachekv.NewStore | 157M | 10.2% | Largest single flat allocator |
86+
| cachekv.Store.Write | 83M | 5.4% | Map iteration during flush |
87+
| codec/types.UnpackAny | 134M | 8.7% | Protobuf deserialization |
88+
| DBImpl.Snapshot | 137M | 8.9% | CMS + maps + sync primitives |
89+
90+
## Mutex/Contention Profile (196s total)
91+
92+
| Source | Time | % | What |
93+
|--------|------|---|------|
94+
| runtime.unlock | 159s | 81% | Runtime-level lock contention |
95+
| sync.Mutex.Unlock | 29s | 15% | Application mutex contention |
96+
| sync.RWMutex.Unlock | 19s | 10% | Reader-writer lock contention |
97+
| AccAddress.String | 10.6s | 5% | Bech32 encoding under lock |
98+
| EventManager.EmitEvents | 9.9s | 5% | Event emission contention |
99+
| sync.Map.Store (HashTrieMap.Swap) | 6.9s | 4% | sync.Map write contention |
100+
101+
## Block Profile (7.26 hrs total)
102+
103+
Dominated by `runtime.selectgo` (6.97 hrs / 96%) — the OCC scheduler's `select` loop waiting for tasks. Not actionable for this function.
104+
105+
## Key Findings
106+
107+
### 1. CacheMultiStore allocation is the #1 optimization target
108+
109+
Each `Snapshot()` call triggers `CacheMultiStore()` which:
110+
- Materializes ALL lazy module stores (not just ones the tx touches)
111+
- Creates `cachekv.NewStore` with 2-3 `sync.Map` objects per module store
112+
- Creates `gigacachekv.NewStore` with 2 `sync.Map` objects per giga store
113+
- Allocates map copies, sync.RWMutex, sync.Once per CMS
114+
115+
This happens at minimum once per tx (NewDBImpl's initial Snapshot), plus additional times for nested EVM calls. OCC re-executions create entirely fresh chains.
116+
117+
### 2. GC overhead is a direct consequence of allocation pressure
118+
119+
17s of CPU on `gcDrain` + 11.5s on `scanobject` = ~24% of CPU spent on GC. The 104 GB of total allocations over 30s creates enormous GC pressure. Reducing allocations in the Snapshot path would have a compounding effect.
120+
121+
### 3. Lock contention is high but may be secondary
122+
123+
`runtime.lock2` at 30.7% is the #1 CPU consumer. Much of this is runtime-internal (GC, scheduler) and would decrease naturally if allocation pressure drops. Some is from `sync.Map` operations and store-level mutexes.
124+
125+
### 4. ChainID() and DefaultChainConfig() called redundantly
126+
127+
`ChainID()` is called at lines 1721 and 1764. `DefaultChainConfig()` is called inside `RecoverSenderFromEthTx` and again implicitly at line 1764. Minor but free to fix.
128+
129+
## Candidate Optimizations
130+
131+
### A. Pool/reuse cachekv.Store objects (high impact)
132+
133+
Replace fresh `cachekv.NewStore` allocations with `sync.Pool` recycling. On return to pool, call `sync.Map.Clear()` (Go 1.23+) to reset state. Eliminates ~9 GB of allocations + reduces GC.
134+
135+
**Expected impact:** 10-20% TPS improvement
136+
**Risk:** Low — mechanical change, clear lifecycle boundaries
137+
138+
### B. Lazy per-store creation in snapshots (high impact)
139+
140+
Currently `materializeOnce.Do` creates cachekv.Store for ALL module stores when any snapshot is taken. Instead, create wrappers only for stores actually accessed via GetKVStore.
141+
142+
**Expected impact:** 10-15% TPS improvement (depends on how many stores are unused per tx)
143+
**Risk:** Medium — changes cachemulti.Store threading model
144+
145+
### C. Replace sync.Map with regular map in giga cachekv (medium impact)
146+
147+
The giga `cachekv.Store` uses `sync.Map` but within OCC, each store belongs to a single goroutine's execution. Regular maps are ~10x cheaper to allocate and access.
148+
149+
**Expected impact:** 5-10% TPS improvement
150+
**Risk:** Low — need to verify no concurrent access in OCC path
151+
152+
### D. Cache block-level constants per-tx (low impact)
153+
154+
Cache `ChainID()`, `DefaultChainConfig()`, `GetParams()`, `EthereumConfigWithSstore()` at block level instead of computing per-tx.
155+
156+
**Expected impact:** 2-5% TPS improvement
157+
**Risk:** Minimal

0 commit comments

Comments
 (0)