|
| 1 | +# Profiling Analysis: executeEVMTxWithGigaExecutor |
| 2 | + |
| 3 | +**Date:** 2026-02-13 |
| 4 | +**Branch:** pd/benchmark-profiling-improvements (commit dcbfd5a02) |
| 5 | +**Scenario:** benchmark/scenarios/evm.json |
| 6 | +**Config:** GIGA_EXECUTOR=true GIGA_OCC=true DURATION=120 |
| 7 | + |
| 8 | +## Baseline |
| 9 | + |
| 10 | +| Metric | Value | |
| 11 | +|--------|-------| |
| 12 | +| Median TPS | 8600 | |
| 13 | +| Avg TPS | 8495 | |
| 14 | +| Min TPS | 7800 | |
| 15 | +| Max TPS | 9000 | |
| 16 | +| Readings | 21 | |
| 17 | +| Block Height | 995 | |
| 18 | + |
| 19 | +## Function Hot Path |
| 20 | + |
| 21 | +``` |
| 22 | +executeEVMTxWithGigaExecutor (app/app.go:1714) |
| 23 | + ├─ msg.AsTransaction() |
| 24 | + ├─ RecoverSenderFromEthTx() — ECDSA recovery |
| 25 | + ├─ GetEVMAddress() / AssociateAddresses() |
| 26 | + ├─ NewDBImpl() — allocates state DB + initial Snapshot |
| 27 | + │ └─ Snapshot() — clones entire CacheMultiStore |
| 28 | + ├─ GetVMBlockContext() |
| 29 | + ├─ GetParams() + ChainID() (x2) — redundant: ChainID called at lines 1721 and 1764 |
| 30 | + ├─ NewGethExecutor() → vm.NewEVM() — new EVM per tx |
| 31 | + ├─ ExecuteTransaction() |
| 32 | + │ └─ ApplyMessage() → StateTransition.Execute() |
| 33 | + │ └─ EVM.Call() — may trigger additional Snapshot() calls |
| 34 | + ├─ stateDB.Finalize() — flushCtxs → cachekv.Store.Write |
| 35 | + ├─ WriteReceipt() |
| 36 | + └─ Marshal response (protobuf) |
| 37 | +``` |
| 38 | + |
| 39 | +## CPU Profile (30s sample, 120.21s total across 4 cores) |
| 40 | + |
| 41 | +`executeEVMTxWithGigaExecutor`: **25.95s cumulative (21.6%)** |
| 42 | + |
| 43 | +| Component | Cumulative | % of total | Notes | |
| 44 | +|-----------|-----------|------------|-------| |
| 45 | +| runtime.lock2 (spinlock) | 36.87s | 30.7% | OCC goroutines fighting for locks | |
| 46 | +| runtime.usleep (lock backoff) | 30.00s | 25.0% | Spinning on contended locks | |
| 47 | +| ExecuteTransaction → ApplyMessage | 15.31s | 12.7% | Actual EVM execution | |
| 48 | +| GC (gcDrain + scanobject) | 17.15s + 11.49s | 23.8% | Driven by allocation pressure | |
| 49 | +| mallocgc | 11.42s | 9.5% | Object allocation | |
| 50 | +| runtime.kevent | 10.38s | 8.6% | I/O polling | |
| 51 | + |
| 52 | +### CPU Focused on executeEVMTxWithGigaExecutor |
| 53 | + |
| 54 | +Top flat contributors within the function's call tree: |
| 55 | + |
| 56 | +| Function | Flat | Notes | |
| 57 | +|----------|------|-------| |
| 58 | +| runtime.usleep | 4.95s | Lock spinning inside store operations | |
| 59 | +| runtime.cgocall | 3.09s | CGo boundary (likely crypto) | |
| 60 | +| runtime.mallocgc | 6.31s | Allocation pressure from stores | |
| 61 | +| runtime.newobject | 3.72s | Heap allocations | |
| 62 | +| memiavl.MemNode.Get | 1.00s | IAVL tree reads | |
| 63 | +| cachekv.Store.Write | 1.03s | Flushing cache stores | |
| 64 | +| cachemulti.newCacheMultiStoreFromCMS | 1.22s | CMS creation during Snapshot | |
| 65 | + |
| 66 | +## Heap Profile (alloc_space) |
| 67 | + |
| 68 | +`executeEVMTxWithGigaExecutor`: **56.2 GB cumulative (54% of 104 GB total)** |
| 69 | + |
| 70 | +| Hotspot | Allocated | % of function | What | |
| 71 | +|---------|----------|--------------|------| |
| 72 | +| DBImpl.Snapshot → CacheMultiStore | 15.1 GB | 27% | Full cache store clone per snapshot | |
| 73 | +| cachekv.NewStore | 9.0 GB | 16% | Individual KV store objects (sync.Map x2-3) | |
| 74 | +| sync.newIndirectNode (sync.Map internals) | 7.8 GB | 14% | sync.Map internal trie nodes | |
| 75 | +| cachekv.Store.Write | 8.4 GB | 15% | Flushing stores (map iteration + write) | |
| 76 | +| DBImpl.Finalize → flushCtxs | 9.7 GB | 17% | Writing state changes back through layers | |
| 77 | +| GetBalance → LockedCoins | 9.0 GB | 16% | Balance lookups triggering deep store reads | |
| 78 | +| AccountKeeper.GetAccount | 7.2 GB | 13% | Account deserialization (protobuf UnpackAny) | |
| 79 | +| scheduler.prepareTask | 8.8 GB | 16% | OCC task preparation (VersionIndexedStore) | |
| 80 | + |
| 81 | +### Allocation Objects |
| 82 | + |
| 83 | +| Hotspot | Objects | % of total | Notes | |
| 84 | +|---------|---------|------------|-------| |
| 85 | +| cachekv.NewStore | 157M | 10.2% | Largest single flat allocator | |
| 86 | +| cachekv.Store.Write | 83M | 5.4% | Map iteration during flush | |
| 87 | +| codec/types.UnpackAny | 134M | 8.7% | Protobuf deserialization | |
| 88 | +| DBImpl.Snapshot | 137M | 8.9% | CMS + maps + sync primitives | |
| 89 | + |
| 90 | +## Mutex/Contention Profile (196s total) |
| 91 | + |
| 92 | +| Source | Time | % | What | |
| 93 | +|--------|------|---|------| |
| 94 | +| runtime.unlock | 159s | 81% | Runtime-level lock contention | |
| 95 | +| sync.Mutex.Unlock | 29s | 15% | Application mutex contention | |
| 96 | +| sync.RWMutex.Unlock | 19s | 10% | Reader-writer lock contention | |
| 97 | +| AccAddress.String | 10.6s | 5% | Bech32 encoding under lock | |
| 98 | +| EventManager.EmitEvents | 9.9s | 5% | Event emission contention | |
| 99 | +| sync.Map.Store (HashTrieMap.Swap) | 6.9s | 4% | sync.Map write contention | |
| 100 | + |
| 101 | +## Block Profile (7.26 hrs total) |
| 102 | + |
| 103 | +Dominated by `runtime.selectgo` (6.97 hrs / 96%) — the OCC scheduler's `select` loop waiting for tasks. Not actionable for this function. |
| 104 | + |
| 105 | +## Key Findings |
| 106 | + |
| 107 | +### 1. CacheMultiStore allocation is the #1 optimization target |
| 108 | + |
| 109 | +Each `Snapshot()` call triggers `CacheMultiStore()` which: |
| 110 | +- Materializes ALL lazy module stores (not just ones the tx touches) |
| 111 | +- Creates `cachekv.NewStore` with 2-3 `sync.Map` objects per module store |
| 112 | +- Creates `gigacachekv.NewStore` with 2 `sync.Map` objects per giga store |
| 113 | +- Allocates map copies, sync.RWMutex, sync.Once per CMS |
| 114 | + |
| 115 | +This happens at minimum once per tx (NewDBImpl's initial Snapshot), plus additional times for nested EVM calls. OCC re-executions create entirely fresh chains. |
| 116 | + |
| 117 | +### 2. GC overhead is a direct consequence of allocation pressure |
| 118 | + |
| 119 | +17s of CPU on `gcDrain` + 11.5s on `scanobject` = ~24% of CPU spent on GC. The 104 GB of total allocations over 30s creates enormous GC pressure. Reducing allocations in the Snapshot path would have a compounding effect. |
| 120 | + |
| 121 | +### 3. Lock contention is high but may be secondary |
| 122 | + |
| 123 | +`runtime.lock2` at 30.7% is the #1 CPU consumer. Much of this is runtime-internal (GC, scheduler) and would decrease naturally if allocation pressure drops. Some is from `sync.Map` operations and store-level mutexes. |
| 124 | + |
| 125 | +### 4. ChainID() and DefaultChainConfig() called redundantly |
| 126 | + |
| 127 | +`ChainID()` is called at lines 1721 and 1764. `DefaultChainConfig()` is called inside `RecoverSenderFromEthTx` and again implicitly at line 1764. Minor but free to fix. |
| 128 | + |
| 129 | +## Candidate Optimizations |
| 130 | + |
| 131 | +### A. Pool/reuse cachekv.Store objects (high impact) |
| 132 | + |
| 133 | +Replace fresh `cachekv.NewStore` allocations with `sync.Pool` recycling. On return to pool, call `sync.Map.Clear()` (Go 1.23+) to reset state. Eliminates ~9 GB of allocations + reduces GC. |
| 134 | + |
| 135 | +**Expected impact:** 10-20% TPS improvement |
| 136 | +**Risk:** Low — mechanical change, clear lifecycle boundaries |
| 137 | + |
| 138 | +### B. Lazy per-store creation in snapshots (high impact) |
| 139 | + |
| 140 | +Currently `materializeOnce.Do` creates cachekv.Store for ALL module stores when any snapshot is taken. Instead, create wrappers only for stores actually accessed via GetKVStore. |
| 141 | + |
| 142 | +**Expected impact:** 10-15% TPS improvement (depends on how many stores are unused per tx) |
| 143 | +**Risk:** Medium — changes cachemulti.Store threading model |
| 144 | + |
| 145 | +### C. Replace sync.Map with regular map in giga cachekv (medium impact) |
| 146 | + |
| 147 | +The giga `cachekv.Store` uses `sync.Map` but within OCC, each store belongs to a single goroutine's execution. Regular maps are ~10x cheaper to allocate and access. |
| 148 | + |
| 149 | +**Expected impact:** 5-10% TPS improvement |
| 150 | +**Risk:** Low — need to verify no concurrent access in OCC path |
| 151 | + |
| 152 | +### D. Cache block-level constants per-tx (low impact) |
| 153 | + |
| 154 | +Cache `ChainID()`, `DefaultChainConfig()`, `GetParams()`, `EthereumConfigWithSstore()` at block level instead of computing per-tx. |
| 155 | + |
| 156 | +**Expected impact:** 2-5% TPS improvement |
| 157 | +**Risk:** Minimal |
0 commit comments