Skip to content

fix: prevent goteth stalls on networks with large validator sets#269

Open
Zyra-V21 wants to merge 2 commits into
devfrom
fix/validator-window-batched-deletes
Open

fix: prevent goteth stalls on networks with large validator sets#269
Zyra-V21 wants to merge 2 commits into
devfrom
fix/validator-window-batched-deletes

Conversation

@Zyra-V21
Copy link
Copy Markdown
Collaborator

Motivation

goteth-hoodi (~1.2M validators) requires a manual docker compose restart every 1-3 days because the head stops advancing. Two independent failure modes compound to produce that symptom:

  1. ClickHouse mutation pressure from per-epoch validator_window retention deletes
  2. A race in head-mode AdvanceFinalized that deadlocks the processer pool

Both are latent on mainnet at current validator counts. Mainnet has not yet hit them because it remains at chain head, where each invocation of the affected code paths is short. Any sustained backlog scenario surfaces the same deadlock.

Description

1. Batch validator_window retention deletes

validator_window emits DELETE FROM t_validator_rewards_summary WHERE f_epoch <= X on every FinalizedCheckpointEvent (one per epoch, ~6:24 min). On networks with ~1M validators the table is hundreds of GiB and tens of billions of rows; each lightweight delete becomes a ClickHouse mutation that rewrites the bulk of the in-window parts. Cadence outruns drain, the merge thread saturates at 120-190% of one core, INSERTs back up, the analyzer head stops advancing.

This PR introduces DELETE_CADENCE_EPOCHS (default 32, ~3.4h). The val-window runner tracks the last fired boundary and skips the DELETE until the window's lower boundary has advanced by N epochs since the last successful fire. The first event after startup always fires to anchor the baseline. The SQL statement and boundary calculation are unchanged; the only observable difference is up to (cadence-1) extra epochs retained beyond the strict window (~0.16% overshoot of the 20250-epoch window).

The reorg recovery path (deleteValidatorRewardsInEpochQuery) is untouched, so surgical per-epoch deletes still fire immediately and reorg correctness is preserved.

Setting DELETE_CADENCE_EPOCHS=1 reproduces legacy per-epoch behaviour bit-for-bit.

2. Serialize AdvanceFinalized

Each FinalizedCheckpointEvent in head mode launches a new go AdvanceFinalized(...). When the previous invocation is still running, typical with 1M validators since ProcessStateTransitionMetrics can exceed the ~6:24 min finalized interval, two goroutines race over the same StateHistory: the newer one runs CleanUpTo at end of loop and evicts entries that the older one is still blocked on inside StateHistory.Wait / BlockHistory.Wait. The blocked goroutine waits forever holding a processerBook slot. Successive races leak the whole 32-slot pool, surfacing as floods of Waiting for too long to acquire page slot=N warnings and a stuck head.

This PR adds advanceFinalizedMu sync.Mutex. AdvanceFinalized calls TryLock at entry; if held by a prior invocation, it logs and returns. The skipped invocation would have iterated a subset of the state keys the next invocation will see, and its CleanUpTo would have been a subset of what the next one performs. Dropping it is monotonically safe.

The historical-mode synchronous call site (runHistorical then AdvanceFinalized) is unaffected: head mode only starts after historical completes, so TryLock always succeeds there.

Type of change

  • Bug fix (non-breaking)
  • Breaking change
  • Refactor

Testing

go build ./... and go vet ./... clean.

End-to-end verified on goteth-hoodi over 6 days post-deploy (deployed 2026-05-05):

container uptime:                          45h (goteth), 50h (val-window), no manual restart
chain head vs db head gap:                 2 epochs (steady state finalized lag)
Waiting for too long (acquire page):       0 / 24h
Waiting for spec.AgnosticState warnings:   0 / 24h
AdvanceFinalized SKIPS (TryLock fired):    14 / 24h  (race triggers, fix catches it)
AdvanceFinalized completions:              211 / 24h
mutations on t_validator_rewards_summary:  61 / 24h, 0 failed
ClickHouse CPU:                            2-6%  (pre-fix sustained 120-190%)
analyzer RAM:                              4.5 GiB stable  (pre-stuck instances reached 12 GiB)

Pre-fix baseline on the same machine: stall every 1-3 days requiring manual restart, restart aborted in-flight mutations and worsened the zombie ratio. Hoodi's t_validator_rewards_summary reached 65% zombies / 838 GiB / 70B physical rows for ~24.5B logical rows. Post-fix the zombie ratio is no longer growing; background merges will reclaim space gradually now that mutation pressure is bounded.

Reproduction

Mutation pressure (fix 1)

  1. Run goteth on a network with ~1M validators (hoodi, mainnet).
  2. Wait 1-3 days of normal operation.
  3. docker stats shows val-window at 0% CPU and ~6 MiB RAM, analyzer at <1% CPU, ClickHouse at 120-190% CPU sustained on one core.
  4. SELECT * FROM system.mutations WHERE table='t_validator_rewards_summary' AND is_done=0 shows a growing queue.
  5. Head stops advancing until manual restart.

AdvanceFinalized race (fix 2)

  1. Run goteth on a network where ProcessStateTransitionMetrics for one epoch can exceed 6:24 min (high validator count, or catch-up after a restart with backlog).
  2. Wait for two FinalizedCheckpointEvents to fire while the first AdvanceFinalized is still in flight.
  3. Logs show Waiting for spec.AgnosticState <epoch> repeating on the same epoch (the state dependency evicted by the newer invocation's CleanUpTo).
  4. Eventually Waiting for too long to acquire page slot=N floods as processerBook exhausts.
  5. Head stops advancing until manual restart.

Backwards compatibility

  • DELETE_CADENCE_EPOCHS defaults to 32; setting it to 1 reproduces the previous per-epoch behaviour exactly.
  • No schema changes, no migrations, no changes to emitted SQL.
  • val-window image and analyzer image can be rebuilt and recreated independently. The val-window fix only requires recreating the val-window container; the AdvanceFinalized fix only requires recreating the analyzer container.

Reviewer notes

  • The two fixes are independent and compound. Without (1) the analyzer falls behind under ClickHouse pressure; without (2) any catch-up regime can deadlock the processer pool. Together they close the failure mode end to end.
  • Fix (2) overlaps in symptom space with Fix fillToHead deadlock #264 but addresses a different code path. Fix fillToHead deadlock #264 enforces the historical-to-head handoff gap invariant. This PR closes the race that triggers in steady-state head mode. Both warnings (Waiting for too long to acquire page, Waiting for spec.AgnosticBlock/State) can be produced by either path. Merging both gives full coverage of the saturation pattern in runHead's event loop.
  • sync.Mutex.TryLock requires Go 1.18+. go.mod is at 1.25.
  • Zombie cleanup of existing dead rows in t_validator_rewards_summary (~570 GiB on hoodi) is intentionally deferred. Background merges reclaim space gradually now that ClickHouse is no longer saturated. OPTIMIZE TABLE t_validator_rewards_summary FINAL can force a one-shot reclamation but is operator-scheduled and out of scope for this PR.

Zyra-V21 added 2 commits May 4, 2026 17:01
…mutation pressure

The val-window service emits a lightweight DELETE on
t_validator_rewards_summary every finalized checkpoint event (~6:24 min).
On networks with ~1M validators this lowers to a ClickHouse mutation that
rewrites the in-window parts (14-55 GiB each on hoodi); each fire takes
minutes and queues up faster than it can drain, saturating one merge core
and stalling the head until an operator restart.

Make the boundary advance gate the fire: only emit DELETE once the
window's lower boundary has advanced by DELETE_CADENCE_EPOCHS since the
last successful fire (default 32 epochs ~3.4h, ~70x headroom over the
~150s mutation cost on hoodi). The first event after start always fires
to anchor the baseline; subsequent events skip until the cadence is met.

The DELETE statement and boundary calculation are unchanged - the only
observable difference is up to (cadence-1) extra epochs retained beyond
the strict window (0.16% overshoot vs. the 20250-epoch window). The
per-epoch surgical delete used by reorg recovery (DeleteStateMetrics) is
untouched. Set DELETE_CADENCE_EPOCHS=1 for legacy behaviour.
…UpTo race

Each FinalizedCheckpointEvent in head mode launches a new
`go AdvanceFinalized(...)`. When a previous invocation is still running
(common when ProcessStateTransitionMetrics takes longer than the ~6:24
min finalized interval — networks with ~1M validators, or any catch-up
scenario), two goroutines race over the same StateHistory: the newer one
runs CleanUpTo at the end of its loop and evicts entries that the older
one is still blocked on inside StateHistory.Wait / BlockHistory.Wait.
The blocked goroutine then waits forever holding a processerBook slot,
and successive races leak the whole 32-slot pool, surfacing as floods
of "Waiting for too long to acquire page" warnings and a stuck head.

Observed on goteth-hoodi this morning: a single dependency state at
epoch 93105 was evicted while a ProcessStateTransitionMetrics goroutine
held a Wait on it, blocking that slot for 30+ minutes; the analyzer
stopped advancing past dbHeadEpoch 93110 even with ClickHouse healthy.

Skip overlapping invocations via TryLock. The skipped one would have
iterated a subset of the state keys the next invocation will see, and
its CleanUpTo would have been a subset of what the next one performs,
so dropping it is monotonically safe — no work is lost.

The historical-mode synchronous call site (routines.go:208) is
unaffected: head mode only starts after historical completes, so
TryLock always succeeds there.
@Zyra-V21
Copy link
Copy Markdown
Collaborator Author

Closes #237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant