fix: prevent goteth stalls on networks with large validator sets by Zyra-V21 · Pull Request #269 · migalabs/goteth

Zyra-V21 · 2026-05-12T07:57:04Z

Motivation

goteth-hoodi (~1.2M validators) requires a manual docker compose restart every 1-3 days because the head stops advancing. Two independent failure modes compound to produce that symptom:

ClickHouse mutation pressure from per-epoch validator_window retention deletes
A race in head-mode AdvanceFinalized that deadlocks the processer pool

Both are latent on mainnet at current validator counts. Mainnet has not yet hit them because it remains at chain head, where each invocation of the affected code paths is short. Any sustained backlog scenario surfaces the same deadlock.

Description

1. Batch validator_window retention deletes

validator_window emits DELETE FROM t_validator_rewards_summary WHERE f_epoch <= X on every FinalizedCheckpointEvent (one per epoch, ~6:24 min). On networks with ~1M validators the table is hundreds of GiB and tens of billions of rows; each lightweight delete becomes a ClickHouse mutation that rewrites the bulk of the in-window parts. Cadence outruns drain, the merge thread saturates at 120-190% of one core, INSERTs back up, the analyzer head stops advancing.

This PR introduces DELETE_CADENCE_EPOCHS (default 32, ~3.4h). The val-window runner tracks the last fired boundary and skips the DELETE until the window's lower boundary has advanced by N epochs since the last successful fire. The first event after startup always fires to anchor the baseline. The SQL statement and boundary calculation are unchanged; the only observable difference is up to (cadence-1) extra epochs retained beyond the strict window (~0.16% overshoot of the 20250-epoch window).

The reorg recovery path (deleteValidatorRewardsInEpochQuery) is untouched, so surgical per-epoch deletes still fire immediately and reorg correctness is preserved.

Setting DELETE_CADENCE_EPOCHS=1 reproduces legacy per-epoch behaviour bit-for-bit.

2. Serialize AdvanceFinalized

Each FinalizedCheckpointEvent in head mode launches a new go AdvanceFinalized(...). When the previous invocation is still running, typical with 1M validators since ProcessStateTransitionMetrics can exceed the ~6:24 min finalized interval, two goroutines race over the same StateHistory: the newer one runs CleanUpTo at end of loop and evicts entries that the older one is still blocked on inside StateHistory.Wait / BlockHistory.Wait. The blocked goroutine waits forever holding a processerBook slot. Successive races leak the whole 32-slot pool, surfacing as floods of Waiting for too long to acquire page slot=N warnings and a stuck head.

This PR adds advanceFinalizedMu sync.Mutex. AdvanceFinalized calls TryLock at entry; if held by a prior invocation, it logs and returns. The skipped invocation would have iterated a subset of the state keys the next invocation will see, and its CleanUpTo would have been a subset of what the next one performs. Dropping it is monotonically safe.

The historical-mode synchronous call site (runHistorical then AdvanceFinalized) is unaffected: head mode only starts after historical completes, so TryLock always succeeds there.

Type of change

Bug fix (non-breaking)
Breaking change
Refactor

Testing

go build ./... and go vet ./... clean.

End-to-end verified on goteth-hoodi over 6 days post-deploy (deployed 2026-05-05):

container uptime:                          45h (goteth), 50h (val-window), no manual restart
chain head vs db head gap:                 2 epochs (steady state finalized lag)
Waiting for too long (acquire page):       0 / 24h
Waiting for spec.AgnosticState warnings:   0 / 24h
AdvanceFinalized SKIPS (TryLock fired):    14 / 24h  (race triggers, fix catches it)
AdvanceFinalized completions:              211 / 24h
mutations on t_validator_rewards_summary:  61 / 24h, 0 failed
ClickHouse CPU:                            2-6%  (pre-fix sustained 120-190%)
analyzer RAM:                              4.5 GiB stable  (pre-stuck instances reached 12 GiB)

Pre-fix baseline on the same machine: stall every 1-3 days requiring manual restart, restart aborted in-flight mutations and worsened the zombie ratio. Hoodi's t_validator_rewards_summary reached 65% zombies / 838 GiB / 70B physical rows for ~24.5B logical rows. Post-fix the zombie ratio is no longer growing; background merges will reclaim space gradually now that mutation pressure is bounded.

Reproduction

Mutation pressure (fix 1)

Run goteth on a network with ~1M validators (hoodi, mainnet).
Wait 1-3 days of normal operation.
docker stats shows val-window at 0% CPU and ~6 MiB RAM, analyzer at <1% CPU, ClickHouse at 120-190% CPU sustained on one core.
SELECT * FROM system.mutations WHERE table='t_validator_rewards_summary' AND is_done=0 shows a growing queue.
Head stops advancing until manual restart.

AdvanceFinalized race (fix 2)

Run goteth on a network where ProcessStateTransitionMetrics for one epoch can exceed 6:24 min (high validator count, or catch-up after a restart with backlog).
Wait for two FinalizedCheckpointEvents to fire while the first AdvanceFinalized is still in flight.
Logs show Waiting for spec.AgnosticState <epoch> repeating on the same epoch (the state dependency evicted by the newer invocation's CleanUpTo).
Eventually Waiting for too long to acquire page slot=N floods as processerBook exhausts.
Head stops advancing until manual restart.

Backwards compatibility

DELETE_CADENCE_EPOCHS defaults to 32; setting it to 1 reproduces the previous per-epoch behaviour exactly.
No schema changes, no migrations, no changes to emitted SQL.
val-window image and analyzer image can be rebuilt and recreated independently. The val-window fix only requires recreating the val-window container; the AdvanceFinalized fix only requires recreating the analyzer container.

Reviewer notes

The two fixes are independent and compound. Without (1) the analyzer falls behind under ClickHouse pressure; without (2) any catch-up regime can deadlock the processer pool. Together they close the failure mode end to end.
Fix (2) overlaps in symptom space with Fix fillToHead deadlock #264 but addresses a different code path. Fix fillToHead deadlock #264 enforces the historical-to-head handoff gap invariant. This PR closes the race that triggers in steady-state head mode. Both warnings (Waiting for too long to acquire page, Waiting for spec.AgnosticBlock/State) can be produced by either path. Merging both gives full coverage of the saturation pattern in runHead's event loop.
sync.Mutex.TryLock requires Go 1.18+. go.mod is at 1.25.
Zombie cleanup of existing dead rows in t_validator_rewards_summary (~570 GiB on hoodi) is intentionally deferred. Background merges reclaim space gradually now that ClickHouse is no longer saturated. OPTIMIZE TABLE t_validator_rewards_summary FINAL can force a one-shot reclamation but is operator-scheduled and out of scope for this PR.

…mutation pressure The val-window service emits a lightweight DELETE on t_validator_rewards_summary every finalized checkpoint event (~6:24 min). On networks with ~1M validators this lowers to a ClickHouse mutation that rewrites the in-window parts (14-55 GiB each on hoodi); each fire takes minutes and queues up faster than it can drain, saturating one merge core and stalling the head until an operator restart. Make the boundary advance gate the fire: only emit DELETE once the window's lower boundary has advanced by DELETE_CADENCE_EPOCHS since the last successful fire (default 32 epochs ~3.4h, ~70x headroom over the ~150s mutation cost on hoodi). The first event after start always fires to anchor the baseline; subsequent events skip until the cadence is met. The DELETE statement and boundary calculation are unchanged - the only observable difference is up to (cadence-1) extra epochs retained beyond the strict window (0.16% overshoot vs. the 20250-epoch window). The per-epoch surgical delete used by reorg recovery (DeleteStateMetrics) is untouched. Set DELETE_CADENCE_EPOCHS=1 for legacy behaviour.

…UpTo race Each FinalizedCheckpointEvent in head mode launches a new `go AdvanceFinalized(...)`. When a previous invocation is still running (common when ProcessStateTransitionMetrics takes longer than the ~6:24 min finalized interval — networks with ~1M validators, or any catch-up scenario), two goroutines race over the same StateHistory: the newer one runs CleanUpTo at the end of its loop and evicts entries that the older one is still blocked on inside StateHistory.Wait / BlockHistory.Wait. The blocked goroutine then waits forever holding a processerBook slot, and successive races leak the whole 32-slot pool, surfacing as floods of "Waiting for too long to acquire page" warnings and a stuck head. Observed on goteth-hoodi this morning: a single dependency state at epoch 93105 was evicted while a ProcessStateTransitionMetrics goroutine held a Wait on it, blocking that slot for 30+ minutes; the analyzer stopped advancing past dbHeadEpoch 93110 even with ClickHouse healthy. Skip overlapping invocations via TryLock. The skipped one would have iterated a subset of the state keys the next invocation will see, and its CleanUpTo would have been a subset of what the next one performs, so dropping it is monotonically safe — no work is lost. The historical-mode synchronous call site (routines.go:208) is unaffected: head mode only starts after historical completes, so TryLock always succeeds there.

Zyra-V21 · 2026-05-12T09:57:52Z

Closes #237

Zyra-V21 added 2 commits May 4, 2026 17:01

Zyra-V21 mentioned this pull request May 12, 2026

val-window: skip DELETE when no rows exist below retention threshold #237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent goteth stalls on networks with large validator sets#269

fix: prevent goteth stalls on networks with large validator sets#269
Zyra-V21 wants to merge 2 commits into
devfrom
fix/validator-window-batched-deletes

Zyra-V21 commented May 12, 2026

Uh oh!

Zyra-V21 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zyra-V21 commented May 12, 2026

Motivation

Description

1. Batch validator_window retention deletes

2. Serialize AdvanceFinalized

Type of change

Testing

Reproduction

Mutation pressure (fix 1)

AdvanceFinalized race (fix 2)

Backwards compatibility

Reviewer notes

Uh oh!

Zyra-V21 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant