fix: prevent goteth stalls on networks with large validator sets#269
Open
Zyra-V21 wants to merge 2 commits into
Open
fix: prevent goteth stalls on networks with large validator sets#269Zyra-V21 wants to merge 2 commits into
Zyra-V21 wants to merge 2 commits into
Conversation
…mutation pressure The val-window service emits a lightweight DELETE on t_validator_rewards_summary every finalized checkpoint event (~6:24 min). On networks with ~1M validators this lowers to a ClickHouse mutation that rewrites the in-window parts (14-55 GiB each on hoodi); each fire takes minutes and queues up faster than it can drain, saturating one merge core and stalling the head until an operator restart. Make the boundary advance gate the fire: only emit DELETE once the window's lower boundary has advanced by DELETE_CADENCE_EPOCHS since the last successful fire (default 32 epochs ~3.4h, ~70x headroom over the ~150s mutation cost on hoodi). The first event after start always fires to anchor the baseline; subsequent events skip until the cadence is met. The DELETE statement and boundary calculation are unchanged - the only observable difference is up to (cadence-1) extra epochs retained beyond the strict window (0.16% overshoot vs. the 20250-epoch window). The per-epoch surgical delete used by reorg recovery (DeleteStateMetrics) is untouched. Set DELETE_CADENCE_EPOCHS=1 for legacy behaviour.
…UpTo race Each FinalizedCheckpointEvent in head mode launches a new `go AdvanceFinalized(...)`. When a previous invocation is still running (common when ProcessStateTransitionMetrics takes longer than the ~6:24 min finalized interval — networks with ~1M validators, or any catch-up scenario), two goroutines race over the same StateHistory: the newer one runs CleanUpTo at the end of its loop and evicts entries that the older one is still blocked on inside StateHistory.Wait / BlockHistory.Wait. The blocked goroutine then waits forever holding a processerBook slot, and successive races leak the whole 32-slot pool, surfacing as floods of "Waiting for too long to acquire page" warnings and a stuck head. Observed on goteth-hoodi this morning: a single dependency state at epoch 93105 was evicted while a ProcessStateTransitionMetrics goroutine held a Wait on it, blocking that slot for 30+ minutes; the analyzer stopped advancing past dbHeadEpoch 93110 even with ClickHouse healthy. Skip overlapping invocations via TryLock. The skipped one would have iterated a subset of the state keys the next invocation will see, and its CleanUpTo would have been a subset of what the next one performs, so dropping it is monotonically safe — no work is lost. The historical-mode synchronous call site (routines.go:208) is unaffected: head mode only starts after historical completes, so TryLock always succeeds there.
Collaborator
Author
|
Closes #237 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
goteth-hoodi (~1.2M validators) requires a manual
docker compose restartevery 1-3 days because the head stops advancing. Two independent failure modes compound to produce that symptom:AdvanceFinalizedthat deadlocks the processer poolBoth are latent on mainnet at current validator counts. Mainnet has not yet hit them because it remains at chain head, where each invocation of the affected code paths is short. Any sustained backlog scenario surfaces the same deadlock.
Description
1. Batch validator_window retention deletes
validator_windowemitsDELETE FROM t_validator_rewards_summary WHERE f_epoch <= Xon every FinalizedCheckpointEvent (one per epoch, ~6:24 min). On networks with ~1M validators the table is hundreds of GiB and tens of billions of rows; each lightweight delete becomes a ClickHouse mutation that rewrites the bulk of the in-window parts. Cadence outruns drain, the merge thread saturates at 120-190% of one core, INSERTs back up, the analyzer head stops advancing.This PR introduces
DELETE_CADENCE_EPOCHS(default 32, ~3.4h). The val-window runner tracks the last fired boundary and skips the DELETE until the window's lower boundary has advanced by N epochs since the last successful fire. The first event after startup always fires to anchor the baseline. The SQL statement and boundary calculation are unchanged; the only observable difference is up to (cadence-1) extra epochs retained beyond the strict window (~0.16% overshoot of the 20250-epoch window).The reorg recovery path (
deleteValidatorRewardsInEpochQuery) is untouched, so surgical per-epoch deletes still fire immediately and reorg correctness is preserved.Setting
DELETE_CADENCE_EPOCHS=1reproduces legacy per-epoch behaviour bit-for-bit.2. Serialize AdvanceFinalized
Each FinalizedCheckpointEvent in head mode launches a new
go AdvanceFinalized(...). When the previous invocation is still running, typical with 1M validators sinceProcessStateTransitionMetricscan exceed the ~6:24 min finalized interval, two goroutines race over the sameStateHistory: the newer one runsCleanUpToat end of loop and evicts entries that the older one is still blocked on insideStateHistory.Wait/BlockHistory.Wait. The blocked goroutine waits forever holding aprocesserBookslot. Successive races leak the whole 32-slot pool, surfacing as floods ofWaiting for too long to acquire page slot=Nwarnings and a stuck head.This PR adds
advanceFinalizedMu sync.Mutex.AdvanceFinalizedcallsTryLockat entry; if held by a prior invocation, it logs and returns. The skipped invocation would have iterated a subset of the state keys the next invocation will see, and itsCleanUpTowould have been a subset of what the next one performs. Dropping it is monotonically safe.The historical-mode synchronous call site (
runHistoricalthenAdvanceFinalized) is unaffected: head mode only starts after historical completes, soTryLockalways succeeds there.Type of change
Testing
go build ./...andgo vet ./...clean.End-to-end verified on goteth-hoodi over 6 days post-deploy (deployed 2026-05-05):
Pre-fix baseline on the same machine: stall every 1-3 days requiring manual restart, restart aborted in-flight mutations and worsened the zombie ratio. Hoodi's
t_validator_rewards_summaryreached 65% zombies / 838 GiB / 70B physical rows for ~24.5B logical rows. Post-fix the zombie ratio is no longer growing; background merges will reclaim space gradually now that mutation pressure is bounded.Reproduction
Mutation pressure (fix 1)
docker statsshows val-window at 0% CPU and ~6 MiB RAM, analyzer at <1% CPU, ClickHouse at 120-190% CPU sustained on one core.SELECT * FROM system.mutations WHERE table='t_validator_rewards_summary' AND is_done=0shows a growing queue.AdvanceFinalized race (fix 2)
ProcessStateTransitionMetricsfor one epoch can exceed 6:24 min (high validator count, or catch-up after a restart with backlog).AdvanceFinalizedis still in flight.Waiting for spec.AgnosticState <epoch>repeating on the same epoch (the state dependency evicted by the newer invocation'sCleanUpTo).Waiting for too long to acquire page slot=Nfloods asprocesserBookexhausts.Backwards compatibility
DELETE_CADENCE_EPOCHSdefaults to 32; setting it to 1 reproduces the previous per-epoch behaviour exactly.Reviewer notes
Waiting for too long to acquire page,Waiting for spec.AgnosticBlock/State) can be produced by either path. Merging both gives full coverage of the saturation pattern inrunHead's event loop.sync.Mutex.TryLockrequires Go 1.18+.go.modis at 1.25.t_validator_rewards_summary(~570 GiB on hoodi) is intentionally deferred. Background merges reclaim space gradually now that ClickHouse is no longer saturated.OPTIMIZE TABLE t_validator_rewards_summary FINALcan force a one-shot reclamation but is operator-scheduled and out of scope for this PR.