Fix fillToHead deadlock#264
Conversation
Fix Lighthouse v8.1.0 SSE race condition and reward calculation bugs
fix: historical deadlock, attestation flag, concurrent map race, orphan duties
fix: propagate block changes to dependent epochs after reorg + v3.8.0
Fix RoutineBook.Acquire deadlock causing missing block rewards
fix: transaction value uint64 overflow and Float32 precision loss
Fix ProcessSlashings accumulation + ManualReward race condition + block rewards validation
fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker
fix(relay): remove securerpc and wenmerge mainnet relays
|
Thanks for tracking this one down — the deadlock pattern is real, the analysis is right, and the reproduction is convincing. The fix is correct in what it does: it enforces an invariant on entry to A couple of nits worth tightening before merge:
Bigger picture though: this PR addresses the symptom (the handoff is the path that triggers the deadlock today) but the underlying fragility lives deeper in Not asking you to expand the PR — land this as the tactical fix, it does its job. But if you want to file a follow-up issue with the broader saturation pattern (and what would need to change in the event loop to handle it robustly), that would be the higher-leverage contribution. |
Motivation
After a long historical backfill,
fillToHeadreturns the originalheadSlotit queried at the start and hands off torunHead. If the chain has moved forward more thanSlotsPerEpochduring the backfill,runHeadstarts with a gap larger than the processer pool can safely absorb.runHead's tight loop tries to enqueue every slot between the oldnextSlotDownloadand the current head SSE event:When the gap is hundreds or thousands of slots wide, every page in
processerBookends up held byProcessBlock/ProcessStateTransitionMetricsgoroutines that are blocked onBlockHistory.Waitfor cross-epoch dependencies (state metrics for epoch E need blocks from epoch E-1, which are still in flight). No page ever frees, the loop spins forever, and the head channel is never drained. Goteth deadlocks until restart.The symptom is repeated Waiting for too long to acquire page slot=N and Waiting for spec.AgnosticBlock M warnings on the same slots over many minutes.
Keeping the historical-to-head handoff gap below one epoch keeps the page demand bounded so the deadlock cannot form.
Related links:
Description
Wrap the
runHistoricalcall infillToHeadwith a loop that re-queriesRequestCurrentHeadafter each pass. If the new head is more than SlotsPerEpoch ahead of the previous headSlot, run another historical pass for the gap. Once the gap is within one epoch, return and let runHead take over.Type of change
Tasks
Testing
go build ./...End-to-end verified on production. The deadlock requires real backpressure on processerBook plus chain advancement during backfill, so unit-testing means mocking the beacon client, SSE stream, and processer pool together.
Reproduction steps (for bug fixes)
Waiting for too long to acquire page slot=N warningson the same slots over many minutes with no progress.Mitigation options considered
Proof of Success
Real run on 2026-05-01 showing the loop converging and handing off cleanly:
Three iterations: 1066-slot gap, 123-slot gap, then under one epoch so the handoff happens.
Pre-fix run that deadlocked (2026-04-28) shows the failure mode for comparison:
Documentation
README.mdupdated (if user-facing flag, install, or run change)docs/tables.mdupdated (if persisted schema change)Backwards compatibility
No
Reviewer notes
Add(1)is inside the loop because runHistorical does defers.wgMainRoutine.Done()on entry. Each iteration is a self-contained add/done pair.