Skip to content

Fix fillToHead deadlock#264

Open
BitWonka wants to merge 12 commits into
migalabs:devfrom
BitWonka:feat-fill-to-head-loop
Open

Fix fillToHead deadlock#264
BitWonka wants to merge 12 commits into
migalabs:devfrom
BitWonka:feat-fill-to-head-loop

Conversation

@BitWonka
Copy link
Copy Markdown

@BitWonka BitWonka commented May 7, 2026

Motivation

After a long historical backfill, fillToHead returns the original headSlot it queried at the start and hands off to runHead. If the chain has moved forward more than SlotsPerEpoch during the backfill, runHead starts with a gap larger than the processer pool can safely absorb.

runHead's tight loop tries to enqueue every slot between the old nextSlotDownload and the current head SSE event:

for nextSlotDownload <= event.HeadEvent.Slot {
    if s.processerBook.NumFreePages() > 0 {
        s.downloadTaskChan <- nextSlotDownload
        nextSlotDownload++
    }
}

When the gap is hundreds or thousands of slots wide, every page in processerBook ends up held by ProcessBlock/ProcessStateTransitionMetrics goroutines that are blocked on BlockHistory.Wait for cross-epoch dependencies (state metrics for epoch E need blocks from epoch E-1, which are still in flight). No page ever frees, the loop spins forever, and the head channel is never drained. Goteth deadlocks until restart.

The symptom is repeated Waiting for too long to acquire page slot=N and Waiting for spec.AgnosticBlock M warnings on the same slots over many minutes.

Keeping the historical-to-head handoff gap below one epoch keeps the page demand bounded so the deadlock cannot form.

Related links:

Description

Wrap the runHistorical call in fillToHead with a loop that re-queries RequestCurrentHead after each pass. If the new head is more than SlotsPerEpoch ahead of the previous headSlot, run another historical pass for the gap. Once the gap is within one epoch, return and let runHead take over.

Type of change

  • Bug fix (non-breaking)
  • New feature (non-breaking)
  • Breaking change (CLI flag rename, schema change, behavior change)
  • Documentation only
  • Refactor / internal cleanup
  • Performance improvement

Tasks

  • Add loop around runHistorical
  • Re-query head after each pass
  • Handoff threshold of SlotsPerEpoch
  • Verified deployed: logs show convergence (large gap, smaller gap, handoff)

Testing

go build ./...

End-to-end verified on production. The deadlock requires real backpressure on processerBook plus chain advancement during backfill, so unit-testing means mocking the beacon client, SSE stream, and processer pool together.

Reproduction steps (for bug fixes)

  1. Run goteth, then stop it for a few few hours.
  2. Restart goteth.
  3. While runHistorical runs, the chain advances by another K slots beyond the start-time head.
  4. When runHistorical returns, runHead takes over with a K-slot gap.
  5. If K is large enough (in practice, more than the processer pool size, ~hundreds of slots), the loop in runHead saturates processerBook and deadlocks. Logs show repeated Waiting for too long to acquire page slot=N warnings on the same slots over many minutes with no progress.

Mitigation options considered

  • Resize the processer pool: papers over the deadlock without fixing the root cause. A larger pool just shifts the threshold.
  • Single re-check, no loop: insufficient when the historical pass itself takes long enough that the chain moves another epoch during the second pass.

Proof of Success

Real run on 2026-05-01 showing the loop converging and handing off cleanly:

time="2026-05-01T02:24:13Z" level=info msg="head moved 1066 slots during catch-up, looping historical" module=analyzer
time="2026-05-01T02:24:13Z" level=info msg="Switch to historical mode: 14230453 - 14231518" module=analyzer
time="2026-05-01T02:48:49Z" level=info msg="historical mode: all download tasks sent" module=analyzer
time="2026-05-01T02:48:49Z" level=info msg="head moved 123 slots during catch-up, looping historical" module=analyzer
time="2026-05-01T02:48:49Z" level=info msg="Switch to historical mode: 14231519 - 14231641" module=analyzer
time="2026-05-01T02:51:46Z" level=info msg="historical mode: all download tasks sent" module=analyzer
time="2026-05-01T02:51:46Z" level=info msg="waiting for remaining historical blocks (14231481 to 14231641) to complete..." module=analyzer
time="2026-05-01T02:52:30Z" level=info msg="Switch to head mode: following chain head" module=analyzer

Three iterations: 1066-slot gap, 123-slot gap, then under one epoch so the handoff happens.

Pre-fix run that deadlocked (2026-04-28) shows the failure mode for comparison:

time="2026-04-28T05:57:24Z" level=warning msg="Waiting for spec.AgnosticBlock 14210688..." module=analyzer
time="2026-04-28T05:57:25Z" level=warning msg="Waiting for too long to acquire page slot=14210913..." bookTag=processer module=utils
time="2026-04-28T05:57:25Z" level=warning msg="Waiting for spec.AgnosticBlock 14210910..." module=analyzer
... (same warnings on the same slots repeating for 15+ minutes, no progress)

Documentation

  • README.md updated (if user-facing flag, install, or run change)
  • docs/tables.md updated (if persisted schema change)
  • Inline comments added where the why is non-obvious

Backwards compatibility

No

Reviewer notes

  • Handoff threshold is SlotsPerEpoch (32 slots, ~6.4 minutes). Small enough that runHead cannot saturate the processer pool, large enough to avoid edge cases where head moves a slot or two during the head re-query itself.
  • The wait-group Add(1) is inside the loop because runHistorical does defer s.wgMainRoutine.Done() on entry. Each iteration is a self-contained add/done pair.

leobago and others added 12 commits February 17, 2026 16:29
Fix Lighthouse v8.1.0 SSE race condition and reward calculation bugs
fix: historical deadlock, attestation flag, concurrent map race, orphan duties
fix: propagate block changes to dependent epochs after reorg + v3.8.0
Fix RoutineBook.Acquire deadlock causing missing block rewards
fix: transaction value uint64 overflow and Float32 precision loss
Fix ProcessSlashings accumulation + ManualReward race condition + block rewards validation
fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker
fix(relay): remove securerpc and wenmerge mainnet relays
@BitWonka BitWonka changed the title Fix fillToHead Fix fillToHead deadlock May 7, 2026
@Zyra-V21
Copy link
Copy Markdown
Collaborator

Thanks for tracking this one down — the deadlock pattern is real, the analysis is right, and the reproduction is convincing. The fix is correct in what it does: it enforces an invariant on entry to runHead (the handoff gap is bounded), and that invariant happens to be exactly what runHead's inner enqueue loop needs in order not to deadlock against processerBook. Happy with the wgMainRoutine balancing and the convergence behavior shown in the log output.

A couple of nits worth tightening before merge:

  • Threshold has no headroom. SlotsPerEpoch is exactly processerBook's capacity, so the last loop iteration can leave a gap that immediately saturates the pool the moment runHead starts dispatching. Worth dropping the threshold a bit (or deriving it from the pool size with margin) so the first burst of head events doesn't sit on the edge.
  • The inline comment could spell out the link between the threshold and the pool size — right now a future reader sees SlotsPerEpoch and has to reverse-engineer why that number specifically. One extra line explaining "matches processerBook capacity so runHead's enqueue burst cannot saturate the pool" would save someone the archaeology later.

Bigger picture though: this PR addresses the symptom (the handoff is the path that triggers the deadlock today) but the underlying fragility lives deeper in runHead's event handler — specifically in how the inner dispatch loop interacts with processerBook under saturation. The handoff invariant keeps the bug out of the hot path, which is fine for shipping. But if you have the appetite, it would be top-tier to follow this up with a more complete RCA on runHead's saturation behavior — i.e. what happens when the pool fills for reasons other than handoff (slow ClickHouse flush, SSE burst after reconnection, etc.). The shape of the deadlock you describe isn't unique to the handoff; it's reachable any time the pool stays full long enough for cross-epoch dependencies to lock the chain. The same patch wouldn't help in those scenarios. Worth a look.

Not asking you to expand the PR — land this as the tactical fix, it does its job. But if you want to file a follow-up issue with the broader saturation pattern (and what would need to change in the event loop to handle it robustly), that would be the higher-leverage contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants