Skip to content

[WIP]fix: Publisher-mode synchronization option for failover scenario#3222

Open
alpe wants to merge 2 commits intomainfrom
alex/sync_race
Open

[WIP]fix: Publisher-mode synchronization option for failover scenario#3222
alpe wants to merge 2 commits intomainfrom
alex/sync_race

Conversation

@alpe
Copy link
Copy Markdown
Contributor

@alpe alpe commented Mar 31, 2026

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic Raft startup mode selection based on persisted state presence; bootstrap is now triggered automatically when no local state exists.
    • Introduced publisher-mode synchronization for failover scenarios, enabling early P2P infrastructure readiness.
  • Documentation

    • Updated Raft configuration guidance to reflect automatic startup mode selection.
    • Clarified that the bootstrap flag is now a legacy compatibility setting.
    • Updated CLI flag namespaces for Raft and aggregator configurations.
  • Improvements

    • Enhanced failover state validation and recovery logic.
    • Improved leader election robustness and error handling.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

This PR implements publisher-mode synchronization for early P2P infrastructure readiness during Raft-based failover scenarios. It refactors Raft leader election and node startup logic to determine mode automatically based on persisted state, introduces a new StartForPublishing method for sync services, upgrades sync service state management to use mutex-based concurrency, and updates E2E test infrastructure for P2P identity handling and process management.

Changes

Cohort / File(s) Summary
Raft Startup Mode & Bootstrap Logic
pkg/raft/node.go, pkg/raft/node_test.go, pkg/raft/election.go, pkg/raft/election_test.go
Refactored Raft node startup to automatically select mode (rejoin if persisted state exists, bootstrap from peers otherwise) by removing Bootstrap flag check. Enhanced follower verification with polling for non-zero raft state and improved error recovery logic. Added nil receiver test and updated election test expectations.
Publisher-Mode Synchronization
node/failover.go, pkg/sync/sync_service.go, pkg/sync/sync_service_test.go
Added shouldStartSyncInPublisherMode logic to start sync services without waiting for P2P readiness when raft leader and stores empty. Introduced new public StartForPublishing method and refactored startSyncer to return (bool, error) to distinguish first-start from retries. Added multi-peer publisher-mode test.
Sync State Management Refactoring
pkg/sync/syncer_status.go, pkg/sync/syncer_status_test.go
Replaced atomic Bool with mutex-protected started field. Added startOnce(startFn) for exclusive start execution and stopIfStarted(stopFn) for conditional stop. Includes comprehensive concurrency tests for race-free start/stop semantics.
Syncer Block Validation
block/internal/syncing/syncer.go, block/internal/syncing/syncer_test.go
Added internal helper trySyncNextBlockWithState(ctx, event, currentState) to accept explicit state parameter. Enhanced RecoverFromRaft raft-ahead case with retry logic on errInvalidState after bootstrap. Added test cases for bootstrap initialization and chain ID validation.
E2E Test Infrastructure
test/e2e/failover_e2e_test.go, test/e2e/sut_helper.go
Added per-node P2P identity generation with stable peer IDs, refactored node metadata to separate listen and peer addresses, improved raft leader detection tolerance for transient failures. Enhanced process cleanup logging and conditional log directory handling via EV_E2E_LOG_DIR.
Configuration & Documentation
pkg/config/config.go, pkg/config/config_test.go, docs/guides/raft_production.md, docs/learn/config.md
Updated RaftConfig.Bootstrap comment to reflect new semantics as compatibility flag. Added Raft CLI flag assertions in config tests. Updated production/config docs to clarify automatic startup-mode selection based on persisted state and updated flag namespaces.
Module Replace Directives
apps/evm/go.mod
Uncommented replace directives for github.com/evstack/ev-node pointing to local ../../ and ../../execution/evm paths.
Changelog
CHANGELOG.md
Moved PR #3222 entry from ### Changes to ### Added section (contains visible merge conflict markers).

Sequence Diagram(s)

sequenceDiagram
    participant failover as Failover Manager
    participant raftNode as Raft Node
    participant syncService as Sync Service
    participant store as Block Store
    
    Note over failover: Run() startup
    
    alt Publisher Mode Eligible
        failover->>raftNode: Check raft leader + config
        raftNode-->>failover: Is leader: true
        failover->>store: Get header/data heights
        store-->>failover: Both empty
        failover->>syncService: StartForPublishing()
        syncService->>syncService: prepareStart (no P2P wait)
        syncService->>syncService: startSubscriber
        syncService-->>failover: OK
        Note over syncService: Start ingesting blocks<br/>without P2P readiness
    else Normal Mode
        failover->>syncService: Start()
        syncService->>syncService: prepareStart
        syncService->>syncService: initFromP2PWithRetry
        Note over syncService: Wait for P2P<br/>peer discovery
        syncService->>syncService: startSubscriber
        syncService-->>failover: OK
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • chore: update config docs #3189: Both PRs modify docs/learn/config.md and make overlapping changes to Raft configuration documentation and CLI flag namespace updates.
  • chore: add stricter linting #3132: Both PRs modify block/internal/syncing/syncer.go with related changes to internal method signatures and context propagation in sync validation logic.
  • feat(sequencer): catchup from base #3057: Both PRs implement sequencer catch-up and failover functionality with matching code-level changes to DA height handling and syncer verification logic.

Suggested reviewers

  • julienrbrt
  • chatton
  • randygrok

Poem

🐰 A syncer hops with startup pride,
Publisher mode starts the ride,
Raft rejoin when state's not bare,
Mutex guards with careful care,
Early peers meet infra's dawn!

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Title check ⚠️ Warning Title begins with '[WIP]' prefix which indicates work-in-progress status, making it unsuitable for merge. The semantic commit format is partially followed ('fix:') but the leading '[WIP]' violates conventional commit standards. Remove '[WIP]' prefix and ensure the title follows semantic commit format strictly: 'fix: Publisher-mode synchronization option for failover scenario' or similar.
Docstring Coverage ⚠️ Warning Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is substantially incomplete. While it briefly mentions the issue (E2E HA tests failing due to a race condition), it lacks explanation of the solution approach, implementation details, and rationale for the changes made across multiple files. Expand the Overview section to explain the specific synchronization changes made, why publisher-mode was introduced, and how it resolves the race condition described.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/sync_race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 2, 2026, 10:07 AM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude finished @alpe's task in 3m 34s —— View job


Code Review

  • Gather context and understand PR scope
  • Read key changed files
  • Review concurrency and sync logic
  • Review error handling and resilience
  • Review testing coverage
  • Post comprehensive review

Summary

This PR addresses a real startup deadlock: when a raft leader boots with an empty P2P sync store and no peer can yet serve height 1, the blocking initFromP2PWithRetry loop would prevent block production from ever starting. The fix introduces a "publisher mode" for sync services and tightens several related race conditions. The overall approach is sound. Below are observations ranging from bugs to nits.


pkg/sync/syncer_status.go — Core race fix ✅

The switch from atomic.Bool to sync.Mutex + startOnce is the correct fix. The previous design had a TOCTOU gap: isStarted() returned atomically, but the check-then-start sequence in the caller was not atomic. startOnce now holds the lock across the entire check-and-start pair, eliminating the race. The tests in syncer_status_test.go (especially concurrent_start_only_runs_once) directly verify this.

One minor note: if stopFn in stopIfStarted returns an error, started remains true (the started = false reset line is never reached on error). A subsequent Stop call would attempt stopFn again. This is probably intentional for retry semantics, but it could also lead to double-stop attempts if the caller doesn't check the error. Worth a comment explaining the intent.


pkg/sync/sync_service.go — Publisher mode ✅

StartForPublishing correctly skips initFromP2PWithRetry while still setting up the P2P exchange server and pubsub subscriber needed by WriteToStoreAndBroadcast. The comment is clear and the design is well-reasoned.

TestHeaderSyncServiceStartForPublishingWithPeers covers the happy path. Missing tests:

  • Publisher mode with zero peers (should be a no-op / trivially pass)
  • Concurrent calls to StartForPublishing and WriteToStoreAndBroadcast to verify storeInitialized is set correctly by the first produced block

node/failover.go — Publisher mode detection ⚠️

storeHeight fetch with no effect on decision (failover.go:188-210):

storeHeight, err := f.store.Height(ctx)
if err != nil {
    f.logger.Warn()...
    return false
}
// storeHeight is then ONLY logged, not used in the condition
if headerHeight > 0 || dataHeight > 0 {
    return false
}

storeHeight triggers an I/O error path and a log field, but is never part of the boolean decision. If the intent is "check P2P store emptiness only", the main-store read adds cost and an error branch that silently falls back to blocking sync. Consider either removing the fetch entirely, or actually using storeHeight in the condition (e.g., to skip publisher mode if the main store is also empty, meaning a truly cold first-boot). Fix this →

TOCTOU leadership check: f.raftNode.IsLeader() is evaluated once at Run startup. Leadership can be lost between that check and StartForPublishing actually running. The consequence is that the node starts in publisher mode but then loses leadership; block production will be stopped by the raft state machine, so this is a benign race. A comment noting this would help future readers.


pkg/raft/election.gostartFollower refactor + verifyState hardening ⚠️

startFollower DRY refactor — Good. Eliminates the code duplication between the leaderCh branch and the ticker branch.

Stacked timeouts: startFollower calls waitForMsgsLanded(SendTimeout) then verifyState, which itself may spin for another full SendTimeout. Total stall before a follower is active can reach 2× SendTimeout. For a default 500ms SendTimeout this is probably fine, but should be documented. Fix this →

Error handling change in verifyState: Previously, if IsSynced returned an error, it was propagated directly. Now the code attempts Recover and swallows the original IsSynced error on success. This is more resilient but masks the original failure. Consider logging the IsSynced error even when recovery succeeds, so operators can observe these events in production.

Duplicate log line (pre-existing, surfaced by review):

// line ~130
d.logger.Info().Msg("became leader, stopping follower operations")
// line ~144
d.logger.Info().Msg("became leader, stopping follower operations")

The same message is logged twice in the becameLeader && !isCurrentlyLeader branch.


pkg/raft/node.go — Bootstrap requirement removed ⚠️

-   if !n.config.Bootstrap {
-       return fmt.Errorf("raft cluster requires bootstrap mode")
-   }

This is a meaningful behavior change: previously nodes without Bootstrap: true would fail fast at startup. Now they silently pass through to the existing-state check. This is necessary for the failover scenario (a restarted leader with existing Raft state), but it removes an important guard for misconfigured nodes that should have Bootstrap: true but don't. The new comment "raft node started with existing local state" is only logged when there IS existing state; a node with Bootstrap: false and no state will now fall through to BootstrapCluster and potentially start a single-node cluster unintentionally. Consider adding a log warning when Bootstrap: false but no existing state is found, to aid debugging.


block/internal/syncing/syncer.go — RecoverFromRaft retry ⚠️

trySyncNextBlockWithState refactor — Good split. Passing currentState explicitly avoids re-reading the atomic inside the block-apply path, which eliminates a subtle race where another goroutine could update lastState between the check and the apply.

Retry logic concern (syncer.go:1222-1239):

err := s.trySyncNextBlockWithState(ctx, event, currentState)
if err != nil && stateBootstrapped && errors.Is(err, errInvalidState) {
    s.SetLastState(currentState)                               // sets s.lastState
    err = s.trySyncNextBlockWithState(ctx, event, currentState) // same currentState
}

trySyncNextBlockWithState receives currentState directly; s.SetLastState changes s.lastState (the atomic) but the retry call still receives the same currentState value. For this retry to have any different outcome than the first attempt, trySyncNextBlockWithState must read s.lastState internally somewhere, in addition to using the passed currentState. If that is not the case, the retry always produces the same result and the if block is dead code. Please verify (and add a comment explaining) why SetLastState + retry produces a different outcome. Fix this →

InitialHeight in genesis defaults — Good catch; without it, height validation during recovery could fail for genesis-height blocks.


Testing ✅ / ⚠️

Area Coverage Notes
SyncerStatus concurrent behavior concurrent_start_only_runs_once is exactly the right test
StartForPublishing with peers TestHeaderSyncServiceStartForPublishingWithPeers
RecoverFromRaft bootstrap path Two new tests covering happy path and strict validation
shouldStartSyncInPublisherMode ⚠️ No unit test; only covered implicitly by E2E
verifyState timeout/recovery branches ⚠️ New wait loop and recovery path are not unit tested
Publisher mode with no peers ⚠️ Not tested

The E2E test changes (test/e2e/failover_e2e_test.go, sut_helper.go) are the primary regression guard, which is appropriate for this class of bug.


Nits

  • pkg/config/config.go: The bootstrap flag comment update ("Bootstrap a new static Raft cluster during initial bring-up") is clearer. ✅
  • docs/: The production Raft guide updates look accurate. ✅
  • apps/evm/go.mod / go.sum: Dependency bumps — presumably required by the changes. ✅

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 52.94118% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.64%. Comparing base (022b565) to head (74641e0).

Files with missing lines Patch % Lines
node/failover.go 31.25% 16 Missing and 6 partials ⚠️
pkg/sync/sync_service.go 47.36% 13 Missing and 7 partials ⚠️
pkg/raft/election.go 53.12% 10 Missing and 5 partials ⚠️
block/internal/syncing/syncer.go 63.63% 3 Missing and 1 partial ⚠️
pkg/sync/syncer_status.go 90.47% 1 Missing and 1 partial ⚠️
pkg/raft/node.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3222      +/-   ##
==========================================
+ Coverage   61.43%   61.64%   +0.20%     
==========================================
  Files         120      120              
  Lines       12504    12594      +90     
==========================================
+ Hits         7682     7763      +81     
+ Misses       3960     3957       -3     
- Partials      862      874      +12     
Flag Coverage Δ
combined 61.64% <52.94%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://evstack.github.io/docs-preview/pr-3222/

Built to branch main at 2026-04-02 10:07 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@alpe alpe force-pushed the alex/sync_race branch 2 times, most recently from ec0ffc4 to 59dc917 Compare April 2, 2026 08:51
@alpe alpe force-pushed the alex/sync_race branch from 59dc917 to 74641e0 Compare April 2, 2026 10:06
@alpe alpe changed the title fix: Publisher-mode synchronization option for failover scenario [WIP]fix: Publisher-mode synchronization option for failover scenario Apr 2, 2026
@alpe alpe marked this pull request as ready for review April 2, 2026 10:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
test/e2e/sut_helper.go (1)

189-195: ⚠️ Potential issue | 🟡 Minor

Close the per-process logfile handle.

os.Create on Line 192 leaves an open fd that is never closed. E2E suites spawn a lot of processes, so this can leak descriptors and delay flushing the captured logs.

Suggested change
 		logfile, err := os.Create(logfileName)
 		require.NoError(s.t, err)
+		s.t.Cleanup(func() { _ = logfile.Close() })
 		errReader = io.NopCloser(io.TeeReader(errReader, logfile))
 		outReader = io.NopCloser(io.TeeReader(outReader, logfile))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/sut_helper.go` around lines 189 - 195, The per-process logfile
created by os.Create (logfile) in the block that checks s.processLogDir() is
never closed; wrap the logfile so it is closed when the returned readers are
closed by replacing the io.NopCloser(io.TeeReader(..., logfile)) usage with a
ReadCloser that closes the underlying logfile on Close, or otherwise ensure
logfile.Close() is called (e.g., return a combined io.ReadCloser that delegates
Read to the TeeReader and Close to logfile.Close) for both errReader and
outReader so the file descriptor is released and buffered data flushed.
block/internal/syncing/syncer.go (1)

1204-1213: ⚠️ Potential issue | 🔴 Critical

Bootstrap the same genesis state as initializeState().

This fallback only sets ChainID, InitialHeight, and LastBlockHeight, but Line 808 later executes the recovered block against currentState.AppHash. On a fresh node that means raft recovery runs against a different execution baseline than normal startup, which calls exec.InitChain and seeds AppHash, DAHeight, and LastBlockTime. Please reuse that bootstrap path here before applying the raft block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 1204 - 1213, The fallback
after s.store.GetState should bootstrap the same full genesis state as
initializeState() instead of only setting ChainID/InitialHeight/LastBlockHeight;
update the recovery path that sets currentState and stateBootstrapped to call
the same initialization logic (e.g., run the InitChain/bootstrap sequence used
in initializeState) so that currentState.AppHash, DAHeight, LastBlockTime (and
any other genesis-seeded fields) are populated before the raft/recovered block
is executed; ensure you invoke the same executor/init routine (the InitChain or
genesis bootstrap used by initializeState) rather than manually setting only
those three fields.
pkg/sync/sync_service.go (1)

190-205: ⚠️ Potential issue | 🟠 Major

Unwind partially started components on startup errors.

By the time Line 233 returns from setupP2PInfrastructure, the store, exchange server, and exchange can already be running. If newSyncer fails here — or startSubscriber fails in the new publisher-mode path — these methods return without tearing those pieces down, so a failed start leaks live components into the process.

As per coding guidelines, "Be mindful of goroutine leaks in Go code".

Also applies to: 219-225, 231-245

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service.go` around lines 190 - 205, The startup may leave
store/exchange/exchange-server running if later steps fail (e.g. newSyncer or
startSubscriber); update the startup sequence (functions prepareStart /
setupP2PInfrastructure / initFromP2PWithRetry) to register and invoke cleanup on
error: after setupP2PInfrastructure returns and before calling newSyncer or
startSubscriber, capture the created components (store, exchangeServer, exchange
or any returned tearDown/Close funcs) and ensure they are stopped when a
subsequent error occurs (use a deferred cleanup or explicit teardown call on
error paths in the block around prepareStart -> initFromP2PWithRetry ->
newSyncer/startSubscriber). Reference symbols to change: prepareStart,
setupP2PInfrastructure, initFromP2PWithRetry, newSyncer, and startSubscriber —
ensure each path that returns an error calls the appropriate stop/Close/Shutdown
methods for the started components to avoid goroutine/resource leaks.
🧹 Nitpick comments (1)
pkg/sync/sync_service_test.go (1)

63-97: Assert remote peer delivery, not just local initialization.

This test pays the cost of bringing up client2, but the only postcondition is svc.storeInitialized. A regression where StartForPublishing initializes locally but never broadcasts to peers would still pass. Please add an assertion on peer 2's view of the header so the new publisher-mode path is actually covered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service_test.go` around lines 63 - 97, The test currently only
asserts svc.storeInitialized after broadcasting; add an assertion that the
remote peer (client2) actually receives the broadcasted header: before calling
svc.WriteToStoreAndBroadcast, register or subscribe a handler on client2 to
capture incoming P2PSignedHeader messages (using client2's subscribe/handler
API), then after WriteToStoreAndBroadcast use require.Eventually to poll/assert
that the handler received a header equal to signedHeader (or matching
height/DataHash/AppHash). Reference client1/client2, NewHeaderSyncService,
StartForPublishing, WriteToStoreAndBroadcast and svc.storeInitialized when
locating where to add the subscription and the eventual assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: Remove the repo-local replace directives from the module
manifest (the replace block that points github.com/evstack/ev-node and
github.com/evstack/ev-node/execution/evm to relative ../../ paths) so the
published module no longer contains repository-specific local wiring; keep the
go.mod clean for downstream consumers and, if local development linking is
needed, use a go.work file or temporary local replace only in your working copy
before publishing.

In `@block/internal/syncing/syncer.go`:
- Around line 1230-1235: The retry uses the identical currentState so
errInvalidState from AssertValidForNextState will always recur; before retrying
the call to trySyncNextBlockWithState, refresh the snapshot/state (e.g. read
s.GetLastState() or obtain the latest snapshot used by AssertValidForNextState)
and pass that refreshed state into trySyncNextBlockWithState instead of reusing
the original currentState, or recompute/validate the next state after calling
s.SetLastState(currentState) so the second call sees the updated global state.

In `@CHANGELOG.md`:
- Around line 12-21: The changelog contains unresolved git conflict markers
(<<<<<<<, =======, >>>>>>>) — remove those markers and produce a single resolved
section (e.g., keep "### Added") combining both bullet entries into the final
list so both PRs stay: the publisher-mode synchronization line [`#3222`] and the
"Improve execution/evm check..." line [`#3221`]; ensure only one header remains
and that the PR links and bullets are intact and properly formatted.

In `@docs/guides/raft_production.md`:
- Around line 93-100: Update the example to use the canonical aggregator flag
and drop the legacy bootstrap flag: replace the --rollkit.node.aggregator=true
occurrence with --evnode.node.aggregator=true, and remove the
--evnode.raft.bootstrap=true line (only keep raft-related flags like
--evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr,
--evnode.raft.raft_dir, and --evnode.raft.peers). Ensure the rest of the example
(e.g., --rollkit.p2p.listen_address) stays unchanged unless also intended to be
canonicalized.

In `@node/failover.go`:
- Around line 193-209: The gating logic ignores storeHeight when deciding to
start in publisher mode; change the condition that currently checks headerHeight
and dataHeight to include storeHeight so publishing only starts when the main
block store and both sync stores are empty. In other words, adjust the boolean
check around headerSyncService.Store().Height() and
dataSyncService.Store().Height() (the if that returns false) to also consider
the local storeHeight variable so StartForPublishing is only taken when
storeHeight == 0 && headerHeight == 0 && dataHeight == 0.

In `@pkg/raft/election.go`:
- Around line 93-96: The wait here uses d.node.waitForMsgsLanded which
internally uses context.Background(), so follower startup isn’t cancellable;
update this call to use the Run cancellation context (e.g., pass the existing
ctx) and add a ctx-aware variant on the node (e.g., waitForMsgsLandedCtx(ctx,
timeout) or change waitForMsgsLanded to accept a context) so the call in
election.go uses that context and honors cancellation/role changes; modify the
node implementation (pkg/raft/node.go – waitForMsgsLanded) to accept and
propagate the provided context instead of creating context.Background().

---

Outside diff comments:
In `@block/internal/syncing/syncer.go`:
- Around line 1204-1213: The fallback after s.store.GetState should bootstrap
the same full genesis state as initializeState() instead of only setting
ChainID/InitialHeight/LastBlockHeight; update the recovery path that sets
currentState and stateBootstrapped to call the same initialization logic (e.g.,
run the InitChain/bootstrap sequence used in initializeState) so that
currentState.AppHash, DAHeight, LastBlockTime (and any other genesis-seeded
fields) are populated before the raft/recovered block is executed; ensure you
invoke the same executor/init routine (the InitChain or genesis bootstrap used
by initializeState) rather than manually setting only those three fields.

In `@pkg/sync/sync_service.go`:
- Around line 190-205: The startup may leave store/exchange/exchange-server
running if later steps fail (e.g. newSyncer or startSubscriber); update the
startup sequence (functions prepareStart / setupP2PInfrastructure /
initFromP2PWithRetry) to register and invoke cleanup on error: after
setupP2PInfrastructure returns and before calling newSyncer or startSubscriber,
capture the created components (store, exchangeServer, exchange or any returned
tearDown/Close funcs) and ensure they are stopped when a subsequent error occurs
(use a deferred cleanup or explicit teardown call on error paths in the block
around prepareStart -> initFromP2PWithRetry -> newSyncer/startSubscriber).
Reference symbols to change: prepareStart, setupP2PInfrastructure,
initFromP2PWithRetry, newSyncer, and startSubscriber — ensure each path that
returns an error calls the appropriate stop/Close/Shutdown methods for the
started components to avoid goroutine/resource leaks.

In `@test/e2e/sut_helper.go`:
- Around line 189-195: The per-process logfile created by os.Create (logfile) in
the block that checks s.processLogDir() is never closed; wrap the logfile so it
is closed when the returned readers are closed by replacing the
io.NopCloser(io.TeeReader(..., logfile)) usage with a ReadCloser that closes the
underlying logfile on Close, or otherwise ensure logfile.Close() is called
(e.g., return a combined io.ReadCloser that delegates Read to the TeeReader and
Close to logfile.Close) for both errReader and outReader so the file descriptor
is released and buffered data flushed.

---

Nitpick comments:
In `@pkg/sync/sync_service_test.go`:
- Around line 63-97: The test currently only asserts svc.storeInitialized after
broadcasting; add an assertion that the remote peer (client2) actually receives
the broadcasted header: before calling svc.WriteToStoreAndBroadcast, register or
subscribe a handler on client2 to capture incoming P2PSignedHeader messages
(using client2's subscribe/handler API), then after WriteToStoreAndBroadcast use
require.Eventually to poll/assert that the handler received a header equal to
signedHeader (or matching height/DataHash/AppHash). Reference client1/client2,
NewHeaderSyncService, StartForPublishing, WriteToStoreAndBroadcast and
svc.storeInitialized when locating where to add the subscription and the
eventual assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b5476c3-9ec8-4d67-a4cb-fcde2300edbd

📥 Commits

Reviewing files that changed from the base of the PR and between 022b565 and 74641e0.

⛔ Files ignored due to path filters (1)
  • apps/evm/go.sum is excluded by !**/*.sum
📒 Files selected for processing (19)
  • CHANGELOG.md
  • apps/evm/go.mod
  • block/internal/syncing/syncer.go
  • block/internal/syncing/syncer_test.go
  • docs/guides/raft_production.md
  • docs/learn/config.md
  • node/failover.go
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/raft/election.go
  • pkg/raft/election_test.go
  • pkg/raft/node.go
  • pkg/raft/node_test.go
  • pkg/sync/sync_service.go
  • pkg/sync/sync_service_test.go
  • pkg/sync/syncer_status.go
  • pkg/sync/syncer_status_test.go
  • test/e2e/failover_e2e_test.go
  • test/e2e/sut_helper.go

Comment on lines +5 to +8
replace (
github.com/evstack/ev-node => ../../
github.com/evstack/ev-node/execution/evm => ../../execution/evm
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check for go.work file (monorepo pattern)
fd -H 'go.work' .

# Explore repository structure
ls -la apps/ 2>/dev/null || echo "apps directory structure"
find . -name 'go.mod' -type f | head -20

Repository: evstack/ev-node

Length of output: 631


🏁 Script executed:

# Check the complete go.mod for apps/evm
cat -n apps/evm/go.mod

Repository: evstack/ev-node

Length of output: 14568


🏁 Script executed:

# Look for CI/CD configuration to understand build patterns
find . -name '.github' -type d
find . -type f \( -name '*.yml' -o -name '*.yaml' \) -path '*/.github/workflows/*' | head -10

Repository: evstack/ev-node

Length of output: 434


🏁 Script executed:

# Check if there are any docs or comments about module publishing
grep -r "apps/evm" --include='*.md' --include='*.txt' . 2>/dev/null | head -10

Repository: evstack/ev-node

Length of output: 579


Remove repo-local replace directives from the published module manifest.

These relative paths only resolve within this repository layout and break downstream builds when github.com/evstack/ev-node/apps/evm is used as a dependency. The module is published with independent versioning (per RELEASE.md); local wiring should use the existing go.work pattern or be removed before merge.

Suggested adjustment
-replace (
-	github.com/evstack/ev-node => ../../
-	github.com/evstack/ev-node/execution/evm => ../../execution/evm
-)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
replace (
github.com/evstack/ev-node => ../../
github.com/evstack/ev-node/execution/evm => ../../execution/evm
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/evm/go.mod` around lines 5 - 8, Remove the repo-local replace directives
from the module manifest (the replace block that points
github.com/evstack/ev-node and github.com/evstack/ev-node/execution/evm to
relative ../../ paths) so the published module no longer contains
repository-specific local wiring; keep the go.mod clean for downstream consumers
and, if local development linking is needed, use a go.work file or temporary
local replace only in your working copy before publishing.

Comment on lines +1230 to +1235
err := s.trySyncNextBlockWithState(ctx, event, currentState)
if err != nil && stateBootstrapped && errors.Is(err, errInvalidState) {
s.logger.Debug().Err(err).Msg("raft recovery failed after bootstrap state init, retrying once")
// Keep strict validation semantics; this retry only guards startup ordering races.
s.SetLastState(currentState)
err = s.trySyncNextBlockWithState(ctx, event, currentState)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The retry cannot clear errInvalidState.

Lines 719-727 validate against the currentState argument, and Line 1235 passes the same snapshot again. s.SetLastState(currentState) only updates the global pointer, so an errInvalidState from AssertValidForNextState will reproduce deterministically on the retry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 1230 - 1235, The retry uses
the identical currentState so errInvalidState from AssertValidForNextState will
always recur; before retrying the call to trySyncNextBlockWithState, refresh the
snapshot/state (e.g. read s.GetLastState() or obtain the latest snapshot used by
AssertValidForNextState) and pass that refreshed state into
trySyncNextBlockWithState instead of reusing the original currentState, or
recompute/validate the next state after calling s.SetLastState(currentState) so
the second call sees the updated global state.

Comment on lines +12 to +21
<<<<<<< HEAD
### Changes

* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)
* Improve execution/evm check for stored meta not stale [#3221](https://github.com/evstack/ev-node/pull/3221)
=======
### Added

* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)
>>>>>>> a2b0ff76 (Changelog)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Resolve merge conflict markers in changelog before merge.

<<<<<<<, =======, and >>>>>>> are still committed, which breaks the changelog and release notes generation.

Proposed conflict resolution
-<<<<<<< HEAD
-### Changes
-
-* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222)
-* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221)
-=======
 ### Added
 
 * Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222)
->>>>>>> a2b0ff76 (Changelog)
+
+### Changes
+
+* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<<<<<<< HEAD
### Changes
* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)
* Improve execution/evm check for stored meta not stale [#3221](https://github.com/evstack/ev-node/pull/3221)
=======
### Added
* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)
>>>>>>> a2b0ff76 (Changelog)
### Added
* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222)
### Changes
* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 12 - 21, The changelog contains unresolved git
conflict markers (<<<<<<<, =======, >>>>>>>) — remove those markers and produce
a single resolved section (e.g., keep "### Added") combining both bullet entries
into the final list so both PRs stay: the publisher-mode synchronization line
[`#3222`] and the "Improve execution/evm check..." line [`#3221`]; ensure only one
header remains and that the PR links and bullets are intact and properly
formatted.

Comment on lines +93 to +100
--rollkit.node.aggregator=true \
--evnode.raft.enable=true \
--evnode.raft.node_id="node-1" \
--evnode.raft.raft_addr="0.0.0.0:5001" \
--evnode.raft.raft_dir="/var/lib/ev-node/raft" \
--evnode.raft.bootstrap=true \
--evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \
--rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the primary example match the new semantics.

The guide now says --evnode.raft.bootstrap is compatibility-only and optional, but the example still sets it and uses the rollkit. alias for the aggregator flag. That makes the happy path look more configuration-sensitive than it is. Prefer the canonical --evnode.node.aggregator form here and omit --evnode.raft.bootstrap unless this example is specifically documenting legacy compatibility.

Suggested change
-  --rollkit.node.aggregator=true \
+  --evnode.node.aggregator=true \
   --evnode.raft.enable=true \
   --evnode.raft.node_id="node-1" \
   --evnode.raft.raft_addr="0.0.0.0:5001" \
   --evnode.raft.raft_dir="/var/lib/ev-node/raft" \
-  --evnode.raft.bootstrap=true \
   --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \
As per coding guidelines, `docs/guides/**/*.md`: Structure guides with clear step-by-step instructions.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--rollkit.node.aggregator=true \
--evnode.raft.enable=true \
--evnode.raft.node_id="node-1" \
--evnode.raft.raft_addr="0.0.0.0:5001" \
--evnode.raft.raft_dir="/var/lib/ev-node/raft" \
--evnode.raft.bootstrap=true \
--evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \
--rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \
--evnode.node.aggregator=true \
--evnode.raft.enable=true \
--evnode.raft.node_id="node-1" \
--evnode.raft.raft_addr="0.0.0.0:5001" \
--evnode.raft.raft_dir="/var/lib/ev-node/raft" \
--evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \
--rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/raft_production.md` around lines 93 - 100, Update the example to
use the canonical aggregator flag and drop the legacy bootstrap flag: replace
the --rollkit.node.aggregator=true occurrence with
--evnode.node.aggregator=true, and remove the --evnode.raft.bootstrap=true line
(only keep raft-related flags like --evnode.raft.enable, --evnode.raft.node_id,
--evnode.raft.raft_addr, --evnode.raft.raft_dir, and --evnode.raft.peers).
Ensure the rest of the example (e.g., --rollkit.p2p.listen_address) stays
unchanged unless also intended to be canonicalized.

Comment on lines +193 to +209
storeHeight, err := f.store.Height(ctx)
if err != nil {
f.logger.Warn().Err(err).Msg("cannot determine store height; keeping blocking sync startup")
return false
}
headerHeight := f.headerSyncService.Store().Height()
dataHeight := f.dataSyncService.Store().Height()
if headerHeight > 0 || dataHeight > 0 {
return false
}

f.logger.Info().
Uint64("store_height", storeHeight).
Uint64("header_height", headerHeight).
Uint64("data_height", dataHeight).
Msg("raft-enabled aggregator with empty sync stores: starting sync services in publisher mode")
return true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gate publisher mode on an actually empty local store.

Line 193 reads storeHeight, but that value never participates in the decision. If the main block store already has blocks while the sync stores are empty, this still takes the fresh-start StartForPublishing path and skips the blocking sync startup outside the failover case this PR is targeting.

Suggested change
 	storeHeight, err := f.store.Height(ctx)
 	if err != nil {
 		f.logger.Warn().Err(err).Msg("cannot determine store height; keeping blocking sync startup")
 		return false
 	}
 	headerHeight := f.headerSyncService.Store().Height()
 	dataHeight := f.dataSyncService.Store().Height()
-	if headerHeight > 0 || dataHeight > 0 {
+	if storeHeight > 0 || headerHeight > 0 || dataHeight > 0 {
 		return false
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node/failover.go` around lines 193 - 209, The gating logic ignores
storeHeight when deciding to start in publisher mode; change the condition that
currently checks headerHeight and dataHeight to include storeHeight so
publishing only starts when the main block store and both sync stores are empty.
In other words, adjust the boolean check around
headerSyncService.Store().Height() and dataSyncService.Store().Height() (the if
that returns false) to also consider the local storeHeight variable so
StartForPublishing is only taken when storeHeight == 0 && headerHeight == 0 &&
dataHeight == 0.

Comment on lines +93 to +96
if err = d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
// this wait can legitimately time out
d.logger.Debug().Err(err).Msg("timed out waiting for raft messages before follower verification; continuing")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep follower startup cancellable.

This new wait path ignores Run cancellation because pkg/raft/node.go:145-175 implements waitForMsgsLanded with its own context.Background() timeout. On shutdown or role changes, follower startup can now sit here for up to SendTimeout. A ctx-aware variant would keep this path responsive. As per coding guidelines, "Use context.Context for cancellation in Go".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/raft/election.go` around lines 93 - 96, The wait here uses
d.node.waitForMsgsLanded which internally uses context.Background(), so follower
startup isn’t cancellable; update this call to use the Run cancellation context
(e.g., pass the existing ctx) and add a ctx-aware variant on the node (e.g.,
waitForMsgsLandedCtx(ctx, timeout) or change waitForMsgsLanded to accept a
context) so the call in election.go uses that context and honors
cancellation/role changes; modify the node implementation (pkg/raft/node.go –
waitForMsgsLanded) to accept and propagate the provided context instead of
creating context.Background().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant