[WIP]fix: Publisher-mode synchronization option for failover scenario by alpe · Pull Request #3222 · evstack/ev-node

alpe · 2026-03-31T16:02:18Z

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

Summary by CodeRabbit

Release Notes

New Features
- Added automatic Raft startup mode selection based on persisted state presence; bootstrap is now triggered automatically when no local state exists.
- Introduced publisher-mode synchronization for failover scenarios, enabling early P2P infrastructure readiness.
Documentation
- Updated Raft configuration guidance to reflect automatic startup mode selection.
- Clarified that the bootstrap flag is now a legacy compatibility setting.
- Updated CLI flag namespaces for Raft and aggregator configurations.
Improvements
- Enhanced failover state validation and recovery logic.
- Improved leader election robustness and error handling.

coderabbitai · 2026-03-31T16:02:27Z

📝 Walkthrough

Walkthrough

This PR implements publisher-mode synchronization for early P2P infrastructure readiness during Raft-based failover scenarios. It refactors Raft leader election and node startup logic to determine mode automatically based on persisted state, introduces a new StartForPublishing method for sync services, upgrades sync service state management to use mutex-based concurrency, and updates E2E test infrastructure for P2P identity handling and process management.

Changes

Cohort / File(s)	Summary
Raft Startup Mode & Bootstrap Logic `pkg/raft/node.go`, `pkg/raft/node_test.go`, `pkg/raft/election.go`, `pkg/raft/election_test.go`	Refactored Raft node startup to automatically select mode (rejoin if persisted state exists, bootstrap from peers otherwise) by removing `Bootstrap` flag check. Enhanced follower verification with polling for non-zero raft state and improved error recovery logic. Added nil receiver test and updated election test expectations.
Publisher-Mode Synchronization `node/failover.go`, `pkg/sync/sync_service.go`, `pkg/sync/sync_service_test.go`	Added `shouldStartSyncInPublisherMode` logic to start sync services without waiting for P2P readiness when raft leader and stores empty. Introduced new public `StartForPublishing` method and refactored `startSyncer` to return `(bool, error)` to distinguish first-start from retries. Added multi-peer publisher-mode test.
Sync State Management Refactoring `pkg/sync/syncer_status.go`, `pkg/sync/syncer_status_test.go`	Replaced atomic `Bool` with mutex-protected `started` field. Added `startOnce(startFn)` for exclusive start execution and `stopIfStarted(stopFn)` for conditional stop. Includes comprehensive concurrency tests for race-free start/stop semantics.
Syncer Block Validation `block/internal/syncing/syncer.go`, `block/internal/syncing/syncer_test.go`	Added internal helper `trySyncNextBlockWithState(ctx, event, currentState)` to accept explicit state parameter. Enhanced `RecoverFromRaft` raft-ahead case with retry logic on `errInvalidState` after bootstrap. Added test cases for bootstrap initialization and chain ID validation.
E2E Test Infrastructure `test/e2e/failover_e2e_test.go`, `test/e2e/sut_helper.go`	Added per-node P2P identity generation with stable peer IDs, refactored node metadata to separate listen and peer addresses, improved raft leader detection tolerance for transient failures. Enhanced process cleanup logging and conditional log directory handling via `EV_E2E_LOG_DIR`.
Configuration & Documentation `pkg/config/config.go`, `pkg/config/config_test.go`, `docs/guides/raft_production.md`, `docs/learn/config.md`	Updated `RaftConfig.Bootstrap` comment to reflect new semantics as compatibility flag. Added Raft CLI flag assertions in config tests. Updated production/config docs to clarify automatic startup-mode selection based on persisted state and updated flag namespaces.
Module Replace Directives `apps/evm/go.mod`	Uncommented `replace` directives for `github.com/evstack/ev-node` pointing to local `../../` and `../../execution/evm` paths.
Changelog `CHANGELOG.md`	Moved PR `#3222` entry from `### Changes` to `### Added` section (contains visible merge conflict markers).

Sequence Diagram(s)

sequenceDiagram
    participant failover as Failover Manager
    participant raftNode as Raft Node
    participant syncService as Sync Service
    participant store as Block Store
    
    Note over failover: Run() startup
    
    alt Publisher Mode Eligible
        failover->>raftNode: Check raft leader + config
        raftNode-->>failover: Is leader: true
        failover->>store: Get header/data heights
        store-->>failover: Both empty
        failover->>syncService: StartForPublishing()
        syncService->>syncService: prepareStart (no P2P wait)
        syncService->>syncService: startSubscriber
        syncService-->>failover: OK
        Note over syncService: Start ingesting blocks<br/>without P2P readiness
    else Normal Mode
        failover->>syncService: Start()
        syncService->>syncService: prepareStart
        syncService->>syncService: initFromP2PWithRetry
        Note over syncService: Wait for P2P<br/>peer discovery
        syncService->>syncService: startSubscriber
        syncService-->>failover: OK
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

chore: update config docs #3189: Both PRs modify docs/learn/config.md and make overlapping changes to Raft configuration documentation and CLI flag namespace updates.
chore: add stricter linting #3132: Both PRs modify block/internal/syncing/syncer.go with related changes to internal method signatures and context propagation in sync validation logic.
feat(sequencer): catchup from base #3057: Both PRs implement sequencer catch-up and failover functionality with matching code-level changes to DA height handling and syncer verification logic.

Suggested reviewers

julienrbrt
chatton
randygrok

Poem

🐰 A syncer hops with startup pride,
Publisher mode starts the ride,
Raft rejoin when state's not bare,
Mutex guards with careful care,
Early peers meet infra's dawn!

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	Title begins with '[WIP]' prefix which indicates work-in-progress status, making it unsuitable for merge. The semantic commit format is partially followed ('fix:') but the leading '[WIP]' violates conventional commit standards.	Remove '[WIP]' prefix and ensure the title follows semantic commit format strictly: 'fix: Publisher-mode synchronization option for failover scenario' or similar.
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is substantially incomplete. While it briefly mentions the issue (E2E HA tests failing due to a race condition), it lacks explanation of the solution approach, implementation details, and rationale for the changes made across multiple files.	Expand the Overview section to explain the specific synchronization changes made, why publisher-mode was introduced, and how it resolves the race condition described.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch alex/sync_race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-31T16:02:35Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Apr 2, 2026, 10:07 AM

claude · 2026-03-31T16:02:39Z

Claude finished @alpe's task in 3m 34s —— View job

Code Review

Gather context and understand PR scope
Read key changed files
Review concurrency and sync logic
Review error handling and resilience
Review testing coverage
Post comprehensive review

Summary

This PR addresses a real startup deadlock: when a raft leader boots with an empty P2P sync store and no peer can yet serve height 1, the blocking initFromP2PWithRetry loop would prevent block production from ever starting. The fix introduces a "publisher mode" for sync services and tightens several related race conditions. The overall approach is sound. Below are observations ranging from bugs to nits.

`pkg/sync/syncer_status.go` — Core race fix ✅

The switch from atomic.Bool to sync.Mutex + startOnce is the correct fix. The previous design had a TOCTOU gap: isStarted() returned atomically, but the check-then-start sequence in the caller was not atomic. startOnce now holds the lock across the entire check-and-start pair, eliminating the race. The tests in syncer_status_test.go (especially concurrent_start_only_runs_once) directly verify this.

One minor note: if stopFn in stopIfStarted returns an error, started remains true (the started = false reset line is never reached on error). A subsequent Stop call would attempt stopFn again. This is probably intentional for retry semantics, but it could also lead to double-stop attempts if the caller doesn't check the error. Worth a comment explaining the intent.

`pkg/sync/sync_service.go` — Publisher mode ✅

StartForPublishing correctly skips initFromP2PWithRetry while still setting up the P2P exchange server and pubsub subscriber needed by WriteToStoreAndBroadcast. The comment is clear and the design is well-reasoned.

TestHeaderSyncServiceStartForPublishingWithPeers covers the happy path. Missing tests:

Publisher mode with zero peers (should be a no-op / trivially pass)
Concurrent calls to StartForPublishing and WriteToStoreAndBroadcast to verify storeInitialized is set correctly by the first produced block

`node/failover.go` — Publisher mode detection ⚠️

storeHeight fetch with no effect on decision (failover.go:188-210):

storeHeight, err := f.store.Height(ctx)
if err != nil {
    f.logger.Warn()...
    return false
}
// storeHeight is then ONLY logged, not used in the condition
if headerHeight > 0 || dataHeight > 0 {
    return false
}

storeHeight triggers an I/O error path and a log field, but is never part of the boolean decision. If the intent is "check P2P store emptiness only", the main-store read adds cost and an error branch that silently falls back to blocking sync. Consider either removing the fetch entirely, or actually using storeHeight in the condition (e.g., to skip publisher mode if the main store is also empty, meaning a truly cold first-boot). Fix this →

TOCTOU leadership check: f.raftNode.IsLeader() is evaluated once at Run startup. Leadership can be lost between that check and StartForPublishing actually running. The consequence is that the node starts in publisher mode but then loses leadership; block production will be stopped by the raft state machine, so this is a benign race. A comment noting this would help future readers.

`pkg/raft/election.go` — `startFollower` refactor + `verifyState` hardening ⚠️

startFollower DRY refactor — Good. Eliminates the code duplication between the leaderCh branch and the ticker branch.

Stacked timeouts: startFollower calls waitForMsgsLanded(SendTimeout) then verifyState, which itself may spin for another full SendTimeout. Total stall before a follower is active can reach 2× SendTimeout. For a default 500ms SendTimeout this is probably fine, but should be documented. Fix this →

Error handling change in verifyState: Previously, if IsSynced returned an error, it was propagated directly. Now the code attempts Recover and swallows the original IsSynced error on success. This is more resilient but masks the original failure. Consider logging the IsSynced error even when recovery succeeds, so operators can observe these events in production.

Duplicate log line (pre-existing, surfaced by review):

// line ~130
d.logger.Info().Msg("became leader, stopping follower operations")
// line ~144
d.logger.Info().Msg("became leader, stopping follower operations")

The same message is logged twice in the becameLeader && !isCurrentlyLeader branch.

`pkg/raft/node.go` — Bootstrap requirement removed ⚠️

-   if !n.config.Bootstrap {
-       return fmt.Errorf("raft cluster requires bootstrap mode")
-   }

This is a meaningful behavior change: previously nodes without Bootstrap: true would fail fast at startup. Now they silently pass through to the existing-state check. This is necessary for the failover scenario (a restarted leader with existing Raft state), but it removes an important guard for misconfigured nodes that should have Bootstrap: true but don't. The new comment "raft node started with existing local state" is only logged when there IS existing state; a node with Bootstrap: false and no state will now fall through to BootstrapCluster and potentially start a single-node cluster unintentionally. Consider adding a log warning when Bootstrap: false but no existing state is found, to aid debugging.

`block/internal/syncing/syncer.go` — RecoverFromRaft retry ⚠️

trySyncNextBlockWithState refactor — Good split. Passing currentState explicitly avoids re-reading the atomic inside the block-apply path, which eliminates a subtle race where another goroutine could update lastState between the check and the apply.

Retry logic concern (syncer.go:1222-1239):

err := s.trySyncNextBlockWithState(ctx, event, currentState)
if err != nil && stateBootstrapped && errors.Is(err, errInvalidState) {
    s.SetLastState(currentState)                               // sets s.lastState
    err = s.trySyncNextBlockWithState(ctx, event, currentState) // same currentState
}

trySyncNextBlockWithState receives currentState directly; s.SetLastState changes s.lastState (the atomic) but the retry call still receives the same currentState value. For this retry to have any different outcome than the first attempt, trySyncNextBlockWithState must read s.lastState internally somewhere, in addition to using the passed currentState. If that is not the case, the retry always produces the same result and the if block is dead code. Please verify (and add a comment explaining) why SetLastState + retry produces a different outcome. Fix this →

InitialHeight in genesis defaults — Good catch; without it, height validation during recovery could fail for genesis-height blocks.

Testing ✅ / ⚠️

Area	Coverage	Notes
`SyncerStatus` concurrent behavior	✅	`concurrent_start_only_runs_once` is exactly the right test
`StartForPublishing` with peers	✅	`TestHeaderSyncServiceStartForPublishingWithPeers`
`RecoverFromRaft` bootstrap path	✅	Two new tests covering happy path and strict validation
`shouldStartSyncInPublisherMode`	⚠️	No unit test; only covered implicitly by E2E
`verifyState` timeout/recovery branches	⚠️	New wait loop and recovery path are not unit tested
Publisher mode with no peers	⚠️	Not tested

The E2E test changes (test/e2e/failover_e2e_test.go, sut_helper.go) are the primary regression guard, which is appropriate for this class of bug.

Nits

pkg/config/config.go: The bootstrap flag comment update ("Bootstrap a new static Raft cluster during initial bring-up") is clearer. ✅
docs/: The production Raft guide updates look accurate. ✅
apps/evm/go.mod / go.sum: Dependency bumps — presumably required by the changes. ✅

codecov · 2026-03-31T16:06:36Z

Codecov Report

❌ Patch coverage is 52.94118% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.64%. Comparing base (022b565) to head (74641e0).

Files with missing lines	Patch %	Lines
node/failover.go	31.25%	16 Missing and 6 partials ⚠️
pkg/sync/sync_service.go	47.36%	13 Missing and 7 partials ⚠️
pkg/raft/election.go	53.12%	10 Missing and 5 partials ⚠️
block/internal/syncing/syncer.go	63.63%	3 Missing and 1 partial ⚠️
pkg/sync/syncer_status.go	90.47%	1 Missing and 1 partial ⚠️
pkg/raft/node.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3222      +/-   ##
==========================================
+ Coverage   61.43%   61.64%   +0.20%     
==========================================
  Files         120      120              
  Lines       12504    12594      +90     
==========================================
+ Hits         7682     7763      +81     
+ Misses       3960     3957       -3     
- Partials      862      874      +12

Flag	Coverage Δ
combined	`61.64% <52.94%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-04-01T16:43:42Z

PR Preview Action v1.8.1
🚀 View preview at https://evstack.github.io/docs-preview/pr-3222/
Built to branch `main` at 2026-04-02 10:07 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

test/e2e/sut_helper.go (1)
189-195: ⚠️ Potential issue | 🟡 Minor

Close the per-process logfile handle.

os.Create on Line 192 leaves an open fd that is never closed. E2E suites spawn a lot of processes, so this can leak descriptors and delay flushing the captured logs.
Suggested change
 		logfile, err := os.Create(logfileName)
 		require.NoError(s.t, err)
+		s.t.Cleanup(func() { _ = logfile.Close() })
 		errReader = io.NopCloser(io.TeeReader(errReader, logfile))
 		outReader = io.NopCloser(io.TeeReader(outReader, logfile))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/sut_helper.go` around lines 189 - 195, The per-process logfile
created by os.Create (logfile) in the block that checks s.processLogDir() is
never closed; wrap the logfile so it is closed when the returned readers are
closed by replacing the io.NopCloser(io.TeeReader(..., logfile)) usage with a
ReadCloser that closes the underlying logfile on Close, or otherwise ensure
logfile.Close() is called (e.g., return a combined io.ReadCloser that delegates
Read to the TeeReader and Close to logfile.Close) for both errReader and
outReader so the file descriptor is released and buffered data flushed.
block/internal/syncing/syncer.go (1)
1204-1213: ⚠️ Potential issue | 🔴 Critical

Bootstrap the same genesis state as initializeState().

This fallback only sets ChainID, InitialHeight, and LastBlockHeight, but Line 808 later executes the recovered block against currentState.AppHash. On a fresh node that means raft recovery runs against a different execution baseline than normal startup, which calls exec.InitChain and seeds AppHash, DAHeight, and LastBlockTime. Please reuse that bootstrap path here before applying the raft block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 1204 - 1213, The fallback
after s.store.GetState should bootstrap the same full genesis state as
initializeState() instead of only setting ChainID/InitialHeight/LastBlockHeight;
update the recovery path that sets currentState and stateBootstrapped to call
the same initialization logic (e.g., run the InitChain/bootstrap sequence used
in initializeState) so that currentState.AppHash, DAHeight, LastBlockTime (and
any other genesis-seeded fields) are populated before the raft/recovered block
is executed; ensure you invoke the same executor/init routine (the InitChain or
genesis bootstrap used by initializeState) rather than manually setting only
those three fields.
pkg/sync/sync_service.go (1)
190-205: ⚠️ Potential issue | 🟠 Major

Unwind partially started components on startup errors.

By the time Line 233 returns from setupP2PInfrastructure, the store, exchange server, and exchange can already be running. If newSyncer fails here — or startSubscriber fails in the new publisher-mode path — these methods return without tearing those pieces down, so a failed start leaks live components into the process.

As per coding guidelines, "Be mindful of goroutine leaks in Go code".

Also applies to: 219-225, 231-245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service.go` around lines 190 - 205, The startup may leave
store/exchange/exchange-server running if later steps fail (e.g. newSyncer or
startSubscriber); update the startup sequence (functions prepareStart /
setupP2PInfrastructure / initFromP2PWithRetry) to register and invoke cleanup on
error: after setupP2PInfrastructure returns and before calling newSyncer or
startSubscriber, capture the created components (store, exchangeServer, exchange
or any returned tearDown/Close funcs) and ensure they are stopped when a
subsequent error occurs (use a deferred cleanup or explicit teardown call on
error paths in the block around prepareStart -> initFromP2PWithRetry ->
newSyncer/startSubscriber). Reference symbols to change: prepareStart,
setupP2PInfrastructure, initFromP2PWithRetry, newSyncer, and startSubscriber —
ensure each path that returns an error calls the appropriate stop/Close/Shutdown
methods for the started components to avoid goroutine/resource leaks.

🧹 Nitpick comments (1)

pkg/sync/sync_service_test.go (1)
63-97: Assert remote peer delivery, not just local initialization.

This test pays the cost of bringing up client2, but the only postcondition is svc.storeInitialized. A regression where StartForPublishing initializes locally but never broadcasts to peers would still pass. Please add an assertion on peer 2's view of the header so the new publisher-mode path is actually covered.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service_test.go` around lines 63 - 97, The test currently only
asserts svc.storeInitialized after broadcasting; add an assertion that the
remote peer (client2) actually receives the broadcasted header: before calling
svc.WriteToStoreAndBroadcast, register or subscribe a handler on client2 to
capture incoming P2PSignedHeader messages (using client2's subscribe/handler
API), then after WriteToStoreAndBroadcast use require.Eventually to poll/assert
that the handler received a header equal to signedHeader (or matching
height/DataHash/AppHash). Reference client1/client2, NewHeaderSyncService,
StartForPublishing, WriteToStoreAndBroadcast and svc.storeInitialized when
locating where to add the subscription and the eventual assertion.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: Remove the repo-local replace directives from the module
manifest (the replace block that points github.com/evstack/ev-node and
github.com/evstack/ev-node/execution/evm to relative ../../ paths) so the
published module no longer contains repository-specific local wiring; keep the
go.mod clean for downstream consumers and, if local development linking is
needed, use a go.work file or temporary local replace only in your working copy
before publishing.

In `@block/internal/syncing/syncer.go`:
- Around line 1230-1235: The retry uses the identical currentState so
errInvalidState from AssertValidForNextState will always recur; before retrying
the call to trySyncNextBlockWithState, refresh the snapshot/state (e.g. read
s.GetLastState() or obtain the latest snapshot used by AssertValidForNextState)
and pass that refreshed state into trySyncNextBlockWithState instead of reusing
the original currentState, or recompute/validate the next state after calling
s.SetLastState(currentState) so the second call sees the updated global state.

In `@CHANGELOG.md`:
- Around line 12-21: The changelog contains unresolved git conflict markers
(<<<<<<<, =======, >>>>>>>) — remove those markers and produce a single resolved
section (e.g., keep "### Added") combining both bullet entries into the final
list so both PRs stay: the publisher-mode synchronization line [`#3222`] and the
"Improve execution/evm check..." line [`#3221`]; ensure only one header remains
and that the PR links and bullets are intact and properly formatted.

In `@docs/guides/raft_production.md`:
- Around line 93-100: Update the example to use the canonical aggregator flag
and drop the legacy bootstrap flag: replace the --rollkit.node.aggregator=true
occurrence with --evnode.node.aggregator=true, and remove the
--evnode.raft.bootstrap=true line (only keep raft-related flags like
--evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr,
--evnode.raft.raft_dir, and --evnode.raft.peers). Ensure the rest of the example
(e.g., --rollkit.p2p.listen_address) stays unchanged unless also intended to be
canonicalized.

In `@node/failover.go`:
- Around line 193-209: The gating logic ignores storeHeight when deciding to
start in publisher mode; change the condition that currently checks headerHeight
and dataHeight to include storeHeight so publishing only starts when the main
block store and both sync stores are empty. In other words, adjust the boolean
check around headerSyncService.Store().Height() and
dataSyncService.Store().Height() (the if that returns false) to also consider
the local storeHeight variable so StartForPublishing is only taken when
storeHeight == 0 && headerHeight == 0 && dataHeight == 0.

In `@pkg/raft/election.go`:
- Around line 93-96: The wait here uses d.node.waitForMsgsLanded which
internally uses context.Background(), so follower startup isn’t cancellable;
update this call to use the Run cancellation context (e.g., pass the existing
ctx) and add a ctx-aware variant on the node (e.g., waitForMsgsLandedCtx(ctx,
timeout) or change waitForMsgsLanded to accept a context) so the call in
election.go uses that context and honors cancellation/role changes; modify the
node implementation (pkg/raft/node.go – waitForMsgsLanded) to accept and
propagate the provided context instead of creating context.Background().

---

Outside diff comments:
In `@block/internal/syncing/syncer.go`:
- Around line 1204-1213: The fallback after s.store.GetState should bootstrap
the same full genesis state as initializeState() instead of only setting
ChainID/InitialHeight/LastBlockHeight; update the recovery path that sets
currentState and stateBootstrapped to call the same initialization logic (e.g.,
run the InitChain/bootstrap sequence used in initializeState) so that
currentState.AppHash, DAHeight, LastBlockTime (and any other genesis-seeded
fields) are populated before the raft/recovered block is executed; ensure you
invoke the same executor/init routine (the InitChain or genesis bootstrap used
by initializeState) rather than manually setting only those three fields.

In `@pkg/sync/sync_service.go`:
- Around line 190-205: The startup may leave store/exchange/exchange-server
running if later steps fail (e.g. newSyncer or startSubscriber); update the
startup sequence (functions prepareStart / setupP2PInfrastructure /
initFromP2PWithRetry) to register and invoke cleanup on error: after
setupP2PInfrastructure returns and before calling newSyncer or startSubscriber,
capture the created components (store, exchangeServer, exchange or any returned
tearDown/Close funcs) and ensure they are stopped when a subsequent error occurs
(use a deferred cleanup or explicit teardown call on error paths in the block
around prepareStart -> initFromP2PWithRetry -> newSyncer/startSubscriber).
Reference symbols to change: prepareStart, setupP2PInfrastructure,
initFromP2PWithRetry, newSyncer, and startSubscriber — ensure each path that
returns an error calls the appropriate stop/Close/Shutdown methods for the
started components to avoid goroutine/resource leaks.

In `@test/e2e/sut_helper.go`:
- Around line 189-195: The per-process logfile created by os.Create (logfile) in
the block that checks s.processLogDir() is never closed; wrap the logfile so it
is closed when the returned readers are closed by replacing the
io.NopCloser(io.TeeReader(..., logfile)) usage with a ReadCloser that closes the
underlying logfile on Close, or otherwise ensure logfile.Close() is called
(e.g., return a combined io.ReadCloser that delegates Read to the TeeReader and
Close to logfile.Close) for both errReader and outReader so the file descriptor
is released and buffered data flushed.

---

Nitpick comments:
In `@pkg/sync/sync_service_test.go`:
- Around line 63-97: The test currently only asserts svc.storeInitialized after
broadcasting; add an assertion that the remote peer (client2) actually receives
the broadcasted header: before calling svc.WriteToStoreAndBroadcast, register or
subscribe a handler on client2 to capture incoming P2PSignedHeader messages
(using client2's subscribe/handler API), then after WriteToStoreAndBroadcast use
require.Eventually to poll/assert that the handler received a header equal to
signedHeader (or matching height/DataHash/AppHash). Reference client1/client2,
NewHeaderSyncService, StartForPublishing, WriteToStoreAndBroadcast and
svc.storeInitialized when locating where to add the subscription and the
eventual assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b5476c3-9ec8-4d67-a4cb-fcde2300edbd

📥 Commits

Reviewing files that changed from the base of the PR and between 022b565 and 74641e0.

⛔ Files ignored due to path filters (1)

apps/evm/go.sum is excluded by !**/*.sum

📒 Files selected for processing (19)

CHANGELOG.md
apps/evm/go.mod
block/internal/syncing/syncer.go
block/internal/syncing/syncer_test.go
docs/guides/raft_production.md
docs/learn/config.md
node/failover.go
pkg/config/config.go
pkg/config/config_test.go
pkg/raft/election.go
pkg/raft/election_test.go
pkg/raft/node.go
pkg/raft/node_test.go
pkg/sync/sync_service.go
pkg/sync/sync_service_test.go
pkg/sync/syncer_status.go
pkg/sync/syncer_status_test.go
test/e2e/failover_e2e_test.go
test/e2e/sut_helper.go

coderabbitai · 2026-04-02T10:42:15Z

apps/evm/go.mod

+replace (
+	github.com/evstack/ev-node => ../../
+	github.com/evstack/ev-node/execution/evm => ../../execution/evm
+)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check for go.work file (monorepo pattern) fd -H 'go.work' . # Explore repository structure ls -la apps/ 2>/dev/null || echo "apps directory structure" find . -name 'go.mod' -type f | head -20

Repository: evstack/ev-node

Length of output: 631

🏁 Script executed:

# Check the complete go.mod for apps/evm cat -n apps/evm/go.mod

Repository: evstack/ev-node

Length of output: 14568

🏁 Script executed:

# Look for CI/CD configuration to understand build patterns find . -name '.github' -type d find . -type f \( -name '*.yml' -o -name '*.yaml' \) -path '*/.github/workflows/*' | head -10

Repository: evstack/ev-node

Length of output: 434

🏁 Script executed:

# Check if there are any docs or comments about module publishing grep -r "apps/evm" --include='*.md' --include='*.txt' . 2>/dev/null | head -10

Repository: evstack/ev-node

Length of output: 579

Remove repo-local replace directives from the published module manifest.

These relative paths only resolve within this repository layout and break downstream builds when github.com/evstack/ev-node/apps/evm is used as a dependency. The module is published with independent versioning (per RELEASE.md); local wiring should use the existing go.work pattern or be removed before merge.

Suggested adjustment

-replace ( - github.com/evstack/ev-node => ../../ - github.com/evstack/ev-node/execution/evm => ../../execution/evm -)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

replace (

github.com/evstack/ev-node => ../../

github.com/evstack/ev-node/execution/evm => ../../execution/evm

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/evm/go.mod` around lines 5 - 8, Remove the repo-local replace directives from the module manifest (the replace block that points github.com/evstack/ev-node and github.com/evstack/ev-node/execution/evm to relative ../../ paths) so the published module no longer contains repository-specific local wiring; keep the go.mod clean for downstream consumers and, if local development linking is needed, use a go.work file or temporary local replace only in your working copy before publishing.

coderabbitai · 2026-04-02T10:42:15Z

block/internal/syncing/syncer.go

+		err := s.trySyncNextBlockWithState(ctx, event, currentState)
+		if err != nil && stateBootstrapped && errors.Is(err, errInvalidState) {
+			s.logger.Debug().Err(err).Msg("raft recovery failed after bootstrap state init, retrying once")
+			// Keep strict validation semantics; this retry only guards startup ordering races.
+			s.SetLastState(currentState)
+			err = s.trySyncNextBlockWithState(ctx, event, currentState)


⚠️ Potential issue | 🟠 Major

The retry cannot clear errInvalidState.

Lines 719-727 validate against the currentState argument, and Line 1235 passes the same snapshot again. s.SetLastState(currentState) only updates the global pointer, so an errInvalidState from AssertValidForNextState will reproduce deterministically on the retry.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@block/internal/syncing/syncer.go` around lines 1230 - 1235, The retry uses the identical currentState so errInvalidState from AssertValidForNextState will always recur; before retrying the call to trySyncNextBlockWithState, refresh the snapshot/state (e.g. read s.GetLastState() or obtain the latest snapshot used by AssertValidForNextState) and pass that refreshed state into trySyncNextBlockWithState instead of reusing the original currentState, or recompute/validate the next state after calling s.SetLastState(currentState) so the second call sees the updated global state.

coderabbitai · 2026-04-02T10:42:15Z

CHANGELOG.md

+<<<<<<< HEAD
 ### Changes

+* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)
 * Improve execution/evm check for stored meta not stale [#3221](https://github.com/evstack/ev-node/pull/3221)
+=======
+### Added
+
+* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)
+>>>>>>> a2b0ff76 (Changelog)


⚠️ Potential issue | 🔴 Critical

Resolve merge conflict markers in changelog before merge.

<<<<<<<, =======, and >>>>>>> are still committed, which breaks the changelog and release notes generation.

Proposed conflict resolution

-<<<<<<< HEAD -### Changes - -* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222) -* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221) -======= ### Added * Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222) ->>>>>>> a2b0ff76 (Changelog) + +### Changes + +* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<<<<<<< HEAD

### Changes

* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)

* Improve execution/evm check for stored meta not stale [#3221](https://github.com/evstack/ev-node/pull/3221)

=======

### Added

* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222)

>>>>>>> a2b0ff76 (Changelog)

### Added

* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222)

### Changes

* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@CHANGELOG.md` around lines 12 - 21, The changelog contains unresolved git conflict markers (<<<<<<<, =======, >>>>>>>) — remove those markers and produce a single resolved section (e.g., keep "### Added") combining both bullet entries into the final list so both PRs stay: the publisher-mode synchronization line [`#3222`] and the "Improve execution/evm check..." line [`#3221`]; ensure only one header remains and that the PR links and bullets are intact and properly formatted.

coderabbitai · 2026-04-02T10:42:16Z

docs/guides/raft_production.md

+  --rollkit.node.aggregator=true \
+  --evnode.raft.enable=true \
+  --evnode.raft.node_id="node-1" \
+  --evnode.raft.raft_addr="0.0.0.0:5001" \
+  --evnode.raft.raft_dir="/var/lib/ev-node/raft" \
+  --evnode.raft.bootstrap=true \
+  --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \
+  --rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \


⚠️ Potential issue | 🟡 Minor

Make the primary example match the new semantics.

The guide now says --evnode.raft.bootstrap is compatibility-only and optional, but the example still sets it and uses the rollkit. alias for the aggregator flag. That makes the happy path look more configuration-sensitive than it is. Prefer the canonical --evnode.node.aggregator form here and omit --evnode.raft.bootstrap unless this example is specifically documenting legacy compatibility.

Suggested change

- --rollkit.node.aggregator=true \ + --evnode.node.aggregator=true \ --evnode.raft.enable=true \ --evnode.raft.node_id="node-1" \ --evnode.raft.raft_addr="0.0.0.0:5001" \ --evnode.raft.raft_dir="/var/lib/ev-node/raft" \ - --evnode.raft.bootstrap=true \ --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \

As per coding guidelines, `docs/guides/**/*.md`: Structure guides with clear step-by-step instructions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

--rollkit.node.aggregator=true \

--evnode.raft.enable=true \

--evnode.raft.node_id="node-1" \

--evnode.raft.raft_addr="0.0.0.0:5001" \

--evnode.raft.raft_dir="/var/lib/ev-node/raft" \

--evnode.raft.bootstrap=true \

--evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \

--rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \

--evnode.node.aggregator=true \

--evnode.raft.enable=true \

--evnode.raft.node_id="node-1" \

--evnode.raft.raft_addr="0.0.0.0:5001" \

--evnode.raft.raft_dir="/var/lib/ev-node/raft" \

--evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \

--rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/guides/raft_production.md` around lines 93 - 100, Update the example to use the canonical aggregator flag and drop the legacy bootstrap flag: replace the --rollkit.node.aggregator=true occurrence with --evnode.node.aggregator=true, and remove the --evnode.raft.bootstrap=true line (only keep raft-related flags like --evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr, --evnode.raft.raft_dir, and --evnode.raft.peers). Ensure the rest of the example (e.g., --rollkit.p2p.listen_address) stays unchanged unless also intended to be canonicalized.

coderabbitai · 2026-04-02T10:42:16Z

node/failover.go

+	storeHeight, err := f.store.Height(ctx)
+	if err != nil {
+		f.logger.Warn().Err(err).Msg("cannot determine store height; keeping blocking sync startup")
+		return false
+	}
+	headerHeight := f.headerSyncService.Store().Height()
+	dataHeight := f.dataSyncService.Store().Height()
+	if headerHeight > 0 || dataHeight > 0 {
+		return false
+	}
+
+	f.logger.Info().
+		Uint64("store_height", storeHeight).
+		Uint64("header_height", headerHeight).
+		Uint64("data_height", dataHeight).
+		Msg("raft-enabled aggregator with empty sync stores: starting sync services in publisher mode")
+	return true


⚠️ Potential issue | 🟠 Major

Gate publisher mode on an actually empty local store.

Line 193 reads storeHeight, but that value never participates in the decision. If the main block store already has blocks while the sync stores are empty, this still takes the fresh-start StartForPublishing path and skips the blocking sync startup outside the failover case this PR is targeting.

Suggested change

storeHeight, err := f.store.Height(ctx) if err != nil { f.logger.Warn().Err(err).Msg("cannot determine store height; keeping blocking sync startup") return false } headerHeight := f.headerSyncService.Store().Height() dataHeight := f.dataSyncService.Store().Height() - if headerHeight > 0 || dataHeight > 0 { + if storeHeight > 0 || headerHeight > 0 || dataHeight > 0 { return false }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@node/failover.go` around lines 193 - 209, The gating logic ignores storeHeight when deciding to start in publisher mode; change the condition that currently checks headerHeight and dataHeight to include storeHeight so publishing only starts when the main block store and both sync stores are empty. In other words, adjust the boolean check around headerSyncService.Store().Height() and dataSyncService.Store().Height() (the if that returns false) to also consider the local storeHeight variable so StartForPublishing is only taken when storeHeight == 0 && headerHeight == 0 && dataHeight == 0.

coderabbitai · 2026-04-02T10:42:16Z

pkg/raft/election.go

+		if err = d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
+			// this wait can legitimately time out
+			d.logger.Debug().Err(err).Msg("timed out waiting for raft messages before follower verification; continuing")
+		}


⚠️ Potential issue | 🟡 Minor

Keep follower startup cancellable.

This new wait path ignores Run cancellation because pkg/raft/node.go:145-175 implements waitForMsgsLanded with its own context.Background() timeout. On shutdown or role changes, follower startup can now sit here for up to SendTimeout. A ctx-aware variant would keep this path responsive. As per coding guidelines, "Use context.Context for cancellation in Go".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/raft/election.go` around lines 93 - 96, The wait here uses d.node.waitForMsgsLanded which internally uses context.Background(), so follower startup isn’t cancellable; update this call to use the Run cancellation context (e.g., pass the existing ctx) and add a ctx-aware variant on the node (e.g., waitForMsgsLandedCtx(ctx, timeout) or change waitForMsgsLanded to accept a context) so the call in election.go uses that context and honors cancellation/role changes; modify the node implementation (pkg/raft/node.go – waitForMsgsLanded) to accept and propagate the provided context instead of creating context.Background().

github-actions bot assigned alpe Mar 31, 2026

alpe force-pushed the alex/sync_race branch 2 times, most recently from ec0ffc4 to 59dc917 Compare April 2, 2026 08:51

alpe added 2 commits April 2, 2026 12:06

Publisher-mode synchronization option for failover scenario

954a74a

Changelog

74641e0

alpe force-pushed the alex/sync_race branch from 59dc917 to 74641e0 Compare April 2, 2026 10:06

alpe changed the title ~~fix: Publisher-mode synchronization option for failover scenario~~ [WIP]fix: Publisher-mode synchronization option for failover scenario Apr 2, 2026

alpe marked this pull request as ready for review April 2, 2026 10:28

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

Conversation

alpe commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

pkg/sync/syncer_status.go — Core race fix ✅

pkg/sync/sync_service.go — Publisher mode ✅

node/failover.go — Publisher mode detection ⚠️

pkg/raft/election.go — startFollower refactor + verifyState hardening ⚠️

pkg/raft/node.go — Bootstrap requirement removed ⚠️

block/internal/syncing/syncer.go — RecoverFromRaft retry ⚠️

Testing ✅ / ⚠️

Nits

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch main at 2026-04-02 10:07 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alpe commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

claude bot commented Mar 31, 2026 •

edited

Loading

`pkg/sync/syncer_status.go` — Core race fix ✅

`pkg/sync/sync_service.go` — Publisher mode ✅

`node/failover.go` — Publisher mode detection ⚠️

`pkg/raft/election.go` — `startFollower` refactor + `verifyState` hardening ⚠️

`pkg/raft/node.go` — Bootstrap requirement removed ⚠️

`block/internal/syncing/syncer.go` — RecoverFromRaft retry ⚠️

codecov bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

Built to branch `main` at 2026-04-02 10:07 UTC.
Preview will be ready when the GitHub Pages deployment is complete.