[WIP]fix: Publisher-mode synchronization option for failover scenario#3222
[WIP]fix: Publisher-mode synchronization option for failover scenario#3222
Conversation
📝 WalkthroughWalkthroughThis PR implements publisher-mode synchronization for early P2P infrastructure readiness during Raft-based failover scenarios. It refactors Raft leader election and node startup logic to determine mode automatically based on persisted state, introduces a new Changes
Sequence Diagram(s)sequenceDiagram
participant failover as Failover Manager
participant raftNode as Raft Node
participant syncService as Sync Service
participant store as Block Store
Note over failover: Run() startup
alt Publisher Mode Eligible
failover->>raftNode: Check raft leader + config
raftNode-->>failover: Is leader: true
failover->>store: Get header/data heights
store-->>failover: Both empty
failover->>syncService: StartForPublishing()
syncService->>syncService: prepareStart (no P2P wait)
syncService->>syncService: startSubscriber
syncService-->>failover: OK
Note over syncService: Start ingesting blocks<br/>without P2P readiness
else Normal Mode
failover->>syncService: Start()
syncService->>syncService: prepareStart
syncService->>syncService: initFromP2PWithRetry
Note over syncService: Wait for P2P<br/>peer discovery
syncService->>syncService: startSubscriber
syncService-->>failover: OK
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @alpe's task in 3m 34s —— View job Code Review
|
| Area | Coverage | Notes |
|---|---|---|
SyncerStatus concurrent behavior |
✅ | concurrent_start_only_runs_once is exactly the right test |
StartForPublishing with peers |
✅ | TestHeaderSyncServiceStartForPublishingWithPeers |
RecoverFromRaft bootstrap path |
✅ | Two new tests covering happy path and strict validation |
shouldStartSyncInPublisherMode |
No unit test; only covered implicitly by E2E | |
verifyState timeout/recovery branches |
New wait loop and recovery path are not unit tested | |
| Publisher mode with no peers | Not tested |
The E2E test changes (test/e2e/failover_e2e_test.go, sut_helper.go) are the primary regression guard, which is appropriate for this class of bug.
Nits
pkg/config/config.go: The bootstrap flag comment update ("Bootstrap a new static Raft cluster during initial bring-up") is clearer. ✅docs/: The production Raft guide updates look accurate. ✅apps/evm/go.mod/go.sum: Dependency bumps — presumably required by the changes. ✅
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3222 +/- ##
==========================================
+ Coverage 61.43% 61.64% +0.20%
==========================================
Files 120 120
Lines 12504 12594 +90
==========================================
+ Hits 7682 7763 +81
+ Misses 3960 3957 -3
- Partials 862 874 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
ec0ffc4 to
59dc917
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
test/e2e/sut_helper.go (1)
189-195:⚠️ Potential issue | 🟡 MinorClose the per-process logfile handle.
os.Createon Line 192 leaves an open fd that is never closed. E2E suites spawn a lot of processes, so this can leak descriptors and delay flushing the captured logs.Suggested change
logfile, err := os.Create(logfileName) require.NoError(s.t, err) + s.t.Cleanup(func() { _ = logfile.Close() }) errReader = io.NopCloser(io.TeeReader(errReader, logfile)) outReader = io.NopCloser(io.TeeReader(outReader, logfile))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/sut_helper.go` around lines 189 - 195, The per-process logfile created by os.Create (logfile) in the block that checks s.processLogDir() is never closed; wrap the logfile so it is closed when the returned readers are closed by replacing the io.NopCloser(io.TeeReader(..., logfile)) usage with a ReadCloser that closes the underlying logfile on Close, or otherwise ensure logfile.Close() is called (e.g., return a combined io.ReadCloser that delegates Read to the TeeReader and Close to logfile.Close) for both errReader and outReader so the file descriptor is released and buffered data flushed.block/internal/syncing/syncer.go (1)
1204-1213:⚠️ Potential issue | 🔴 CriticalBootstrap the same genesis state as
initializeState().This fallback only sets
ChainID,InitialHeight, andLastBlockHeight, but Line 808 later executes the recovered block againstcurrentState.AppHash. On a fresh node that means raft recovery runs against a different execution baseline than normal startup, which callsexec.InitChainand seedsAppHash,DAHeight, andLastBlockTime. Please reuse that bootstrap path here before applying the raft block.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/syncing/syncer.go` around lines 1204 - 1213, The fallback after s.store.GetState should bootstrap the same full genesis state as initializeState() instead of only setting ChainID/InitialHeight/LastBlockHeight; update the recovery path that sets currentState and stateBootstrapped to call the same initialization logic (e.g., run the InitChain/bootstrap sequence used in initializeState) so that currentState.AppHash, DAHeight, LastBlockTime (and any other genesis-seeded fields) are populated before the raft/recovered block is executed; ensure you invoke the same executor/init routine (the InitChain or genesis bootstrap used by initializeState) rather than manually setting only those three fields.pkg/sync/sync_service.go (1)
190-205:⚠️ Potential issue | 🟠 MajorUnwind partially started components on startup errors.
By the time Line 233 returns from
setupP2PInfrastructure, the store, exchange server, and exchange can already be running. IfnewSyncerfails here — orstartSubscriberfails in the new publisher-mode path — these methods return without tearing those pieces down, so a failed start leaks live components into the process.As per coding guidelines, "Be mindful of goroutine leaks in Go code".
Also applies to: 219-225, 231-245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sync/sync_service.go` around lines 190 - 205, The startup may leave store/exchange/exchange-server running if later steps fail (e.g. newSyncer or startSubscriber); update the startup sequence (functions prepareStart / setupP2PInfrastructure / initFromP2PWithRetry) to register and invoke cleanup on error: after setupP2PInfrastructure returns and before calling newSyncer or startSubscriber, capture the created components (store, exchangeServer, exchange or any returned tearDown/Close funcs) and ensure they are stopped when a subsequent error occurs (use a deferred cleanup or explicit teardown call on error paths in the block around prepareStart -> initFromP2PWithRetry -> newSyncer/startSubscriber). Reference symbols to change: prepareStart, setupP2PInfrastructure, initFromP2PWithRetry, newSyncer, and startSubscriber — ensure each path that returns an error calls the appropriate stop/Close/Shutdown methods for the started components to avoid goroutine/resource leaks.
🧹 Nitpick comments (1)
pkg/sync/sync_service_test.go (1)
63-97: Assert remote peer delivery, not just local initialization.This test pays the cost of bringing up
client2, but the only postcondition issvc.storeInitialized. A regression whereStartForPublishinginitializes locally but never broadcasts to peers would still pass. Please add an assertion on peer 2's view of the header so the new publisher-mode path is actually covered.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sync/sync_service_test.go` around lines 63 - 97, The test currently only asserts svc.storeInitialized after broadcasting; add an assertion that the remote peer (client2) actually receives the broadcasted header: before calling svc.WriteToStoreAndBroadcast, register or subscribe a handler on client2 to capture incoming P2PSignedHeader messages (using client2's subscribe/handler API), then after WriteToStoreAndBroadcast use require.Eventually to poll/assert that the handler received a header equal to signedHeader (or matching height/DataHash/AppHash). Reference client1/client2, NewHeaderSyncService, StartForPublishing, WriteToStoreAndBroadcast and svc.storeInitialized when locating where to add the subscription and the eventual assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: Remove the repo-local replace directives from the module
manifest (the replace block that points github.com/evstack/ev-node and
github.com/evstack/ev-node/execution/evm to relative ../../ paths) so the
published module no longer contains repository-specific local wiring; keep the
go.mod clean for downstream consumers and, if local development linking is
needed, use a go.work file or temporary local replace only in your working copy
before publishing.
In `@block/internal/syncing/syncer.go`:
- Around line 1230-1235: The retry uses the identical currentState so
errInvalidState from AssertValidForNextState will always recur; before retrying
the call to trySyncNextBlockWithState, refresh the snapshot/state (e.g. read
s.GetLastState() or obtain the latest snapshot used by AssertValidForNextState)
and pass that refreshed state into trySyncNextBlockWithState instead of reusing
the original currentState, or recompute/validate the next state after calling
s.SetLastState(currentState) so the second call sees the updated global state.
In `@CHANGELOG.md`:
- Around line 12-21: The changelog contains unresolved git conflict markers
(<<<<<<<, =======, >>>>>>>) — remove those markers and produce a single resolved
section (e.g., keep "### Added") combining both bullet entries into the final
list so both PRs stay: the publisher-mode synchronization line [`#3222`] and the
"Improve execution/evm check..." line [`#3221`]; ensure only one header remains
and that the PR links and bullets are intact and properly formatted.
In `@docs/guides/raft_production.md`:
- Around line 93-100: Update the example to use the canonical aggregator flag
and drop the legacy bootstrap flag: replace the --rollkit.node.aggregator=true
occurrence with --evnode.node.aggregator=true, and remove the
--evnode.raft.bootstrap=true line (only keep raft-related flags like
--evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr,
--evnode.raft.raft_dir, and --evnode.raft.peers). Ensure the rest of the example
(e.g., --rollkit.p2p.listen_address) stays unchanged unless also intended to be
canonicalized.
In `@node/failover.go`:
- Around line 193-209: The gating logic ignores storeHeight when deciding to
start in publisher mode; change the condition that currently checks headerHeight
and dataHeight to include storeHeight so publishing only starts when the main
block store and both sync stores are empty. In other words, adjust the boolean
check around headerSyncService.Store().Height() and
dataSyncService.Store().Height() (the if that returns false) to also consider
the local storeHeight variable so StartForPublishing is only taken when
storeHeight == 0 && headerHeight == 0 && dataHeight == 0.
In `@pkg/raft/election.go`:
- Around line 93-96: The wait here uses d.node.waitForMsgsLanded which
internally uses context.Background(), so follower startup isn’t cancellable;
update this call to use the Run cancellation context (e.g., pass the existing
ctx) and add a ctx-aware variant on the node (e.g., waitForMsgsLandedCtx(ctx,
timeout) or change waitForMsgsLanded to accept a context) so the call in
election.go uses that context and honors cancellation/role changes; modify the
node implementation (pkg/raft/node.go – waitForMsgsLanded) to accept and
propagate the provided context instead of creating context.Background().
---
Outside diff comments:
In `@block/internal/syncing/syncer.go`:
- Around line 1204-1213: The fallback after s.store.GetState should bootstrap
the same full genesis state as initializeState() instead of only setting
ChainID/InitialHeight/LastBlockHeight; update the recovery path that sets
currentState and stateBootstrapped to call the same initialization logic (e.g.,
run the InitChain/bootstrap sequence used in initializeState) so that
currentState.AppHash, DAHeight, LastBlockTime (and any other genesis-seeded
fields) are populated before the raft/recovered block is executed; ensure you
invoke the same executor/init routine (the InitChain or genesis bootstrap used
by initializeState) rather than manually setting only those three fields.
In `@pkg/sync/sync_service.go`:
- Around line 190-205: The startup may leave store/exchange/exchange-server
running if later steps fail (e.g. newSyncer or startSubscriber); update the
startup sequence (functions prepareStart / setupP2PInfrastructure /
initFromP2PWithRetry) to register and invoke cleanup on error: after
setupP2PInfrastructure returns and before calling newSyncer or startSubscriber,
capture the created components (store, exchangeServer, exchange or any returned
tearDown/Close funcs) and ensure they are stopped when a subsequent error occurs
(use a deferred cleanup or explicit teardown call on error paths in the block
around prepareStart -> initFromP2PWithRetry -> newSyncer/startSubscriber).
Reference symbols to change: prepareStart, setupP2PInfrastructure,
initFromP2PWithRetry, newSyncer, and startSubscriber — ensure each path that
returns an error calls the appropriate stop/Close/Shutdown methods for the
started components to avoid goroutine/resource leaks.
In `@test/e2e/sut_helper.go`:
- Around line 189-195: The per-process logfile created by os.Create (logfile) in
the block that checks s.processLogDir() is never closed; wrap the logfile so it
is closed when the returned readers are closed by replacing the
io.NopCloser(io.TeeReader(..., logfile)) usage with a ReadCloser that closes the
underlying logfile on Close, or otherwise ensure logfile.Close() is called
(e.g., return a combined io.ReadCloser that delegates Read to the TeeReader and
Close to logfile.Close) for both errReader and outReader so the file descriptor
is released and buffered data flushed.
---
Nitpick comments:
In `@pkg/sync/sync_service_test.go`:
- Around line 63-97: The test currently only asserts svc.storeInitialized after
broadcasting; add an assertion that the remote peer (client2) actually receives
the broadcasted header: before calling svc.WriteToStoreAndBroadcast, register or
subscribe a handler on client2 to capture incoming P2PSignedHeader messages
(using client2's subscribe/handler API), then after WriteToStoreAndBroadcast use
require.Eventually to poll/assert that the handler received a header equal to
signedHeader (or matching height/DataHash/AppHash). Reference client1/client2,
NewHeaderSyncService, StartForPublishing, WriteToStoreAndBroadcast and
svc.storeInitialized when locating where to add the subscription and the
eventual assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8b5476c3-9ec8-4d67-a4cb-fcde2300edbd
⛔ Files ignored due to path filters (1)
apps/evm/go.sumis excluded by!**/*.sum
📒 Files selected for processing (19)
CHANGELOG.mdapps/evm/go.modblock/internal/syncing/syncer.goblock/internal/syncing/syncer_test.godocs/guides/raft_production.mddocs/learn/config.mdnode/failover.gopkg/config/config.gopkg/config/config_test.gopkg/raft/election.gopkg/raft/election_test.gopkg/raft/node.gopkg/raft/node_test.gopkg/sync/sync_service.gopkg/sync/sync_service_test.gopkg/sync/syncer_status.gopkg/sync/syncer_status_test.gotest/e2e/failover_e2e_test.gotest/e2e/sut_helper.go
| replace ( | ||
| github.com/evstack/ev-node => ../../ | ||
| github.com/evstack/ev-node/execution/evm => ../../execution/evm | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check for go.work file (monorepo pattern)
fd -H 'go.work' .
# Explore repository structure
ls -la apps/ 2>/dev/null || echo "apps directory structure"
find . -name 'go.mod' -type f | head -20Repository: evstack/ev-node
Length of output: 631
🏁 Script executed:
# Check the complete go.mod for apps/evm
cat -n apps/evm/go.modRepository: evstack/ev-node
Length of output: 14568
🏁 Script executed:
# Look for CI/CD configuration to understand build patterns
find . -name '.github' -type d
find . -type f \( -name '*.yml' -o -name '*.yaml' \) -path '*/.github/workflows/*' | head -10Repository: evstack/ev-node
Length of output: 434
🏁 Script executed:
# Check if there are any docs or comments about module publishing
grep -r "apps/evm" --include='*.md' --include='*.txt' . 2>/dev/null | head -10Repository: evstack/ev-node
Length of output: 579
Remove repo-local replace directives from the published module manifest.
These relative paths only resolve within this repository layout and break downstream builds when github.com/evstack/ev-node/apps/evm is used as a dependency. The module is published with independent versioning (per RELEASE.md); local wiring should use the existing go.work pattern or be removed before merge.
Suggested adjustment
-replace (
- github.com/evstack/ev-node => ../../
- github.com/evstack/ev-node/execution/evm => ../../execution/evm
-)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| replace ( | |
| github.com/evstack/ev-node => ../../ | |
| github.com/evstack/ev-node/execution/evm => ../../execution/evm | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/evm/go.mod` around lines 5 - 8, Remove the repo-local replace directives
from the module manifest (the replace block that points
github.com/evstack/ev-node and github.com/evstack/ev-node/execution/evm to
relative ../../ paths) so the published module no longer contains
repository-specific local wiring; keep the go.mod clean for downstream consumers
and, if local development linking is needed, use a go.work file or temporary
local replace only in your working copy before publishing.
| err := s.trySyncNextBlockWithState(ctx, event, currentState) | ||
| if err != nil && stateBootstrapped && errors.Is(err, errInvalidState) { | ||
| s.logger.Debug().Err(err).Msg("raft recovery failed after bootstrap state init, retrying once") | ||
| // Keep strict validation semantics; this retry only guards startup ordering races. | ||
| s.SetLastState(currentState) | ||
| err = s.trySyncNextBlockWithState(ctx, event, currentState) |
There was a problem hiding this comment.
The retry cannot clear errInvalidState.
Lines 719-727 validate against the currentState argument, and Line 1235 passes the same snapshot again. s.SetLastState(currentState) only updates the global pointer, so an errInvalidState from AssertValidForNextState will reproduce deterministically on the retry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@block/internal/syncing/syncer.go` around lines 1230 - 1235, The retry uses
the identical currentState so errInvalidState from AssertValidForNextState will
always recur; before retrying the call to trySyncNextBlockWithState, refresh the
snapshot/state (e.g. read s.GetLastState() or obtain the latest snapshot used by
AssertValidForNextState) and pass that refreshed state into
trySyncNextBlockWithState instead of reusing the original currentState, or
recompute/validate the next state after calling s.SetLastState(currentState) so
the second call sees the updated global state.
| <<<<<<< HEAD | ||
| ### Changes | ||
|
|
||
| * Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222) | ||
| * Improve execution/evm check for stored meta not stale [#3221](https://github.com/evstack/ev-node/pull/3221) | ||
| ======= | ||
| ### Added | ||
|
|
||
| * Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222) | ||
| >>>>>>> a2b0ff76 (Changelog) |
There was a problem hiding this comment.
Resolve merge conflict markers in changelog before merge.
<<<<<<<, =======, and >>>>>>> are still committed, which breaks the changelog and release notes generation.
Proposed conflict resolution
-<<<<<<< HEAD
-### Changes
-
-* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222)
-* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221)
-=======
### Added
* Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222)
->>>>>>> a2b0ff76 (Changelog)
+
+### Changes
+
+* Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <<<<<<< HEAD | |
| ### Changes | |
| * Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222) | |
| * Improve execution/evm check for stored meta not stale [#3221](https://github.com/evstack/ev-node/pull/3221) | |
| ======= | |
| ### Added | |
| * Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [#3222](https://github.com/evstack/ev-node/pull/3222) | |
| >>>>>>> a2b0ff76 (Changelog) | |
| ### Added | |
| * Added publisher-mode synchronization option for failover scenarios with early P2P infrastructure readiness [`#3222`](https://github.com/evstack/ev-node/pull/3222) | |
| ### Changes | |
| * Improve execution/evm check for stored meta not stale [`#3221`](https://github.com/evstack/ev-node/pull/3221) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CHANGELOG.md` around lines 12 - 21, The changelog contains unresolved git
conflict markers (<<<<<<<, =======, >>>>>>>) — remove those markers and produce
a single resolved section (e.g., keep "### Added") combining both bullet entries
into the final list so both PRs stay: the publisher-mode synchronization line
[`#3222`] and the "Improve execution/evm check..." line [`#3221`]; ensure only one
header remains and that the PR links and bullets are intact and properly
formatted.
| --rollkit.node.aggregator=true \ | ||
| --evnode.raft.enable=true \ | ||
| --evnode.raft.node_id="node-1" \ | ||
| --evnode.raft.raft_addr="0.0.0.0:5001" \ | ||
| --evnode.raft.raft_dir="/var/lib/ev-node/raft" \ | ||
| --evnode.raft.bootstrap=true \ | ||
| --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \ | ||
| --rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \ |
There was a problem hiding this comment.
Make the primary example match the new semantics.
The guide now says --evnode.raft.bootstrap is compatibility-only and optional, but the example still sets it and uses the rollkit. alias for the aggregator flag. That makes the happy path look more configuration-sensitive than it is. Prefer the canonical --evnode.node.aggregator form here and omit --evnode.raft.bootstrap unless this example is specifically documenting legacy compatibility.
Suggested change
- --rollkit.node.aggregator=true \
+ --evnode.node.aggregator=true \
--evnode.raft.enable=true \
--evnode.raft.node_id="node-1" \
--evnode.raft.raft_addr="0.0.0.0:5001" \
--evnode.raft.raft_dir="/var/lib/ev-node/raft" \
- --evnode.raft.bootstrap=true \
--evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --rollkit.node.aggregator=true \ | |
| --evnode.raft.enable=true \ | |
| --evnode.raft.node_id="node-1" \ | |
| --evnode.raft.raft_addr="0.0.0.0:5001" \ | |
| --evnode.raft.raft_dir="/var/lib/ev-node/raft" \ | |
| --evnode.raft.bootstrap=true \ | |
| --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \ | |
| --rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \ | |
| --evnode.node.aggregator=true \ | |
| --evnode.raft.enable=true \ | |
| --evnode.raft.node_id="node-1" \ | |
| --evnode.raft.raft_addr="0.0.0.0:5001" \ | |
| --evnode.raft.raft_dir="/var/lib/ev-node/raft" \ | |
| --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \ | |
| --rollkit.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/guides/raft_production.md` around lines 93 - 100, Update the example to
use the canonical aggregator flag and drop the legacy bootstrap flag: replace
the --rollkit.node.aggregator=true occurrence with
--evnode.node.aggregator=true, and remove the --evnode.raft.bootstrap=true line
(only keep raft-related flags like --evnode.raft.enable, --evnode.raft.node_id,
--evnode.raft.raft_addr, --evnode.raft.raft_dir, and --evnode.raft.peers).
Ensure the rest of the example (e.g., --rollkit.p2p.listen_address) stays
unchanged unless also intended to be canonicalized.
| storeHeight, err := f.store.Height(ctx) | ||
| if err != nil { | ||
| f.logger.Warn().Err(err).Msg("cannot determine store height; keeping blocking sync startup") | ||
| return false | ||
| } | ||
| headerHeight := f.headerSyncService.Store().Height() | ||
| dataHeight := f.dataSyncService.Store().Height() | ||
| if headerHeight > 0 || dataHeight > 0 { | ||
| return false | ||
| } | ||
|
|
||
| f.logger.Info(). | ||
| Uint64("store_height", storeHeight). | ||
| Uint64("header_height", headerHeight). | ||
| Uint64("data_height", dataHeight). | ||
| Msg("raft-enabled aggregator with empty sync stores: starting sync services in publisher mode") | ||
| return true |
There was a problem hiding this comment.
Gate publisher mode on an actually empty local store.
Line 193 reads storeHeight, but that value never participates in the decision. If the main block store already has blocks while the sync stores are empty, this still takes the fresh-start StartForPublishing path and skips the blocking sync startup outside the failover case this PR is targeting.
Suggested change
storeHeight, err := f.store.Height(ctx)
if err != nil {
f.logger.Warn().Err(err).Msg("cannot determine store height; keeping blocking sync startup")
return false
}
headerHeight := f.headerSyncService.Store().Height()
dataHeight := f.dataSyncService.Store().Height()
- if headerHeight > 0 || dataHeight > 0 {
+ if storeHeight > 0 || headerHeight > 0 || dataHeight > 0 {
return false
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@node/failover.go` around lines 193 - 209, The gating logic ignores
storeHeight when deciding to start in publisher mode; change the condition that
currently checks headerHeight and dataHeight to include storeHeight so
publishing only starts when the main block store and both sync stores are empty.
In other words, adjust the boolean check around
headerSyncService.Store().Height() and dataSyncService.Store().Height() (the if
that returns false) to also consider the local storeHeight variable so
StartForPublishing is only taken when storeHeight == 0 && headerHeight == 0 &&
dataHeight == 0.
| if err = d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil { | ||
| // this wait can legitimately time out | ||
| d.logger.Debug().Err(err).Msg("timed out waiting for raft messages before follower verification; continuing") | ||
| } |
There was a problem hiding this comment.
Keep follower startup cancellable.
This new wait path ignores Run cancellation because pkg/raft/node.go:145-175 implements waitForMsgsLanded with its own context.Background() timeout. On shutdown or role changes, follower startup can now sit here for up to SendTimeout. A ctx-aware variant would keep this path responsive. As per coding guidelines, "Use context.Context for cancellation in Go".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/raft/election.go` around lines 93 - 96, The wait here uses
d.node.waitForMsgsLanded which internally uses context.Background(), so follower
startup isn’t cancellable; update this call to use the Run cancellation context
(e.g., pass the existing ctx) and add a ctx-aware variant on the node (e.g.,
waitForMsgsLandedCtx(ctx, timeout) or change waitForMsgsLanded to accept a
context) so the call in election.go uses that context and honors
cancellation/role changes; modify the node implementation (pkg/raft/node.go –
waitForMsgsLanded) to accept and propagate the provided context instead of
creating context.Background().

Overview
E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Improvements