Skip to content

feat(pkg/p2p): reconnect on disconnected peers#3212

Open
julienrbrt wants to merge 7 commits intomainfrom
julien/p2p
Open

feat(pkg/p2p): reconnect on disconnected peers#3212
julienrbrt wants to merge 7 commits intomainfrom
julien/p2p

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Mar 30, 2026

Attempt to reconnect on peers when having connectivity issue.
PR made by AI. Still testing.

Summary by CodeRabbit

  • Bug Fixes
    • Improved P2P resilience with automatic reconnection for transient network failures and periodic peer discovery refreshes for more stable networking.
  • Documentation
    • Added an unreleased changelog entry describing the P2P connection improvements.
  • Chores
    • Restored local module mapping to enable local-source builds for development workflows.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Implements automatic reconnection and periodic peer discovery for seed peers, re-enables local module replace directives for development builds, and updates retry-loop syntax in cloud KMS signers.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added unreleased entry: "Improve P2P connection in case of transient network failure" referencing PR #3212.
Go module replaces
apps/evm/go.mod, apps/testapp/go.mod
Uncommented local replace directives to resolve github.com/evstack/ev-node (and submodule) to local paths for development builds.
P2P Connection Resilience
pkg/p2p/client.go
Added seed peer tracking, disconnect notifee, reconnect channel, connection semaphore, maintenance goroutine with exponential backoff per-peer reconnects and periodic peer discovery/refresh. (High-impact: connection lifecycle and concurrency.)
Signer retry loops
pkg/signer/aws/signer.go, pkg/signer/gcp/signer.go
Refactored retry loops from indexed for attempt := 0; ... to range-based iteration over maxAttempts without changing retry/backoff behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client as P2P Client
    participant Host as libp2p Host
    participant DHT as DHT/Discovery
    participant Notifee as Disconnect Notifier
    participant Maintenance as Maintenance Loop

    Client->>Host: startWithHost()
    Client->>Notifee: register disconnect handler
    Client->>Maintenance: start maintenance goroutine

    Note over Host,Notifee: network event
    Host-->>Notifee: peer disconnected
    Notifee->>Maintenance: send peerID on reconnectCh

    Maintenance->>Maintenance: apply per-peer backoff
    Maintenance->>Host: tryConnect(peerID) (limited by connectSem)
    Host-->>Maintenance: connection result

    Maintenance->>DHT: periodic refreshPeerDiscovery()
    DHT->>Host: re-advertise & FindPeers()
    DHT-->>Maintenance: peer list
    Maintenance->>Host: tryConnect(discovered peers)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • tac0turtle
  • randygrok

Poem

🐰 When wires fray and peers hop away,
I nibble code at break of day.
With backoff hops and discovery cheer,
I bind the network close and near.
Reconnect! — the rabbit's debugging prayer.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description is minimal and vague, lacking the required overview section with context, background, goal, and rationale. It only provides a brief statement and mentions the PR is AI-generated and still in testing. Expand the description to include detailed context about the connectivity issue being addressed, explain the implementation approach, and provide clarity on the current testing status before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding reconnection logic when P2P peers disconnect, which is the primary purpose of the modifications to pkg/p2p/client.go.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/p2p

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 31, 2026, 8:05 PM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 30, 2026

Claude finished @julienrbrt's task in 12m 10s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze code quality, concurrency, error handling, and testing
  • Post review

Overall the PR introduces a well-structured reconnection mechanism. The design is sound (background maintenance goroutine, exponential backoff, semaphore-bounded peer discovery), and using context.Background() for the maintenance context is the right call. A few correctness issues need to be addressed before this is production-ready.


Critical: Reconnect never retries after a failed connect attempt

pkg/p2p/client.go:366-370

reconnectCh is only fed by the network Disconnected notifee. If host.Connect fails, there is no mechanism to schedule another attempt — the peer just silently disappears from future reconnections unless it disconnects again (which can't happen if we never connected). This turns a transient outage into permanent silence.

go func(info peer.AddrInfo) {
    if err := c.host.Connect(ctx, info); err != nil && ctx.Err() == nil {
        c.logger.Warn()... // ← missing: schedule a retry
    }
}(sp)

Fix: re-enqueue the peer ID after the backoff delay on failure, e.g.:

go func(info peer.AddrInfo, delay time.Duration) {
    if err := c.host.Connect(ctx, info); err != nil && ctx.Err() == nil {
        c.logger.Warn()...
        time.AfterFunc(delay, func() {
            select {
            case <-ctx.Done():
            case c.reconnectCh <- info.ID:
            default:
            }
        })
    }
}(sp, backoff)

Critical: Disconnect events arriving during backoff window are silently dropped

pkg/p2p/client.go:351-353

if time.Now().Before(st.lastAttempt.Add(backoff)) {
    continue  // ← event lost forever
}

If a disconnect fires during the cooldown window (e.g. a flapping peer), the event is discarded and no retry is scheduled. After the window expires there's nothing left to trigger a reconnect. Should schedule the retry for when the backoff expires rather than dropping it:

if wait := time.Until(st.lastAttempt.Add(backoff)); wait > 0 {
    time.AfterFunc(wait, func() {
        select { case c.reconnectCh <- pid: default: }
    })
    continue
}

Major: Goroutines from refreshPeerDiscovery and reconnect are untracked after Close

pkg/p2p/client.go:199-201, 366, 402

maintenanceWg.Wait() only waits for the single goroutine registered via maintenanceWg.Go(...). The fire-and-forget goroutines inside refreshPeerDiscovery and the reconnect block (go func(info peer.AddrInfo) {...}) are not tracked. After maintenanceWg.Wait() returns and c.host.Close() is called, those goroutines may still be calling c.host.Connect(...) or c.logger.Warn(), causing use-after-close.

Consider passing a sync.WaitGroup into the goroutine spawner, or using the maintenanceWg for all child goroutines:

c.maintenanceWg.Add(1)
go func(info peer.AddrInfo) {
    defer c.maintenanceWg.Done()
    ...
}(sp)

Major: refreshPeerDiscovery blocks the maintenance goroutine

pkg/p2p/client.go:396-409

refreshPeerDiscovery drains the peerCh channel inline inside the maintenance goroutine's select loop. Since FindPeers can return many peers and the semaphore is bounded to 16 workers, this loop can block for a significant amount of time — preventing reconnectCh events from being processed in that window. Run the refresh in its own goroutine:

case <-discoveryTicker.C:
    go c.refreshPeerDiscovery(ctx) // don't block the select loop

Moderate: attempts counter is incremented before the connect attempt

pkg/p2p/client.go:364

st.attempts++
// ...
go func(info peer.AddrInfo) {
    if err := c.host.Connect(ctx, info); ...

attempts is incremented regardless of whether the connect will succeed. If it succeeds, the counter remains inflated and the next disconnect will use a longer-than-necessary backoff. Increment attempts only on failure (inside the goroutine, on error).


Moderate: go.mod local replace directives should not be committed

apps/evm/go.mod:5-8, apps/testapp/go.mod:5

These replace directives tie the build to ../../ relative paths, which work only inside the monorepo checkout. If any consumer imports github.com/evstack/ev-node/apps/evm as a versioned module, the build will fail. Use a go.work file (already standard for Go monorepos) for local development overrides, and keep go.mod free of relative paths.


Minor: Advertise errors are silently discarded

pkg/p2p/client.go:388

_ = c.advertise(ctx)

If advertisement fails, this node becomes undiscoverable to new peers and the error is swallowed. At minimum this should be logged at Warn level:

if err := c.advertise(ctx); err != nil && ctx.Err() == nil {
    c.logger.Warn().Err(err).Msg("peer re-advertisement failed")
}

Minor: Concurrency inconsistency between initial findPeers and refreshPeerDiscovery

pkg/p2p/client.go:512-516 vs pkg/p2p/client.go:396-409

Initial findPeers spawns unbounded goroutines (go c.tryConnect(...)), while refreshPeerDiscovery uses the connectSem semaphore. Either both should use the semaphore, or the existing connectSem is only relevant for the refresh path (which should be documented).


Positive observations

  • Using context.Background() for the maintenance context (rather than the Start() context) is correct — the maintenance loop should survive beyond startup.
  • The disconnectNotifee interface implementation is clean and all no-op methods are properly stubbed.
  • Exponential backoff with min(st.attempts, 6) cap is well-designed to avoid overflow.
  • The for attempt := range maxAttempts refactor in the signer packages is a clean Go 1.22+ modernization.
  • connectSem as a buffered channel semaphore is idiomatic Go.

Testing

Patch coverage is at 33.7% with 57 lines uncovered in client.go. The new reconnect/maintenance paths have no test coverage. Given the concurrency bugs found above, tests are especially important here. Consider:

  • A test using mocknet that simulates a seed peer disconnect and verifies reconnection
  • A test verifying that Close() waits for in-flight goroutines to complete
  • A test for the backoff window behavior (events during cooldown get scheduled, not dropped)

Fix this →

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 33.70787% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.23%. Comparing base (4a70e0b) to head (2ff6cee).

Files with missing lines Patch % Lines
pkg/p2p/client.go 32.18% 57 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3212      +/-   ##
==========================================
- Coverage   61.43%   61.23%   -0.21%     
==========================================
  Files         120      120              
  Lines       12504    12591      +87     
==========================================
+ Hits         7682     7710      +28     
- Misses       3960     4017      +57     
- Partials      862      864       +2     
Flag Coverage Δ
combined 61.23% <33.70%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as ready for review March 31, 2026 16:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
apps/testapp/go.mod (1)

5-5: Keep this replace local to development.

This ties the module to ../../. being present at build time. That works inside this checkout, but not when the submodule is built outside the monorepo layout. Prefer a local go.work or an uncommitted dev override instead. Please verify this module is not expected to build outside the repository layout; if apps/testapp is intentionally monorepo-only, feel free to ignore this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/testapp/go.mod` at line 5, The committed go.mod contains a local replace
directive ("replace github.com/evstack/ev-node => ../../.") that forces a
monorepo layout; remove this replace from the tracked go.mod and instead keep it
as an uncommitted developer-only override (or move the replacement into a
go.work used only in development), or if this module is intended to be
monorepo-only, document that restriction; update the repository by deleting the
replace line from the committed go.mod and adding a dev-only go.work or
.gitignore'd override as appropriate so builds outside the repo layout are not
broken.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: The go.mod contains committed relative replace directives for
modules "github.com/evstack/ev-node" and
"github.com/evstack/ev-node/execution/evm" (the two replace (...) entries) which
will break consumers outside the monorepo; remove these relative replace lines
from apps/evm/go.mod before committing and, for local development only, move
them into a go.work file or use developer instructions to run `go work use` so
local overrides remain available but are not shipped in the module's go.mod.

In `@pkg/p2p/client.go`:
- Around line 343-355: The reconnect loop currently only receives signals from
reconnectCh (fed by Disconnected) so a failed host.Connect leaves the peer
without future triggers, and the current logic drops events that arrive during
the backoff window; update the reconnect handling (references: reconnectCh,
Disconnected, host.Connect, st.attempts, st.lastAttempt, reconnectCooldown,
maxReconnectCooldown) so that on a connect failure you re-schedule another retry
(e.g., push a retry event into reconnectCh or start a time.AfterFunc/goroutine
to re-enqueue after computed backoff) and when an event arrives inside the
backoff window do not drop it but schedule it to be processed after the
remaining backoff interval (use a timer to requeue into reconnectCh); apply the
same scheduling change to the other similar block (lines around the second
instance handling backoff) so transient outages cause repeated retries rather
than a single attempt.
- Line 199: The maintenance goroutine is started with the caller-supplied ctx so
timeouts/cancels on Start() prematurely stop reconnection/discovery; change this
so maintenance uses its own background context owned by the client and only
stops when Close() is called. Create a maintenance context via
context.WithCancel(context.Background()) (store the cancel func on the client
struct), pass that maintenance context into startConnectionMaintenance (and the
other maintenance/startDiscovery loops currently using the Start() ctx around
lines 317-320), and invoke the stored cancel in Close(); do not use the Start()
ctx for these long-lived loops.

---

Nitpick comments:
In `@apps/testapp/go.mod`:
- Line 5: The committed go.mod contains a local replace directive ("replace
github.com/evstack/ev-node => ../../.") that forces a monorepo layout; remove
this replace from the tracked go.mod and instead keep it as an uncommitted
developer-only override (or move the replacement into a go.work used only in
development), or if this module is intended to be monorepo-only, document that
restriction; update the repository by deleting the replace line from the
committed go.mod and adding a dev-only go.work or .gitignore'd override as
appropriate so builds outside the repo layout are not broken.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c692055-7514-4e34-ad88-0d6d76c89f7d

📥 Commits

Reviewing files that changed from the base of the PR and between 4a70e0b and 32dcf35.

⛔ Files ignored due to path filters (2)
  • apps/evm/go.sum is excluded by !**/*.sum
  • apps/testapp/go.sum is excluded by !**/*.sum
📒 Files selected for processing (6)
  • CHANGELOG.md
  • apps/evm/go.mod
  • apps/testapp/go.mod
  • pkg/p2p/client.go
  • pkg/signer/aws/signer.go
  • pkg/signer/gcp/signer.go

Comment on lines +343 to +355
if time.Since(st.lastAttempt) > maxReconnectCooldown {
st.attempts = 0
}

backoff := reconnectCooldown * time.Duration(1<<min(st.attempts, 6))
if backoff > maxReconnectCooldown {
backoff = maxReconnectCooldown
}
if time.Now().Before(st.lastAttempt.Add(backoff)) {
continue
}
st.lastAttempt = time.Now()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Backoff never produces a second reconnect attempt.

reconnectCh is only fed by Disconnected, so a failed host.Connect leaves the peer with no future trigger. Line 352 also drops any event that arrives inside the current backoff window instead of scheduling it for later. That turns transient outages into a single reconnect attempt.

🔧 One way to keep the backoff loop alive
+func (c *Client) scheduleReconnect(ctx context.Context, pid peer.ID, delay time.Duration) {
+	time.AfterFunc(delay, func() {
+		select {
+		case <-ctx.Done():
+		case c.reconnectCh <- pid:
+		default:
+		}
+	})
+}
+
 func (c *Client) startConnectionMaintenance(parentCtx context.Context) {
 	ctx, cancel := context.WithCancel(parentCtx)
 	c.maintenanceCancel = cancel
@@
-				if time.Now().Before(st.lastAttempt.Add(backoff)) {
-					continue
-				}
+				if wait := time.Until(st.lastAttempt.Add(backoff)); wait > 0 {
+					c.scheduleReconnect(ctx, pid, wait)
+					continue
+				}
@@
-					go func(info peer.AddrInfo) {
+					go func(info peer.AddrInfo, retryAfter time.Duration) {
 						if err := c.host.Connect(ctx, info); err != nil && ctx.Err() == nil {
 							c.logger.Warn().Str("peer", info.ID.String()).Err(err).Msg("failed to reconnect to seed peer")
+							c.scheduleReconnect(ctx, info.ID, retryAfter)
 						}
-					}(sp)
+					}(sp, backoff)

Also applies to: 364-370

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 343 - 355, The reconnect loop currently only
receives signals from reconnectCh (fed by Disconnected) so a failed host.Connect
leaves the peer without future triggers, and the current logic drops events that
arrive during the backoff window; update the reconnect handling (references:
reconnectCh, Disconnected, host.Connect, st.attempts, st.lastAttempt,
reconnectCooldown, maxReconnectCooldown) so that on a connect failure you
re-schedule another retry (e.g., push a retry event into reconnectCh or start a
time.AfterFunc/goroutine to re-enqueue after computed backoff) and when an event
arrives inside the backoff window do not drop it but schedule it to be processed
after the remaining backoff interval (use a timer to requeue into reconnectCh);
apply the same scheduling change to the other similar block (lines around the
second instance handling backoff) so transient outages cause repeated retries
rather than a single attempt.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pkg/p2p/client.go (1)

351-352: ⚠️ Potential issue | 🟠 Major

Backoff still turns a transient seed outage into a single retry.

At Line 351 the event is dropped if the peer is still inside its backoff window, and at Line 367 a failed host.Connect only logs. After that first failed dial there is no future trigger for that seed peer, so it can remain disconnected indefinitely.

Also applies to: 367-369

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 351 - 352, The code currently skips
processing a seed peer if time.Now().Before(st.lastAttempt.Add(backoff)) and
only logs on host.Connect failure, which means a transient outage may never be
retried; change this so that when a peer is within backoff you requeue/schedule
it for a future attempt instead of dropping the event (use a timer or push it
back onto the attempt queue and keep state in st), and when host.Connect(peer)
fails update st.lastAttempt and also schedule a follow-up retry (rather than
only logging) so the peer will be retried after backoff; make these changes
around the references to st.lastAttempt.Add(backoff) and the host.Connect call
to ensure failed dials trigger future retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/p2p/client.go`:
- Around line 317-321: startConnectionMaintenance currently only handles future
Disconnected events so seed peers that are already disconnected never get
enqueued; before the maintenance select loop begins (inside the maintenanceWg.Go
started by startConnectionMaintenance), iterate c.seedPeers and for each peer
that is not currently connected (use existing connection state check used
elsewhere in the file), send that peer into c.reconnectCh once so they will be
retried; ensure sends won't block the worker (use a non-blocking select with
default or spawn a short goroutine per send) and rely on
refreshPeerDiscovery/c.getNamespace as before for discovery updates.
- Around line 388-389: The current loop repeatedly calls c.advertise(ctx) which
spawns a background re-advertisement goroutine each time (discutil.Advertise
runs until ctx.Done()), causing a goroutine leak; fix by invoking
c.advertise(ctx) exactly once from the maintenance context (e.g., start it when
maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.

---

Duplicate comments:
In `@pkg/p2p/client.go`:
- Around line 351-352: The code currently skips processing a seed peer if
time.Now().Before(st.lastAttempt.Add(backoff)) and only logs on host.Connect
failure, which means a transient outage may never be retried; change this so
that when a peer is within backoff you requeue/schedule it for a future attempt
instead of dropping the event (use a timer or push it back onto the attempt
queue and keep state in st), and when host.Connect(peer) fails update
st.lastAttempt and also schedule a follow-up retry (rather than only logging) so
the peer will be retried after backoff; make these changes around the references
to st.lastAttempt.Add(backoff) and the host.Connect call to ensure failed dials
trigger future retries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d4f0858-895a-4ea8-aa2e-46955a48c244

📥 Commits

Reviewing files that changed from the base of the PR and between 32dcf35 and 2ff6cee.

📒 Files selected for processing (1)
  • pkg/p2p/client.go

Comment on lines +317 to +321
func (c *Client) startConnectionMaintenance() {
ctx, cancel := context.WithCancel(context.Background())
c.maintenanceCancel = cancel

c.maintenanceWg.Go(func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Queue currently disconnected seed peers when maintenance starts.

This worker only reacts to future Disconnected events. A seed peer that never connected during startup, or dropped before the notifee was registered, never enters reconnectCh, and refreshPeerDiscovery only searches c.getNamespace(). Seed any currently disconnected c.seedPeers once before the select loop begins.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 317 - 321, startConnectionMaintenance
currently only handles future Disconnected events so seed peers that are already
disconnected never get enqueued; before the maintenance select loop begins
(inside the maintenanceWg.Go started by startConnectionMaintenance), iterate
c.seedPeers and for each peer that is not currently connected (use existing
connection state check used elsewhere in the file), send that peer into
c.reconnectCh once so they will be retried; ensure sends won't block the worker
(use a non-blocking select with default or spawn a short goroutine per send) and
rely on refreshPeerDiscovery/c.getNamespace as before for discovery updates.

Comment on lines +388 to +389
_ = c.advertise(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For github.com/libp2p/go-libp2p v0.48.0, what does p2p/discovery/util.Advertise do? Does it start a background re-advertisement loop tied to ctx.Done(), or is it a single synchronous advertise call?

💡 Result:

In github.com/libp2p/go-libp2p/p2p/discovery/util v0.48.0, util.Advertise(ctx, a, ns, ...) starts a background goroutine that re-advertises in a loop until ctx.Done() is closed—it is not a single synchronous advertise call. (raw.githubusercontent.com)

Behavior details:

Citations:


Fix goroutine leak in advertise refresh loop.

Calling c.advertise(ctx) every 5 minutes at line 388 spawns a new background goroutine with each call. In libp2p v0.48.0, discutil.Advertise starts a re-advertisement loop that runs until ctx.Done() — it is not a single synchronous call. Stacking these goroutines will leak resources for the client's lifetime.

Start c.advertise(ctx) once from the maintenance context and let the ticker only refresh peer discovery via FindPeers, or switch to a single-shot advertise call that does not spawn a background loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 388 - 389, The current loop repeatedly calls
c.advertise(ctx) which spawns a background re-advertisement goroutine each time
(discutil.Advertise runs until ctx.Done()), causing a goroutine leak; fix by
invoking c.advertise(ctx) exactly once from the maintenance context (e.g., start
it when maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant