feat: persist peer bans for a configurable duration (A-1157)#23922
Conversation
Peer scores decay toward zero over time, so a peer that crossed the ban
threshold recovered to a healthy score within minutes, effectively making
bans toothless (see security advisory GHSA-h4vv-85x5-6hmh).
Persist a ban when a peer's score drops below the ban threshold, holding it
for a configurable duration (P2P_PEER_BAN_DURATION_SECONDS, default 24h):
- PeerScoring records {score, expiry} in an in-memory map (so getScore stays
synchronous for the peer-manager hot paths) and writes through to a
dedicated kv-store map for durability across restarts.
- While banned, getScore returns the persisted ban score regardless of decay,
so the peer cannot recover out of the ban early. Once the ban expires it is
removed and the live (decayed) score takes over, letting the peer recover.
- PeerScoring.new() restores active bans on startup, pruning expired ones in a
single transaction.
- PeerManager.stop() flushes pending ban writes so they are durable on
shutdown.
Update the peer-scoring docs to reflect that bans are now persisted for a configurable duration rather than decaying away within minutes: clarify the ban-floor behaviour, add a Ban Persistence section, fix the recovery example that assumed app-score decay un-bans a peer, and note P2P_PEER_BAN_DURATION_SECONDS.
Replace the hand-rolled promise-chain persistence queue with a SerialQueue (created and started when a store is configured). enqueueBanPersistence now puts onto the queue and flushBanPersistence awaits a syncPoint, matching the SerialQueue pattern used by the tx pool.
Persisting a ban only mattered if the peer was actually kept out. Banned peers were disconnected reactively each heartbeat, but the libp2p connection gater (isNodeAllowedToConnect, used by denyInboundEncryptedConnection) only checked failed auth handshakes, so a banned peer could reconnect inbound between heartbeats and resume sending messages. Reject peers with an active ban in isNodeAllowedToConnect so the gater refuses the connection during the noise handshake, before any protocol stream opens, for the full ban duration.
Introduce a single getActiveBanScore helper that returns a peer's persisted ban score while the ban is active, or undefined otherwise, lazily lifting an expired ban before returning undefined. getScore and maybeBanPeer now both go through it. This removes the duplicated ban-lookup/expiry logic and fixes an edge case: maybeBanPeer previously checked bannedPeers.has(), so an expired-but-not-yet- pruned record blocked a fresh ban. It now prunes the stale record and starts a new window, so a peer that re-offends after its ban expired is re-banned.
…1157) The ban threshold (MIN_SCORE_BEFORE_BAN) is a hardcoded constant. If it changes across a software upgrade, a persisted ban score from the old threshold may no longer cross the new one, which would otherwise pin the peer at a stale floor for the rest of the ban window. restoreBannedPeers now drops (and prunes) any ban whose score is no longer below MIN_SCORE_BEFORE_BAN, in addition to expired bans.
…nts (A-1157) isNodeAllowedToConnect took `string | PeerId` and was called with either a peer id or an IP, conflating two keys with different rules (bans are peer-id only). Replace it with isPeerAllowedToConnect (ban + failed-auth, used by the encrypted inbound gater and dialing) and isAddressAllowedToConnect (failed-auth only, used by the raw inbound gater), sharing a private isWithinFailedAuthLimit helper.
| private bannedPeers: Map<string, BanRecord> = new Map(); | ||
| /** The kv-store backing bans, kept so ban pruning can run in a single transaction. */ | ||
| private readonly kvStore?: AztecAsyncKVStore; | ||
| /** Backing store for bans, so they survive restarts. */ |
There was a problem hiding this comment.
Is surviving restarts worth the hassle of a store and keeping things in sync? Assuming that restarts are not very common and that a previously banned peer would get itself banned again.
There was a problem hiding this comment.
Perhaps not, I did consider it solely being in memory. I added the persistence just for completeness as I agree, it seems like something that will rarely come in use.
There was a problem hiding this comment.
Personally I'd prefer to remove the DB persistence code, just because (1) more code means more chances for mistakes (2) once code gets added, it rarely goes away even if it could.
But I'm approving since I don't feel strongly
There was a problem hiding this comment.
Have removed the persistence. It's reduced the changeset significantly.
…157) Adds the regression test recommended by GHSA-h4vv-85x5-6hmh: a peer banned via penalties must not be silently restored to Healthy once decayAllScores removes its decayed score entry. Fails against the pre-fix behaviour (peer reads back Healthy after ~hours of idle decay); passes now the ban is persisted.
| * lifted (removed in memory and in the store) before returning undefined, so callers never see a | ||
| * stale ban. | ||
| */ | ||
| private getActiveBanScore(peerId: string): number | undefined { |
There was a problem hiding this comment.
Is this the only point at which an expired ban is removed from the maps? Could the maps potentially grow indefinitely if we lose track of a peerId and never again ask getActiveBanScore on it?
There was a problem hiding this comment.
Yes, this should now be fixed.
| private bannedPeers: Map<string, BanRecord> = new Map(); | ||
| /** The kv-store backing bans, kept so ban pruning can run in a single transaction. */ | ||
| private readonly kvStore?: AztecAsyncKVStore; | ||
| /** Backing store for bans, so they survive restarts. */ |
There was a problem hiding this comment.
Personally I'd prefer to remove the DB persistence code, just because (1) more code means more chances for mistakes (2) once code gets added, it rarely goes away even if it could.
But I'm approving since I don't feel strongly
| expect(peerManager.isPeerAllowedToConnect(peerIdStr)).toBe(false); | ||
| expect(peerManager.isAddressAllowedToConnect(ipAddress)).toBe(false); | ||
| }); | ||
|
|
There was a problem hiding this comment.
Should/do we have a test for the case where the address is allowed to connect, but then we realize it's a known peer and we don't allow connection?
There was a problem hiding this comment.
Have added one under 'allows the address but denies the banned peer id'.
Per review feedback: persisting bans across restarts isn't worth the extra code and sync surface for something that will rarely matter — a still-bad peer simply re-earns its ban. Remove the kv-store wiring, SerialQueue, restoreBannedPeers/flushBanPersistence and the PeerScoring.new factory; bans are now held in memory only and cleared on restart. The configurable ban duration (P2P_PEER_BAN_DURATION_SECONDS, default 24h) and gater enforcement are unchanged. Also add a test for the two-stage connection gate: the raw-inbound gate allows an address (no peer id to match a ban) while the encrypted-inbound gate denies the banned peer id.
Expired bans were only pruned lazily when a peer's score was next queried, so a banned peer that disconnects and is never queried again would linger in the ban map. Add PeerScoring.pruneExpiredBans(), called from PeerManager.heartbeat alongside decayAllScores, to drop elapsed bans and bound the map's size.
…g iteration (A-1157)
Fixes A-1157. Addresses security advisory GHSA-h4vv-85x5-6hmh.
Problem
Peer scores decay toward zero (~0.9/minute). A peer whose score crossed the ban threshold (
MIN_SCORE_BEFORE_BAN = -100) recovered to a healthy score within approximately 1 hour.Fix
Record a ban when a peer's score drops below the ban threshold and hold it for a configurable duration (default 24h). Bans are kept in memory only and are cleared on restart — a restarted node re-learns bad peers from their behaviour rather than carrying stale bans across runs.
PeerScoringrecords{ score, expiry }in an in-memorybannedPeersmap, sogetScore/getScoreStatestay synchronous (required by the peer-manager hot paths, including a.sort()comparator).getScorereturns the ban score regardless of decay, so a peer cannot recover its way out of the ban early — even afterdecayAllScorescleans up the decayed live-score entry. Once the ban expires it is lifted and the live (decayed) score takes over, letting the peer recover.getActiveBanScore) and swept proactively each heartbeat viapruneExpiredBans(), so a banned peer that disconnects and is never queried again does not linger in the map.Configuration
New
P2P_PEER_BAN_DURATION_SECONDS(config fieldpeerBanDurationSeconds), default86400(24h). Registered infoundationenv vars and the P2P config mappings.Tests
peer_scoring.test.tscovers the full lifecycle, asserting both score values and states:peerBanDurationSecondsdrives the window (60s case);getScorestill returns the-150ban score (not0), keeping the peer Banned;pruneExpiredBansremoves expired bans but keeps active ones.Existing
peer_managerandpeer_scoringsuites pass; the previously existing "returns to Healthy after improving score" assertion was updated to reflect the new intended behaviour (a banned peer stays banned for the full window).