Garbage collect incompatible peers in Host::run()#5624
Conversation
Garbage collect peers in Host::run() which are deemed incompatible - an incompatible peer is one with which we will never be able to connect to successfully peer with (e.g. on a different chain or network, not running any capabilities).
libp2p/Peer.cpp
Outdated
| case UserReason: | ||
| return numeric_limits<unsigned>::max(); | ||
| case TooManyPeers: | ||
| return 25 * (m_failedAttempts + 1); |
There was a problem hiding this comment.
m_failedAttempts seems to affect only the value of fallbackSeconds currently. Maybe we should use it now to retry several times and then go to "critical error, disconnect" state.
(at least for some cases of failures)
There was a problem hiding this comment.
@gumb0 Good idea - what do you think of this being taken care of in another PR? I'd like to limit the amount of changes I make to the peer gc logic in this PR so if something ends up breaking it will be easier to debug.
There was a problem hiding this comment.
Ok for another PR, but it seems to be the matter of only adding condition like m_failedAttempts >= 20 to uselessPeer function.
There was a problem hiding this comment.
Additional thought is that we could make all this change more conservative if we leave fallbackSeconds() as it was before (so that we do the same reconnects as before) and have just this check for failed attempts count in uselessPeer (plus the new check for handshake failures)
This way it would reconnect with the same intervals for each case as before, but stop after limited number of attempts.
But I'm fine with it if you think it's better to immediately stop in some cases.
Rather than use a magic number to indicate a useless peer (numeric_limits<unsigned>::max()), add a function to Peer (Peer::uselessPeer) which can be called to detect this. Also update Peer::fallbackSeconds() to return a number which is large enough to prevent us from realistically ever connecting to a useless peer but which is small enough to not overflow when added to Peer::m_lastAttempted. Also, log a message with error verbosity in Host::handshakeFailed() if a peer is not found rather than asserting, since while we never expect users to hit this case, the EIP8 handshake tests can.
Update Peer::uselessPeer() to check peer type and return false if it's a required peer, since in this case the user obviously wants to try to stay connected to this peer.
Codecov Report
@@ Coverage Diff @@
## master #5624 +/- ##
==========================================
+ Coverage 62.62% 62.63% +<.01%
==========================================
Files 350 350
Lines 29690 29742 +52
Branches 3344 3350 +6
==========================================
+ Hits 18593 18628 +35
- Misses 9889 9901 +12
- Partials 1208 1213 +5 |
gumb0
left a comment
There was a problem hiding this comment.
Some minor comments and a small bug in the loop iterating over peers.
libp2p/Host.h
Outdated
| std::shared_ptr<Peer> peer(NodeID const& _n) const; | ||
|
|
||
| /// Set a handshake failure reason for a peer | ||
| void handshakeFailed(NodeID const& _n, HandshakeFailureReason _r); |
There was a problem hiding this comment.
Also I would make it public and declare close to startPeerSession
There was a problem hiding this comment.
@gumb0 : Why make this public, does it make sense to expose the concept of a handshake to consumers of Host?
There was a problem hiding this comment.
Well it looks to me like a callback similar to startPeerSession, the callback called by RLPXHandshake when handshake is finished. One is for success another one is for failure.
It works as private, because RLPXHandshake is a friend of Host, but idealluy we should get rid of this friend declarations at some point.
Exposing it to the clients of Host is of course not great, but the proper way to deal with it could be to create a separate interface with these callbacks only, don't expose it to Host clients, but pass it only to RLPXHandshake. That's a bit complicated change, at least it's not for this PR.
(In other words, we won't make it much worse, because startPeerSession is already public)
There was a problem hiding this comment.
Well it looks to me like a callback similar to
startPeerSession, the callback called byRLPXHandshakewhen handshake is finished. One is for success another one is for failure.It works as private, because
RLPXHandshakeis a friend ofHost, but idealluy we should get rid of this friend declarations at some point.Exposing it to the clients of
Hostis of course not great, but the proper way to deal with it could be to create a separate interface with these callbacks only, don't expose it toHostclients, but pass it only toRLPXHandshake. That's a bit complicated change, at least it's not for this PR.(In other words, we won't make it much worse, because
startPeerSessionis already public)
Ah that makes sense, thank you for clarifying! 😄 I'll make the change before merging.
libp2p/Peer.cpp
Outdated
| case UselessPeer: | ||
| case IncompatibleProtocol: | ||
| case UnexpectedIdentity: | ||
| case UserReason: |
There was a problem hiding this comment.
Not sure about UserReason - this could in theory be any kind of reason specific to subprotocol, including some temporary reasons, maybe it makes sense to try to reconnect
There was a problem hiding this comment.
I chose to treat UserReason as terminal because we only use it in 2 cases - if status validation fails:
aleth/libethereum/BlockChainSync.cpp
Lines 190 to 202 in 505aead
Here's where we perform the actual validation in EthereumPeer::validate:
aleth/libethereum/EthereumPeer.cpp
Lines 58 to 68 in 505aead
- If there's a bug in our network code and Session read validation fails:
Lines 409 to 418 in 505aead
You bring up a good point though, technically it can be any reason specific to a subprotocol and other clients can also choose to send it to us for reasons specific to their implementation. I think that rather than treating it as immediately critical we should use it in some sort of reconnection count threshold.
There was a problem hiding this comment.
@gumb0 I've decided to change UserReason -> UselessPeer when status message validation fails for a peer (since if that happens the peer is effectively useless to us i.e. we can't sync with it). I've also added in logic in Peer::isUseless to take the number of failed connection attempts into account when the last disconnect isn't critical.
Status validation failing means we have no chance of syncing with the peer (e.g. it's running on a different network) so a UselessPeer disconnect reason makes more sense than UserReason. This also has the benefit of status failure validation being treated as a critical disconnect and these nodes being gc'd in Host::run.
gumb0
left a comment
There was a problem hiding this comment.
Looks good, feel free to make onHandshakeFailed public, if you agree with my reasoning, but it's not critical
Various minor changes in RLPxHandshake, Host, and Peer classes: * Initialize RLPxHandshake::m_failureReason in ctor * Reduce redundant code in RLPxHandshake which sets the last failure reason when TCP errors occur * Rename Host::handshakeFailed * Fix m_peers iteration bug in Host * Make Host::onHandshakeFailed public (we want to avoid using friend classes where possible) * Rename Peers::uselessPeer * Update deprecated comment in Peer * Take # of failed connection attempts into account in Peer function which determines if instance is useless or not Move fallback seconds computation for default case to anonymous namespace function
Garbage collect peers in
Host::run()which are deemed incompatible - an incompatible peer is one with which we will never be able to successfully peer with (e.g. on a different chain or network, not running any capabilities).