Fix multiple Ping and assertion failure in Discovery#5483
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5483 +/- ##
==========================================
- Coverage 61.9% 61.89% -0.02%
==========================================
Files 345 345
Lines 28702 28725 +23
Branches 3266 3266
==========================================
+ Hits 17768 17779 +11
- Misses 9768 9778 +10
- Partials 1166 1168 +2 |
|
The assert failure still happens here. |
|
Still crashes :) |
|
This still doesn't fix assertion failure, just improves handling of attempts to Ping multiple times. But it gets quite confusing and difficult to keep in mind all the possible cases. I'm thinking now that |
libp2p/NodeTable.cpp
Outdated
| if (_ec || m_timers.isStopped()) | ||
| return; | ||
|
|
||
| if (contains(m_sentPings, _nodeEntry.id)) |
There was a problem hiding this comment.
Should we maybe log something here?
libp2p/NodeTable.cpp
Outdated
| if (contains(m_sentPings, _nodeEntry.id)) | ||
| return; | ||
|
|
||
| NodeIPEndpoint src; |
There was a problem hiding this comment.
(Nit) Can be combined onto one line?
libp2p/NodeTable.cpp
Outdated
| PingNode p(src, _nodeEntry.endpoint); | ||
| p.ts = nextRequestExpirationTime(); | ||
| auto const pingHash = p.sign(m_secret); | ||
| LOG(m_logger) << p.typeName() << " to " << _nodeEntry.id << "@" << p.destination; |
There was a problem hiding this comment.
Can we just log the _nodeEntry here to get both the ID and the destination endpoint?
|
It turned out to be quite a lot of change, but I think it haven't got more complicated at least. |
Don't put not yet validated nodes there, neither the ones that don't fit to the bucket and are replacement nodes for evicted ones. Replacement nodes are kept only in the m_sentPing items.
3ef9cf6 to
0daaaa5
Compare
|
Rebased and addressed @halfalicious's comments |
libp2p/NodeTable.h
Outdated
| void processEvents(); | ||
|
|
||
| /// Add node to the list of all nodes and ping it to trigger the endpoint proof. | ||
| /// Starts async node adding tot the node table by pinging it to trigger the endpoint proof. |
There was a problem hiding this comment.
Typo in comment (tot) and think it could be rephrased a little bit eg Starts async add of node to the node table...
libp2p/NodeTable.h
Outdated
| /// In case the node is already in the node table, pings only if the endpoint proof expired. | ||
| /// | ||
| /// @return True if the node has been added. | ||
| /// @return True if the node id valid. |
There was a problem hiding this comment.
Grammar: “If the node is is valid”
|
What were the cases where we would send double pings? And why would we hit the assertion? |
|
Cases when we sent double pings:
The assertion failure: |
| } | ||
|
|
||
| void NodeTable::noteActiveNode(Public const& _pubk, bi::udp::endpoint const& _endpoint) | ||
| void NodeTable::noteActiveNode(shared_ptr<NodeEntry> _nodeEntry, bi::udp::endpoint const& _endpoint) |
There was a problem hiding this comment.
The changes in this method are only:
shared_ptr<NodeEntry>instead ofPublicparamter.- insert into
m_allNodestogether with inserting into the bucket
|
Tests seem to be fixed now, I think it's ready for review |
| // Don't sent Ping if one is already sent | ||
| if (contains(m_sentPings, _node.id)) | ||
| { | ||
| LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged"; |
There was a problem hiding this comment.
This is being logged quite a lot I should say
There was a problem hiding this comment.
I think we benefit from more logging since it makes it easier to track what's going on and detect bugs, but wading through the log spew can be challenging. It would be nice to eventually have some additional log levels so we could have more targeted logging...in the meantime I think we should keep/add logs unless they significantly impact the readability of the Aleth logs.
There was a problem hiding this comment.
Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.
For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.
There was a problem hiding this comment.
For aleth-bootnode the networking INFO level should be enabled by default.
Agreed, filed #5499 to track this
| entry = createNodeEntry(_node, 0, 0); | ||
| else | ||
| entry = it->second; | ||
| needToPing = (it == m_allNodes.end() || !it->second->hasValidEndpointProof()); |
There was a problem hiding this comment.
Should we also check m_sentPings? What if we sent a ping recently but just haven't received a response yet?
There was a problem hiding this comment.
yeah good point, it would improve things, but unfortunately it's not easy to do it here, because m_sentPings is accessed only from the network thread...
There was a problem hiding this comment.
We could either make addNode asynchronous; or we could create another private method for adding nodes from the packet handlers (but it would be very similar to addNode); or maybe we could check before sending Ping in ping() whether the node already happened to be validated, then skip additional Ping.
All this seems quite ugly for me for now, I'd suggest first to observe how often we get double pings because of this problem, then decide whether it make sense to complicate the code...
Can we address this by adding the ping information in schedulePing() before we actually schedule the ping? One danger with doing this is that there's a delay of the ping not actually executing until after the UDP datagram time to live (1 minute) which means that the ping is already considered expired when it gets sent out over the wire. I don't know much about boost deadline timers but I don't think this can realistically happen, even if there are a lot of boost deadline timers expiring around the same time (since you'd have to execute a lot of handlers to consume 1 minute of wall clock time and I don't think we create that many deadline timers). Another possible way to address this would be via something like m_queuedPings, and we could check both this and m_sentPings before deciding if we should queue a new ping. |
| ping(*entry); | ||
| if (needToPing) | ||
| { | ||
| LOG(m_logger) << "Pending " << _node; |
There was a problem hiding this comment.
This log and "Adding node " are redundant. How about we long either "Pending" or "Added".
| // Don't sent Ping if one is already sent | ||
| if (contains(m_sentPings, _node.id)) | ||
| { | ||
| LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged"; |
There was a problem hiding this comment.
Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.
For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.
| LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node " << _pubk | ||
| << "@" << _endpoint; | ||
| LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node " | ||
| << _nodeEntry->id << "@" << _endpoint; |
| if (!_nodeEntry->hasValidEndpointProof()) | ||
| return; | ||
|
|
||
| LOG(m_logger) << "Active node " << _nodeEntry->id << '@' << _endpoint; |
Filed #5500 to track this |
Addresses #5471 and probably some things from #5484
Summary of changes:
Ping(by looking atm_sentPings) before sending another onepingmethod is split intopingandschedulePing- when we are already in the network thread, we can just directly accessm_sentPingswithout additionalm_timers.schedule(0, ...). So now when we need to ping from the network thread we just callpingmethod without schedulingm_allNodesnow contains only the nodes from the node table buckets.Pongnot received yet) and the nodes that didn't fit into the bucket, and wait for older node to be evicted, are not put intom_allNodesm_allNodesonly innoteActiveNodetogether with being put into the bucket.This allows us not to care about erasing from
m_allNodesthe nodes that didn't get validated or are being thrown away when eviction ends with the old node answering.(maybe
m_allNodesshould be renamed now)m_sentPingsitems (as ashared_ptr<NodeEntry> replacementNodeEntrymember). They are dropped when we erase fromm_sentPingsinPonghandler.NodeIPEndpointinm_sentPingsnow, because this data is sent to us inPing(or inNeighbours) and when we receive it from an unnknown node, we have to save it somewhere until the moment when we'll add it to the node table and to them_allNodes.(actually I think it's only TCP port number that is important to save, the rest is overwritten from the actual source of UDP packet anyway. But I expect this part to be changed somewhat when addressing the problem described in Discovery: additional security check before sending Neighbours #5455 (comment)
I expect
m_sentPingsto have in the future the UDP endpoint as a map key instead ofNodeID)