Fix multiple Ping and assertion failure in Discovery by gumb0 · Pull Request #5483 · ethereum/aleth

gumb0 · 2019-02-11T14:01:55Z

Addresses #5471 and probably some things from #5484

Summary of changes:

checking whether we've already sent one Ping (by looking at m_sentPings) before sending another one
ping method is split into ping and schedulePing - when we are already in the network thread, we can just directly access m_sentPings without additional m_timers.schedule(0, ...). So now when we need to ping from the network thread we just call ping method without scheduling
m_allNodes now contains only the nodes from the node table buckets.
So the nodes that are still pending (Pong not received yet) and the nodes that didn't fit into the bucket, and wait for older node to be evicted, are not put into m_allNodes
Nodes are added to m_allNodes only in noteActiveNode together with being put into the bucket.
This allows us not to care about erasing from m_allNodes the nodes that didn't get validated or are being thrown away when eviction ends with the old node answering.
(maybe m_allNodes should be renamed now)
replacement nodes are remembered only in the m_sentPings items (as a shared_ptr<NodeEntry> replacementNodeEntry member). They are dropped when we erase from m_sentPings in Pong handler.
we have to save node's NodeIPEndpoint in m_sentPings now, because this data is sent to us in Ping (or in Neighbours) and when we receive it from an unnknown node, we have to save it somewhere until the moment when we'll add it to the node table and to the m_allNodes.
(actually I think it's only TCP port number that is important to save, the rest is overwritten from the actual source of UDP packet anyway. But I expect this part to be changed somewhat when addressing the problem described in Discovery: additional security check before sending Neighbours #5455 (comment)
I expect m_sentPings to have in the future the UDP endpoint as a map key instead of NodeID)

codecov-io · 2019-02-11T15:00:46Z

Codecov Report

Merging #5483 into master will decrease coverage by 0.01%.
The diff coverage is 85.21%.

@@            Coverage Diff             @@
##           master    #5483      +/-   ##
==========================================
- Coverage    61.9%   61.89%   -0.02%     
==========================================
  Files         345      345              
  Lines       28702    28725      +23     
  Branches     3266     3266              
==========================================
+ Hits        17768    17779      +11     
- Misses       9768     9778      +10     
- Partials     1166     1168       +2

chfast · 2019-02-12T11:15:49Z

The assert failure still happens here.

chfast · 2019-02-12T14:28:30Z

Still crashes :)

gumb0 · 2019-02-12T17:10:23Z

This still doesn't fix assertion failure, just improves handling of attempts to Ping multiple times.

But it gets quite confusing and difficult to keep in mind all the possible cases. I'm thinking now that m_allNodes shouldn't contain both the pending nodes and the nodes already in the node table. Maybe we'll create the NodeEntry only when it's actually added to the node table. Then there won't be the need to this assert.

halfalicious · 2019-02-13T04:34:30Z

libp2p/NodeTable.cpp

        if (_ec || m_timers.isStopped())
            return;

+        if (contains(m_sentPings, _nodeEntry.id))


Should we maybe log something here?

halfalicious · 2019-02-13T04:47:23Z

libp2p/NodeTable.cpp

+    if (contains(m_sentPings, _nodeEntry.id))
+        return;
+
+    NodeIPEndpoint src;


(Nit) Can be combined onto one line?

halfalicious · 2019-02-13T04:48:43Z

libp2p/NodeTable.cpp

+    PingNode p(src, _nodeEntry.endpoint);
+    p.ts = nextRequestExpirationTime();
+    auto const pingHash = p.sign(m_secret);
+    LOG(m_logger) << p.typeName() << " to " << _nodeEntry.id << "@" << p.destination;


Can we just log the _nodeEntry here to get both the ID and the destination endpoint?

…de several times.

gumb0 · 2019-02-14T17:30:38Z

It turned out to be quite a lot of change, but I think it haven't got more complicated at least.

Don't put not yet validated nodes there, neither the ones that don't fit to the bucket and are replacement nodes for evicted ones. Replacement nodes are kept only in the m_sentPing items.

gumb0 · 2019-02-14T17:38:35Z

Rebased and addressed @halfalicious's comments

halfalicious · 2019-02-14T19:42:29Z

libp2p/NodeTable.h

    void processEvents();

-    /// Add node to the list of all nodes and ping it to trigger the endpoint proof.
+    /// Starts async node adding tot the node table by pinging it to trigger the endpoint proof.


Typo in comment (tot) and think it could be rephrased a little bit eg Starts async add of node to the node table...

halfalicious · 2019-02-14T19:43:09Z

libp2p/NodeTable.h

+    /// In case the node is already in the node table, pings only if the endpoint proof expired.
    ///
-    /// @return True if the node has been added.
+    /// @return True if the node id valid.


Grammar: “If the node is is valid”

libp2p/NodeTable.cpp

halfalicious · 2019-02-14T19:50:00Z

What were the cases where we would send double pings? And why would we hit the assertion?

gumb0 · 2019-02-15T10:08:22Z

Cases when we sent double pings:

We receive Neighbours packet, it has some nodes that should go into the same bucket. But the bucket is full, when trying to add the first new node, we start eviction by pinging the oldest node of the bucket. Before Ping arrives, we add subsequent new nodes, and this leads to pinging the oldest node again.
(after the fix it ingores the attempts to add the subsequent new nodes)
Sometimes we've already scheduled initiating a ping (with m_timers.schedule(0, ...), if another request to add the same node happens immediately in succession, we schedule another ping (i.e. we don't look into io_service to check whether we already plan to ping soon) - I beleive this sometimes still can happen after these fixes, but should be less often after the split to ping/schedulePing (regular ping now doesn't do scheduling with m_timers)

The assertion failure:
The reason was m_allNodes and m_sentPings getting out of sync, i.e. m_sentPings (in Pong handler) contained the ID of some node that already was erased from m_allNodes
I think this could have happened when we ping the node that is considered already a "replacement node" (the one that we already tried to add to the bucket, but the bucket was full and therefore we pinged the oldest node instead), but then the Pong from the oldest node arrives and we erase replacement node from m_allNodes before Pong of replacement node arrives.
The fix now is just to not put replacement nodes into m_allNodes.

gumb0 · 2019-02-15T11:41:06Z

libp2p/NodeTable.cpp

 }

-void NodeTable::noteActiveNode(Public const& _pubk, bi::udp::endpoint const& _endpoint)
+void NodeTable::noteActiveNode(shared_ptr<NodeEntry> _nodeEntry, bi::udp::endpoint const& _endpoint)


The changes in this method are only:

shared_ptr<NodeEntry> instead of Public paramter.

insert into m_allNodes together with inserting into the bucket

gumb0 · 2019-02-15T13:18:39Z

Tests seem to be fixed now, I think it's ready for review

gumb0 · 2019-02-15T13:58:36Z

libp2p/NodeTable.cpp

+    // Don't sent Ping if one is already sent
+    if (contains(m_sentPings, _node.id))
+    {
+        LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged";


This is being logged quite a lot I should say

I think we benefit from more logging since it makes it easier to track what's going on and detect bugs, but wading through the log spew can be challenging. It would be nice to eventually have some additional log levels so we could have more targeted logging...in the meantime I think we should keep/add logs unless they significantly impact the readability of the Aleth logs.

Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.

For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.

For aleth-bootnode the networking INFO level should be enabled by default.

Agreed, filed #5499 to track this

halfalicious · 2019-02-15T16:23:42Z

libp2p/NodeTable.cpp

-            entry = createNodeEntry(_node, 0, 0);
-        else
-            entry = it->second;
+        needToPing = (it == m_allNodes.end() || !it->second->hasValidEndpointProof());


Should we also check m_sentPings? What if we sent a ping recently but just haven't received a response yet?

yeah good point, it would improve things, but unfortunately it's not easy to do it here, because m_sentPings is accessed only from the network thread...

We could either make addNode asynchronous; or we could create another private method for adding nodes from the packet handlers (but it would be very similar to addNode); or maybe we could check before sending Ping in ping() whether the node already happened to be validated, then skip additional Ping.

All this seems quite ugly for me for now, I'd suggest first to observe how often we get double pings because of this problem, then decide whether it make sense to complicate the code...

halfalicious · 2019-02-16T17:07:47Z

* Sometimes we've already scheduled initiating a ping (with `m_timers.schedule(0, ...`), if another request to add the same node happens immediately in succession, we schedule another ping (i.e. we don't look into `io_service` to check whether we already plan to ping soon) - I beleive this sometimes still can happen after these fixes, but should be less often after the split to `ping/schedulePing` (regular `ping` now doesn't do scheduling with `m_timers`)

Can we address this by adding the ping information in schedulePing() before we actually schedule the ping? One danger with doing this is that there's a delay of the ping not actually executing until after the UDP datagram time to live (1 minute) which means that the ping is already considered expired when it gets sent out over the wire. I don't know much about boost deadline timers but I don't think this can realistically happen, even if there are a lot of boost deadline timers expiring around the same time (since you'd have to execute a lot of handlers to consume 1 minute of wall clock time and I don't think we create that many deadline timers).

Another possible way to address this would be via something like m_queuedPings, and we could check both this and m_sentPings before deciding if we should queue a new ping.

chfast · 2019-02-18T11:54:12Z

libp2p/NodeTable.cpp

-        ping(*entry);
+    if (needToPing)
+    {
+        LOG(m_logger) << "Pending " << _node;


This log and "Adding node " are redundant. How about we long either "Pending" or "Added".

chfast · 2019-02-18T12:04:28Z

libp2p/NodeTable.cpp

+    // Don't sent Ping if one is already sent
+    if (contains(m_sentPings, _node.id))
+    {
+        LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged";


Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.

For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.

chfast · 2019-02-18T12:11:02Z

libp2p/NodeTable.cpp

-        LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node " << _pubk
-                      << "@" << _endpoint;
+        LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node "
+                      << _nodeEntry->id << "@" << _endpoint;


You can log _nodeEntry here.

chfast · 2019-02-18T12:11:34Z

libp2p/NodeTable.cpp

+    if (!_nodeEntry->hasValidEndpointProof())
+        return;

+    LOG(m_logger) << "Active node " << _nodeEntry->id << '@' << _endpoint;


You can long _nodeEntry here.

halfalicious · 2019-02-20T05:48:49Z

* Sometimes we've already scheduled initiating a ping (with `m_timers.schedule(0, ...`), if another request to add the same node happens immediately in succession, we schedule another ping (i.e. we don't look into `io_service` to check whether we already plan to ping soon) - I beleive this sometimes still can happen after these fixes, but should be less often after the split to `ping/schedulePing` (regular `ping` now doesn't do scheduling with `m_timers`)
Can we address this by adding the ping information in schedulePing() before we actually schedule the ping? One danger with doing this is that there's a delay of the ping not actually executing until after the UDP datagram time to live (1 minute) which means that the ping is already considered expired when it gets sent out over the wire. I don't know much about boost deadline timers but I don't think this can realistically happen, even if there are a lot of boost deadline timers expiring around the same time (since you'd have to execute a lot of handlers to consume 1 minute of wall clock time and I don't think we create that many deadline timers).

Another possible way to address this would be via something like m_queuedPings, and we could check both this and m_sentPings before deciding if we should queue a new ping.

Filed #5500 to track this

gumb0 added the in progress label Feb 11, 2019

gumb0 mentioned this pull request Feb 11, 2019

Assertion failure in discovery #5471

Closed

halfalicious mentioned this pull request Feb 12, 2019

Potential discovery issues #5484

Open

halfalicious reviewed Feb 13, 2019

View reviewed changes

gumb0 added 4 commits February 14, 2019 18:24

Skip sending Ping when we've already sent one to this node

a939710

Add unit test for addNode the same node twice

1dd33d6

Remove ignored replacement node in case trying to ping one evicted no…

6537567

…de several times.

Split ping() into synchronous ping() and asynchronous schedulePing()

9dcc4da

gumb0 changed the title ~~Fix multiple Ping~~ Fix multiple Ping and assertion failure in Discovery Feb 14, 2019

Make m_allNodes contain only the nodes of the node table buckets.

0daaaa5

Don't put not yet validated nodes there, neither the ones that don't fit to the bucket and are replacement nodes for evicted ones. Replacement nodes are kept only in the m_sentPing items.

gumb0 force-pushed the you-only-ping-once branch from 3ef9cf6 to 0daaaa5 Compare February 14, 2019 17:34

Address minor review issues.

6b54ecf

halfalicious reviewed Feb 14, 2019

View reviewed changes

gumb0 added 2 commits February 15, 2019 11:14

Fix typos

bd42df4

Fix tests after making m_allNodes contain only node table nodes

2cdb3f4

gumb0 commented Feb 15, 2019

View reviewed changes

Fix tests crash on clang build

ceffe80

gumb0 removed the in progress label Feb 15, 2019

gumb0 requested a review from halfalicious February 15, 2019 13:18

gumb0 requested a review from chfast February 15, 2019 13:18

gumb0 commented Feb 15, 2019

View reviewed changes

gumb0 mentioned this pull request Feb 15, 2019

Discovery timer stops unexpectedly #5495

Closed

halfalicious reviewed Feb 15, 2019

View reviewed changes

chfast approved these changes Feb 18, 2019

View reviewed changes

chfast merged commit b09978b into master Feb 19, 2019

chfast deleted the you-only-ping-once branch February 19, 2019 11:36

halfalicious mentioned this pull request Feb 20, 2019

Potential double-ping during discovery #5500

Open

Conversation

gumb0 commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chfast commented Feb 12, 2019

Uh oh!

chfast commented Feb 12, 2019

Uh oh!

gumb0 commented Feb 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gumb0 commented Feb 14, 2019

Uh oh!

gumb0 commented Feb 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

halfalicious commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gumb0 commented Feb 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gumb0 commented Feb 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halfalicious Feb 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halfalicious commented Feb 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halfalicious commented Feb 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gumb0 commented Feb 11, 2019 •

edited

Loading

codecov-io commented Feb 11, 2019 •

edited

Loading

halfalicious commented Feb 14, 2019 •

edited

Loading

gumb0 commented Feb 15, 2019 •

edited

Loading

halfalicious Feb 15, 2019 •

edited

Loading

halfalicious commented Feb 16, 2019 •

edited

Loading