Fix deadline timer race condition which was causing premature discovery termination#5502
Fix deadline timer race condition which was causing premature discovery termination#5502halfalicious merged 8 commits intomasterfrom
Conversation
a76a849 to
d8ff0f1
Compare
Codecov Report
@@ Coverage Diff @@
## master #5502 +/- ##
==========================================
- Coverage 61.89% 61.86% -0.03%
==========================================
Files 345 345
Lines 28736 28725 -11
Branches 3265 3261 -4
==========================================
- Hits 17785 17770 -15
- Misses 9782 9796 +14
+ Partials 1169 1159 -10 |
da4ba10 to
c61ba26
Compare
e95000d to
18ae15b
Compare
|
Rebased |
6970220 to
672e9a1
Compare
|
The following 2 tests are failing due to near-null AVs: |
Issues seem specific to clang...can't repro locally on Windows with full page heap enabled and the Linux GCC CircleCI run passed. |
|
Bad news, man. It crashes when testing in parallel. I could not reproduce it at first, because the test must run in parallel to cause the crash. In the end I compiled with clang + ASan + libc++ on linux. The report: |
|
It looks like the shared pointer is deleted, but some still has the pointer to the object. |
Ah, I think the problem is in the DeadlineOps::reap() handler - it checks the error code and m_stopped before deciding to proceed, but DeadlineLoops dtor doesn't cancel timers before deleting them which means that the error code won't be set and attempting to check m_stopped will trigger an AV. |
Can't you just check the error code first? |
672e9a1 to
9c03d48
Compare
We do but it doesn't indicate operation aborted since the timer(s) haven't been cancelled + are still alive because of the shared_ptr accepted by the handler. |
|
I've rebased my changes after merging @chfast 's changes which queue pings via io_service::post and scoped them down to just adding the dedicated timers for the discovery and eviction processes. I've also removed DeadlineOps since we no longer need it. I'm going to test my changes by running discovery overnight and verifying no cancellation but I don't envision any problems. |
chfast
left a comment
There was a problem hiding this comment.
This looks good to me, but there are some test failures that must be investigated.
Moreover, check out this tutorial about triggering timers periodically (we are close to this, but maybe you can spot some small differences): https://www.boost.org/doc/libs/1_69_0/doc/html/boost_asio/tutorial/tuttimer3.html
Also they are using steady_timer instead of deadline_timer. The difference is only in the clock used. We don't need high precision here, so maybe one is better than the other for our case.
3dbf8d6 to
72fe8c4
Compare
|
One test is failing in linux-clang6:
I think it's just an unreliable test, I'll file a new issue for tracking purposes. No tests are failing in the appveyor config, but I'm seeing the following error which is causing the config to fail: I recall seeing this before... @chfast : Is this a known issue? |
804b409 to
1df1040
Compare
f5571f2 to
de9bffc
Compare
|
Looks like the linux-clang failure is unrelated to my changes, all unit tests passed but the sstore test appears to have timed out. |
gumb0
left a comment
There was a problem hiding this comment.
A couple of minor comments, otherwise looks good
| m_discoveryTimer->expires_from_now(c_bucketRefreshMs); | ||
| auto discoveryTimer{m_discoveryTimer}; | ||
| m_discoveryTimer->async_wait([this, discoveryTimer](boost::system::error_code const& _ec) { | ||
| // We can't use m_logger if an error occurred because captured this might be already |
There was a problem hiding this comment.
I'd move this comment inside brackets, before clog(...)<<
| auto discoveryTimer{m_discoveryTimer}; | ||
| m_discoveryTimer->async_wait( | ||
| [this, discoveryTimer, _node, _round, _tried](boost::system::error_code const& _ec) { | ||
| // We can't use m_logger here if there's an error because captured this might already be |
There was a problem hiding this comment.
So I purposely put this comment above the "if" case since it applies to both the "if" and "else if" and I didn't want to include the same comment in both cases, plus I phrased it so that it would apply to both cases (i.e. I mentioned "error" instead of calling out error:operation_aborted)
| m_evictionTimer->expires_from_now(c_processEvictionsIntervalMs); | ||
| auto evictionTimer{m_evictionTimer}; | ||
| m_evictionTimer->async_wait([this, evictionTimer](boost::system::error_code const& _ec) { | ||
| // We can't use m_logger if an error occurred because captured this might be already |
Add dedicated node table timers for the discovery and eviction algorithms - this enables us to remove the DeadlineOps timer wrapper class which has a race condition (handlers are executed after their timers have been deleted) which is overly complicated for what we need. Note that the timers are shared_ptr - this is required to keep timers alive if handlers are executed after the node table has been destroyed. This shouldn't happen in practice since the node table lifetime is tied to the host lifetime and if the host is destroyed then the ioservice is no longer running, but these changes will help avoid use-after-frees if the node table/host relationship changes in the future. Also, have the node table inherit enable_shared_from_this so that the node table can be kept alive in ping handlers. Similar to the timers case, this isn't needed but is good design in case the node table/host relationship changes in the future. Note that this change requires that the code which starts the discovery and eviction processes is moved to a separate function (NodeTable::start) because we cannot call shared_from_this() before the object is fully constructed.
* Fix unreliable network test findNodeIsSentAfterPong - it's unreliable because it expects the first node table to send a FindNode packet to the second node table before the second node table sends a FindNode packet to the first node table. However, there's no guarantee that this will happen because both node tables are running discovery code in separate threads. While starting the first node table before the second node table results in the expected behavior most of the time, occasionally the second node table will send the FindNode packet first resulting in the test failing because the second node table receives a Neighbours packet when it expects a FindNode packet. I've fixed this by replacing one of the node tables with a TestUDPSocketHost - since there's only 1 discovery process running now we can deterministically know when to expect a FindNode packet to be sent by the node table. * Add ping validation to test invalidPong Potentially makes test a little easier to debug on failure
Removed unused m_enabled, renamed doDiscover to doDiscoveryRound so there's less confusion between it and doDiscovery, and made the discovery round time interval a constant (expression). Also, replace std::bind usage with lambdas and remove Node Table function declaration without a definition (doCheckEvictions) and rename doHandleTimeouts -> doProcessEvictions since the latter is more descriptive of what the function does
Cancel timers using boost::asio::io_service::post to ensure that timer cancellation happens on the network thread
Removing this functionality since it complicated things without a clear benefit (e.g. I had to move discovery/eviction start out of the node table ctor into a separate function which the node table owner would have to call) - after tracing through the steps that occur at aleth/aleth-bootnode and network test exit I think we are safe without this functionality. Additional context: The fear was that we could hit a case where handlers could be executed after the node table had been destroyed (UAF) but that is not possible since the node table is "owned" by the host and is only destroyed after the host has stopped the network thread which means that the io service has been shut down (i.e. no more Host::doWork() calling io_service::run) and no handlers can be run. Note that the node table owner does have to call NodeTable::stop() when shutting down, otherwise we can hit a UAF in the UDP socket. I think this is a reasonable requirement.
steady_timer supports std::chrono values and has enough precision on both linux and Windows Also remove unused node table constant (NodeTable::c_evictionCheckInterval)
de9bffc to
00ef44d
Compare
|
Squashed a bunch of commits and addressed PR feedback |
Old name makes more sense since the function handles evictions and clears timed out sent pings. Also update the name of the associated timer and timer interval. Also fix a bug where we check the node table member variable timer in the doHandleTimeouts lambda, rather than the captured timer.
00ef44d to
add83b7
Compare
|
@chfast : I'll merge tomorrow AM unless you have any objections. |
He's already reviewed the core logic of the changes and is currently busy with other things
Fix #5495
Fix the boost deadline timer race condition which was causing the discovery process to terminate (in some cases extremely) prematurely. The race condition occurred because
DeadlineOpswould check for timer expiration (deadline_timer::expires_from_now() < 0) and delete expired timers (seeDeadlineOps::reap()), but their handlers may not have executed yet which means that when they finally executed their boost error code would indicate operation aborted and the handler would return before executing any of its core logic.To fix this, I created 2 dedicated deadline timers in
NodeTable- 1 for the discovery process and 1 for the node eviction process. I think this makes sense since these are 2 algorithms which are supposed to be running for as long as the node table is running…additionally, having dedicated timers here makes the code easier to understand. I also only check the boost error code in these timer handlers since the timers are cancelled inNodeTable::stop()(which is executed on both network shutdown and onNodeTabledestruction), meaning that the error code is guaranteed to indicate timer cancellation when a handler is executed after network shutdown or node table destruction.I've also removed the
DeadlineOpsclass since it's no longer needed.