Skip to content
This repository was archived by the owner on Oct 28, 2021. It is now read-only.

Fix deadline timer race condition which was causing premature discovery termination#5502

Merged
halfalicious merged 8 commits intomasterfrom
discov-timer-cancel
Mar 14, 2019
Merged

Fix deadline timer race condition which was causing premature discovery termination#5502
halfalicious merged 8 commits intomasterfrom
discov-timer-cancel

Conversation

@halfalicious
Copy link
Contributor

@halfalicious halfalicious commented Feb 24, 2019

Fix #5495

Fix the boost deadline timer race condition which was causing the discovery process to terminate (in some cases extremely) prematurely. The race condition occurred because DeadlineOps would check for timer expiration (deadline_timer::expires_from_now() < 0) and delete expired timers (see DeadlineOps::reap()), but their handlers may not have executed yet which means that when they finally executed their boost error code would indicate operation aborted and the handler would return before executing any of its core logic.

To fix this, I created 2 dedicated deadline timers in NodeTable - 1 for the discovery process and 1 for the node eviction process. I think this makes sense since these are 2 algorithms which are supposed to be running for as long as the node table is running…additionally, having dedicated timers here makes the code easier to understand. I also only check the boost error code in these timer handlers since the timers are cancelled in NodeTable::stop() (which is executed on both network shutdown and on NodeTable destruction), meaning that the error code is guaranteed to indicate timer cancellation when a handler is executed after network shutdown or node table destruction.

I've also removed the DeadlineOps class since it's no longer needed.

@codecov-io
Copy link

codecov-io commented Feb 25, 2019

Codecov Report

Merging #5502 into master will decrease coverage by 0.02%.
The diff coverage is 82.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5502      +/-   ##
==========================================
- Coverage   61.89%   61.86%   -0.03%     
==========================================
  Files         345      345              
  Lines       28736    28725      -11     
  Branches     3265     3261       -4     
==========================================
- Hits        17785    17770      -15     
- Misses       9782     9796      +14     
+ Partials     1169     1159      -10

@halfalicious halfalicious force-pushed the discov-timer-cancel branch 3 times, most recently from e95000d to 18ae15b Compare February 26, 2019 03:22
@halfalicious
Copy link
Contributor Author

Rebased

@halfalicious halfalicious force-pushed the discov-timer-cancel branch 3 times, most recently from 6970220 to 672e9a1 Compare February 26, 2019 05:03
@halfalicious halfalicious requested a review from chfast February 26, 2019 05:07
@halfalicious halfalicious changed the title [WIP] Discov timer cancel Fix deadline timer race condition which was causing premature discovery termination Feb 26, 2019
@halfalicious
Copy link
Contributor Author

The following 2 tests are failing due to near-null AVs:

Test Case "noteActiveNodeEvictsTheNodeWhenBucketIsFull": 
unknown location(0): fatal error: in "network/net/noteActiveNodeEvictsTheNodeWhenBucketIsFull": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(516): last checkpoint

Test Case "evictionWithOldNodeDropped": 
unknown location(0): fatal error: in "network/net/evictionWithOldNodeDropped": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(881): last checkpoint

@halfalicious
Copy link
Contributor Author

The following 2 tests are failing due to near-null AVs:

Test Case "noteActiveNodeEvictsTheNodeWhenBucketIsFull": 
unknown location(0): fatal error: in "network/net/noteActiveNodeEvictsTheNodeWhenBucketIsFull": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(516): last checkpoint

Test Case "evictionWithOldNodeDropped": 
unknown location(0): fatal error: in "network/net/evictionWithOldNodeDropped": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(881): last checkpoint

Issues seem specific to clang...can't repro locally on Windows with full page heap enabled and the Linux GCC CircleCI run passed.

@chfast
Copy link
Member

chfast commented Feb 26, 2019

Bad news, man. It crashes when testing in parallel. I could not reproduce it at first, because the test must run in parallel to cause the crash. In the end I compiled with clang + ASan + libc++ on linux.

The report:

=================================================================
==19042==ERROR: AddressSanitizer: heap-use-after-free on address 0x625000027b50 at pc 0x0000013ede33 bp 0x7fc60c40a4a0 sp 0x7fc60c40a498
READ of size 1 at 0x625000027b50 thread T5 (test)
    #0 0x13ede32 in std::__1::__atomic_base<bool, false>::load(std::__1::memory_order) const /usr/include/c++/v1/atomic:926:17
    #1 0x13ede32 in std::__1::__atomic_base<bool, false>::operator bool() const /usr/include/c++/v1/atomic:930
    #2 0x13ede32 in dev::p2p::DeadlineOps::reap()::$_0::operator()(std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&) const /home/chfast/Projects/ethereum/aleth/libp2p/Common.cpp:177
    #3 0x13ef307 in std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>::operator()(std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&) const /usr/include/c++/v1/functional:1916:12
    #4 0x13ef307 in decltype(std::__1::forward<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&>(fp)(std::__1::forward<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>&>(fp0), std::__1::forward<boost::system::error_code const&>(fp0))) std::__1::__invoke<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>&, boost::system::error_code const&>(std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>&, boost::system::error_code const&) /usr/include/c++/v1/type_traits:4482
    #5 0x13ef307 in std::__1::__bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&>, __is_valid_bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&> >::value>::type std::__1::__apply_functor<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, 0ul, 1ul, std::__1::tuple<boost::system::error_code const&> >(std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >&, std::__1::__tuple_indices<0ul, 1ul>, std::__1::tuple<boost::system::error_code const&>&&) /usr/include/c++/v1/functional:2219
    #6 0x13ef307 in std::__1::__bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&>, __is_valid_bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&> >::value>::type std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>::operator()<boost::system::error_code const&>(boost::system::error_code const&) /usr/include/c++/v1/functional:2252
    #7 0x13ef307 in boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/bind_handler.hpp:47
    #8 0x13ef307 in void boost::asio::asio_handler_invoke<boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code> >(boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>&, ...) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/handler_invoke_hook.hpp:69
    #9 0x13ef307 in void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>, std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&> >(boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>&, std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/handler_invoke_helpers.hpp:37
    #10 0x13ef307 in boost::asio::detail::wait_handler<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&> >::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/wait_handler.hpp:70
    #11 0xe748f0 in boost::asio::detail::task_io_service_operation::complete(boost::asio::detail::task_io_service&, boost::system::error_code const&, unsigned long) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/task_io_service_operation.hpp:38:5
    #12 0xe748f0 in boost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/impl/task_io_service.ipp:372
    #13 0xe74480 in boost::asio::detail::task_io_service::run(boost::system::error_code&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/impl/task_io_service.ipp:149:10
    #14 0xe74339 in boost::asio::io_service::run() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/impl/io_service.ipp:59:25
    #15 0x1226cb9 in dev::Worker::workLoop() /home/chfast/Projects/ethereum/aleth/libdevcore/Worker.cpp:140:3
    #16 0x1227511 in dev::Worker::startWorking()::$_0::operator()() const /home/chfast/Projects/ethereum/aleth/libdevcore/Worker.cpp:62:6
    #17 0x1226dcb in decltype(std::__1::forward<dev::Worker::startWorking()::$_0>(fp)()) std::__1::__invoke<dev::Worker::startWorking()::$_0>(dev::Worker::startWorking()::$_0&&) /usr/include/c++/v1/type_traits:4482:1
    #18 0x1226dcb in void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dev::Worker::startWorking()::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dev::Worker::startWorking()::$_0>&, std::__1::__tuple_indices<>) /usr/include/c++/v1/thread:342
    #19 0x1226dcb in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dev::Worker::startWorking()::$_0> >(void*) /usr/include/c++/v1/thread:352
    #20 0x7fc6113286da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #21 0x7fc6103a088e in clone /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95

0x625000027b50 is located 8784 bytes inside of 9016-byte region [0x625000025900,0x625000027c38)
freed by thread T0 here:
    #0 0x9147b2 in operator delete(void*) (/home/chfast/Projects/ethereum/aleth/build/asan-libcxx/test/testeth+0x9147b2)
    #1 0xe64ef8 in std::__1::__shared_count::__release_shared() /usr/include/c++/v1/memory:3490:9
    #2 0xe64ef8 in std::__1::__shared_weak_count::__release_shared() /usr/include/c++/v1/memory:3532
    #3 0xe64ef8 in std::__1::shared_ptr<network::net::TestNodeTable>::~shared_ptr() /usr/include/c++/v1/memory:4468
    #4 0xe64ef8 in std::__1::enable_if<is_convertible<network::net::TestNodeTable*, network::net::TestNodeTable*>::value, void>::type std::__1::shared_ptr<network::net::TestNodeTable>::reset<network::net::TestNodeTable>(network::net::TestNodeTable*) /usr/include/c++/v1/memory:4616
    #5 0xe64ef8 in network::net::TestNodeTableHost::TestNodeTableHost(unsigned int, unsigned short) /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:234
    #6 0xe3b931 in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull::test_method() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:514:23
    #7 0xe3b06c in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull_invoker() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:512:1
    #8 0xb8689b in boost::function0<void>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #9 0xb8689b in boost::detail::forward::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1300
    #10 0xb8689b in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:138
    #11 0xb45699 in boost::function0<int>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #12 0xb45699 in int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:281
    #13 0xb45699 in boost::execution_monitor::catch_signals(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:864
    #14 0xb4589a in boost::execution_monitor::execute(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1203:16
    #15 0xb3a915 in boost::execution_monitor::vexecute(boost::function<void ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1309:5
    #16 0xb418c8 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_monitor.ipp:46:9
    #17 0xb64c64 in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:771:44
    #18 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #19 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #20 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #21 0xb3f6f1 in boost::unit_test::framework::run(unsigned long, bool) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:1577:29
    #22 0xb5141c in boost::unit_test::unit_test_main(boost::unit_test::test_suite* (*)(int, char**), int, char**) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_main.ipp:231:9
    #23 0xb5e0ff in main /home/chfast/Projects/ethereum/aleth/test/tools/libtesteth/boostTest.cpp:139:14
    #24 0x7fc6102a0b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310

previously allocated by thread T0 here:
    #0 0x913b72 in operator new(unsigned long) (/home/chfast/Projects/ethereum/aleth/build/asan-libcxx/test/testeth+0x913b72)
    #1 0xe64da9 in network::net::TestNodeTableHost::TestNodeTableHost(unsigned int, unsigned short) /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:235:17
    #2 0xe3b931 in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull::test_method() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:514:23
    #3 0xe3b06c in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull_invoker() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:512:1
    #4 0xb8689b in boost::function0<void>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #5 0xb8689b in boost::detail::forward::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1300
    #6 0xb8689b in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:138
    #7 0xb45699 in boost::function0<int>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #8 0xb45699 in int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:281
    #9 0xb45699 in boost::execution_monitor::catch_signals(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:864
    #10 0xb4589a in boost::execution_monitor::execute(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1203:16
    #11 0xb3a915 in boost::execution_monitor::vexecute(boost::function<void ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1309:5
    #12 0xb418c8 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_monitor.ipp:46:9
    #13 0xb64c64 in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:771:44
    #14 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #15 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #16 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #17 0xb3f6f1 in boost::unit_test::framework::run(unsigned long, bool) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:1577:29
    #18 0xb5141c in boost::unit_test::unit_test_main(boost::unit_test::test_suite* (*)(int, char**), int, char**) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_main.ipp:231:9
    #19 0xb5e0ff in main /home/chfast/Projects/ethereum/aleth/test/tools/libtesteth/boostTest.cpp:139:14
    #20 0x7fc6102a0b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310

Thread T5 (test) created by T0 here:
    #0 0x8ccc2d in pthread_create (/home/chfast/Projects/ethereum/aleth/build/asan-libcxx/test/testeth+0x8ccc2d)
    #1 0x1225d3f in std::__1::__libcpp_thread_create(unsigned long*, void* (*)(void*), void*) /usr/include/c++/v1/__threading_support:327:10
    #2 0x1225d3f in std::__1::thread::thread<dev::Worker::startWorking()::$_0, void>(dev::Worker::startWorking()::$_0&&) /usr/include/c++/v1/thread:368
    #3 0x1225735 in dev::Worker::startWorking() /home/chfast/Projects/ethereum/aleth/libdevcore/Worker.cpp:44:20
    #4 0xe3be72 in network::net::TestHost::start() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:37:20
    #5 0xe3be72 in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull::test_method() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:524
    #6 0xe3b06c in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull_invoker() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:512:1
    #7 0xb8689b in boost::function0<void>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #8 0xb8689b in boost::detail::forward::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1300
    #9 0xb8689b in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:138
    #10 0xb45699 in boost::function0<int>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #11 0xb45699 in int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:281
    #12 0xb45699 in boost::execution_monitor::catch_signals(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:864
    #13 0xb4589a in boost::execution_monitor::execute(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1203:16
    #14 0xb3a915 in boost::execution_monitor::vexecute(boost::function<void ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1309:5
    #15 0xb418c8 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_monitor.ipp:46:9
    #16 0xb64c64 in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:771:44
    #17 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #18 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #19 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #20 0xb3f6f1 in boost::unit_test::framework::run(unsigned long, bool) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:1577:29
    #21 0xb5141c in boost::unit_test::unit_test_main(boost::unit_test::test_suite* (*)(int, char**), int, char**) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_main.ipp:231:9
    #22 0xb5e0ff in main /home/chfast/Projects/ethereum/aleth/test/tools/libtesteth/boostTest.cpp:139:14
    #23 0x7fc6102a0b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310

SUMMARY: AddressSanitizer: heap-use-after-free /usr/include/c++/v1/atomic:926:17 in std::__1::__atomic_base<bool, false>::load(std::__1::memory_order) const
Shadow bytes around the buggy address:
  0x0c4a7fffcf10: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf20: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf40: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0c4a7fffcf60: fd fd fd fd fd fd fd fd fd fd[fd]fd fd fd fd fd
  0x0c4a7fffcf70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf80: fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa fa
  0x0c4a7fffcf90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffcfa0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffcfb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==19042==ABORTING

@chfast
Copy link
Member

chfast commented Feb 26, 2019

It looks like the shared pointer is deleted, but some still has the pointer to the object.

Copy link
Member

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the use-after-free.

@halfalicious
Copy link
Contributor Author

eadlineOps::DeadlineOp> const&, b

Ah, I think the problem is in the DeadlineOps::reap() handler - it checks the error code and m_stopped before deciding to proceed, but DeadlineLoops dtor doesn't cancel timers before deleting them which means that the error code won't be set and attempting to check m_stopped will trigger an AV.

@chfast
Copy link
Member

chfast commented Feb 26, 2019

Ah, I think the problem is in the DeadlineOps::reap() handler - it checks the error code and m_stopped before deciding to proceed, but DeadlineLoops dtor doesn't cancel timers before deleting them which means that the error code won't be set and attempting to check m_stopped will trigger an AV.

Can't you just check the error code first?

@halfalicious
Copy link
Contributor Author

Ah, I think the problem is in the DeadlineOps::reap() handler - it checks the error code and m_stopped before deciding to proceed, but DeadlineLoops dtor doesn't cancel timers before deleting them which means that the error code won't be set and attempting to check m_stopped will trigger an AV.

Can't you just check the error code first?

We do but it doesn't indicate operation aborted since the timer(s) haven't been cancelled + are still alive because of the shared_ptr accepted by the handler.

@halfalicious
Copy link
Contributor Author

I've rebased my changes after merging @chfast 's changes which queue pings via io_service::post and scoped them down to just adding the dedicated timers for the discovery and eviction processes. I've also removed DeadlineOps since we no longer need it.

I'm going to test my changes by running discovery overnight and verifying no cancellation but I don't envision any problems.

Copy link
Member

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but there are some test failures that must be investigated.

Moreover, check out this tutorial about triggering timers periodically (we are close to this, but maybe you can spot some small differences): https://www.boost.org/doc/libs/1_69_0/doc/html/boost_asio/tutorial/tuttimer3.html

Also they are using steady_timer instead of deadline_timer. The difference is only in the clock used. We don't need high precision here, so maybe one is better than the other for our case.

@halfalicious halfalicious force-pushed the discov-timer-cancel branch 2 times, most recently from 3dbf8d6 to 72fe8c4 Compare March 3, 2019 23:08
@halfalicious
Copy link
Contributor Author

One test is failing in linux-clang6:

510 - network/net/findNodeIsSentAfterPong (Failed)

I think it's just an unreliable test, I'll file a new issue for tracking purposes.

No tests are failing in the appveyor config, but I'm seeing the following error which is causing the config to fail:

*** No errors detected
Collecting artifacts...
Found artifact 'build\aleth.zip' matching 'build\aleth.zip' path
Uploading artifacts...
Error uploading artifact the storage: Unable to connect to the remote server

I recall seeing this before... @chfast : Is this a known issue?

@halfalicious halfalicious force-pushed the discov-timer-cancel branch from 804b409 to 1df1040 Compare March 9, 2019 07:37
@halfalicious halfalicious changed the title Fix deadline timer race condition which was causing premature discovery termination [WIP] Fix deadline timer race condition which was causing premature discovery termination Mar 13, 2019
@halfalicious halfalicious force-pushed the discov-timer-cancel branch 3 times, most recently from f5571f2 to de9bffc Compare March 13, 2019 04:24
@halfalicious
Copy link
Contributor Author

Looks like the linux-clang failure is unrelated to my changes, all unit tests passed but the sstore test appears to have timed out.

@halfalicious halfalicious changed the title [WIP] Fix deadline timer race condition which was causing premature discovery termination Fix deadline timer race condition which was causing premature discovery termination Mar 13, 2019
@halfalicious
Copy link
Contributor Author

cc @chfast / @gumb0

Copy link
Member

@gumb0 gumb0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor comments, otherwise looks good

m_discoveryTimer->expires_from_now(c_bucketRefreshMs);
auto discoveryTimer{m_discoveryTimer};
m_discoveryTimer->async_wait([this, discoveryTimer](boost::system::error_code const& _ec) {
// We can't use m_logger if an error occurred because captured this might be already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this comment inside brackets, before clog(...)<<

auto discoveryTimer{m_discoveryTimer};
m_discoveryTimer->async_wait(
[this, discoveryTimer, _node, _round, _tried](boost::system::error_code const& _ec) {
// We can't use m_logger here if there's an error because captured this might already be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too, move comment

Copy link
Contributor Author

@halfalicious halfalicious Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I purposely put this comment above the "if" case since it applies to both the "if" and "else if" and I didn't want to include the same comment in both cases, plus I phrased it so that it would apply to both cases (i.e. I mentioned "error" instead of calling out error:operation_aborted)

m_evictionTimer->expires_from_now(c_processEvictionsIntervalMs);
auto evictionTimer{m_evictionTimer};
m_evictionTimer->async_wait([this, evictionTimer](boost::system::error_code const& _ec) {
// We can't use m_logger if an error occurred because captured this might be already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move comment

halfalicious and others added 7 commits March 13, 2019 19:29
Add dedicated node table timers for the discovery and eviction algorithms - this enables us to remove the DeadlineOps timer wrapper class which has a race condition (handlers are executed after their timers have been deleted) which is overly complicated for what we need.

Note that the timers are shared_ptr - this is required to keep timers alive if handlers are executed after the node table has been destroyed. This shouldn't happen in practice since the node table lifetime is tied to the host lifetime and if the host is destroyed then the ioservice is no longer running, but these changes will help avoid use-after-frees if the node table/host relationship changes in the future.

Also, have the node table inherit enable_shared_from_this so that the node table can be kept alive in ping handlers. Similar to the timers case, this isn't needed but is good design in case the node table/host
relationship changes in the future. Note that this change requires that the code which starts the discovery and eviction processes is moved to a separate function (NodeTable::start) because we cannot call shared_from_this() before the
object is fully constructed.
* Fix unreliable network test findNodeIsSentAfterPong - it's unreliable because it expects the first node table to send a FindNode packet to the second node table before the second node table sends a FindNode packet to the first node table. However, there's no guarantee that this will happen because both node tables are running discovery code in separate threads. While starting the first node table before the second node table results in the expected behavior most of the time, occasionally the second node table will send the FindNode packet first resulting in the test failing because the second node table receives a Neighbours packet when it expects a FindNode packet.

I've fixed this by replacing one of the node tables with a TestUDPSocketHost - since there's only 1 discovery process running now we can deterministically know when to expect a FindNode packet to be sent by the node table.

* Add ping validation to test invalidPong Potentially makes test a little easier to debug on failure
Removed unused m_enabled, renamed doDiscover to doDiscoveryRound so there's less confusion between it and doDiscovery, and made the discovery round time interval a constant (expression).

Also, replace std::bind usage with lambdas and remove Node Table function declaration without a definition (doCheckEvictions) and rename doHandleTimeouts -> doProcessEvictions since the latter is more descriptive of what the function does
Cancel timers using boost::asio::io_service::post to ensure that timer
cancellation happens on the network thread
Removing this functionality since it complicated things without a clear benefit (e.g. I had to move discovery/eviction start out of the node table ctor into a separate function which the node table owner would have to call) - after tracing through the steps that occur at aleth/aleth-bootnode and network test exit I think we are safe without this functionality. Additional context:

The fear was that we could hit a case where handlers could be executed after the node table had been destroyed (UAF) but that is not possible since the node table is "owned" by the host and is only destroyed after the host has stopped the network thread which means that the io service has been shut down (i.e. no more Host::doWork() calling io_service::run) and no handlers can be run.

Note that the node table owner does have to call NodeTable::stop() when shutting down, otherwise we can hit a UAF in the UDP socket. I think this is a reasonable requirement.
steady_timer supports std::chrono values and has enough precision on both linux and Windows

Also remove unused node table constant (NodeTable::c_evictionCheckInterval)
@halfalicious
Copy link
Contributor Author

Squashed a bunch of commits and addressed PR feedback

Old name makes more sense since the function handles evictions and clears timed out sent pings. Also update the name of the associated timer and timer interval.

Also fix a bug where we check the node table member variable timer in the doHandleTimeouts lambda, rather than the captured timer.
@halfalicious
Copy link
Contributor Author

@chfast : I'll merge tomorrow AM unless you have any objections.

@halfalicious halfalicious dismissed chfast’s stale review March 14, 2019 15:43

He's already reviewed the core logic of the changes and is currently busy with other things

@halfalicious halfalicious merged commit 68b41bf into master Mar 14, 2019
@halfalicious halfalicious deleted the discov-timer-cancel branch March 14, 2019 15:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discovery timer stops unexpectedly

4 participants