Fix deadline timer race condition which was causing premature discovery termination by halfalicious · Pull Request #5502 · ethereum/aleth

halfalicious · 2019-02-24T17:05:41Z

Fix the boost deadline timer race condition which was causing the discovery process to terminate (in some cases extremely) prematurely. The race condition occurred because DeadlineOps would check for timer expiration (deadline_timer::expires_from_now() < 0) and delete expired timers (see DeadlineOps::reap()), but their handlers may not have executed yet which means that when they finally executed their boost error code would indicate operation aborted and the handler would return before executing any of its core logic.

To fix this, I created 2 dedicated deadline timers in NodeTable - 1 for the discovery process and 1 for the node eviction process. I think this makes sense since these are 2 algorithms which are supposed to be running for as long as the node table is running…additionally, having dedicated timers here makes the code easier to understand. I also only check the boost error code in these timer handlers since the timers are cancelled in NodeTable::stop() (which is executed on both network shutdown and on NodeTable destruction), meaning that the error code is guaranteed to indicate timer cancellation when a handler is executed after network shutdown or node table destruction.

I've also removed the DeadlineOps class since it's no longer needed.

codecov-io · 2019-02-25T02:40:16Z

Codecov Report

Merging #5502 into master will decrease coverage by 0.02%.
The diff coverage is 82.47%.

@@            Coverage Diff             @@
##           master    #5502      +/-   ##
==========================================
- Coverage   61.89%   61.86%   -0.03%     
==========================================
  Files         345      345              
  Lines       28736    28725      -11     
  Branches     3265     3261       -4     
==========================================
- Hits        17785    17770      -15     
- Misses       9782     9796      +14     
+ Partials     1169     1159      -10

halfalicious · 2019-02-26T03:22:37Z

Rebased

halfalicious · 2019-02-26T05:11:42Z

The following 2 tests are failing due to near-null AVs:

Test Case "noteActiveNodeEvictsTheNodeWhenBucketIsFull": 
unknown location(0): fatal error: in "network/net/noteActiveNodeEvictsTheNodeWhenBucketIsFull": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(516): last checkpoint

Test Case "evictionWithOldNodeDropped": 
unknown location(0): fatal error: in "network/net/evictionWithOldNodeDropped": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(881): last checkpoint

halfalicious · 2019-02-26T06:36:21Z

The following 2 tests are failing due to near-null AVs:

Test Case "noteActiveNodeEvictsTheNodeWhenBucketIsFull": 
unknown location(0): fatal error: in "network/net/noteActiveNodeEvictsTheNodeWhenBucketIsFull": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(516): last checkpoint

Test Case "evictionWithOldNodeDropped": 
unknown location(0): fatal error: in "network/net/evictionWithOldNodeDropped": signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
/home/builder/project/test/unittests/libp2p/net.cpp(881): last checkpoint

Issues seem specific to clang...can't repro locally on Windows with full page heap enabled and the Linux GCC CircleCI run passed.

chfast · 2019-02-26T14:03:13Z

Bad news, man. It crashes when testing in parallel. I could not reproduce it at first, because the test must run in parallel to cause the crash. In the end I compiled with clang + ASan + libc++ on linux.

The report:

=================================================================
==19042==ERROR: AddressSanitizer: heap-use-after-free on address 0x625000027b50 at pc 0x0000013ede33 bp 0x7fc60c40a4a0 sp 0x7fc60c40a498
READ of size 1 at 0x625000027b50 thread T5 (test)
    #0 0x13ede32 in std::__1::__atomic_base<bool, false>::load(std::__1::memory_order) const /usr/include/c++/v1/atomic:926:17
    #1 0x13ede32 in std::__1::__atomic_base<bool, false>::operator bool() const /usr/include/c++/v1/atomic:930
    #2 0x13ede32 in dev::p2p::DeadlineOps::reap()::$_0::operator()(std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&) const /home/chfast/Projects/ethereum/aleth/libp2p/Common.cpp:177
    #3 0x13ef307 in std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>::operator()(std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&) const /usr/include/c++/v1/functional:1916:12
    #4 0x13ef307 in decltype(std::__1::forward<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&>(fp)(std::__1::forward<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>&>(fp0), std::__1::forward<boost::system::error_code const&>(fp0))) std::__1::__invoke<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>&, boost::system::error_code const&>(std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>&, boost::system::error_code const&) /usr/include/c++/v1/type_traits:4482
    #5 0x13ef307 in std::__1::__bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&>, __is_valid_bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&> >::value>::type std::__1::__apply_functor<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, 0ul, 1ul, std::__1::tuple<boost::system::error_code const&> >(std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>&, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >&, std::__1::__tuple_indices<0ul, 1ul>, std::__1::tuple<boost::system::error_code const&>&&) /usr/include/c++/v1/functional:2219
    #6 0x13ef307 in std::__1::__bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&>, __is_valid_bind_return<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)>, std::__1::tuple<std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> >, std::__1::tuple<boost::system::error_code const&> >::value>::type std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>::operator()<boost::system::error_code const&>(boost::system::error_code const&) /usr/include/c++/v1/functional:2252
    #7 0x13ef307 in boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/bind_handler.hpp:47
    #8 0x13ef307 in void boost::asio::asio_handler_invoke<boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code> >(boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>&, ...) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/handler_invoke_hook.hpp:69
    #9 0x13ef307 in void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>, std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&> >(boost::asio::detail::binder1<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>, boost::system::error_code>&, std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&>&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/handler_invoke_helpers.hpp:37
    #10 0x13ef307 in boost::asio::detail::wait_handler<std::__1::__bind<std::__1::function<void (std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp> const&, boost::system::error_code const&)> const&, std::__1::shared_ptr<dev::p2p::DeadlineOps::DeadlineOp>, std::__1::placeholders::__ph<1> const&> >::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/wait_handler.hpp:70
    #11 0xe748f0 in boost::asio::detail::task_io_service_operation::complete(boost::asio::detail::task_io_service&, boost::system::error_code const&, unsigned long) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/task_io_service_operation.hpp:38:5
    #12 0xe748f0 in boost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/impl/task_io_service.ipp:372
    #13 0xe74480 in boost::asio::detail::task_io_service::run(boost::system::error_code&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/detail/impl/task_io_service.ipp:149:10
    #14 0xe74339 in boost::asio::io_service::run() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/asio/impl/io_service.ipp:59:25
    #15 0x1226cb9 in dev::Worker::workLoop() /home/chfast/Projects/ethereum/aleth/libdevcore/Worker.cpp:140:3
    #16 0x1227511 in dev::Worker::startWorking()::$_0::operator()() const /home/chfast/Projects/ethereum/aleth/libdevcore/Worker.cpp:62:6
    #17 0x1226dcb in decltype(std::__1::forward<dev::Worker::startWorking()::$_0>(fp)()) std::__1::__invoke<dev::Worker::startWorking()::$_0>(dev::Worker::startWorking()::$_0&&) /usr/include/c++/v1/type_traits:4482:1
    #18 0x1226dcb in void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dev::Worker::startWorking()::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dev::Worker::startWorking()::$_0>&, std::__1::__tuple_indices<>) /usr/include/c++/v1/thread:342
    #19 0x1226dcb in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dev::Worker::startWorking()::$_0> >(void*) /usr/include/c++/v1/thread:352
    #20 0x7fc6113286da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #21 0x7fc6103a088e in clone /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95

0x625000027b50 is located 8784 bytes inside of 9016-byte region [0x625000025900,0x625000027c38)
freed by thread T0 here:
    #0 0x9147b2 in operator delete(void*) (/home/chfast/Projects/ethereum/aleth/build/asan-libcxx/test/testeth+0x9147b2)
    #1 0xe64ef8 in std::__1::__shared_count::__release_shared() /usr/include/c++/v1/memory:3490:9
    #2 0xe64ef8 in std::__1::__shared_weak_count::__release_shared() /usr/include/c++/v1/memory:3532
    #3 0xe64ef8 in std::__1::shared_ptr<network::net::TestNodeTable>::~shared_ptr() /usr/include/c++/v1/memory:4468
    #4 0xe64ef8 in std::__1::enable_if<is_convertible<network::net::TestNodeTable*, network::net::TestNodeTable*>::value, void>::type std::__1::shared_ptr<network::net::TestNodeTable>::reset<network::net::TestNodeTable>(network::net::TestNodeTable*) /usr/include/c++/v1/memory:4616
    #5 0xe64ef8 in network::net::TestNodeTableHost::TestNodeTableHost(unsigned int, unsigned short) /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:234
    #6 0xe3b931 in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull::test_method() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:514:23
    #7 0xe3b06c in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull_invoker() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:512:1
    #8 0xb8689b in boost::function0<void>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #9 0xb8689b in boost::detail::forward::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1300
    #10 0xb8689b in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:138
    #11 0xb45699 in boost::function0<int>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #12 0xb45699 in int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:281
    #13 0xb45699 in boost::execution_monitor::catch_signals(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:864
    #14 0xb4589a in boost::execution_monitor::execute(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1203:16
    #15 0xb3a915 in boost::execution_monitor::vexecute(boost::function<void ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1309:5
    #16 0xb418c8 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_monitor.ipp:46:9
    #17 0xb64c64 in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:771:44
    #18 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #19 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #20 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #21 0xb3f6f1 in boost::unit_test::framework::run(unsigned long, bool) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:1577:29
    #22 0xb5141c in boost::unit_test::unit_test_main(boost::unit_test::test_suite* (*)(int, char**), int, char**) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_main.ipp:231:9
    #23 0xb5e0ff in main /home/chfast/Projects/ethereum/aleth/test/tools/libtesteth/boostTest.cpp:139:14
    #24 0x7fc6102a0b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310

previously allocated by thread T0 here:
    #0 0x913b72 in operator new(unsigned long) (/home/chfast/Projects/ethereum/aleth/build/asan-libcxx/test/testeth+0x913b72)
    #1 0xe64da9 in network::net::TestNodeTableHost::TestNodeTableHost(unsigned int, unsigned short) /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:235:17
    #2 0xe3b931 in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull::test_method() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:514:23
    #3 0xe3b06c in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull_invoker() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:512:1
    #4 0xb8689b in boost::function0<void>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #5 0xb8689b in boost::detail::forward::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1300
    #6 0xb8689b in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:138
    #7 0xb45699 in boost::function0<int>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #8 0xb45699 in int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:281
    #9 0xb45699 in boost::execution_monitor::catch_signals(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:864
    #10 0xb4589a in boost::execution_monitor::execute(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1203:16
    #11 0xb3a915 in boost::execution_monitor::vexecute(boost::function<void ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1309:5
    #12 0xb418c8 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_monitor.ipp:46:9
    #13 0xb64c64 in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:771:44
    #14 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #15 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #16 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #17 0xb3f6f1 in boost::unit_test::framework::run(unsigned long, bool) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:1577:29
    #18 0xb5141c in boost::unit_test::unit_test_main(boost::unit_test::test_suite* (*)(int, char**), int, char**) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_main.ipp:231:9
    #19 0xb5e0ff in main /home/chfast/Projects/ethereum/aleth/test/tools/libtesteth/boostTest.cpp:139:14
    #20 0x7fc6102a0b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310

Thread T5 (test) created by T0 here:
    #0 0x8ccc2d in pthread_create (/home/chfast/Projects/ethereum/aleth/build/asan-libcxx/test/testeth+0x8ccc2d)
    #1 0x1225d3f in std::__1::__libcpp_thread_create(unsigned long*, void* (*)(void*), void*) /usr/include/c++/v1/__threading_support:327:10
    #2 0x1225d3f in std::__1::thread::thread<dev::Worker::startWorking()::$_0, void>(dev::Worker::startWorking()::$_0&&) /usr/include/c++/v1/thread:368
    #3 0x1225735 in dev::Worker::startWorking() /home/chfast/Projects/ethereum/aleth/libdevcore/Worker.cpp:44:20
    #4 0xe3be72 in network::net::TestHost::start() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:37:20
    #5 0xe3be72 in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull::test_method() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:524
    #6 0xe3b06c in network::net::noteActiveNodeEvictsTheNodeWhenBucketIsFull_invoker() /home/chfast/Projects/ethereum/aleth/test/unittests/libp2p/net.cpp:512:1
    #7 0xb8689b in boost::function0<void>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #8 0xb8689b in boost::detail::forward::operator()() /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1300
    #9 0xb8689b in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:138
    #10 0xb45699 in boost::function0<int>::operator()() const /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/function/function_template.hpp:759:14
    #11 0xb45699 in int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:281
    #12 0xb45699 in boost::execution_monitor::catch_signals(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:864
    #13 0xb4589a in boost::execution_monitor::execute(boost::function<int ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1203:16
    #14 0xb3a915 in boost::execution_monitor::vexecute(boost::function<void ()> const&) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/execution_monitor.ipp:1309:5
    #15 0xb418c8 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_monitor.ipp:46:9
    #16 0xb64c64 in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:771:44
    #17 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #18 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #19 0xb650bc in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:717:54
    #20 0xb3f6f1 in boost::unit_test::framework::run(unsigned long, bool) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/framework.ipp:1577:29
    #21 0xb5141c in boost::unit_test::unit_test_main(boost::unit_test::test_suite* (*)(int, char**), int, char**) /home/chfast/.hunter/_Base/c4cfcc0/59044ab/ef3c3eb/Install/include/boost/test/impl/unit_test_main.ipp:231:9
    #22 0xb5e0ff in main /home/chfast/Projects/ethereum/aleth/test/tools/libtesteth/boostTest.cpp:139:14
    #23 0x7fc6102a0b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310

SUMMARY: AddressSanitizer: heap-use-after-free /usr/include/c++/v1/atomic:926:17 in std::__1::__atomic_base<bool, false>::load(std::__1::memory_order) const
Shadow bytes around the buggy address:
  0x0c4a7fffcf10: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf20: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf40: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0c4a7fffcf60: fd fd fd fd fd fd fd fd fd fd[fd]fd fd fd fd fd
  0x0c4a7fffcf70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c4a7fffcf80: fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa fa
  0x0c4a7fffcf90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffcfa0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a7fffcfb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==19042==ABORTING

chfast · 2019-02-26T14:04:10Z

It looks like the shared pointer is deleted, but some still has the pointer to the object.

chfast

Fix the use-after-free.

halfalicious · 2019-02-26T17:15:02Z

eadlineOps::DeadlineOp> const&, b

Ah, I think the problem is in the DeadlineOps::reap() handler - it checks the error code and m_stopped before deciding to proceed, but DeadlineLoops dtor doesn't cancel timers before deleting them which means that the error code won't be set and attempting to check m_stopped will trigger an AV.

chfast · 2019-02-26T18:16:03Z

Ah, I think the problem is in the DeadlineOps::reap() handler - it checks the error code and m_stopped before deciding to proceed, but DeadlineLoops dtor doesn't cancel timers before deleting them which means that the error code won't be set and attempting to check m_stopped will trigger an AV.

Can't you just check the error code first?

halfalicious · 2019-02-27T05:18:31Z

Ah, I think the problem is in the DeadlineOps::reap() handler - it checks the error code and m_stopped before deciding to proceed, but DeadlineLoops dtor doesn't cancel timers before deleting them which means that the error code won't be set and attempting to check m_stopped will trigger an AV.

Can't you just check the error code first?

We do but it doesn't indicate operation aborted since the timer(s) haven't been cancelled + are still alive because of the shared_ptr accepted by the handler.

halfalicious · 2019-02-27T05:20:24Z

I've rebased my changes after merging @chfast 's changes which queue pings via io_service::post and scoped them down to just adding the dedicated timers for the discovery and eviction processes. I've also removed DeadlineOps since we no longer need it.

I'm going to test my changes by running discovery overnight and verifying no cancellation but I don't envision any problems.

chfast

This looks good to me, but there are some test failures that must be investigated.

Moreover, check out this tutorial about triggering timers periodically (we are close to this, but maybe you can spot some small differences): https://www.boost.org/doc/libs/1_69_0/doc/html/boost_asio/tutorial/tuttimer3.html

Also they are using steady_timer instead of deadline_timer. The difference is only in the clock used. We don't need high precision here, so maybe one is better than the other for our case.

libp2p/NodeTable.cpp

halfalicious · 2019-03-04T04:15:08Z

One test is failing in linux-clang6:

510 - network/net/findNodeIsSentAfterPong (Failed)

I think it's just an unreliable test, I'll file a new issue for tracking purposes.

No tests are failing in the appveyor config, but I'm seeing the following error which is causing the config to fail:

*** No errors detected
Collecting artifacts...
Found artifact 'build\aleth.zip' matching 'build\aleth.zip' path
Uploading artifacts...
Error uploading artifact the storage: Unable to connect to the remote server

I recall seeing this before... @chfast : Is this a known issue?

libp2p/NodeTable.cpp

libp2p/NodeTable.h

halfalicious · 2019-03-13T04:58:11Z

Looks like the linux-clang failure is unrelated to my changes, all unit tests passed but the sstore test appears to have timed out.

halfalicious · 2019-03-13T04:58:39Z

cc @chfast / @gumb0

gumb0

A couple of minor comments, otherwise looks good

gumb0 · 2019-03-13T16:32:55Z

libp2p/NodeTable.cpp

+    m_discoveryTimer->expires_from_now(c_bucketRefreshMs);
+    auto discoveryTimer{m_discoveryTimer};
+    m_discoveryTimer->async_wait([this, discoveryTimer](boost::system::error_code const& _ec) {
+        // We can't use m_logger if an error occurred because captured this might be already


I'd move this comment inside brackets, before clog(...)<<

gumb0 · 2019-03-13T16:36:17Z

libp2p/NodeTable.cpp

+    auto discoveryTimer{m_discoveryTimer};
+    m_discoveryTimer->async_wait(
+        [this, discoveryTimer, _node, _round, _tried](boost::system::error_code const& _ec) {
+            // We can't use m_logger here if there's an error because captured this might already be


here too, move comment

So I purposely put this comment above the "if" case since it applies to both the "if" and "else if" and I didn't want to include the same comment in both cases, plus I phrased it so that it would apply to both cases (i.e. I mentioned "error" instead of calling out error:operation_aborted)

libp2p/NodeTable.cpp

gumb0 · 2019-03-13T16:39:40Z

libp2p/NodeTable.cpp

+    m_evictionTimer->expires_from_now(c_processEvictionsIntervalMs);
+    auto evictionTimer{m_evictionTimer};
+    m_evictionTimer->async_wait([this, evictionTimer](boost::system::error_code const& _ec) {
+        // We can't use m_logger if an error occurred because captured this might be already


Also move comment

Add dedicated node table timers for the discovery and eviction algorithms - this enables us to remove the DeadlineOps timer wrapper class which has a race condition (handlers are executed after their timers have been deleted) which is overly complicated for what we need. Note that the timers are shared_ptr - this is required to keep timers alive if handlers are executed after the node table has been destroyed. This shouldn't happen in practice since the node table lifetime is tied to the host lifetime and if the host is destroyed then the ioservice is no longer running, but these changes will help avoid use-after-frees if the node table/host relationship changes in the future. Also, have the node table inherit enable_shared_from_this so that the node table can be kept alive in ping handlers. Similar to the timers case, this isn't needed but is good design in case the node table/host relationship changes in the future. Note that this change requires that the code which starts the discovery and eviction processes is moved to a separate function (NodeTable::start) because we cannot call shared_from_this() before the object is fully constructed.

* Fix unreliable network test findNodeIsSentAfterPong - it's unreliable because it expects the first node table to send a FindNode packet to the second node table before the second node table sends a FindNode packet to the first node table. However, there's no guarantee that this will happen because both node tables are running discovery code in separate threads. While starting the first node table before the second node table results in the expected behavior most of the time, occasionally the second node table will send the FindNode packet first resulting in the test failing because the second node table receives a Neighbours packet when it expects a FindNode packet. I've fixed this by replacing one of the node tables with a TestUDPSocketHost - since there's only 1 discovery process running now we can deterministically know when to expect a FindNode packet to be sent by the node table. * Add ping validation to test invalidPong Potentially makes test a little easier to debug on failure

Removed unused m_enabled, renamed doDiscover to doDiscoveryRound so there's less confusion between it and doDiscovery, and made the discovery round time interval a constant (expression). Also, replace std::bind usage with lambdas and remove Node Table function declaration without a definition (doCheckEvictions) and rename doHandleTimeouts -> doProcessEvictions since the latter is more descriptive of what the function does

Cancel timers using boost::asio::io_service::post to ensure that timer cancellation happens on the network thread

Removing this functionality since it complicated things without a clear benefit (e.g. I had to move discovery/eviction start out of the node table ctor into a separate function which the node table owner would have to call) - after tracing through the steps that occur at aleth/aleth-bootnode and network test exit I think we are safe without this functionality. Additional context: The fear was that we could hit a case where handlers could be executed after the node table had been destroyed (UAF) but that is not possible since the node table is "owned" by the host and is only destroyed after the host has stopped the network thread which means that the io service has been shut down (i.e. no more Host::doWork() calling io_service::run) and no handlers can be run. Note that the node table owner does have to call NodeTable::stop() when shutting down, otherwise we can hit a UAF in the UDP socket. I think this is a reasonable requirement.

steady_timer supports std::chrono values and has enough precision on both linux and Windows Also remove unused node table constant (NodeTable::c_evictionCheckInterval)

halfalicious · 2019-03-14T03:04:02Z

Squashed a bunch of commits and addressed PR feedback

Old name makes more sense since the function handles evictions and clears timed out sent pings. Also update the name of the associated timer and timer interval. Also fix a bug where we check the node table member variable timer in the doHandleTimeouts lambda, rather than the captured timer.

halfalicious · 2019-03-14T04:00:38Z

@chfast : I'll merge tomorrow AM unless you have any objections.

He's already reviewed the core logic of the changes and is currently busy with other things

halfalicious added the in progress label Feb 24, 2019

halfalicious force-pushed the discov-timer-cancel branch from a76a849 to d8ff0f1 Compare February 25, 2019 02:40

halfalicious force-pushed the discov-timer-cancel branch from da4ba10 to c61ba26 Compare February 25, 2019 05:15

halfalicious added the networking label Feb 25, 2019

halfalicious force-pushed the discov-timer-cancel branch 3 times, most recently from e95000d to 18ae15b Compare February 26, 2019 03:22

halfalicious force-pushed the discov-timer-cancel branch 3 times, most recently from 6970220 to 672e9a1 Compare February 26, 2019 05:03

halfalicious requested a review from chfast February 26, 2019 05:07

halfalicious removed the in progress label Feb 26, 2019

halfalicious changed the title ~~[WIP] Discov timer cancel~~ Fix deadline timer race condition which was causing premature discovery termination Feb 26, 2019

chfast suggested changes Feb 26, 2019

View reviewed changes

halfalicious force-pushed the discov-timer-cancel branch from 672e9a1 to 9c03d48 Compare February 27, 2019 05:16

chfast suggested changes Feb 27, 2019

View reviewed changes

libp2p/NodeTable.cpp Outdated Show resolved Hide resolved

halfalicious force-pushed the discov-timer-cancel branch 2 times, most recently from 3dbf8d6 to 72fe8c4 Compare March 3, 2019 23:08

halfalicious force-pushed the discov-timer-cancel branch from 804b409 to 1df1040 Compare March 9, 2019 07:37

gumb0 reviewed Mar 11, 2019

View reviewed changes

libp2p/NodeTable.cpp Outdated Show resolved Hide resolved

gumb0 reviewed Mar 11, 2019

View reviewed changes

libp2p/NodeTable.h Outdated Show resolved Hide resolved

gumb0 mentioned this pull request Mar 12, 2019

Handle receiving a Pong with changed NodeID #5452

Merged

halfalicious added the in progress label Mar 13, 2019

halfalicious changed the title ~~Fix deadline timer race condition which was causing premature discovery termination~~ [WIP] Fix deadline timer race condition which was causing premature discovery termination Mar 13, 2019

halfalicious force-pushed the discov-timer-cancel branch 3 times, most recently from f5571f2 to de9bffc Compare March 13, 2019 04:24

halfalicious removed the in progress label Mar 13, 2019

halfalicious changed the title ~~[WIP] Fix deadline timer race condition which was causing premature discovery termination~~ Fix deadline timer race condition which was causing premature discovery termination Mar 13, 2019

gumb0 approved these changes Mar 13, 2019

View reviewed changes

halfalicious and others added 7 commits March 13, 2019 19:29

Remove unnecessary std::

72086e8

Cancel node table timers in a thread-safe manner

790a774

Cancel timers using boost::asio::io_service::post to ensure that timer cancellation happens on the network thread

Replace deadline_timer with steady_timer

b08d83a

steady_timer supports std::chrono values and has enough precision on both linux and Windows Also remove unused node table constant (NodeTable::c_evictionCheckInterval)

halfalicious force-pushed the discov-timer-cancel branch from de9bffc to 00ef44d Compare March 14, 2019 03:03

halfalicious force-pushed the discov-timer-cancel branch from 00ef44d to add83b7 Compare March 14, 2019 03:07

halfalicious merged commit 68b41bf into master Mar 14, 2019

halfalicious deleted the discov-timer-cancel branch March 14, 2019 15:43

halfalicious mentioned this pull request Mar 16, 2019

Fix use of timers when managing capability background work loops #5523

Merged

Conversation

halfalicious commented Feb 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Feb 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

halfalicious commented Feb 26, 2019

Uh oh!

halfalicious commented Feb 26, 2019

Uh oh!

halfalicious commented Feb 26, 2019

Uh oh!

chfast commented Feb 26, 2019

Uh oh!

chfast commented Feb 26, 2019

Uh oh!

chfast left a comment

Choose a reason for hiding this comment

Uh oh!

halfalicious commented Feb 26, 2019

Uh oh!

chfast commented Feb 26, 2019

Uh oh!

halfalicious commented Feb 27, 2019

Uh oh!

halfalicious commented Feb 27, 2019

Uh oh!

chfast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

halfalicious commented Mar 4, 2019

Uh oh!

Uh oh!

Uh oh!

halfalicious commented Mar 13, 2019

Uh oh!

halfalicious commented Mar 13, 2019

Uh oh!

gumb0 left a comment

Choose a reason for hiding this comment

Uh oh!

gumb0 Mar 13, 2019

Choose a reason for hiding this comment

Uh oh!

gumb0 Mar 13, 2019

Choose a reason for hiding this comment

Uh oh!

halfalicious Mar 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gumb0 Mar 13, 2019

Choose a reason for hiding this comment

Uh oh!

halfalicious commented Mar 14, 2019

Uh oh!

halfalicious commented Mar 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

halfalicious commented Feb 24, 2019 •

edited

Loading

codecov-io commented Feb 25, 2019 •

edited

Loading

halfalicious Mar 14, 2019 •

edited

Loading