Describe the bug (描述bug)
当下游节点X故障重启后,集群有时候会出现某个上游节点Y一直无法连接X的情况,其他上游节点在健康检查后会重建与X的连接。例如
- 下游节点10.26.44.32在09:02:17因故障重启后,某个上游节点Y没有重建与10.26.44.32的连接,日志中持续输出"Not connected to 10.26.44.32:8060 yet"
W0716 09:02:17.695824 142210 input_messenger.cpp:212] Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0xa352000): Connection reset by peer [104]
W0716 09:02:17.702852 141955 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0x0xa352000): Connection reset by peer [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
....忽略类似内容....
W0716 09:09:58.714361 38298 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
2)查看netstat发现没有Y与10.26.44.32的TCP连接
3)查看Y的/connections发现Socket状态为Broken,信息如下
$ curl http://localhost:8060/connections | grep 10.26.44.32:8060
Broken |10.26.44.32:8060 |55309|- |- |- |- |- |- |- |- |- |- |- |- |896
$ curl http://localhost:8060/sockets/896
# This is a broken Socket
version=1
shared_part={
ref_count=1
socket_pool=null
creator_socket=896
in_size=316616369
in_num_messages=12120483
out_size=114960271066
out_num_messages=12120511
}
nref=1
nevent=1
fd=4228
tos=0
reset_fd_to_now=485975008182us
remote_side=10.26.44.32:8060
local_side=10.22.180.15:55309
on_et_events=0x1bc5dd0
user=(brpc::InputMessenger*)0x5c0ab40
this_id=896
preferred_index=1 (baidu_std)
hc_count=0
avg_input_msg_size=26
read_buf=0
last_read_to_now=960432766us
last_write_to_now=960412394us
overcrowded=0
id_wait_list={}
parsing_context=0
pipeline_q=0
hc_interval_s=3
ninprocess=1
auth_flag_error=0
auth_id=177098681547473
auth_context=0
logoff_flag=0
recycle_flag=1
agent_socket_id=(none)
cid=0
write_head=0
ssl_state=SSL_OFF
tcpi={
state=7
ca_state=0
retransmits=0
probes=0
backoff=0
options=7
snd_wscale=7
rcv_wscale=7
rto=205000
ato=40000
snd_mss=1448
rcv_mss=736
unacked=0
sacked=0
lost=0
retrans=0
fackets=0
last_data_sent=960413
last_ack_sent=0
last_data_recv=960433
last_ack_recv=960413
pmtu=1500
rcv_ssthresh=52260
rtt=2750
rttvar=3000
snd_ssthresh=18
snd_cwnd=18
advmss=1448
reordering=3
}
4)Y日志中没有"Checking Socket"和"Revived Socket"的日志(集群启用了健康检查,health_check_interval = 3,其他上游节点有Checking和Revived日志)
5)stack中与brpc相关的pthread包括 1个sample线程 + 1个timer线程 + 1个dispatcher线程 + 48个worker线程
Thread 568 (Thread 0x7ff75256e700 (LWP 140858)):
#0 0x00007ff75263480d in nanosleep () from /lib64/libc.so.6
#1 0x00007ff7526650e4 in usleep () from /lib64/libc.so.6
#2 0x0000000001cf10a2 in bvar::detail::SamplerCollector::run (this=0x5342000) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bvar/detail/sampler.cpp:130
#3 0x0000000001cf1839 in bvar::detail::SamplerCollector::sampling_thread (arg=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bvar/detail/sampler.cpp:81
#4 0x00007ff753268e65 in start_thread () from /lib64/libpthread.so.0
#5 0x00007ff75266d88d in clone () from /lib64/libc.so.6
Thread 256 (Thread 0x7ff6b58c8700 (LWP 142170)):
#0 0x00007ff752667ba9 in syscall () from /lib64/libc.so.6
#1 0x0000000001d1ad43 in futex_wait_private (timeout=0x0, expected=257, addr1=0x7ff68572cf18) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/sys_futex.h:37
#2 bthread::internal::FastPthreadMutex::lock_contended (this=0x7ff68572cf18) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/mutex.cpp:661
#3 0x0000000001d0aab4 in lock_guard (__m=..., this=<synthetic pointer>) at /usr/local/include/c++/7.2.0/bits/std_mutex.h:162
#4 erase_from_butex (state=bthread::WAITER_STATE_TIMEDOUT, wakeup=true, bw=0x7ff68572ce20) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/butex.cpp:467
#5 bthread::erase_from_butex_and_wakeup (arg=0x7ff68572ce20) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/butex.cpp:450
#6 0x0000000001d29da0 in bthread::TimerThread::Task::run_and_delete (this=this@entry=0x1737a7080) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/timer_thread.cpp:277
#7 0x0000000001d2aa72 in bthread::TimerThread::run (this=0xd18c300) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/timer_thread.cpp:391
#8 0x0000000001d2b7e9 in bthread::TimerThread::run_this (arg=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/timer_thread.cpp:117
#9 0x00007ff753268e65 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ff75266d88d in clone () from /lib64/libc.so.6
Thread 254 (Thread 0x7ff6b48c6700 (LWP 142172)):
#0 0x00007ff75266de63 in epoll_wait () from /lib64/libc.so.6
#1 0x0000000001bbb64c in brpc::EventDispatcher::Run (this=0x1517d448) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/event_dispatcher.cpp:286
#2 0x0000000001bbb769 in brpc::EventDispatcher::RunThis (arg=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/event_dispatcher.cpp:271
#3 0x0000000001d24af7 in bthread::TaskGroup::task_runner (skip_remained=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_group.cpp:293
#4 0x0000000001d0d891 in bthread_make_fcontext ()
#5 0x00010102464c457f in ?? ()
#6 0x0000000000000000 in ?? ()
Thread 253 (Thread 0x7ff6b40c5700 (LWP 142173)): 相同堆栈的worker线程共48个(bthread_worker_count = 49)
#0 0x00007ff752667ba9 in syscall () from /lib64/libc.so.6
#1 0x0000000001d21d8d in futex_wait_private (timeout=0x0, expected=<optimized out>, addr1=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/sys_futex.h:37
#2 wait (expected_state=..., this=<optimized out>) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/parking_lot.h:57
#3 bthread::TaskGroup::wait_task (this=this@entry=0x6eff6a80, tid=tid@entry=0x7ff6b40c1728) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_group.cpp:121
#4 0x0000000001d252eb in bthread::TaskGroup::run_main_task (this=this@entry=0x6eff6a80) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_group.cpp:149
#5 0x0000000001d1f4b9 in bthread::TaskControl::worker_thread (arg=0x7c7a300) at /data2/olap/gaodayue/palo/thirdparty/src/incubator-brpc-0.9.5/src/bthread/task_control.cpp:73
#6 0x00007ff753268e65 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ff75266d88d in clone () from /lib64/libc.so.6
To Reproduce (复现方法)
生产环境小概率出现,目前需要通过重启上游节点恢复。
Versions (各种版本)
OS: CentOS Linux release 7.1.1503 (Core)
Compiler: gcc (GCC) 7.2.0
brpc: 0.9.5
Describe the bug (描述bug)
当下游节点X故障重启后,集群有时候会出现某个上游节点Y一直无法连接X的情况,其他上游节点在健康检查后会重建与X的连接。例如
2)查看netstat发现没有Y与10.26.44.32的TCP连接
3)查看Y的/connections发现Socket状态为Broken,信息如下
4)Y日志中没有"Checking Socket"和"Revived Socket"的日志(集群启用了健康检查,health_check_interval = 3,其他上游节点有Checking和Revived日志)
5)stack中与brpc相关的pthread包括 1个sample线程 + 1个timer线程 + 1个dispatcher线程 + 48个worker线程
To Reproduce (复现方法)
生产环境小概率出现,目前需要通过重启上游节点恢复。
Versions (各种版本)
OS: CentOS Linux release 7.1.1503 (Core)
Compiler: gcc (GCC) 7.2.0
brpc: 0.9.5