Skip to content

Fix brpc client cannot reconnect to server probabilistically when the…#1774

Closed
zhaodongzhi wants to merge 1 commit into
apache:masterfrom
zhaodongzhi:fix-client-cannot-reconnect
Closed

Fix brpc client cannot reconnect to server probabilistically when the…#1774
zhaodongzhi wants to merge 1 commit into
apache:masterfrom
zhaodongzhi:fix-client-cannot-reconnect

Conversation

@zhaodongzhi

@zhaodongzhi zhaodongzhi commented Jun 3, 2022

Copy link
Copy Markdown

Issue #1773:Fix brpc client cannot reconnect to server probabilistically when the server restarts

image

Comment thread src/brpc/socket.h
@@ -787,6 +787,7 @@ friend void DereferenceSocket(Socket*);
// Flag used to mark whether additional reference has been decreased
// by either `SetFailed' or `SetRecycle'
butil::atomic<bool> _recycle_flag;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

既然都加锁了,这个就不需要是原子变量了吧?

Comment thread src/brpc/socket.cpp Outdated
}
return;
}
_recycle_mutex.unlock();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以用std::unique_lock包一下,这里就不用手动释放

@zhaodongzhi

Copy link
Copy Markdown
Author

解决方案这里采用增加mutex的方式其实和brpc整体很多无锁的设计有点不太搭。。。。最初也考虑过一个无锁的方式但是发现比较麻烦
本质上来说_recycle_flag为true的时候socket不应该被获取到
最初的想法是将Socket::Address和Socket::AddressFailedAsWell中
if (ver1 == VersionOfSocketId(id)) 返回有效socket ptr
改为
if (ver1 == VersionOfSocketId(id) && !m->_recycle_flag) 才返回有效socket ptr【AddressFailedAsWell中再加一个对if (ver1 == VersionOfSocketId(id) + 1)的处理】
使得上述场景下1,2之间其他request无法通过address获取到socket,也就不会再进入SetFailed产生这个少释放一次ref count的问题了
但是这种做法带来的问题是,部分恢复逻辑依赖通过Socket::AddressFailedAsWell获取socket ptr,如果_recycle_flag为true的时候也让Socket::AddressFailedAsWell获取不到有效的ptr同样会导致socket无法恢复,但如果允许Socket::AddressFailedAsWell忽略_recycle_flag,部分代码中有存在类似
SocketUniquePtr ptr;
const int rc = Socket::AddressFailedAsWell(handle, &ptr);
ptr->ReleaseAdditionalReference();
的逻辑(例如ChannelBalancer::RemoveAndDestroyChannel),还是会有引发这个问题的风险

可以看下是否还有更优雅的不使用mutex的解决方案?

@wwbmmm

wwbmmm commented Jun 6, 2022

Copy link
Copy Markdown
Contributor

if (ver1 == VersionOfSocketId(id)) 返回有效socket ptr
改为
if (ver1 == VersionOfSocketId(id) && !m->_recycle_flag) 才返回有效socket ptr

Address是hot path,加了影响性能。而ReleaseAddtionalReference和Revive相对没有那么hot (除非是短连接),加锁影响小一些。

还有一种解决方案是让recycle_flag有3个值:正常、Revive中、已释放。如果ReleaseAddtionalReference发现状态是“Revive中”就sched_yield一下,直到遇到其它两个状态再继续。因为这种情况发生概率很小,这么改对性能几乎不会有影响。

@zhaodongzhi zhaodongzhi requested a review from wwbmmm June 6, 2022 04:34
@zhaodongzhi

Copy link
Copy Markdown
Author

if (ver1 == VersionOfSocketId(id)) 返回有效socket ptr
改为
if (ver1 == VersionOfSocketId(id) && !m->_recycle_flag) 才返回有效socket ptr

Address是hot path,加了影响性能。而ReleaseAddtionalReference和Revive相对没有那么hot (除非是短连接),加锁影响小一些。

还有一种解决方案是让recycle_flag有3个值:正常、Revive中、已释放。如果ReleaseAddtionalReference发现状态是“Revive中”就sched_yield一下,直到遇到其它两个状态再继续。因为这种情况发生概率很小,这么改对性能几乎不会有影响。

嗯嗯,我也是考虑到这两个函数调用频率比较低,所以最后还是选择了加锁的解决方式,相对来说会比较直观简单些

… server restarts

Problem
-------

Server restart will trigger Socket::SetFailed -> Socket check healthy
-> Socket::WaitAndReset -> Reconnect to server -> Socket::Revive
on the brpc client side

Socket::Revive
-> _versioned_ref.cas(vref, <id_ver, nref + 1>) // step 1
   -> _recycle_flag = false // step 2

After step 1 is successfully executed, the Socket can be acquired by Socket::Address
and the gap between 1 and 2 will cause the ref count  in _versioned_ref to not be
decreased in Socket::ReleaseAdditionalReference since _recycle_flag is true
Finally Socket::WaitAndReset can never wait for the expected ref count to mistakenly
think that there are still another requests holding socket references
and fall into an infinite loop

Example:

Session1(Thread1)                                   Session2(Thread2)

Socket::Revive
T1: _versioned_ref.cas(vref, <id_ver, nref + 1>)
                                                    T2: Socker::Address success
                                                    T3: Server restart or network error happened
                                                    T4: Socket::SetFailed
                                                    T5: Socket::ReleaseAdditionalReference
// Now _recycle_flag is true                        T6: _recycle_flag.cas(false, true)
                                                        -> Failed: No call Socket::Dereference
T7: _recycle_flag = false
                                                    T8: Socket::WaitAndReset
                                                        -> nref always be three, not excepted two
                                                        -> dead loop. cannot reconnected to server

Solution
--------
Add _recycle_mutex to avoid this race condition

Socket::Revive and Socket::ReleaseAdditionalReference will only be called
when some exceptions occur in the socket, so this mutex will not cause
performance degradation
@zhaodongzhi

Copy link
Copy Markdown
Author

您好,请问这里是还有其他需要修改的地方吗?

@wwbmmm

wwbmmm commented Aug 9, 2022

Copy link
Copy Markdown
Contributor

由于PR : #1817 已经解决这个问题,所以这个PR close了。

@wwbmmm wwbmmm closed this Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants