fix additional refrerence can not release when socket is reviving by chenBright · Pull Request #1817 · apache/brpc

chenBright · 2022-06-22T12:30:22Z

fix #1773

#1774 的pr是使用了加锁的方案，该pr实现了@wwbmmm在 #1774 提到的方案：

_recycle_flag增加reviving状态。如果ReleaseAddtionalReference发现状态是reviving就一直等，直到遇到其它两个状态再继续释放additional refrerence。

wwbmmm · 2022-06-23T07:54:07Z

LGTM

chenBright · 2022-06-29T07:16:47Z

@wwbmmm 大概什么时候能合呢？

chenBright · 2022-07-18T04:19:50Z

@zyearn 麻烦看看这个pr。

zyearn · 2022-07-19T21:26:11Z

@chenBright 我这两天看一下。

zyearn · 2022-07-21T21:29:18Z

一个Nice To Have：这里会有2次HC，第一次HC是真正的HC，第二次HC是因为这个极端case导致的false positive（其实对端server已经恢复了），理论上只做一次HC即可。

可以不用在这个PR实现，有兴趣的话可以想想怎么解决 :)

chenBright · 2022-07-23T09:18:39Z

一个Nice To Have：这里会有2次HC，第一次HC是真正的HC，第二次HC是因为这个极端case导致的false positive（其实对端server已经恢复了），理论上只做一次HC即可。

可以不用在这个PR实现，有兴趣的话可以想想怎么解决 :)

设个标志_is_hc_started（atomic），默认为false。hc开始的时候判断_is_hc_started.compare_exchange_strong(expect, true,butil::memory_order_relaxed)，其中expect为false。如果为true，则表示无hc，可以进行hc；否则，跳过，不进行hc。当hc结束时，将_is_hc_started置为false。这样应该可以避免同时进行2次hc的问题。

…thread_yield

zyearn · 2022-07-23T20:56:21Z

一个Nice To Have：这里会有2次HC，第一次HC是真正的HC，第二次HC是因为这个极端case导致的false positive（其实对端server已经恢复了），理论上只做一次HC即可。
可以不用在这个PR实现，有兴趣的话可以想想怎么解决 :)

设个标志_is_hc_started（atomic），默认为false。hc开始的时候判断_is_hc_started.compare_exchange_strong(expect, true,butil::memory_order_relaxed)，其中expect为false。如果为true，则表示无hc，可以进行hc；否则，跳过，不进行hc。当hc结束时，将_is_hc_started置为false。这样应该可以避免同时进行2次hc的问题。

Another Nice-to-have: 有一种情况会导致2次连续(不并发）的HC：SetFailed后调度出去，另外一个线程的HC结束，然后SetFailed继续执行到StartHealthCheck。当然第二次是必要的因为第一次HC之后如果server立刻down了，这第二次HC可以检测出来，不过这个场景大概率第二次HC是成功的，因为之前紧接的第一次是成功的。

我们还可以继续优化掉这个第二次HC吗？如果发生了server在第一次HC后立刻down的情况，可以在下一次用户主动的rpc call的时候再触发HC.

chenBright · 2022-07-24T11:09:55Z

一个Nice To Have：这里会有2次HC，第一次HC是真正的HC，第二次HC是因为这个极端case导致的false positive（其实对端server已经恢复了），理论上只做一次HC即可。
可以不用在这个PR实现，有兴趣的话可以想想怎么解决 :)

设个标志_is_hc_started（atomic），默认为false。hc开始的时候判断_is_hc_started.compare_exchange_strong(expect, true,butil::memory_order_relaxed)，其中expect为false。如果为true，则表示无hc，可以进行hc；否则，跳过，不进行hc。当hc结束时，将_is_hc_started置为false。这样应该可以避免同时进行2次hc的问题。

Another Nice-to-have: 有一种情况会导致2次连续(不并发）的HC：SetFailed后调度出去，另外一个线程的HC结束，然后SetFailed继续执行到StartHealthCheck。当然第二次是必要的因为第一次HC之后如果server立刻down了，这第二次HC可以检测出来，不过这个场景大概率第二次HC是成功的，因为之前紧接的第一次是成功的。

我们还可以继续优化掉这个第二次HC吗？如果发生了server在第一次HC后立刻down的情况，可以在下一次用户主动的rpc call的时候再触发HC.

这种情况好像没有好的优化方法，此场景下的第二次SetFailed和普通场景下的SetFailed差不多。_versioned_ref的恢复需要HC，如果发生了server在第一次HC后立刻down的情况，不在这次触发HC的话，_versioned_ref就无法恢复，用户也无法address到socket，主动rpc call了。

zyearn · 2022-07-24T20:52:18Z

一个Nice To Have：这里会有2次HC，第一次HC是真正的HC，第二次HC是因为这个极端case导致的false positive（其实对端server已经恢复了），理论上只做一次HC即可。
可以不用在这个PR实现，有兴趣的话可以想想怎么解决 :)

设个标志_is_hc_started（atomic），默认为false。hc开始的时候判断_is_hc_started.compare_exchange_strong(expect, true,butil::memory_order_relaxed)，其中expect为false。如果为true，则表示无hc，可以进行hc；否则，跳过，不进行hc。当hc结束时，将_is_hc_started置为false。这样应该可以避免同时进行2次hc的问题。

Another Nice-to-have: 有一种情况会导致2次连续(不并发）的HC：SetFailed后调度出去，另外一个线程的HC结束，然后SetFailed继续执行到StartHealthCheck。当然第二次是必要的因为第一次HC之后如果server立刻down了，这第二次HC可以检测出来，不过这个场景大概率第二次HC是成功的，因为之前紧接的第一次是成功的。
我们还可以继续优化掉这个第二次HC吗？如果发生了server在第一次HC后立刻down的情况，可以在下一次用户主动的rpc call的时候再触发HC.

这种情况好像没有好的优化方法，此场景下的第二次SetFailed和普通场景下的SetFailed差不多。_versioned_ref的恢复需要HC，如果发生了server在第一次HC后立刻down的情况，不在这次触发HC的话，_versioned_ref就无法恢复，用户也无法address到socket，主动rpc call了。

是的，这是这个优化在实现上比较难的地方，先暂时放一放吧。

zyearn · 2022-07-25T20:03:46Z

感谢贡献

wwbmmm reviewed Jun 23, 2022

View reviewed changes

Comment thread src/brpc/socket.cpp Outdated

Comment thread src/brpc/socket.cpp Outdated

zyearn self-assigned this Jul 1, 2022

zyearn reviewed Jul 21, 2022

View reviewed changes

Comment thread src/brpc/socket.h Outdated

Comment thread src/brpc/socket.cpp

chenguangming added 2 commits July 23, 2022 17:36

fix additional refrerence can not release when socket is reviving

7c784c2

fix update ReleaseAdditionalReference && use sched_yield instead of b…

2c0e031

…thread_yield

chenBright force-pushed the fix_socket_recyle_flag branch 2 times, most recently from bfc1621 to 485bbf8 Compare July 23, 2022 10:39

avoid run 2 hc at the same time

ec32816

chenBright force-pushed the fix_socket_recyle_flag branch from 485bbf8 to ec32816 Compare July 23, 2022 15:07

zyearn reviewed Jul 24, 2022

View reviewed changes

Comment thread src/brpc/socket.h Outdated

Comment thread src/brpc/socket.h Outdated

chenguangming added 2 commits July 25, 2022 11:20

fix annotations

1acd991

add HC-related reference infomation to DebugSocket

dfe071d

zyearn merged commit 83fc7b7 into apache:master Jul 25, 2022

chenBright mentioned this pull request Jul 26, 2022

call AfterHCCompleted when WaitAndReset fails #1858

Merged

chenBright deleted the fix_socket_recyle_flag branch July 27, 2022 03:32

CalvinKirs mentioned this pull request Aug 3, 2022

Set protected branches and limit merge methods #1872

Closed

wwbmmm mentioned this pull request Aug 9, 2022

Fix brpc client cannot reconnect to server probabilistically when the… #1774

Closed

lorinlee mentioned this pull request Mar 4, 2023

brpc故障恢复bug #2146

Open

QinchengZhang mentioned this pull request Jun 30, 2023

客户端请求流量高后请求server端报[E110]Fail to connect Socket错误 #2294

Open

howzi mentioned this pull request Aug 7, 2025

brpc socket极端场景下修改完version但是没有启动health check，导致socket无法建立新连接 #3058

Closed

Uh oh!

Conversation

chenBright commented Jun 22, 2022

Uh oh!

Uh oh!

Uh oh!

wwbmmm commented Jun 23, 2022

Uh oh!

chenBright commented Jun 29, 2022

Uh oh!

chenBright commented Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyearn commented Jul 19, 2022

Uh oh!

zyearn commented Jul 21, 2022

Uh oh!

Uh oh!

Uh oh!

chenBright commented Jul 23, 2022

Uh oh!

zyearn commented Jul 23, 2022

Uh oh!

chenBright commented Jul 24, 2022

Uh oh!

Uh oh!

Uh oh!

zyearn commented Jul 24, 2022

Uh oh!

zyearn commented Jul 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenBright commented Jul 18, 2022 •

edited

Loading