Use monotonic time instead of wall time#3268
Merged
Merged
Conversation
c97a737 to
023761c
Compare
023761c to
0525824
Compare
Contributor
|
LGTM |
This was referenced Apr 22, 2026
Closed
hjwsm1989
pushed a commit
to hjwsm1989/brpc
that referenced
this pull request
Apr 23, 2026
Commit 12fb539 ("Use monotonic time instead of wall time", apache#3268) switched the three time-source calls in SamplerCollector::run() from gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still treats the timestamps as microseconds: - abstime += 1000000L now represents 1 ms (not 1 s), causing the sampler to spin at ~1 kHz instead of 1 Hz; - usleep(abstime - now) receives a nanosecond delta, which usleep() interprets as microseconds. Use cpuwide_time_us() instead, which preserves the monotonic behavior from apache#3268 while keeping the existing microsecond-based arithmetic correct. Fixes apache#3277.
chenBright
pushed a commit
that referenced
this pull request
Apr 23, 2026
Commit 12fb539 ("Use monotonic time instead of wall time", #3268) switched the three time-source calls in SamplerCollector::run() from gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still treats the timestamps as microseconds: - abstime += 1000000L now represents 1 ms (not 1 s), causing the sampler to spin at ~1 kHz instead of 1 Hz; - usleep(abstime - now) receives a nanosecond delta, which usleep() interprets as microseconds. Use cpuwide_time_us() instead, which preserves the monotonic behavior from #3268 while keeping the existing microsecond-based arithmetic correct. Fixes #3277. Co-authored-by: huangjun <huangjun@xsky.com>
3 tasks
hjwsm1989
pushed a commit
to hjwsm1989/brpc
that referenced
this pull request
Apr 25, 2026
PR apache#3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before apache#3268): 0/500 fail * commit 12fb539 (apache#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of apache#3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch.
chenBright
pushed a commit
that referenced
this pull request
Apr 26, 2026
…rs (#3283) PR #3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before #3268): 0/500 fail * commit 12fb539 (#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of #3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch. Co-authored-by: huangjun <huangjun@xsky.com>
zchuango
pushed a commit
to zchuango/brpc
that referenced
this pull request
May 9, 2026
zchuango
pushed a commit
to zchuango/brpc
that referenced
this pull request
May 9, 2026
) Commit 12fb539 ("Use monotonic time instead of wall time", apache#3268) switched the three time-source calls in SamplerCollector::run() from gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still treats the timestamps as microseconds: - abstime += 1000000L now represents 1 ms (not 1 s), causing the sampler to spin at ~1 kHz instead of 1 Hz; - usleep(abstime - now) receives a nanosecond delta, which usleep() interprets as microseconds. Use cpuwide_time_us() instead, which preserves the monotonic behavior from apache#3268 while keeping the existing microsecond-based arithmetic correct. Fixes apache#3277. Co-authored-by: huangjun <huangjun@xsky.com>
zchuango
pushed a commit
to zchuango/brpc
that referenced
this pull request
May 9, 2026
…rs (apache#3283) PR apache#3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before apache#3268): 0/500 fail * commit 12fb539 (apache#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of apache#3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch. Co-authored-by: huangjun <huangjun@xsky.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: resolve #3258, related to #2763
Problem Summary:
Some internal modules were using wall-clock time for elapsed-time measurement, periodic scheduling, and timeout-related calculations even though they do not depend on real calendar time. This makes them vulnerable to system clock jumps caused by NTP adjustments, manual time changes, or virtualization time drift.
As discussed in #2763 and observed in #3258, clock jumps can lead to incorrect interval calculations, unstable sampling behavior, and unexpected internal state inconsistencies. This PR addresses the issue by switching the affected modules to monotonic time sources where wall-clock semantics are not required.
What is changed and the side effects?
Changed:
Side effects:
Performance effects:
Breaking backward compatibility:
Check List: