Use monotonic time instead of wall time by chenBright · Pull Request #3268 · apache/brpc

chenBright · 2026-04-11T13:53:13Z

What problem does this PR solve?

Issue Number: resolve #3258, related to #2763

Problem Summary:

Some internal modules were using wall-clock time for elapsed-time measurement, periodic scheduling, and timeout-related calculations even though they do not depend on real calendar time. This makes them vulnerable to system clock jumps caused by NTP adjustments, manual time changes, or virtualization time drift.

As discussed in #2763 and observed in #3258, clock jumps can lead to incorrect interval calculations, unstable sampling behavior, and unexpected internal state inconsistencies. This PR addresses the issue by switching the affected modules to monotonic time sources where wall-clock semantics are not required.

What is changed and the side effects?

Changed:

Side effects:

Performance effects:
Breaking backward compatibility:

Check List:

Please make sure your changes are compilable.
When providing us with a new feature, it is best to add related tests.
Please follow Contributor Covenant Code of Conduct.

wwbmmm · 2026-04-12T02:58:09Z

LGTM

Commit 12fb539 ("Use monotonic time instead of wall time", apache#3268) switched the three time-source calls in SamplerCollector::run() from gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still treats the timestamps as microseconds: - abstime += 1000000L now represents 1 ms (not 1 s), causing the sampler to spin at ~1 kHz instead of 1 Hz; - usleep(abstime - now) receives a nanosecond delta, which usleep() interprets as microseconds. Use cpuwide_time_us() instead, which preserves the monotonic behavior from apache#3268 while keeping the existing microsecond-based arithmetic correct. Fixes apache#3277.

Commit 12fb539 ("Use monotonic time instead of wall time", #3268) switched the three time-source calls in SamplerCollector::run() from gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still treats the timestamps as microseconds: - abstime += 1000000L now represents 1 ms (not 1 s), causing the sampler to spin at ~1 kHz instead of 1 Hz; - usleep(abstime - now) receives a nanosecond delta, which usleep() interprets as microseconds. Use cpuwide_time_us() instead, which preserves the monotonic behavior from #3268 while keeping the existing microsecond-based arithmetic correct. Fixes #3277. Co-authored-by: huangjun <huangjun@xsky.com>

PR apache#3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before apache#3268): 0/500 fail * commit 12fb539 (apache#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of apache#3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch.

…rs (#3283) PR #3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before #3268): 0/500 fail * commit 12fb539 (#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of #3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch. Co-authored-by: huangjun <huangjun@xsky.com>

) Commit 12fb539 ("Use monotonic time instead of wall time", apache#3268) switched the three time-source calls in SamplerCollector::run() from gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still treats the timestamps as microseconds: - abstime += 1000000L now represents 1 ms (not 1 s), causing the sampler to spin at ~1 kHz instead of 1 Hz; - usleep(abstime - now) receives a nanosecond delta, which usleep() interprets as microseconds. Use cpuwide_time_us() instead, which preserves the monotonic behavior from apache#3268 while keeping the existing microsecond-based arithmetic correct. Fixes apache#3277. Co-authored-by: huangjun <huangjun@xsky.com>

…rs (apache#3283) PR apache#3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before apache#3268): 0/500 fail * commit 12fb539 (apache#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of apache#3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch. Co-authored-by: huangjun <huangjun@xsky.com>

chenBright force-pushed the monotonic_time branch from c97a737 to 023761c Compare April 11, 2026 14:39

Use monotonic time instead of wall time

0525824

chenBright force-pushed the monotonic_time branch from 023761c to 0525824 Compare April 11, 2026 15:18

chenBright merged commit 12fb539 into apache:master Apr 13, 2026
17 checks passed

chenBright deleted the monotonic_time branch April 13, 2026 06:14

This was referenced Apr 22, 2026

bvar: sampler busy-loops at ~1 kHz after switch to cpuwide_time_ns (regression from #3268) #3277

Closed

[bvar] fix sampler interval after switch to cpuwide_time_ns #3278

Merged

hjwsm1989 mentioned this pull request Apr 25, 2026

Fix time source mismatch in client RPC path after #3268 migration #3283

Merged

3 tasks

zchuango pushed a commit to zchuango/brpc that referenced this pull request May 9, 2026

Use monotonic time instead of wall time (apache#3268)

e9bda71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use monotonic time instead of wall time#3268

Use monotonic time instead of wall time#3268
chenBright merged 1 commit into
apache:masterfrom
chenBright:monotonic_time

chenBright commented Apr 11, 2026

Uh oh!

wwbmmm commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chenBright commented Apr 11, 2026

What problem does this PR solve?

What is changed and the side effects?

Check List:

Uh oh!

wwbmmm commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants