Skip to content

Use monotonic time instead of wall time#3268

Merged
chenBright merged 1 commit into
apache:masterfrom
chenBright:monotonic_time
Apr 13, 2026
Merged

Use monotonic time instead of wall time#3268
chenBright merged 1 commit into
apache:masterfrom
chenBright:monotonic_time

Conversation

@chenBright

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: resolve #3258, related to #2763

Problem Summary:

Some internal modules were using wall-clock time for elapsed-time measurement, periodic scheduling, and timeout-related calculations even though they do not depend on real calendar time. This makes them vulnerable to system clock jumps caused by NTP adjustments, manual time changes, or virtualization time drift.

As discussed in #2763 and observed in #3258, clock jumps can lead to incorrect interval calculations, unstable sampling behavior, and unexpected internal state inconsistencies. This PR addresses the issue by switching the affected modules to monotonic time sources where wall-clock semantics are not required.

What is changed and the side effects?

Changed:

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

@wwbmmm

wwbmmm commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

LGTM

@chenBright chenBright merged commit 12fb539 into apache:master Apr 13, 2026
17 checks passed
@chenBright chenBright deleted the monotonic_time branch April 13, 2026 06:14
hjwsm1989 pushed a commit to hjwsm1989/brpc that referenced this pull request Apr 23, 2026
Commit 12fb539 ("Use monotonic time instead of wall time", apache#3268)
switched the three time-source calls in SamplerCollector::run() from
gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still
treats the timestamps as microseconds:

- abstime += 1000000L now represents 1 ms (not 1 s), causing the
  sampler to spin at ~1 kHz instead of 1 Hz;
- usleep(abstime - now) receives a nanosecond delta, which usleep()
  interprets as microseconds.

Use cpuwide_time_us() instead, which preserves the monotonic behavior
from apache#3268 while keeping the existing microsecond-based arithmetic
correct.

Fixes apache#3277.
chenBright pushed a commit that referenced this pull request Apr 23, 2026
Commit 12fb539 ("Use monotonic time instead of wall time", #3268)
switched the three time-source calls in SamplerCollector::run() from
gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still
treats the timestamps as microseconds:

- abstime += 1000000L now represents 1 ms (not 1 s), causing the
  sampler to spin at ~1 kHz instead of 1 Hz;
- usleep(abstime - now) receives a nanosecond delta, which usleep()
  interprets as microseconds.

Use cpuwide_time_us() instead, which preserves the monotonic behavior
from #3268 while keeping the existing microsecond-based arithmetic
correct.

Fixes #3277.

Co-authored-by: huangjun <huangjun@xsky.com>
hjwsm1989 pushed a commit to hjwsm1989/brpc that referenced this pull request Apr 25, 2026
PR apache#3268 ("Use monotonic time instead of wall time") switched
LocalityAwareLoadBalancer::Weight::Update's end_time_us and
LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(),
but every caller that supplies CallInfo::begin_time_us still uses
butil::gettimeofday_us():

  - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC ->
    Controller::Call::begin_time_us -> SelectIn::begin_time_us ->
    CallInfo::begin_time_us
  - Controller::OnVersionedRPCReturned retry sites
    (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on
    backup-request and regular retries

The mismatched time domains make

    latency = end_time_us - ci.begin_time_us
            = cpuwide_now - wallclock_begin
            ~= -1.7e15 us

trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }`
short-circuit on every call. _time_q never accumulates samples,
_avg_latency stays at 0, and locality-aware weight feedback is silently
disabled.

Visible downstream symptom: cold-start `list://` channels with `lb=la`
and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select
server from list://...") on retry even when one backend is healthy.
Bisected reproduction in xsky/brpc fork:

  - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2)
  - master code + LA-driven multipath probe at 2 backends, max_retry=1,
    repeat 500x:
      * commit 771de31 (one before apache#3268): 0/500 fail
      * commit 12fb539 (apache#3268):           25/500 fail
      * commit 12fb539 + revert only Weight::Update::end_time_us to
        gettimeofday_us:                    0/500 fail

This commit reverts the LA-side of apache#3268's clock change so the LB lines
up with its existing callers again. Channel::CallMethod and the retry
paths in Controller stay on butil::gettimeofday_us(), which preserves
the wall-clock semantics of Controller::_begin_time_us /
Controller::latency_us() that public users rely on.

Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source
which drives a series of SelectServer + Feedback cycles against
LocalityAwareLoadBalancer (no Server / Channel needed) and asserts
that _avg_latency reflects the elapsed time, rather than being stuck
at 0 because of a time-source mismatch.
chenBright pushed a commit that referenced this pull request Apr 26, 2026
…rs (#3283)

PR #3268 ("Use monotonic time instead of wall time") switched
LocalityAwareLoadBalancer::Weight::Update's end_time_us and
LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(),
but every caller that supplies CallInfo::begin_time_us still uses
butil::gettimeofday_us():

  - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC ->
    Controller::Call::begin_time_us -> SelectIn::begin_time_us ->
    CallInfo::begin_time_us
  - Controller::OnVersionedRPCReturned retry sites
    (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on
    backup-request and regular retries

The mismatched time domains make

    latency = end_time_us - ci.begin_time_us
            = cpuwide_now - wallclock_begin
            ~= -1.7e15 us

trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }`
short-circuit on every call. _time_q never accumulates samples,
_avg_latency stays at 0, and locality-aware weight feedback is silently
disabled.

Visible downstream symptom: cold-start `list://` channels with `lb=la`
and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select
server from list://...") on retry even when one backend is healthy.
Bisected reproduction in xsky/brpc fork:

  - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2)
  - master code + LA-driven multipath probe at 2 backends, max_retry=1,
    repeat 500x:
      * commit 771de31 (one before #3268): 0/500 fail
      * commit 12fb539 (#3268):           25/500 fail
      * commit 12fb539 + revert only Weight::Update::end_time_us to
        gettimeofday_us:                    0/500 fail

This commit reverts the LA-side of #3268's clock change so the LB lines
up with its existing callers again. Channel::CallMethod and the retry
paths in Controller stay on butil::gettimeofday_us(), which preserves
the wall-clock semantics of Controller::_begin_time_us /
Controller::latency_us() that public users rely on.

Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source
which drives a series of SelectServer + Feedback cycles against
LocalityAwareLoadBalancer (no Server / Channel needed) and asserts
that _avg_latency reflects the elapsed time, rather than being stuck
at 0 because of a time-source mismatch.

Co-authored-by: huangjun <huangjun@xsky.com>
zchuango pushed a commit to zchuango/brpc that referenced this pull request May 9, 2026
zchuango pushed a commit to zchuango/brpc that referenced this pull request May 9, 2026
)

Commit 12fb539 ("Use monotonic time instead of wall time", apache#3268)
switched the three time-source calls in SamplerCollector::run() from
gettimeofday_us() to cpuwide_time_ns(), but the surrounding code still
treats the timestamps as microseconds:

- abstime += 1000000L now represents 1 ms (not 1 s), causing the
  sampler to spin at ~1 kHz instead of 1 Hz;
- usleep(abstime - now) receives a nanosecond delta, which usleep()
  interprets as microseconds.

Use cpuwide_time_us() instead, which preserves the monotonic behavior
from apache#3268 while keeping the existing microsecond-based arithmetic
correct.

Fixes apache#3277.

Co-authored-by: huangjun <huangjun@xsky.com>
zchuango pushed a commit to zchuango/brpc that referenced this pull request May 9, 2026
…rs (apache#3283)

PR apache#3268 ("Use monotonic time instead of wall time") switched
LocalityAwareLoadBalancer::Weight::Update's end_time_us and
LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(),
but every caller that supplies CallInfo::begin_time_us still uses
butil::gettimeofday_us():

  - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC ->
    Controller::Call::begin_time_us -> SelectIn::begin_time_us ->
    CallInfo::begin_time_us
  - Controller::OnVersionedRPCReturned retry sites
    (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on
    backup-request and regular retries

The mismatched time domains make

    latency = end_time_us - ci.begin_time_us
            = cpuwide_now - wallclock_begin
            ~= -1.7e15 us

trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }`
short-circuit on every call. _time_q never accumulates samples,
_avg_latency stays at 0, and locality-aware weight feedback is silently
disabled.

Visible downstream symptom: cold-start `list://` channels with `lb=la`
and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select
server from list://...") on retry even when one backend is healthy.
Bisected reproduction in xsky/brpc fork:

  - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2)
  - master code + LA-driven multipath probe at 2 backends, max_retry=1,
    repeat 500x:
      * commit 771de31 (one before apache#3268): 0/500 fail
      * commit 12fb539 (apache#3268):           25/500 fail
      * commit 12fb539 + revert only Weight::Update::end_time_us to
        gettimeofday_us:                    0/500 fail

This commit reverts the LA-side of apache#3268's clock change so the LB lines
up with its existing callers again. Channel::CallMethod and the retry
paths in Controller stay on butil::gettimeofday_us(), which preserves
the wall-clock semantics of Controller::_begin_time_us /
Controller::latency_us() that public users rely on.

Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source
which drives a series of SelectServer + Feedback cycles against
LocalityAwareLoadBalancer (no Server / Channel needed) and asserts
that _avg_latency reflects the elapsed time, rather than being stuck
at 0 because of a time-source mismatch.

Co-authored-by: huangjun <huangjun@xsky.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

percentile.h:103] Check failed: _num_samples == _num_added (254 vs. 4294967280)

2 participants