Skip to content

[fix](be) Fix SIGSEGV in bvar::take_sample caused by AgentCombiner/TLS Agent lifetime race under high EPS#64040

Open
vchag wants to merge 4 commits into
apache:masterfrom
vchag:bug/BE-SIGSEGV-bvar-take-sample-high-EPS
Open

[fix](be) Fix SIGSEGV in bvar::take_sample caused by AgentCombiner/TLS Agent lifetime race under high EPS#64040
vchag wants to merge 4 commits into
apache:masterfrom
vchag:bug/BE-SIGSEGV-bvar-take-sample-high-EPS

Conversation

@vchag

@vchag vchag commented Jun 3, 2026

Copy link
Copy Markdown

What problem does this PR solve?

Issue Number: close 63193

Related PR: #2949

Problem Summary:

Under high throughput, a race condition in brpc's bvar subsystem causes a SIGSEGV during take_sample. When a thread's TLS Agent destructs after its owning
AgentCombiner (Reducer, IntRecorder, or Percentile) has already been freed, the agent dereferences a dangling raw pointer in its destructor via
combiner->commit_and_erase(this).

The fix (backport of apache/brpc#2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via
shared_ptr. The agent destructor now calls combiner.lock() — if the combiner is already destroyed, lock() returns null and the destructor safely no-ops, eliminating
the use-after-free.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@vchag

vchag commented Jun 8, 2026

Copy link
Copy Markdown
Author

Thank you for your contribution to Apache Doris. Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
    BE nodes crash with a segmentation fault (SIGSEGV) under sustained high-throughput ingestion. The crash occurs inside bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's AgentCombiner: when a thread exits while SamplerCollector is iterating the agent list, it dereferences already-freed memory.

At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes.

The fix (backport of apache/brpc#2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via
shared_ptr. The agent destructor now calls combiner.lock() — if the combiner is already destroyed, lock() returns null and the destructor safely no-ops, eliminating
the use-after-free.

  1. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.

The fix (backport of apache/brpc#2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via
shared_ptr. The agent destructor now calls combiner.lock() — if the combiner is already destroyed, lock() returns null and the destructor safely no-ops, eliminating
the use-after-free.

  1. What features were added. Why was this function added?
    No, feature or functionality updated. Consider a thirdparty library update.
  1. Which code was refactored and why was this part of the code refactored?
    Released a new thirdparty patch to backport a changes made to brpc library to address the bug described above.
  1. Which functions were optimized and what is the difference before and after the optimization?
    These changes make sure the Doris continues to function as expect under high ingestion rate (400-500).

@morningman morningman self-assigned this Jun 17, 2026
@morningman

Copy link
Copy Markdown
Contributor

I am reviewing it, but it may takes sometimes.
Could you provide a way to reproduce this bug?

@vchag

vchag commented Jun 21, 2026

Copy link
Copy Markdown
Author

I am reviewing it, but it may takes sometimes. Could you provide a way to reproduce this bug?

Sure thing, here you go!
Cluster setup:

  • 20 BE nodes, apache/doris:be-4.0.4 (brps 1.4.0, unpatched)

Table shape:

  • Duplicate Key (event_time, event_id)
  • Partition by range(event_time) with dynamic hourly partitions
  • 30+columns with at least 2-3 inverted indexes (wide table)

Load:

  • Synthetics row generator streaming via group_commit=sync_mode Stream Load.
  • 15-20K EPS/BE sustained for at least 20-30 minutes (~300-400k EPS total across 20BEs)

Expected Result:

  • Greater than 1 BE crashed with SIGSEGV in bvar::Reducer::SeriesSampler::take_sample within 30 minutes
  • Kubernetes signal: RESTART counter incrementing on BE Pods
  • Bare-METAL signal: grep SIGSEGV /path/to/be/log/bo.out

Let me know if you need additional details.

@yiguolei

Copy link
Copy Markdown
Contributor

very very great, thanks a lot. A very important bugfix

@yiguolei yiguolei added dev/4.0.x dev/4.1.x usercase Important user case type label p0_c labels Jun 22, 2026
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor

do we also need to backport apache/brpc#3066?

@vchag

vchag commented Jun 25, 2026

Copy link
Copy Markdown
Author

apache/brpc#3066

apache/brpc#3066

@BiteTheDDDDt Maybe be, what problem does this backport address?

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor

#2949 introduced some NPE issues, which were fixed in #3066. I think perhaps these should be backported together.

@vchag

vchag commented Jun 25, 2026

Copy link
Copy Markdown
Author

#2949 introduced some NPE issues, which were fixed in #3066. I think perhaps these should be backported together.

Oh I see, great find. It makes sense to backport these together. I'll make the change tomorrow.

@vchag vchag force-pushed the bug/BE-SIGSEGV-bvar-take-sample-high-EPS branch from c240a77 to 2fcd368 Compare June 26, 2026 00:34
@vchag

vchag commented Jun 26, 2026

Copy link
Copy Markdown
Author

#2949 introduced some NPE issues, which were fixed in #3066. I think perhaps these should be backported together.

Oh I see, great find. It makes sense to backport these together. I'll make the change tomorrow.

@BiteTheDDDDt Done. We backported #3066.

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 27, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.23% (28510/38408)
Line Coverage 58.04% (310227/534511)
Region Coverage 54.69% (258942/473476)
Branch Coverage 56.11% (112654/200781)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.1.x p0_c reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants