[fix](be) Fix SIGSEGV in bvar::take_sample caused by AgentCombiner/TLS Agent lifetime race under high EPS#64040
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes. The fix (backport of apache/brpc#2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via
The fix (backport of apache/brpc#2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via
|
|
I am reviewing it, but it may takes sometimes. |
Sure thing, here you go!
Table shape:
Load:
Expected Result:
Let me know if you need additional details. |
|
very very great, thanks a lot. A very important bugfix |
|
do we also need to backport apache/brpc#3066? |
|
@BiteTheDDDDt Maybe be, what problem does this backport address? |
c240a77 to
2fcd368
Compare
@BiteTheDDDDt Done. We backported #3066. |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close 63193
Related PR: #2949
Problem Summary:
Under high throughput, a race condition in brpc's bvar subsystem causes a SIGSEGV during take_sample. When a thread's TLS Agent destructs after its owning
AgentCombiner (Reducer, IntRecorder, or Percentile) has already been freed, the agent dereferences a dangling raw pointer in its destructor via
combiner->commit_and_erase(this).
The fix (backport of apache/brpc#2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via
shared_ptr. The agent destructor now calls combiner.lock() — if the combiner is already destroyed, lock() returns null and the destructor safely no-ops, eliminating
the use-after-free.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)