fix: avoid bthread mutex deadlock between tx processors and bthread waiters#491
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (8)
WalkthroughThe PR replaces mutex/condition-variable completion coordination with atomic counters and polling waits across CC request types, updates UploadBatch and SK upload flows to use atomic shared state, and adds tests for the new WaitableCc behavior. ChangesLock-free CC request synchronization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…eadlock fix Pulls in eloqdata/tx_service#491, which fixes the same deadlock class as the CkptTsCc hang (#483) in the remaining CC requests (ActiveTxMaxTsCc, WaitableCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc, DbSizeCc, UploadBatchCc): a bthread::Mutex shared between tx processors (the brpc worker main stack) and bthread waiters can park the worker forever when the butex wake is routed to a bthread bound to that worker. For eloqkv this also covers the Redis dbsize command path. Also picks up the upstream standby replication protocol documentation commit (f62f0fa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tx_service/include/cc/cc_request.h (1)
8424-8452:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake
DbSizeCc::term_atomic before using it as the stale-response fence.
AddRemoteObjSize()now runs lock-free, but it still compares against a plainterm_whileIncTerm()mutates the same field for the next round. A late remote reply can now race the nextIncTerm()and either slip into the new request or get dropped nondeterministically.Suggested fix
int32_t GetTerm() { - return term_; + return term_.load(std::memory_order_acquire); } void IncTerm() { - term_++; + term_.fetch_add(1, std::memory_order_acq_rel); } @@ - if (term != term_) + if (term != term_.load(std::memory_order_acquire)) { return; } @@ - int32_t term_{0}; + std::atomic<int32_t> term_{0};Also applies to: 8491-8493
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tx_service/include/cc/cc_req_misc.h`:
- Around line 924-938: AbortCcRequest stores a non-NO_ERROR into the atomic
error_code_, but Execute currently unconditionally stores NO_ERROR on shard
success and can overwrite a prior failure; change Execute(CcShard &ccs) so it
does not overwrite a previously set non-NO_ERROR by using an atomic
compare-exchange (e.g. load expected = CcErrorCode::NO_ERROR and only store
NO_ERROR via error_code_.compare_exchange_strong if the current value is
NO_ERROR) or simply omit writing NO_ERROR altogether, and ensure FinishOne()
behavior remains unchanged (refer to AbortCcRequest, Execute, error_code_,
FinishOne).
In `@tx_service/include/cc/cc_request.h`:
- Around line 8378-8380: total_obj_sizes_ is currently being updated lock-free
via reinterpret_cast to std::atomic_int64_t and calling atomics (e.g.,
fetch_add/load), which violates the C++ atomic object model; change the storage
so each element is genuinely atomic (replace the underlying std::vector<int64_t>
with std::vector<std::atomic<int64_t>> or use std::atomic_ref against properly
aligned storage) or revert to holding the mutex around non-atomic accesses;
remove any reinterpret_cast<std::atomic_int64_t&>(total_obj_sizes_[idx]) usages
and update callsites that call fetch_add/load on those elements (and related
code paths noted around total_ref_cnt_, remote_ref_cnt_ and the ranges
referenced in the review) to use the real atomic type or guarded access
consistently.
In `@tx_service/src/cc/cc_req_misc.cpp`:
- Around line 513-523: The backoff in ClearCcNodeGroup::Wait stops growing one
step short of the intended max_interval (100000us) because the guard uses a
strict '<' against the doubled interval; change the logic so when doubling would
exceed max_interval you set interval_us to max_interval (or use '<=' to allow
reaching the cap) instead of leaving it at the previous value; update the same
pattern in the similar loop at the other occurrence (around lines shown for
1137-1147) and ensure you refer to done_, interval_us and max_interval in your
fix.
In `@tx_service/src/cc/cc_request.cpp`:
- Line 143: In cc_request.cpp update the LOG(WARNING) message that currently
reads "Waitting timeout for dbsize" to correct the typo — replace "Waitting"
with "Waiting" (i.e., change the LOG(WARNING) string literal). Locate the
LOG(WARNING) call emitting that exact message and adjust the string to "Waiting
timeout for dbsize" (or a clearer variant like "Waiting for dbsize timeout") so
log searches and alerts work as expected.
In `@tx_service/src/sk_generator.cpp`:
- Around line 774-776: res_code is being overwritten unconditionally (e.g., when
setting CcErrorCode::ESTABLISH_NODE_CHANNEL_FAILED), which allows a later
success to reset an earlier failure; change writes to res_code inside
UploadIndexInternal/UploadIndexWorker to latch the first non-NO_ERROR by using
an atomic compare-exchange that only transitions from CcErrorCode::NO_ERROR to
the new error and never writes when res_code is already non-NO_ERROR (do not
allow writes that set it back to NO_ERROR); keep finished_req_cnt updates as-is
and ensure this behavior mirrors UploadBatchCc's contract so a later successful
batch cannot clear an earlier failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2706025e-5534-4d78-8441-9a3e711ad87b
📒 Files selected for processing (8)
tx_service/include/cc/cc_req_misc.htx_service/include/cc/cc_request.htx_service/include/sk_generator.htx_service/src/cc/cc_req_misc.cpptx_service/src/cc/cc_request.cpptx_service/src/remote/cc_node_service.cpptx_service/src/remote/remote_cc_request.cpptx_service/src/sk_generator.cpp
…eadlock fix Pulls in eloqdata/tx_service#491, which fixes the same deadlock class as the CkptTsCc hang (#483) in the remaining CC requests (ActiveTxMaxTsCc, WaitableCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc, DbSizeCc, UploadBatchCc): a bthread::Mutex shared between tx processors (the brpc worker main stack) and bthread waiters can park the worker forever when the butex wake is routed to a bthread bound to that worker. For eloqkv this also covers the Redis dbsize command path. Also picks up the upstream standby replication protocol documentation commit (f62f0fa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eadlock fix Pulls in eloqdata/tx_service#491, which fixes the same deadlock class as the CkptTsCc hang (#483) in the remaining CC requests (ActiveTxMaxTsCc, WaitableCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc, DbSizeCc, UploadBatchCc): a bthread::Mutex shared between tx processors (the brpc worker main stack) and bthread waiters can park the worker forever when the butex wake is routed to a bthread bound to that worker. For eloqkv this also covers the Redis dbsize command path. Also picks up the upstream standby replication protocol documentation commit (f62f0fa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@CodeRabbit review |
✅ Action performedReview finished.
|
…aiters Same bug class as the CkptTsCc hang fixed in #483: a bthread::Mutex / ConditionVariable shared between CcRequest::Execute() (which runs on the brpc worker main stack via the tx processor module) and a bthread waiter can deadlock. When the worker main stack sleeps in the mutex's butex queue as a pthread waiter, an unlock may route the wake to a bthread bound to that same blocked worker; the bthread is parked in the worker's bound run queue and can never be scheduled, so both sides hang forever. Timeouts (wait_for) do not help because no timer remains pending in the end state. Convert the affected requests to the same pattern as the CkptTsCc fix: shard-side completion is tracked with atomics and the waiter polls with bthread_usleep exponential backoff. - ActiveTxMaxTsCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc: drop the mutex/cv, use an atomic counter (plus a done flag for ClearCcNodeGroup so the waiter returns only after the last core's cleanup). - WaitableCc: blocking Wait() now polls; IsFinished/IsError/ErrorCode read atomics. mux_ is kept solely for the coroutine yield/resume handshake, whose waiter is a dedicated flush-worker thread, so bthread owners never touch the butex. - DbSizeCc / RemoteDbSizeCc: atomic ref counts; Wait() keeps the give-up-on-remote-after-2s semantics. remote_ref_cnt_ is decremented before total_ref_cnt_ so the waiter never mistakes a pending local core for a missing remote response. - UploadBatchCc: the external coordination state (req_mux/req_cv/ finished_req_cnt/req_result/ng_term) becomes std::atomic references; ValidTermCheck uses a CAS instead of locking a mutex that lives on the owner bthread's stack. The CcNodeService::UploadBatch handler and the sk_generator upload path (SendIndexes, UploadIndexInternal, and the UploadBatchClosure callback, which runs on a bthread) are converted accordingly. Requests with a single in-flight execution and a single waiter (CheckTxStatusCc, WaitNoNakedBucketRefCc, UploadRangeSlicesCc, UploadBatchSlicesCc) are structurally safe and left unchanged, as are paths whose waiter is a dedicated std::thread (UpdateCceCkptTsCc, the flush-data coroutine path). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- WaitableCc: latch the first error in AbortCcRequest with a CAS and stop storing NO_ERROR on every successful Execute, so a later success on another core can no longer erase an earlier failure. - sk_generator: same latch-first-error contract for the shared res_code in SendIndexes (channel failure path and the UploadBatchClosure callback), mirroring UploadBatchCc::SetError. - DbSizeCc: back total_obj_sizes_ with real std::atomic<int64_t> storage instead of reinterpret_cast on a std::vector<int64_t> (undefined behavior per the C++ atomic object model), and make term_ atomic since AddRemoteObjSize races IncTerm lock-free. - Backoff loops: cap the polling interval at max_interval instead of stopping one doubling short of it (also fixes the pre-existing CkptTsCc::Wait). - Fix "Waitting" typo in the dbsize timeout warning log. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Guards the atomic completion counter + first-error latching that replaced the bthread::Mutex + ConditionVariable handshake in the waitable CC requests (ClearTxCc / ActiveTxMaxTsCc / EscalateStandbyCcmCc / DbSizeCc / ClearCcNodeGroup share the same atomic-counter + polling Wait()). WaitableCc is the generic carrier of that pattern. Covers: concurrent multi-core completion (Wait() returns, no lost decrement), first-error-wins latching (a later success must not clear it), Reset re-arming, and a repeated-rounds stress loop. A watchdog aborts with a clear message if Wait() hangs (the deadlock the fix prevents). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1448c93 to
302ef9a
Compare
…eadlock fix Pulls in eloqdata/tx_service#491, which fixes the same deadlock class as the CkptTsCc hang (#483) in the remaining CC requests (ActiveTxMaxTsCc, WaitableCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc, DbSizeCc, UploadBatchCc): a bthread::Mutex shared between tx processors (the brpc worker main stack) and bthread waiters can park the worker forever when the butex wake is routed to a bthread bound to that worker. For eloqkv this also covers the Redis dbsize command path. Also picks up the upstream standby replication protocol documentation commit (f62f0fa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eadlock fix Pulls in eloqdata/tx_service#491, which fixes the same deadlock class as the CkptTsCc hang (#483) in the remaining CC requests (ActiveTxMaxTsCc, WaitableCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc, DbSizeCc, UploadBatchCc): a bthread::Mutex shared between tx processors (the brpc worker main stack) and bthread waiters can park the worker forever when the butex wake is routed to a bthread bound to that worker. For eloqkv this also covers the Redis dbsize command path. Also picks up the upstream standby replication protocol documentation commit (f62f0fa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eadlock fix (#491) Pulls in eloqdata/tx_service#491, which fixes the same deadlock class as the CkptTsCc hang (#483) in the remaining CC requests (ActiveTxMaxTsCc, WaitableCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc, DbSizeCc, UploadBatchCc): a bthread::Mutex shared between tx processors (the brpc worker main stack) and bthread waiters can park the worker forever when the butex wake is routed to a bthread bound to that worker. For eloqkv this also covers the Redis dbsize command path. Also picks up the upstream standby replication protocol documentation commit (f62f0fa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: build eloqkv_to_rdb in the non-cloud (ROCKSDB) config eloqkv_to_rdb is built for WITH_DATA_STORE=ROCKSDB as well as the RocksDB-Cloud backends, but ParseSizeBytes and ShardProgressPrinter were defined only inside `#if ROCKSDB_CLOUD_EXPORT` while main() referenced ParseSizeBytes and ShardProgressPrinter::kMaxScanThreads unconditionally. In a non-cloud build (the cloud data-store macros are undefined, so ROCKSDB_CLOUD_EXPORT is undefined) those symbols do not exist and the tool fails to compile -- a latent regression since #485. ParseSizeBytes parses --write_block_size into write_block_size_bytes, which the non-cloud writer (ParseWorker / Rocksdb2RDB) also reads, so it must be available in both builds: move it (and its <algorithm>/<cctype> includes) out of the cloud-only block and keep its validation in main() unconditional. ShardProgressPrinter is genuinely cloud-only, so guard only its kMaxScanThreads check with ROCKSDB_CLOUD_EXPORT. Verified with -fsyntax-only in both the non-cloud and cloud configs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore: bump data_substrate to merged main (tx_service#491) tx_service#491 (bthread-mutex CC-request deadlock fix) was squash-merged to tx_service main as 4621885. Re-point the submodule from the now-merged feature-branch commit to that main commit. The squash also brings main forward to include tx_service#507 (test-only multi-process cluster harness, under tests/). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * rebase with main --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Problem
Same bug class as the CkptTsCc hang fixed in #483, found in several more CC requests after the issue reappeared in the field.
A
bthread::Mutex/ConditionVariableshared betweenCcRequest::Execute()and a bthread owner can deadlock under our brpc fork:Execute()runs on the brpc worker main stack (viaProcessModulesTask()→TxProcessor::RunOneRound(), with a fixed 1:1 worker↔shard binding), so a contendedlock()there parks the worker as a pthread waiter in the mutex's butex queue.butex_wakeprefers bound-bthread waiters, and bound bthreads can only run on their home worker (_bound_rqis never stolen).butex_wait. The mutex may even end up free — nobody can take it.wait_fortimeouts don't help: in the end state no timer is pending.The dangerous shape is concurrent multi-acquisition (a request broadcast to multiple cores, or one mutex shared by several in-flight requests) combined with a bthread waiter. Single-flight/single-waiter requests are structurally safe.
Fix
Convert the affected requests to the pattern used by the CkptTsCc fix (#483): shard-side completion is tracked with atomics, the waiter polls with
bthread_usleepexponential backoff (100us → 100ms), and no bthread::Mutex is shared between tx processors and bthreads.ActiveTxMaxTsCc(used byResetStandbySequenceIdRPC)Wait()WaitableCc(coversOnLeaderStart/SubscribePrimaryNode/UpdateInMemoryClusterConfig/ standby RPC handlers, etc.)Wait()polls;IsFinished/IsError/ErrorCoderead atomics;mux_retained only for the coroutine yield/resume handshake whose waiter is a flush-workerstd::threadClearCcNodeGroupdone_flag set after the last core's catalog/range/bucket cleanupEscalateStandbyCcmCcWait()ClearTxCcWait()(wasstd::mutex; safe today since the waiter is the recoverystd::thread, but a latent landmine for any future bthread caller)DbSizeCc/RemoteDbSizeCcWait()keeps the give-up-on-remote-after-2s semantics;remote_ref_cnt_decremented beforetotal_ref_cnt_so the give-up condition never fires while a local core is pendingUploadBatchCc+CcNodeService::UploadBatch+ sk_generator upload pathreq_mux/req_cv/finished_req_cnt/req_result/ng_term) replaced withstd::atomicreferences;ValidTermCheck()uses CAS; theUploadBatchClosurecallback (runs on a bthread) updates atomics lock-freeLeft intentionally unchanged (structurally safe):
CheckTxStatusCc,WaitNoNakedBucketRefCc,UploadRangeSlicesCc,UploadBatchSlicesCc(single-flight + single waiter),UpdateCceCkptTsCcand the flush-data coroutine path (waiter is a dedicatedstd::thread).Verification
install/build.shexit 0.UploadBatchduring bucket migration, and the Redisdbsizecommand — these are the paths that previously hung.🤖 Generated with Claude Code
Summary by CodeRabbit
Wait()support for clearer completion handling in CC request paths.