Range read block on rw conflict#367
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces immediate abort-on-lock-failure with deadlock-check-driven blocking for range reads; makes DeadLockCheck.requested_check_ atomic with acquire/release semantics; updates Read/Lock ops and tx operations to block-and-retry; adds fault-injection hooks and explicit data-sync abort on range-version mismatch. Changes
Sequence Diagram(s)sequenceDiagram
participant Reader as ReadCc (Range Read)
participant LockMgr as Lock Manager
participant DeadLock as DeadLockCheck
participant Scheduler as Scheduler/Queue
Reader->>LockMgr: Attempt bucket/floor lock
LockMgr-->>Reader: ACQUIRE_LOCK_BLOCKED
Reader->>DeadLock: RequestCheck() (store true, release)
DeadLock-->>Reader: ack
Reader->>Scheduler: BlockByLock(request) (enqueue / mark blocking)
Scheduler-->>Reader: return false / request enqueued
Note over DeadLock,Scheduler: Background Run loads requested_check_ (acquire), gathers deps, detects deadlock, triggers recovery/unblock
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@tx_service/include/cc/range_cc_map.h`:
- Around line 233-236: Update the comment to distinguish bucket-lock behavior
from range-lock behavior: explain that range reads block on range lock
contention (allowing deadlock detection to observe wait conditions) but
bucket-lock contention does not block and instead aborts immediately via
AbortQueueRequest(), and note that PostReadCc::Execute properly releases both
locks so resumed reads do not leave bucket locks pinned. Locate and replace the
ambiguous sentence in the comment near the range read handling to explicitly
call out AbortQueueRequest() for bucket locks and PostReadCc::Execute for the
resume path.
In `@tx_service/src/tx_operation.cpp`:
- Around line 4207-4248: The fault-inject branch only marks hd_results_[0] and
leaves commit_acquire_all_write_op_ counters (upload_cnt_, fail_cnt_) and other
hd_results_ untouched, which can leave AcquireAllOp unfinished and hang
SplitFlushRangeOp; to fix, when taking the fault-inject path for
commit_acquire_all_write_op_ iterate over
commit_acquire_all_write_op_.hd_results_ (or use the size from Reset) and for
each hd_result call Reset(), SetToBlock(), and
SetError(CcErrorCode::DEAD_LOCK_ABORT) (or otherwise mark them finished), and
update upload_cnt_ / fail_cnt_ consistently (or shrink upload_cnt_ to match the
number of marked results) so the operation reaches a finished state; ensure
these changes occur before returning (same branch that calls
txm->PushOperation(&commit_acquire_all_write_op_)) so AcquireAllOp and
SplitFlushRangeOp see a consistent completed state.
75c19df to
68cdca0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@tx_service/include/dead_lock_check.h`:
- Line 244: The header declares std::atomic<bool> (requested_check_) but is
missing the <atomic> include; add `#include` <atomic> to dead_lock_check.h (or to
a common precompiled header included by it) so requested_check_ compiles; ensure
the include appears before the declaration of std::atomic in the file.
In `@tx_service/src/tx_operation.cpp`:
- Around line 304-314: The retry logic treats CcErrorCode::DATA_STORE_ERR the
same as RW lock conflicts and immediately requeues the operation, risking an
infinite busy retry loop; change the branch that checks hd_result_->ErrorCode()
so DATA_STORE_ERR is handled separately from
CcErrorCode::ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT: add a per-operation retry
budget counter (e.g., data_store_retry_count_ or reuse a attempts_ field) and on
DATA_STORE_ERR increment it, apply exponential backoff or schedule a delayed
retry (instead of immediate txm->Process(*this)), and when the budget is
exhausted propagate/return the error (or set hd_result_ accordingly) rather than
spinning; keep the blocking immediate retry behavior only for
ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT, and ensure you still call
hd_result_->Value().Reset() / hd_result_->Reset() where appropriate.
68cdca0 to
33e0f48
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tx_service/src/dead_lock_check.cpp (1)
559-571:⚠️ Potential issue | 🟡 MinorAtomic operations are correct; fix comment typos.
The
memory_order_acquireload at line 559 andmemory_order_releasestore at line 569 form proper synchronization pairs withRequestCheck(). The logic correctly gates execution on both the request flag and time interval.Line 568 has typos: "brefore" → "before" and "dependancy" → "dependency".
Proposed fix
- // Reset the check flag brefore gather lock dependancy + // Reset the check flag before gathering lock dependency requested_check_.store(false, std::memory_order_release);
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3818-3825: The branch that handles a DataSync range-version
mismatch currently calls data_sync_task->SetErrorCode(...) and
SetScanTaskFinished(), but other early-exit paths use
data_sync_task->SetError(...) (which also marks the task finished/updates
unfinished_tasks_); to fix, replace the SetErrorCode call with
data_sync_task->SetError(...) passing an appropriate error message/enum (or call
data_sync_task->SetFinish() immediately after SetErrorCode) so the task is
marked finished and callers won't hang; update the branch that also calls
txservice::AbortTx(data_sync_txm) and PopPendingTask(...) to use the same finish
semantics as other error paths.
537f468 to
2c0a931
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tx_service/src/dead_lock_check.cpp`:
- Around line 568-569: Fix the typo in the inline comment above the
requested_check_ store call: change "brefore" to "before" in the comment that
currently reads "// Reset the check flag brefore gather lock dependancy" located
next to requested_check_.store(false, std::memory_order_release); and while here
also correct "dependancy" to "dependency" for clarity.
| // Reset the check flag brefore gather lock dependancy | ||
| requested_check_ = false; | ||
| requested_check_.store(false, std::memory_order_release); |
There was a problem hiding this comment.
Typo: "brefore" → "before".
- // Reset the check flag brefore gather lock dependancy
+ // Reset the check flag before gather lock dependancy📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Reset the check flag brefore gather lock dependancy | |
| requested_check_ = false; | |
| requested_check_.store(false, std::memory_order_release); | |
| // Reset the check flag before gather lock dependancy | |
| requested_check_.store(false, std::memory_order_release); |
🤖 Prompt for AI Agents
In `@tx_service/src/dead_lock_check.cpp` around lines 568 - 569, Fix the typo in
the inline comment above the requested_check_ store call: change "brefore" to
"before" in the comment that currently reads "// Reset the check flag brefore
gather lock dependancy" located next to requested_check_.store(false,
std::memory_order_release); and while here also correct "dependancy" to
"dependency" for clarity.
62a1c26 to
fc18555
Compare
Here are some reminders before you submit the pull request
fixes eloqdb/tx_service#issue_id./mtr --suite=mono_main,mono_multi,mono_basicSummary by CodeRabbit
Bug Fixes
Performance
Chores