Skip to content

Range read block on rw conflict#367

Merged
githubzilla merged 26 commits into
mainfrom
range_read_block_on_rw_conflict
Feb 11, 2026
Merged

Range read block on rw conflict#367
githubzilla merged 26 commits into
mainfrom
range_read_block_on_rw_conflict

Conversation

@githubzilla

@githubzilla githubzilla commented Jan 17, 2026

Copy link
Copy Markdown
Collaborator

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • Bug Fixes

    • Lock acquisition blocks now trigger coordinated deadlock checks and move requests into a blocking/retry state (including resumed range reads) instead of immediate aborts.
    • Resumed requests correctly release or reuse prior locks before retrying.
    • Range data-sync mismatches now abort and clean up active sync work.
  • Performance

    • Deadlock-check signaling made atomic/thread-safe to avoid races.
  • Chores

    • Added diagnostics and fault‑injection hooks for debugging.

@coderabbitai

coderabbitai Bot commented Jan 17, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces immediate abort-on-lock-failure with deadlock-check-driven blocking for range reads; makes DeadLockCheck.requested_check_ atomic with acquire/release semantics; updates Read/Lock ops and tx operations to block-and-retry; adds fault-injection hooks and explicit data-sync abort on range-version mismatch.

Changes

Cohort / File(s) Summary
Deadlock detection core
tx_service/include/dead_lock_check.h, tx_service/src/dead_lock_check.cpp
requested_check_ changed from bool to std::atomic<bool>; RequestCheck() uses store(..., memory_order_release); Run/wait logic uses load(..., memory_order_acquire) and updated wait predicates.
Range CC lock handling
tx_service/include/cc/range_cc_map.h
Bucket/floor ACQUIRE_LOCK_BLOCKED paths now call DeadLockCheck::RequestCheck() and transition requests via ReadCc::BlockByLock(...) (enqueue/block) instead of immediate release+abort; header now includes dead_lock_check.h.
Transaction operations & fault-injection
tx_service/src/tx_operation.cpp
ReadLocalOperation/LockWriteRangeBucketsOp adopt blocking-retry semantics on lock conflicts (reset state and re-issue via txm->Process); added DLOG adjustments and CODE_FAULT_INJECTOR scaffolding in SplitFlushRangeOp to simulate deadlock aborts.
Range partition data-sync handling
tx_service/src/cc/local_cc_shards.cpp
On range-version mismatch during data sync, explicitly aborts the active data-sync transaction, sets range error (GET_RANGE_ID_ERR), finishes the scan task, and pops the pending task before returning.

Sequence Diagram(s)

sequenceDiagram
    participant Reader as ReadCc (Range Read)
    participant LockMgr as Lock Manager
    participant DeadLock as DeadLockCheck
    participant Scheduler as Scheduler/Queue

    Reader->>LockMgr: Attempt bucket/floor lock
    LockMgr-->>Reader: ACQUIRE_LOCK_BLOCKED

    Reader->>DeadLock: RequestCheck() (store true, release)
    DeadLock-->>Reader: ack

    Reader->>Scheduler: BlockByLock(request) (enqueue / mark blocking)
    Scheduler-->>Reader: return false / request enqueued

    Note over DeadLock,Scheduler: Background Run loads requested_check_ (acquire), gathers deps, detects deadlock, triggers recovery/unblock
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • liunyl
  • MrGuin

Poem

🐰 I set a tiny flag to true,
I let the readers pause in queue,
No sudden abort — just patient check,
I sniff for knots and guide the trek,
Then tap the locks so work goes through. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description contains only an unchecked pre-submission checklist template with no implementation details, rationale, or changes explanation provided. Replace the checklist template with a detailed description explaining the purpose, design changes, affected components, and impact. Complete and check off the actual pre-submission items.
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Range read block on rw conflict' directly relates to the main changes: enabling range reads to block on read-write conflicts via deadlock detection instead of aborting.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch range_read_block_on_rw_conflict

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tx_service/include/cc/range_cc_map.h`:
- Around line 233-236: Update the comment to distinguish bucket-lock behavior
from range-lock behavior: explain that range reads block on range lock
contention (allowing deadlock detection to observe wait conditions) but
bucket-lock contention does not block and instead aborts immediately via
AbortQueueRequest(), and note that PostReadCc::Execute properly releases both
locks so resumed reads do not leave bucket locks pinned. Locate and replace the
ambiguous sentence in the comment near the range read handling to explicitly
call out AbortQueueRequest() for bucket locks and PostReadCc::Execute for the
resume path.

In `@tx_service/src/tx_operation.cpp`:
- Around line 4207-4248: The fault-inject branch only marks hd_results_[0] and
leaves commit_acquire_all_write_op_ counters (upload_cnt_, fail_cnt_) and other
hd_results_ untouched, which can leave AcquireAllOp unfinished and hang
SplitFlushRangeOp; to fix, when taking the fault-inject path for
commit_acquire_all_write_op_ iterate over
commit_acquire_all_write_op_.hd_results_ (or use the size from Reset) and for
each hd_result call Reset(), SetToBlock(), and
SetError(CcErrorCode::DEAD_LOCK_ABORT) (or otherwise mark them finished), and
update upload_cnt_ / fail_cnt_ consistently (or shrink upload_cnt_ to match the
number of marked results) so the operation reaches a finished state; ensure
these changes occur before returning (same branch that calls
txm->PushOperation(&commit_acquire_all_write_op_)) so AcquireAllOp and
SplitFlushRangeOp see a consistent completed state.

Comment thread tx_service/include/cc/range_cc_map.h
Comment thread tx_service/src/tx_operation.cpp
@githubzilla githubzilla force-pushed the range_read_block_on_rw_conflict branch from 75c19df to 68cdca0 Compare January 30, 2026 10:52

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tx_service/include/dead_lock_check.h`:
- Line 244: The header declares std::atomic<bool> (requested_check_) but is
missing the <atomic> include; add `#include` <atomic> to dead_lock_check.h (or to
a common precompiled header included by it) so requested_check_ compiles; ensure
the include appears before the declaration of std::atomic in the file.

In `@tx_service/src/tx_operation.cpp`:
- Around line 304-314: The retry logic treats CcErrorCode::DATA_STORE_ERR the
same as RW lock conflicts and immediately requeues the operation, risking an
infinite busy retry loop; change the branch that checks hd_result_->ErrorCode()
so DATA_STORE_ERR is handled separately from
CcErrorCode::ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT: add a per-operation retry
budget counter (e.g., data_store_retry_count_ or reuse a attempts_ field) and on
DATA_STORE_ERR increment it, apply exponential backoff or schedule a delayed
retry (instead of immediate txm->Process(*this)), and when the budget is
exhausted propagate/return the error (or set hd_result_ accordingly) rather than
spinning; keep the blocking immediate retry behavior only for
ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT, and ensure you still call
hd_result_->Value().Reset() / hd_result_->Reset() where appropriate.

Comment thread tx_service/include/dead_lock_check.h
Comment thread tx_service/src/tx_operation.cpp Outdated
@githubzilla githubzilla force-pushed the range_read_block_on_rw_conflict branch from 68cdca0 to 33e0f48 Compare February 2, 2026 03:16

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/src/dead_lock_check.cpp (1)

559-571: ⚠️ Potential issue | 🟡 Minor

Atomic operations are correct; fix comment typos.

The memory_order_acquire load at line 559 and memory_order_release store at line 569 form proper synchronization pairs with RequestCheck(). The logic correctly gates execution on both the request flag and time interval.

Line 568 has typos: "brefore" → "before" and "dependancy" → "dependency".

Proposed fix
-        // Reset the check flag brefore gather lock dependancy
+        // Reset the check flag before gathering lock dependency
         requested_check_.store(false, std::memory_order_release);

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3818-3825: The branch that handles a DataSync range-version
mismatch currently calls data_sync_task->SetErrorCode(...) and
SetScanTaskFinished(), but other early-exit paths use
data_sync_task->SetError(...) (which also marks the task finished/updates
unfinished_tasks_); to fix, replace the SetErrorCode call with
data_sync_task->SetError(...) passing an appropriate error message/enum (or call
data_sync_task->SetFinish() immediately after SetErrorCode) so the task is
marked finished and callers won't hang; update the branch that also calls
txservice::AbortTx(data_sync_txm) and PopPendingTask(...) to use the same finish
semantics as other error paths.

Comment thread tx_service/src/cc/local_cc_shards.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tx_service/src/dead_lock_check.cpp`:
- Around line 568-569: Fix the typo in the inline comment above the
requested_check_ store call: change "brefore" to "before" in the comment that
currently reads "// Reset the check flag brefore gather lock dependancy" located
next to requested_check_.store(false, std::memory_order_release); and while here
also correct "dependancy" to "dependency" for clarity.

Comment thread tx_service/src/dead_lock_check.cpp Outdated
Comment on lines +568 to +569
// Reset the check flag brefore gather lock dependancy
requested_check_ = false;
requested_check_.store(false, std::memory_order_release);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "brefore" → "before".

-        // Reset the check flag brefore gather lock dependancy
+        // Reset the check flag before gather lock dependancy
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Reset the check flag brefore gather lock dependancy
requested_check_ = false;
requested_check_.store(false, std::memory_order_release);
// Reset the check flag before gather lock dependancy
requested_check_.store(false, std::memory_order_release);
🤖 Prompt for AI Agents
In `@tx_service/src/dead_lock_check.cpp` around lines 568 - 569, Fix the typo in
the inline comment above the requested_check_ store call: change "brefore" to
"before" in the comment that currently reads "// Reset the check flag brefore
gather lock dependancy" located next to requested_check_.store(false,
std::memory_order_release); and while here also correct "dependancy" to
"dependency" for clarity.

@githubzilla githubzilla force-pushed the range_read_block_on_rw_conflict branch from 62a1c26 to fc18555 Compare February 11, 2026 06:04
@githubzilla githubzilla merged commit 3be56b8 into main Feb 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants