Skip to content

Range read block on rw conflict#283

Merged
githubzilla merged 14 commits into
mainfrom
range_read_block_on_rw_conflict
Dec 12, 2025
Merged

Range read block on rw conflict#283
githubzilla merged 14 commits into
mainfrom
range_read_block_on_rw_conflict

Conversation

@githubzilla

@githubzilla githubzilla commented Dec 10, 2025

Copy link
Copy Markdown
Collaborator

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • Bug Fixes

    • Range read requests now block and wait for locks (instead of aborting), preserving read locks and queueing retries; remote requesters receive acknowledgments when applicable.
    • Error paths now enter a safe waiting state to allow deadlock detection to observe real wait conditions.
  • Chores

    • Added per-second throttling for global deadlock checks and made deadlock-check signaling thread-safe.
    • Improved diagnostic logging and fault-injection hooks for transaction flows.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Dec 10, 2025

Copy link
Copy Markdown

Warning

Rate limit exceeded

@githubzilla has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 33 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between a2c4eb8 and 9306036.

📒 Files selected for processing (1)
  • tx_service/include/dead_lock_check.h (2 hunks)

Walkthrough

Range read conflict handling now blocks on lock-acquire failures (retaining bucket read locks) and invokes a throttled deadlock checker instead of aborting; several pre-abort checks were removed and fault-injection/diagnostics added in commit-acquire-all flows.

Changes

Cohort / File(s) Summary
Deadlock detection core
tx_service/include/dead_lock_check.h, tx_service/src/dead_lock_check.cpp
requested_check_ changed from bool to std::atomic<bool> with atomic loads/stores (memory_order_acquire/release). Added last_throttled_check_time_ and CHECK_THROTTLE_INTERVAL_ to throttle global deadlock-check requests; wait/notify logic updated to use atomics.
Range CC blocking behavior
tx_service/include/cc/range_cc_map.h
ReadCc execution now blocks when lock acquisition fails (marks request as blocked, returns false), retains bucket read lock during blocking, allows resumed range-read CC requests to persist after blocking, and invokes throttled deadlock checks instead of abort-and-log. Sends remote acknowledgments for remote-origin requests on block.
Operation forwarding & retries
tx_service/src/tx_operation.cpp
Removed pre-flight/pre-abort loops in ReadLocalOperation::Forward and LockWriteRangeBucketsOp::Forward, now relying on deadlock-detection and retry paths. Added diagnostic logging in AcquireAllOp::Forward.
Fault injection & diagnostics
tx_service/src/tx_operation.cpp
Added fault-injection handling for split_flush_commit_acquire_all_deadlock in SplitFlushRangeOp::Forward (inject DEAD_LOCK_ABORT path, initialize state, log, remove fault), DLOG diagnostics in commit-acquire-all hot paths, and an optional term hook to short-circuit continuation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Req as ReadCc Request
  participant LM as LockManager/BucketLock
  participant DLD as DeadLockCheck
  participant Remote as RemoteNode

  Req->>LM: attempt acquire bucket/range lock
  alt acquire succeeds
    LM-->>Req: lock acquired
    Req->>Req: proceed with read
  else acquire fails
    LM-->>Req: lock busy / contention
    Req->>Req: mark blocked (retain bucket read lock)
    Req->>DLD: try request deadlock check (throttled)
    note right of DLD: uses atomic requested_check_ and last_throttled_check_time_
    alt remote-origin
      Req->>Remote: send ack of block
    end
    DLD-->>Req: (possibly) schedule deadlock analysis / no-op if throttled
    LM-->>Req: on lock release -> retry queued request
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify correct atomic memory ordering (loads/stores) for requested_check_ and that no races remain.
  • Confirm throttle interval semantics and that last_throttled_check_time_ uses consistent time units.
  • Inspect lock-retention during blocking in range_cc_map.h to ensure no long-lived undesired lock-holding or self-deadlock.
  • Review fault-injection and term-hook paths in tx_operation.cpp for safe teardown and no silent bypass.

Possibly related PRs

Suggested reviewers

  • liunyl

Poem

🐰 I held a bucket, waited still,
Locks danced round the quiet hill,
Throttled checks to wake the night,
Blocked, then retried when freed — delight!
A hop, a fix, and logs that twinkle bright. ✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains an unchecked template checklist with no implementation details, context, or actual description of the changes made. Provide a detailed description explaining what changes were made, why they were needed, and how they affect deadlock detection. Complete the checklist items and reference the relevant issue number.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Range read block on rw conflict' directly describes the main change: making range reads block on read-write conflicts to enable deadlock detection.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/src/tx_operation.cpp (1)

317-327: Do not blindly retry DATA_STORE_ERR in ReadLocalOperation::Forward

In ReadLocalOperation::Forward (Lines 317–327), all errors (both ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT and DATA_STORE_ERR) now cause the handler result to be reset and the operation to be re‑processed:

  • This makes sense for ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT under the new “blocking range read + deadlock detection” model.
  • For DATA_STORE_ERR, this turns a non‑transient local data‑store failure into an unbounded immediate retry loop with no backoff and no explicit timeout handling in this op. The transaction never sees the terminal error and may just spin until some external timeout/abort kicks in.

You likely want to preserve the old “propagate DATA_STORE_ERR and finish” behavior and only apply the retry logic to the RW‑conflict case.

A minimal, more robust structure would be:

-        else if (hd_result_->IsError())
-        {
-            assert(hd_result_->ErrorCode() ==
-                       CcErrorCode::ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT ||
-                   hd_result_->ErrorCode() == CcErrorCode::DATA_STORE_ERR);
-            // With blocking range reads and deadlock detection, we no longer
-            // need to preemptively abort. The request will block and deadlock
-            // detection will handle actual deadlocks. Retry the operation.
-            hd_result_->Value().Reset();
-            hd_result_->Reset();
-            execute_immediately_ = false;
-            txm->Process(*this);
-            return;
-        }
+        else if (hd_result_->IsError())
+        {
+            const auto err = hd_result_->ErrorCode();
+            if (err ==
+                CcErrorCode::ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT)
+            {
+                // With blocking range reads and deadlock detection, we no
+                // longer need to preemptively abort. The request will block and
+                // deadlock detection will handle actual deadlocks. Retry the
+                // operation.
+                hd_result_->Value().Reset();
+                hd_result_->Reset();
+                execute_immediately_ = false;
+                txm->Process(*this);
+                return;
+            }
+
+            // Non‑retryable local error: bubble it up instead of spinning.
+            assert(err == CcErrorCode::DATA_STORE_ERR);
+            txm->PostProcess(*this);
+            return;
+        }
🧹 Nitpick comments (1)
tx_service/src/tx_operation.cpp (1)

612-622: Consider bounding or backing off retries in LockWriteRangeBucketsOp::Forward

For LockWriteRangeBucketsOp::Forward (Lines 612–622), on ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT you now:

  • Reset lock_result_
  • Set is_running_ = false; execute_immediately_ = false;
  • Immediately call txm->Process(*this);

This implements the new “blocking range read” semantics by retrying instead of aborting, which is consistent with the PR’s intent. However, if for any reason the CC layer keeps returning the conflict code (e.g. a bug or a permanently blocked writer), this path will repeatedly re‑issue the same lock request with no retry bound and no backoff in this op.

It might be safer to either:

  • Consume retry_num_ and go through ReRunOp(txm) (with the existing exponential sleep), or
  • Add a small retry counter/backoff specific to this op.

Not strictly a blocker, but worth tightening before this path becomes hot.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1344f8 and a43d6d1.

📒 Files selected for processing (4)
  • tx_service/include/cc/range_cc_map.h (3 hunks)
  • tx_service/include/dead_lock_check.h (2 hunks)
  • tx_service/src/dead_lock_check.cpp (1 hunks)
  • tx_service/src/tx_operation.cpp (7 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/src/tx_operation.cpp
  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
🧬 Code graph analysis (2)
tx_service/include/dead_lock_check.h (1)
tx_service/src/dead_lock_check.cpp (2)
  • RequestCheckWithThrottle (56-79)
  • RequestCheckWithThrottle (56-56)
tx_service/include/cc/range_cc_map.h (1)
tx_service/src/dead_lock_check.cpp (2)
  • RequestCheckWithThrottle (56-79)
  • RequestCheckWithThrottle (56-56)
🔇 Additional comments (7)
tx_service/src/tx_operation.cpp (1)

49-49: Include order change looks fine

Moving #include "tx_operation.h" (Line 49) below other headers is safe and aligns with including the corresponding header last. No functional impact.

tx_service/include/cc/range_cc_map.h (4)

32-32: LGTM!

The include directive is necessary for the new DeadLockCheck::RequestCheckWithThrottle call added below.


233-235: LGTM!

The updated comments clearly explain the new blocking behavior for range reads, which is consistent with the implementation changes below.


338-343: The remote request acknowledgement pattern is correct and safe. RemoteRead properly inherits from ReadCc (confirmed in remote_cc_request.h:180), making the static_cast to remote::RemoteRead& safe when guarded by the !req.IsLocal() check. The Acknowledge() method is properly defined and correctly notifies the remote node by setting transaction number and handler address in the output message. This pattern is consistently used throughout the codebase for distinguishing remote vs. local request handling.


328-346: Bucket lock handling is consistent with the design and safely managed.

The code correctly holds the bucket read lock while the request is blocked on range lock acquisition, matching the pattern used for regular data reads. This lock is intentionally retained to avoid re-acquiring it when the request resumes.

Lock lifecycle:

  • During blocking: Bucket lock is held (not released) along with BlockByLock status set
  • On resumption: Request re-enters PostReadCc Execute, which checks if BlockedBy == BlockByLock and calls LockHandleForResumedRequest to retry lock acquisition
  • Lock release guarantee: PostReadCc Execute ensures the bucket lock is always released via ReleaseCceLock() regardless of the outcome

The design is sound: holding a read lock during blocking has minimal deadlock risk since read locks don't block other readers, and the lock is guaranteed to be released in PostReadCc. This matches how regular template-based reads handle the same scenario (template_cc_map.h lines 1489-1503).

tx_service/include/dead_lock_check.h (2)

202-205: LGTM with a note.

The method declaration and documentation are clear. However, this is related to the earlier comment about the unused tx_id parameter in the implementation.


251-254: LGTM!

The member variable declarations and throttle interval constant are well-defined. The initialization of last_throttled_check_time_ to 0 means the first deadlock check request will always be allowed, which is reasonable behavior.

Comment thread tx_service/src/dead_lock_check.cpp Outdated
Comment thread tx_service/src/tx_operation.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a43d6d1 and 7848048.

📒 Files selected for processing (3)
  • tx_service/include/cc/range_cc_map.h (3 hunks)
  • tx_service/include/dead_lock_check.h (2 hunks)
  • tx_service/src/dead_lock_check.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tx_service/src/dead_lock_check.cpp
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
🧬 Code graph analysis (2)
tx_service/include/cc/range_cc_map.h (1)
tx_service/src/dead_lock_check.cpp (2)
  • RequestCheckWithThrottle (56-79)
  • RequestCheckWithThrottle (56-56)
tx_service/include/dead_lock_check.h (1)
tx_service/src/dead_lock_check.cpp (2)
  • RequestCheckWithThrottle (56-79)
  • RequestCheckWithThrottle (56-56)
🔇 Additional comments (5)
tx_service/include/dead_lock_check.h (2)

202-205: LGTM!

The throttled deadlock check method is well-documented and provides a clear mechanism to limit the frequency of deadlock checks.


251-254: Verify throttling behavior when deadlock check is skipped.

The throttle implementation correctly limits checks to once per second. However, in range_cc_map.h line 329, the return value of RequestCheckWithThrottle() is not checked. When the throttle prevents a deadlock check (returns false), the request still blocks and waits. This means if multiple transactions hit lock contention within the same second, only the first triggers a deadlock check while others block indefinitely until the next scheduled check or lock release.

Verify this is the intended behavior by checking if the periodic deadlock checker will eventually detect these blocked transactions, or if there are other mechanisms to wake them:

#!/bin/bash
# Search for periodic deadlock check mechanism and blocked request wake-up logic
rg -nP -C3 'time_interval_|requested_check_|con_var_\.notify' --type=cpp
tx_service/include/cc/range_cc_map.h (3)

32-32: LGTM!

The include is necessary for the RequestCheckWithThrottle() call added in the Execute method.


233-235: LGTM!

The updated comment accurately reflects the behavioral change from aborting to blocking when lock acquisition fails, enabling deadlock detection.


328-346: Verify resumed request retry path is properly implemented in RangeCcMap.

The design approach is sound—retaining the bucket read lock and retrying range lock acquisition on resumption is consistent with regular data reads and allows deadlock detection. However, RangeCcMap's Execute(ReadCc) override (lines 202–350) lacks the resume handling pattern found in the parent TemplateCcMap class. The parent class checks req.CcePtr() to detect resumed requests and calls LockHandleForResumedRequest() when blocked by lock (template_cc_map.h:1489–1503), but RangeCcMap's implementation has no such check. This means resumed requests will attempt to re-acquire the bucket lock without verifying prior acquisition state. Confirm whether RangeCcMap intentionally bypasses this pattern or if the resume handling needs to be added to properly support resumed range read requests with retained bucket locks.

Comment thread tx_service/include/cc/range_cc_map.h Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/include/cc/range_cc_map.h (1)

217-217: assert(req.IsLocal()) conflicts with new remote‑acknowledgement branch

At line 217 you assert req.IsLocal(), but at lines 338‑343 you added a if (!req.IsLocal()) branch that casts to remote::RemoteRead and calls Acknowledge().

In debug/test builds this makes the remote path effectively unreachable: any non‑local ReadCc reaching this method would trip the assert before hitting the new acknowledgement logic. Please either:

  • Relax/remove the assert(req.IsLocal()) if remote ReadCc requests are expected here, or
  • Remove the remote‑specific branch and keep the invariant that RangeCcMap::Execute(ReadCc) only handles local requests.

Also applies to: 338-343

♻️ Duplicate comments (1)
tx_service/src/tx_operation.cpp (1)

4217-4255: Fault-injected split_flush_commit_acquire_all_deadlock can still stall the state machine

Even after this change, the WITH_FAULT_INJECT path for split_flush_commit_acquire_all_deadlock risks leaving commit_acquire_all_write_op_ (an AcquireAllOp) stuck and never reaching the op_ == &commit_acquire_all_write_op_ branch in SplitFlushRangeOp::Forward, for two reasons:

  1. Sub‑op never reaches a “logically finished” state:

    • commit_acquire_all_write_op_.Reset(all_node_groups->size()) sets:
      • upload_cnt_ = node_cnt * keys_.size() (for range split this is #node_groups * 1).
      • finish_cnt_ = fail_cnt_ = remote_ack_cnt_ = 0.
    • You then mark only hd_results_[0] as DEAD_LOCK_ABORT via SetError + ForceError.
    • Assuming ForceError triggers the post_lambda_, you get finish_cnt_ = 1 and fail_cnt_ = 1, but all other hd_results_[i>0] remain unfinished and remote_ack_cnt_ is still 0.
    • In AcquireAllOp::Forward, this means:
      • remote_ack_cnt_ == 0, so the first timeout branch is skipped.
      • finish_cnt_ != upload_cnt_, so the “all done” branch is never taken.
      • The final timeout branch iterates unfinished hd_results_, but only acts when ac_res.node_term_ > 0 && !blocked_remote_cce_addr_.empty(), which is not true for the synthetic entries created by Reset. Those entries are never ForceError’d, so finish_cnt_ never reaches upload_cnt_ and txm->PostProcess(*this) is never called.
  2. You still don’t run the sub‑op’s Forward with real work:

    • In the fault branch you do:
      • op_ = &commit_acquire_all_write_op_;
      • txm->PushOperation(&commit_acquire_all_write_op_);
      • commit_acquire_all_write_op_.is_running_ = true;
    • Unlike the normal path, you do not call ForwardToSubOperation(txm, &commit_acquire_all_write_op_), so there is no guaranteed AcquireAllOp::Forward(txm) invocation tied to this transition.
    • Even if Forward is eventually called because the op is now on top of the stack, is_running_ == true prevents AcquireAllOp from calling txm->Process(*this) and sending any RPCs, and (per point 1) the synthetic state still never reaches the “all requests finished” branch.

Net effect: with the split_flush_commit_acquire_all_deadlock fault enabled, the composite split‑flush operation can hang rather than driving the deadlock downgrade path you’re trying to test — essentially the same failure mode highlighted in the earlier review.

A safer pattern is to fully synthesize a “completed, deadlock’ed AcquireAllOp” before forwarding to it, so AcquireAllOp::Forward immediately takes the finish_cnt_ == upload_cnt_ branch and PostProcesses without issuing RPCs. For example:

-#ifdef WITH_FAULT_INJECT
-            // Check if fault injector is enabled
-            FaultEntry *fault_entry =
-                FaultInject::Entry("split_flush_commit_acquire_all_deadlock");
-            if (fault_entry != nullptr)
-            {
-                // Only assign to op_ and push to txm, do not call Process
-                op_ = &commit_acquire_all_write_op_;
-                txm->PushOperation(&commit_acquire_all_write_op_);
-                auto all_node_groups = Sharder::Instance().AllNodeGroups();
-                commit_acquire_all_write_op_.Reset(all_node_groups->size());
-                commit_acquire_all_write_op_.is_running_ = true;
-
-                // Ensure at least one result slot exists.
-                assert(commit_acquire_all_write_op_.hd_results_.size() >= 1);
-                // Set error on the first hd_result with DEAD_LOCK_ABORT
-                commit_acquire_all_write_op_.hd_results_[0].SetError(
-                    CcErrorCode::DEAD_LOCK_ABORT);
-                commit_acquire_all_write_op_.hd_results_[0].ForceError();
-
-                DLOG(INFO)
-                    << "FaultInject split_flush_commit_acquire_all_deadlock, "
-                    << " commit_acquire_all_write_op_.IsDeadlock(): "
-                    << commit_acquire_all_write_op_.IsDeadlock()
-                    << " upload_cnt_: "
-                    << commit_acquire_all_write_op_.upload_cnt_
-                    << " fail_cnt_: "
-                    << commit_acquire_all_write_op_.fail_cnt_.load(
-                           std::memory_order_relaxed);
-                // Auto-remove the fault to prevent it from being triggered
-                // again
-                FaultInject::Instance().InjectFault(
-                    "split_flush_commit_acquire_all_deadlock", "remove");
-            }
-            else
-            {
-                ForwardToSubOperation(txm, &commit_acquire_all_write_op_);
-            }
-#else
-            ForwardToSubOperation(txm, &commit_acquire_all_write_op_);
-#endif
+#ifdef WITH_FAULT_INJECT
+            if (FaultInject::Entry(
+                    "split_flush_commit_acquire_all_deadlock") != nullptr)
+            {
+                // Synthesize a completed DEAD_LOCK_ABORT without issuing RPCs.
+                auto &results = commit_acquire_all_write_op_.hd_results_;
+                if (results.empty())
+                {
+                    commit_acquire_all_write_op_.Resize(1);
+                }
+
+                commit_acquire_all_write_op_.upload_cnt_ = 1;
+                commit_acquire_all_write_op_.finish_cnt_.store(
+                    1, std::memory_order_relaxed);
+                commit_acquire_all_write_op_.fail_cnt_.store(
+                    1, std::memory_order_relaxed);
+                commit_acquire_all_write_op_.remote_ack_cnt_.store(
+                    0, std::memory_order_relaxed);
+                commit_acquire_all_write_op_.is_running_ = true;
+
+                auto &h = results[0];
+                h.Reset();
+                h.SetError(CcErrorCode::DEAD_LOCK_ABORT);
+                h.ForceError();
+
+                DLOG(INFO) << "FaultInject split_flush_commit_acquire_all_deadlock, "
+                           << "IsDeadlock=" << commit_acquire_all_write_op_.IsDeadlock();
+
+                FaultInject::Instance().InjectFault(
+                    "split_flush_commit_acquire_all_deadlock", "remove");
+            }
+#endif
+            // In both normal and fault-injected paths, drive the sub-op once
+            // so AcquireAllOp::Forward can PostProcess and let the parent see
+            // the deadlock state.
+            ForwardToSubOperation(txm, &commit_acquire_all_write_op_);

This keeps AcquireAllOp::Process() suppressed (via is_running_ = true), but guarantees:

  • upload_cnt_ == finish_cnt_ == 1, fail_cnt_ > 0, remote_ack_cnt_ == 0.
  • AcquireAllOp::Forward immediately takes the finish_cnt_ == upload_cnt_ branch, runs the fail‑handling logic once, and calls txm->PostProcess(*this).
  • On the next SplitFlushRangeOp::Forward call, the op_ == &commit_acquire_all_write_op_ branch sees IsDeadlock() == true and exercises the downgrade path instead of hanging.

If you prefer not to rely on AcquireAllOp::Forward at all, an alternative is to skip pushing the sub‑op and simply set op_ = &commit_acquire_all_write_op_, fail_cnt_ = 1, upload_cnt_ = 1, and mark one hd_result as DEAD_LOCK_ABORT, then handle the downgrade directly in the parent on the next Forward call — but the key point is that the current shape still leaves the sub‑op in a permanently “incomplete” state under the fault.

Also applies to: 4532-4573

🧹 Nitpick comments (3)
tx_service/include/cc/range_cc_map.h (2)

233-235: Clarify which lock type “range reads now block” refers to

The comment at lines 233‑235 says range reads now block when lock acquisition fails, but the bucket lock acquisition path (lines 268‑290) still aborts via AbortQueueRequest, while only the range lock path (lines 325‑347) actually blocks and participates in deadlock detection.

Consider tightening the wording to something like “range record lock acquisition now blocks” or similar, or, if desired, extend the same blocking/deadlock‑check semantics to bucket lock conflicts as well.

Also applies to: 283-290, 325-347


267-283: Confirm bucket read‑lock cleanup for blocked‑but‑aborted range reads

In the blocked range‑lock path you now intentionally “keep the bucket read lock” (lines 332‑336) and return false with req.SetBlockType(ReadCc::BlockByLock). Since the bucket read lock is not tracked in the tx readset and is manually released only in Execute(PostReadCc) (lines 357‑368), any path that aborts/cancels a blocked ReadCc before it reaches PostReadCc must also ensure the bucket lock is released.

If existing generic abort/cancellation logic already does this for blocked range reads, then behavior is fine; otherwise you may need either to:

  • Track the bucket lock in the readset for these requests, or
  • Add explicit cleanup in the abort path for blocked range reads.

Also applies to: 332-336, 350-369

tx_service/src/tx_operation.cpp (1)

309-327: ReadLocalOperation retry comment matches behavior; watch unbounded retries

The new comment correctly describes the retry-on-error behavior for RW conflicts and data store errors. Note that this path will keep retrying indefinitely if hd_result_ keeps returning DATA_STORE_ERR or ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT, since it doesn’t consult retry_num_ or ReRunOp. If that’s intentional (e.g., higher-level timeout/abort will eventually kick in), this is fine; otherwise, consider wiring this through ReRunOp with a bounded retry count.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7848048 and c981cd6.

📒 Files selected for processing (2)
  • tx_service/include/cc/range_cc_map.h (3 hunks)
  • tx_service/src/tx_operation.cpp (8 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-09-25T06:15:41.186Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 142
File: include/tx_req_result.h:140-159
Timestamp: 2025-09-25T06:15:41.186Z
Learning: When cc_notify_ is true in TxResult, both Finish() and Wait() are guaranteed to be called on the same thread, eliminating concurrency concerns and making relaxed memory ordering sufficient for status_combo_ updates.

Applied to files:

  • tx_service/src/tx_operation.cpp
🧬 Code graph analysis (1)
tx_service/include/cc/range_cc_map.h (1)
tx_service/src/dead_lock_check.cpp (2)
  • RequestCheckWithThrottle (56-79)
  • RequestCheckWithThrottle (56-56)
🔇 Additional comments (4)
tx_service/include/cc/range_cc_map.h (1)

32-32: Deadlock check integration looks fine

Including dead_lock_check.h and calling DeadLockCheck::RequestCheckWithThrottle() in the blocked-range path is consistent with the new throttled deadlock-check API. Ignoring the boolean return and relying only on the side effect is reasonable here.

Also applies to: 328-330

tx_service/src/tx_operation.cpp (3)

49-49: Include reordering for tx_operation.h is fine

Moving this include down into the project header block is harmless and improves include hygiene; no behavioral impact.


601-623: LockWriteRangeBucketsOp RW-conflict comment aligns with retry behavior

The updated comment accurately reflects that RW conflicts are now handled by retrying instead of preemptive abort, deferring to blocking + deadlock detection. The control flow (Reset result, flip execute_immediately_, re-Process) is consistent with that design.


1846-1848: Additional AcquireAllOp completion DLOG is useful and low‑risk

The new debug log on “all acquire-all requests finished” (including fail_cnt_) is inexpensive and should help diagnose lock-acquisition behavior without affecting semantics.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tx_service/src/tx_operation.cpp (1)

4218-4256: Fault injection for split_flush_commit_acquire_all_deadlock still leaves commit_acquire_all_write_op_ stuck

The new fault‑inject branch improves diagnostics but still doesn’t let AcquireAllOp complete, so SplitFlushRangeOp can stall indefinitely when this fault is enabled:

  • commit_acquire_all_write_op_.Reset(all_node_groups->size()) sets upload_cnt_ = node_cnt * keys_.size() (≥ number of node groups), and zeroes finish_cnt_ / fail_cnt_.
  • You then set is_running_ = true, push the sub‑op without calling Process, and mark only hd_results_[0] as DEAD_LOCK_ABORT via SetError + ForceError(). That increments fail_cnt_ and finish_cnt_ to 1 via the per‑result post_lambda_.
  • With is_running_ == true, AcquireAllOp::Forward will not call txm->Process(*this) (no RPCs are issued). At the same time:
    • remote_ack_cnt_ stays 0.
    • finish_cnt_ (==1) != upload_cnt_, so the finish_cnt_ == upload_cnt_ branch (which calls txm->PostProcess(*this)) is never taken.
    • The timeout/deadlock path at the end of AcquireAllOp::Forward inspects ac_res.node_term_ > 0 && has_blocked_remote_cce, but Reset() leaves node_term_ = -1, so it never forces additional completions or calls PostProcess either.
  • As a result, commit_acquire_all_write_op_ never reaches a “finished” state in which it would call txm->PostProcess(*this), and the parent SplitFlushRangeOp never re‑enters its op_ == &commit_acquire_all_write_op_ branch to see fail_cnt_ > 0 and IsDeadlock() == true. Any test that enables this fault will hang rather than exercising the downgrade path.

To keep the fault semantics (no real RPCs) but guarantee progress, you need to either:

  1. Do not push the sub‑op at all; just synthesize a failed, deadlock’ed sub‑op state and let the parent handle it on the next Forward, e.g.:
#ifdef WITH_FAULT_INJECT
-    FaultEntry *fault_entry =
-        FaultInject::Entry("split_flush_commit_acquire_all_deadlock");
-    if (fault_entry != nullptr)
-    {
-        // Only assign to op_ and push to txm, do not call Process
-        op_ = &commit_acquire_all_write_op_;
-        txm->PushOperation(&commit_acquire_all_write_op_);
-        auto all_node_groups = Sharder::Instance().AllNodeGroups();
-        commit_acquire_all_write_op_.Reset(all_node_groups->size());
-        commit_acquire_all_write_op_.is_running_ = true;
-
-        // Ensure at least one result slot exists.
-        assert(commit_acquire_all_write_op_.hd_results_.size() >= 1);
-        // Set error on the first hd_result with DEAD_LOCK_ABORT
-        commit_acquire_all_write_op_.hd_results_[0].SetError(
-            CcErrorCode::DEAD_LOCK_ABORT);
-        commit_acquire_all_write_op_.hd_results_[0].ForceError();
+    if (FaultInject::Entry(
+            "split_flush_commit_acquire_all_deadlock") != nullptr)
+    {
+        // Synthesize a completed, deadlock'ed acquire-all without issuing RPCs.
+        op_ = &commit_acquire_all_write_op_;
+
+        commit_acquire_all_write_op_.upload_cnt_ = 1;
+        commit_acquire_all_write_op_.finish_cnt_.store(
+            1, std::memory_order_relaxed);
+        commit_acquire_all_write_op_.fail_cnt_.store(
+            1, std::memory_order_relaxed);
+        commit_acquire_all_write_op_.remote_ack_cnt_.store(
+            0, std::memory_order_relaxed);
+
+        auto &res = commit_acquire_all_write_op_.hd_results_[0];
+        res.Reset();
+        res.SetError(CcErrorCode::DEAD_LOCK_ABORT);
+        res.ForceError();  // make IsError() true for IsDeadlock()
  1. Or keep pushing the sub‑op but also synthesize a fully completed state so that AcquireAllOp::Forward will immediately observe finish_cnt_ == upload_cnt_ and call txm->PostProcess(*this) on the next tick (as suggested in the prior review).

Either way, the key is: after the fault fires, commit_acquire_all_write_op_ must transition to a finished, deadlock’ed state that lets the parent branch run once more and hit the IsDeadlock() downgrade path, instead of remaining perpetually incomplete.

This is the same underlying issue that was flagged in the earlier review; the current changes haven’t eliminated the risk of the state machine getting stuck under this fault.

Also applies to: 4258-4258

🧹 Nitpick comments (4)
tx_service/include/cc/range_cc_map.h (1)

328-347: Blocking path and deadlock check integration look good; clarify remote/local and lock-lifetime invariants

Switching the range-lock conflict path to:

  • trigger a throttled deadlock check,
  • keep the bucket read lock,
  • mark the request as BlockByLock, and
  • return false to let the scheduler re-run it

is consistent with typical “block instead of abort” CC behavior and should make deadlock detection see real wait edges.

Two things are worth tightening/confirming:

  • The if (!req.IsLocal()) branch sits in a function that begins with assert(req.IsLocal());, so in debug builds that Acknowledge() path appears unreachable. If remote ReadCc instances are ever routed through this Execute, the assert will fire instead of acknowledging; if they are truly never routed here, the remote-specific code is dead and could be removed to avoid confusion.
  • Because the bucket read lock is deliberately kept while the request is blocked and is not tracked in the readset, it’s important that every eventual completion/abort path for a blocked range read goes through PostReadCc (or an equivalent cleanup path) so bucket locks can’t leak. The pattern looks intentional and mirrors regular data reads, but it’s worth double-checking that the resume + post-read flow for blocked range reads follows the same conventions.
tx_service/src/tx_operation.cpp (3)

321-328: Range-local read now retries on RW conflict instead of aborting – confirm scheduling semantics

The new retry path for ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT / DATA_STORE_ERR clears hd_result_, marks execute_immediately_ = false, and immediately calls txm->Process(*this) without any backoff or ReRunOp use. That matches the “blocking range reads + deadlock detection” intent, but relies on the surrounding scheduler to avoid a tight retry loop.

Please double‑check that:

  • execute_immediately_ = false actually re-queues this op onto the blocked/waiting path rather than spinning on the TxProcessor thread; and
  • the DATA_STORE_ERR case is intended to be retriable under this new behavior (vs surfacing an error after bounded retries).

If both are intentional and covered by tests, then this looks good.


615-623: LockWriteRangeBucketsOp now retries on RW conflict — ensure it blocks instead of spinning

Similar to ReadLocalOperation, a bucket-lock error with code ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT now resets lock_result_, sets is_running_ = false / execute_immediately_ = false, and re-invokes txm->Process(*this).

This seems correct for “blocking range read” semantics, but it assumes:

  • execute_immediately_ = false causes the operation to be parked until notified (e.g., via deadlock check and unblocking), and
  • no fast spin occurs on the TxProcessor thread while the conflicting writer still holds the lock.

If your concurrency tests confirm this behavior, no change is needed; otherwise, consider wiring this through ReRunOp (with backoff) for consistency with other retryable ops.


1847-1848: New AcquireAllOp DLOG is fine; consider explicit memory order for consistency

The extra DLOG(INFO) when all acquire-all requests finish is useful diagnostics and is in a non‑hot path (debug logging only), so it’s safe.

For consistency with the rest of AcquireAllOp (which uses load(std::memory_order_relaxed) everywhere), you might want to specify the memory order here as well:

-       DLOG(INFO) << "All acquire all requests finished for tx "
-                  << txm->TxNumber() << ", fail count: " << fail_cnt_.load();
+       DLOG(INFO) << "All acquire all requests finished for tx "
+                  << txm->TxNumber()
+                  << ", fail count: "
+                  << fail_cnt_.load(std::memory_order_relaxed);

Purely stylistic; behavior is already correct.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c981cd6 and d2f5a8a.

📒 Files selected for processing (3)
  • tx_service/include/cc/range_cc_map.h (3 hunks)
  • tx_service/include/dead_lock_check.h (2 hunks)
  • tx_service/src/tx_operation.cpp (7 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-09-25T06:15:41.186Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 142
File: include/tx_req_result.h:140-159
Timestamp: 2025-09-25T06:15:41.186Z
Learning: When cc_notify_ is true in TxResult, both Finish() and Wait() are guaranteed to be called on the same thread, eliminating concurrency concerns and making relaxed memory ordering sufficient for status_combo_ updates.

Applied to files:

  • tx_service/src/tx_operation.cpp
🧬 Code graph analysis (2)
tx_service/include/dead_lock_check.h (1)
tx_service/src/dead_lock_check.cpp (2)
  • RequestCheckWithThrottle (56-79)
  • RequestCheckWithThrottle (56-56)
tx_service/include/cc/range_cc_map.h (1)
tx_service/src/dead_lock_check.cpp (2)
  • RequestCheckWithThrottle (56-79)
  • RequestCheckWithThrottle (56-56)
🔇 Additional comments (6)
tx_service/include/dead_lock_check.h (2)

202-205: Throttle API shape and contract look consistent with implementation

The RequestCheckWithThrottle() declaration and its documented behavior match the cpp implementation (singleton guard, mutex, 1s window, boolean return). No issues from the header side.


251-256: Throttle state and interval wiring look correct

Storing last_throttled_check_time_ per DeadLockCheck instance and using CHECK_THROTTLE_INTERVAL_ = 1 * MICRO_SECOND correctly encode a 1-second global throttle for the singleton; this aligns with the comments and the cpp usage.

tx_service/include/cc/range_cc_map.h (2)

32-32: Including dead_lock_check.h here is appropriate

The new include is needed for DeadLockCheck::RequestCheckWithThrottle() and fits the existing include style in this header.


233-235: Behavioral comment matches the new blocking semantics

The added comment accurately reflects that range reads can now block on lock acquisition and later resume, which is an important invariant shift from the previous “always abort on conflict” behavior.

tx_service/src/tx_operation.cpp (2)

4535-4537: Additional diagnostics around commit-acquire-all failures look good

The new DLOG(INFO) for commit_acquire_all_write_op_.fail_cnt_ and the DLOG(ERROR) when IsDeadlock() is true provide useful context in split‑flush range deadlock scenarios, with negligible overhead in non‑debug builds.

No functional issues from these logging additions.

Also applies to: 4548-4548


4569-4575: New term_SplitFlushOp_CommitAcquireAllWriteOp_Continue fault hook is consistent with existing patterns

The CODE_FAULT_INJECTOR("term_SplitFlushOp_CommitAcquireAllWriteOp_Continue", { ...; return; }); hook is in line with other *_Continue injection points in this file (e.g., the install‑dirty and data‑sync phases). It cleanly short‑circuits the branch while the fault is enabled and relies on test code to remove the entry when it’s time to resume.

No correctness concerns here; just ensure test harnesses always remove this fault to avoid persistent stalls.

Comment thread tx_service/include/cc/range_cc_map.h Outdated
// bucket lock and can immediately retry the range lock acquisition.
// If the read request comes from a remote node, send
// acknowledgement
if (!req.IsLocal())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should never happen. range cc map ReadCc should only be initiated by local node

Comment thread tx_service/src/dead_lock_check.cpp Outdated
return false;
}

std::unique_lock<std::mutex> lk(inst_->mutex_);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove lock

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tx_service/src/tx_operation.cpp (1)

4218-4256: Fault injection for split_flush_commit_acquire_all_deadlock still risks leaving commit_acquire_all_write_op_ never-finished

This new fault path sets op_, pushes commit_acquire_all_write_op_, calls Reset(...), marks is_running_ = true, and only manipulates hd_results_[0]. It does not synthesize a fully “finished” state for AcquireAllOp:

  • upload_cnt_ remains node_cnt * keys_.size() (≥ 1, typically > 1).
  • finish_cnt_ and fail_cnt_ are left at 0 (or whatever previous values were).
  • No ForceError()-driven post-lambda runs for indices [1, upload_cnt_), so finish_cnt_ != upload_cnt_.
  • remote_ack_cnt_ stays 0.

As a result, AcquireAllOp::Forward never hits the finish_cnt_ == upload_cnt_ branch that calls txm->PostProcess(*this), and the composite SplitFlushRangeOp can remain stuck with op_ == &commit_acquire_all_write_op_ for tests that enable this fault. This is essentially the same liveness issue called out in the earlier review.

You can preserve the “skip real Process()/RPCs” intent while guaranteeing progress by fully synthesizing a one-slot, failed, deadlock’ed state:

 #ifdef WITH_FAULT_INJECT
-            // Check if fault injector is enabled
-            FaultEntry *fault_entry =
-                FaultInject::Entry("split_flush_commit_acquire_all_deadlock");
-            if (fault_entry != nullptr)
-            {
-                // Only assign to op_ and push to txm, do not call Process
-                op_ = &commit_acquire_all_write_op_;
-                txm->PushOperation(&commit_acquire_all_write_op_);
-                auto all_node_groups = Sharder::Instance().AllNodeGroups();
-                commit_acquire_all_write_op_.Reset(all_node_groups->size());
-                commit_acquire_all_write_op_.is_running_ = true;
-
-                // Ensure at least one result slot exists.
-                assert(commit_acquire_all_write_op_.hd_results_.size() >= 1);
-                auto &hd_result0 = commit_acquire_all_write_op_.hd_results_[0];
-                hd_result0.Reset();
-                hd_result0.SetToBlock();
-                // Set error on the first hd_result with DEAD_LOCK_ABORT
-                hd_result0.SetError(CcErrorCode::DEAD_LOCK_ABORT);
-                DLOG(INFO)
-                    << "FaultInject split_flush_commit_acquire_all_deadlock, "
-                    << " commit_acquire_all_write_op_.IsDeadlock(): "
-                    << commit_acquire_all_write_op_.IsDeadlock()
-                    << " upload_cnt_: "
-                    << commit_acquire_all_write_op_.upload_cnt_
-                    << " fail_cnt_: "
-                    << commit_acquire_all_write_op_.fail_cnt_.load(
-                           std::memory_order_relaxed);
+            FaultEntry *fault_entry =
+                FaultInject::Entry("split_flush_commit_acquire_all_deadlock");
+            if (fault_entry != nullptr)
+            {
+                // Put the sub-op on the stack but skip real RPCs; synthesize a
+                // completed DEAD_LOCK_ABORT so AcquireAllOp::Forward will
+                // immediately finish and be seen as deadlocked.
+                op_ = &commit_acquire_all_write_op_;
+                txm->PushOperation(&commit_acquire_all_write_op_);
+                commit_acquire_all_write_op_.is_running_ = true;
+
+                if (commit_acquire_all_write_op_.hd_results_.empty())
+                {
+                    commit_acquire_all_write_op_.Resize(1);
+                }
+                auto &hd_result0 = commit_acquire_all_write_op_.hd_results_[0];
+                hd_result0.Reset();
+                hd_result0.SetError(CcErrorCode::DEAD_LOCK_ABORT);
+                hd_result0.ForceError();  // runs post_lambda_, bumps counters
+
+                commit_acquire_all_write_op_.upload_cnt_ = 1;
+                commit_acquire_all_write_op_.finish_cnt_.store(
+                    1, std::memory_order_relaxed);
+                commit_acquire_all_write_op_.fail_cnt_.store(
+                    1, std::memory_order_relaxed);
+                commit_acquire_all_write_op_.remote_ack_cnt_.store(
+                    0, std::memory_order_relaxed);
+
+                DLOG(INFO)
+                    << "FaultInject split_flush_commit_acquire_all_deadlock, "
+                    << " commit_acquire_all_write_op_.IsDeadlock(): "
+                    << commit_acquire_all_write_op_.IsDeadlock()
+                    << " upload_cnt_: "
+                    << commit_acquire_all_write_op_.upload_cnt_
+                    << " fail_cnt_: "
+                    << commit_acquire_all_write_op_.fail_cnt_.load(
+                           std::memory_order_relaxed);
                 // Auto-remove the fault to prevent it from being triggered
                 // again
                 FaultInject::Instance().InjectFault(
                     "split_flush_commit_acquire_all_deadlock", "remove");
             }
             else
             {
                 ForwardToSubOperation(txm, &commit_acquire_all_write_op_);
             }
 #else
             ForwardToSubOperation(txm, &commit_acquire_all_write_op_);
 #endif

This keeps AcquireAllOp::Process() from issuing real locks but guarantees finish_cnt_ == upload_cnt_, fail_cnt_ > 0, and IsDeadlock() == true, so the parent reaches the downgrade path instead of hanging.

🧹 Nitpick comments (3)
tx_service/src/tx_operation.cpp (3)

318-329: Range-read local RW-conflict now always retries; consider bounding or differentiating non-retriable errors

The new behavior for ReadLocalOperation::Forward on ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT and DATA_STORE_ERR is to unconditionally reset and re-run the op. With blocking range reads this makes sense for true lock conflicts, but for persistent DATA_STORE_ERR this can create an effectively infinite retry loop with no backoff or fail-out.

Consider:

  • Treating DATA_STORE_ERR as non-retriable (propagate error instead of retry), or
  • Adding a bounded retry counter / backoff here, similar to other ops using ReRunOp, so a permanently failing KV or CC layer doesn’t leave the tx stuck forever.

613-623: Bucket-range RW-conflict also unboundedly retries; align with desired failure semantics

LockWriteRangeBucketsOp::Forward now handles ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT by resetting and re-processing the op without ever surfacing an abort. That matches the new “blocking + deadlock detection” design, but again there is no retry bound or differentiation between transient lock contention and structural failures.

If ACQUIRE_KEY_LOCK_FAILED_FOR_RW_CONFLICT can be returned for reasons that will not be resolved by deadlock detection (e.g., misconfiguration, protocol mismatch), this op will also spin indefinitely.

I’d recommend:

  • Confirming that this error code is now only used for genuine lock conflicts that must be blocked, and/or
  • Introducing a small bounded retry / timeout path analogous to other operations to preserve liveness on misbehaving nodes.

4535-4575: New logging around commit-phase AcquireAll and fault injector behavior

  • The extra DLOG(INFO) and DLOG(ERROR) lines around commit_acquire_all_write_op_ failure and deadlock handling are reasonable and help when debugging range-split issues.
  • The new CODE_FAULT_INJECTOR("term_SplitFlushOp_CommitAcquireAllWriteOp_Continue", ...) intentionally early-returns under fault to simulate a stalled commit; that’s consistent with other term_*_Continue injectors in this file, but it will leave the op parked until the fault is cleared.

Given how invasive this state-machine is, it’s worth double-checking that all these fault names are documented in your tests/tooling and that they’re always cleaned up (either via auto-remove, like above, or explicit “remove” calls) to avoid accidentally wedging a test environment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4cc016 and c306859.

📒 Files selected for processing (3)
  • tx_service/include/cc/range_cc_map.h (3 hunks)
  • tx_service/include/dead_lock_check.h (1 hunks)
  • tx_service/src/tx_operation.cpp (7 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/cc/range_cc_map.h
  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/tx_operation.cpp
📚 Learning: 2025-09-25T06:15:41.186Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 142
File: include/tx_req_result.h:140-159
Timestamp: 2025-09-25T06:15:41.186Z
Learning: When cc_notify_ is true in TxResult, both Finish() and Wait() are guaranteed to be called on the same thread, eliminating concurrency concerns and making relaxed memory ordering sufficient for status_combo_ updates.

Applied to files:

  • tx_service/src/tx_operation.cpp
🧬 Code graph analysis (1)
tx_service/include/cc/range_cc_map.h (1)
tx_service/include/dead_lock_check.h (1)
  • RequestCheck (192-200)
🔇 Additional comments (4)
tx_service/include/dead_lock_check.h (1)

246-251: Throttle metadata fits existing time model

The new last_throttled_check_time_ and CHECK_THROTTLE_INTERVAL_ fields align with the existing microsecond-based timing in this class and with the singleton usage of DeadLockCheck. From the header alone there’s no correctness concern; the actual throttling policy will hinge on how these are used in Run()/GatherLockDependancy().

tx_service/include/cc/range_cc_map.h (2)

32-32: Including dead_lock_check.h here is appropriate

Wiring RangeCcMap directly to DeadLockCheck via the header makes sense given the new blocking behavior in ReadCc; no layering or dependency issues are apparent from this file.


233-333: New blocking behavior on range-key lock looks consistent; verify deadlock/cleanup assumptions

The shift from eager abort to blocking on ACQUIRE_LOCK_BLOCKED for the range entry, plus calling DeadLockCheck::RequestCheck() and marking ReadCc as BlockByLock, is consistent with how regular data reads normally participate in deadlock detection and should let the detector see these waits.

Two things are worth explicitly verifying:

  1. DeadLockCheck availability
    DeadLockCheck::RequestCheck() is a no-op if inst_ has not been initialized. With the new blocking behavior, range reads rely on deadlock detection to resolve genuine cycles. Please confirm that DeadLockCheck::Init(...) is called in all deployments/tests before any RangeCcMap::Execute(ReadCc&) can run, or consider a fallback (e.g., logging + abort) if inst_ is null so we don’t risk silent permanent blocking in configurations where deadlock checking is disabled.

  2. Bucket read lock release on deadlock-victim aborts
    This path acquires the bucket read lock first and keeps it while the range lock is blocked. Since the bucket lock is not tracked in the transaction readset and is instead manually released in Execute(PostReadCc&) (and cleaned up in the destructor for drop-table), please double-check that the transaction-abort path used when the deadlock detector kills a victim will also eventually release this bucket lock (either by driving PostReadCc or an equivalent cleanup). Otherwise, a deadlock-induced abort could leave the bucket read lock pinned longer than intended.

Functionally, the change looks sound and aligns range reads with the broader deadlock-detection strategy; the above items are about ensuring progress and avoiding subtle lock leaks under deadlock-handling scenarios.

tx_service/src/tx_operation.cpp (1)

1847-1848: Additional AcquireAll completion DLOG is fine

The added DLOG(INFO) on “all acquire all requests finished” is low-impact and useful for debugging; it’s gated behind DLOG so it won’t spam production logs.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tx_service/src/dead_lock_check.cpp (1)

558-565: Consider simplifying redundant condition checks.

Lines 558-565 re-check the same conditions already verified by the wait_for predicate at lines 535-541. While these checks handle the timeout case correctly, they duplicate logic and reduce readability.

Consider simplifying to:

-        if (!requested_check_.load(std::memory_order_acquire))
-        {
-            continue;
-        }
-        else if (LocalCcShards::ClockTs() - last_check_time_ < time_interval_)
-        {
-            continue;
-        }
-
-        // Reset the check flag brefore gather lock dependancy
         requested_check_.store(false, std::memory_order_release);
         lk.unlock();
         GatherLockDependancy();

The wait_for predicate already ensures both conditions are met when returning true (non-timeout), so these redundant checks can be removed. The defensive programming doesn't add safety here since the mutex protects access to these variables.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c306859 and a2c4eb8.

📒 Files selected for processing (2)
  • tx_service/include/dead_lock_check.h (2 hunks)
  • tx_service/src/dead_lock_check.cpp (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/src/dead_lock_check.cpp
🔇 Additional comments (2)
tx_service/include/dead_lock_check.h (1)

192-199: LGTM! Proper atomic signaling pattern.

The atomic store with memory_order_release correctly publishes the check request to the waiting thread, which uses memory_order_acquire to observe it.

tx_service/src/dead_lock_check.cpp (1)

525-545: LGTM! Correct atomic wait condition with throttling.

The wait_for lambda properly uses memory_order_acquire to observe the flag set by RequestCheck() and correctly throttles check requests by requiring time_interval_ to elapse since the last check.

Comment thread tx_service/include/dead_lock_check.h Outdated
@githubzilla githubzilla merged commit 7c0a426 into main Dec 12, 2025
4 checks passed
liunyl pushed a commit that referenced this pull request Dec 16, 2025
liunyl added a commit that referenced this pull request Dec 16, 2025
This reverts commit 7c0a426.
Causes checkpoint unable to proceed
@coderabbitai coderabbitai Bot mentioned this pull request Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants