Fix eloqkv report oom#203
Conversation
WalkthroughAdds a per-request abort-on-OOM flag propagated through CC request/handler/proto interfaces, wires the flag from TransactionExecution::HoldingRangeReadLock(), changes CcShard memory-wait handling to optionally abort waiting requests, and updates local/remote handler and request creation paths to carry abort_if_oom. Changes
Sequence Diagram(s)sequenceDiagram
participant Op as Operation (Forward)
participant Tx as TransactionExecution
participant CC as CcHandler (Local/Remote)
participant CcReq as CcRequest
participant Aborter as RequestAborterCc
Note right of Tx `#f3f4f6`: compute abort_if_oom = HoldingRangeReadLock()
Op->>Tx: Forward()
Tx->>CC: Invoke (Acquire/Read/Post/ObjectCommand, abort_if_oom)
CC->>CcReq: create/enqueue request (abort_if_oom_ set)
CcReq->>CC: Execute -> returns OUT_OF_MEMORY
CC->>Aborter: handle OOM
alt CcReq->AbortIfOom() == true
Aborter->>CC: re-enqueue request (retry later)
else CcReq->AbortIfOom() == false
Aborter->>CC: abort request immediately (finish with OOM)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 2 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cc/cc_shard.cpp (1)
380-399: Requests queued for abort never get drained once memory recoversRange-read-lock holders are redirected into
wait_for_memory_to_abort_, but the only code that clears that deque isAbortRequestsAfterMemoryFree(). In the common success path whereShardCleanCc::Executefrees enough memory, we only callDequeueWaitListAfterMemoryFree(), so the abort deque is never touched—those requests hang forever, holding their locks and starving callers. Please make sure the abort deque is drained whenever we retry the wait-list, e.g.:bool CcShard::DequeueWaitListAfterMemoryFree(bool deque_all) { + AbortRequestsAfterMemoryFree(); if (cc_wait_list_for_memory_.size() == 0) { return true;
🧹 Nitpick comments (1)
include/cc/cc_request.h (1)
8434-8434: Good: use AbortRequestsAfterMemoryFree on terminal OOM pathSwitching to AbortRequestsAfterMemoryFree() when we’ve exhausted cleaning is aligned with the new abort-queue design and prevents futile retries. Consider emitting a counter/metric for “oom_aborted_requests” to aid RCA and alerting.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
include/cc/cc_request.h(5 hunks)include/cc/cc_shard.h(3 hunks)include/cc/object_cc_map.h(2 hunks)include/cc/template_cc_map.h(10 hunks)src/cc/cc_shard.cpp(4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-11T07:10:40.324Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.324Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
include/cc/object_cc_map.hinclude/cc/template_cc_map.hinclude/cc/cc_request.hsrc/cc/cc_shard.cppinclude/cc/cc_shard.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
include/cc/cc_request.hsrc/cc/cc_shard.cppinclude/cc/cc_shard.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
src/cc/cc_shard.cppinclude/cc/cc_shard.h
🧬 Code graph analysis (1)
include/cc/cc_shard.h (2)
src/cc/cc_shard.cpp (8)
EnqueueWaitListIfMemoryFull(368-378)EnqueueWaitListIfMemoryFull(368-369)DequeueWaitListAfterMemoryFree(380-400)DequeueWaitListAfterMemoryFree(380-380)AbortRequestsAfterMemoryFree(402-414)AbortRequestsAfterMemoryFree(402-402)HoldingRangeReadLock(954-991)HoldingRangeReadLock(954-954)include/cc/cc_request.h (2)
CcRequestBase(4468-4488)CcRequestBase(8340-8476)
🔇 Additional comments (5)
include/cc/cc_request.h (2)
3467-3467: DequeueWaitListAfterMemoryFree usage is correctVerification confirms the API is properly documented and usage is intentional:
- Line 3467: Return ignored in periodic opportunistic drain (context: every 20 batches in DefragShardHeapCc). No scheduling dependency on emptiness signal.
- Line 3514:
trueflag passes deque_all to drain all waiting requests at end-of-defrag—documented in method signature ("If true, dequeue all the requests in the wait list") and code comment.- AbortRequestsAfterMemoryFree (line 8433) operates on a separate abort queue when OOM persists, distinct from the main wait list drain.
No changes needed.
8454-8454: Behavior is consistent across both branches; no starvation risk with current queue designBoth call sites (lines 8453 and 8469) follow the same pattern: capture
DequeueWaitListAfterMemoryFree()return and conditionally re-enqueue based on emptiness. The two-queue design (cc_wait_list_for_memory_vs.wait_for_memory_to_abort_) inherently prevents starvation—the abort queue is only processed viaAbortRequestsAfterMemoryFree()at line 8433 when memory pressure persists, not starved by normal queue re-enqueueing. Adding a brief comment at the API declaration ininclude/cc/cc_shard.hlines 335 and 339 explaining this queue separation and fairness strategy would clarify intent for future readers.include/cc/object_cc_map.h (2)
454-455: LGTM! API signature update correctly applied.The addition of
req.NodeGroupId()as the second parameter aligns with the refactored memory-management API. This ensures node-group context is properly passed when enqueueing requests that are blocked on memory availability.
2201-2201: LGTM! Consistent API update for log replay path.The change correctly applies the same API signature update for the log replay OOM handling path, maintaining consistency with the other call site at lines 454-455.
include/cc/template_cc_map.h (1)
335-336: NodeGroupId plumbed into EnqueueWaitListIfMemoryFull — looks good.All updated callsites now pass the request’s NodeGroupId to the memory‑full wait list, aligning with the new OOM handling semantics. No issues spotted here.
Also applies to: 841-842, 1075-1075, 1657-1657, 1719-1719, 1790-1791, 6742-6742, 7515-7515, 7707-7707
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cc/cc_shard.cpp (1)
430-433: Include both wait lists in the size calculation.The issue is valid.
WaitListSizeForMemory()omitswait_for_memory_to_abort_, causing two problems:
Incorrect wakeup logic:
OnDirtyDataFlushed()(line 1050) checksif (WaitListSizeForMemory() > 0)to decide whether to wake up the shard. If only the abort list has pending requests, this check fails, missing necessary cleanup.Incomplete metrics: Stats recording at line 7358 undercounts actual waiting requests.
Both queues hold memory-waiting requests (comments at lines 1147–1152); they differ only in abort semantics. The parallel check in
DequeueWaitListAfterMemoryFree()(lines 382–383) treats them as a unit, confirming they belong together.Change line 432 to:
return cc_wait_list_for_memory_.size() + wait_for_memory_to_abort_.size();
🧹 Nitpick comments (1)
src/cc/cc_shard.cpp (1)
368-378: Clarify the rationale for routing range-read-lock holders to the abort queue.The logic routes requests holding range read locks to
wait_for_memory_to_abort_while others go tocc_wait_list_for_memory_. However, the design rationale isn't documented. Consider adding a comment explaining why holding a range read lock necessitates aborting on OOM (likely to prevent deadlock or prolonged lock holding during memory pressure).Additionally, there's no null check on
reqbefore callingreq->Txn(). While the caller might guarantee non-null, defensive validation would prevent potential crashes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/cc/cc_shard.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-11T07:10:40.324Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.324Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
src/cc/cc_shard.cpp
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
src/cc/cc_shard.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
src/cc/cc_shard.cpp
🔇 Additional comments (2)
src/cc/cc_shard.cpp (2)
380-414: Semantic confusion inwait_for_memory_to_abort_queue handling is valid; consider renaming for clarity.The usage patterns confirm your concern.
DequeueWaitListAfterMemoryFreeis called only when memory becomes available (heap not full, post-defrag, memory freed), whileAbortRequestsAfterMemoryFreeis called only when memory remains constrained. However, the naming ofwait_for_memory_to_abort_is misleading:
- Requests in this queue are enqueued/retried in
DequeueWaitListAfterMemoryFreewhen memory is available- Requests in this queue are aborted in
AbortRequestsAfterMemoryFreeonly when memory remains unavailableThe queue serves dual purposes: retry-on-recovery or abort-on-failure. Consider renaming to clarify intent, e.g.,
wait_for_memory_retry_or_abort_or restructuring the logic to make the dual purpose explicit.Additionally, the header doc for
AbortRequestsAfterMemoryFree(line 337–338) says "Abort the cc requests whose tx is holding a range read lock" but the implementation abortswait_for_memory_to_abort_requests. Update the documentation to match the actual behavior.
968-1005: Code implementation is correct; API verification complete.The
lock->ReadLocks().contains(txn)call on line 998 is valid. absl::flat_hash_set supports a contains(key) method (returns bool), following the C++20 container API. The method logic correctly checks for range partition read locks with proper null checks and early returns.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/cc/cc_shard.cpp (2)
382-421: Consider simplifying the dual-list processing logic.The current implementation interleaves processing of two lists using a shared flags list, which adds significant complexity:
Maintenance burden: The loop must carefully manage three iterators (
it,abort_it, and flags) in sync, making the code harder to understand and maintain.Potential starvation: With the 20-item limit and FIFO processing, if requests are continuously added to the front of the combined order, requests at the back might experience delays.
Alternative approach: Consider using a single list of
std::pair<CcRequestBase*, bool>where the boolean indicates whether to abort. This would simplify the logic and reduce the risk of synchronization errors.Example refactor:
std::deque<std::pair<CcRequestBase*, bool>> wait_list_for_memory_; // true = needs abort check, false = can retry
990-1027: Consider caching the result to avoid repeated computation.
HoldingRangeReadLock()is called on the memory allocation path inEnqueueWaitListIfMemoryFull(). The method iterates through all locks held by a transaction, which could be expensive if a transaction holds many locks.Performance considerations:
Hot path: This method is invoked during OOM scenarios when the system is already under memory pressure. The O(n) iteration through
cce_list_could add latency.Optimization opportunity: Consider caching a boolean flag in
TxLockInfothat tracks whether the transaction holds any range read locks. Update this flag when locks are acquired/released.Alternative: If range read locks are rare, the current implementation may be acceptable. Profile to determine if this is a bottleneck.
include/cc/cc_shard.h (1)
1146-1156: Consider using std::deque instead of std::list for flags.The current implementation uses
std::list<bool>forwait_for_memory_flags_, which has higher memory overhead:
Memory efficiency:
std::list<bool>stores each boolean in a separate node with pointer overhead (typically 16-24 bytes per bool).std::deque<bool>would be more memory-efficient.Cache locality:
std::dequeprovides better cache locality thanstd::list.Iterator stability: If
std::listwas chosen for iterator stability during erase operations inAbortRequestsAfterMemoryFree(), note thatstd::dequealso supports stable iteration when erasing from the front.Apply this diff to use std::deque:
- std::list<bool> wait_for_memory_flags_; + std::deque<bool> wait_for_memory_flags_;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
include/cc/cc_shard.h(3 hunks)src/cc/cc_shard.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-11T07:10:40.324Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.324Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
src/cc/cc_shard.cppinclude/cc/cc_shard.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
src/cc/cc_shard.cppinclude/cc/cc_shard.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
src/cc/cc_shard.cppinclude/cc/cc_shard.h
🧬 Code graph analysis (1)
include/cc/cc_shard.h (2)
src/cc/cc_shard.cpp (8)
EnqueueWaitListIfMemoryFull(368-380)EnqueueWaitListIfMemoryFull(368-369)DequeueWaitListAfterMemoryFree(382-421)DequeueWaitListAfterMemoryFree(382-382)AbortRequestsAfterMemoryFree(423-448)AbortRequestsAfterMemoryFree(423-423)HoldingRangeReadLock(990-1027)HoldingRangeReadLock(990-990)include/cc/cc_request.h (3)
CcRequestBase(4468-4488)CcRequestBase(8340-8476)bool(114-131)
🔇 Additional comments (4)
src/cc/cc_shard.cpp (2)
450-455: Good defensive programming with the assertion.The assertion ensures that the three data structures (
cc_wait_list_for_memory_,wait_for_memory_to_abort_, andwait_for_memory_flags_) remain synchronized. This is essential given the complexity of managing three separate structures.The assertion will catch synchronization bugs early in development and testing.
368-380: The review comment's concerns about thread safety and exception safety are not substantiated by the codebase architecture.The verification reveals that this code follows a per-shard, per-thread work distribution design:
Thread safety: The
Enqueue()method signatures includethd_idparameters, and LocalCcShards routes requests to shards by thread ID. The three data structures are accessed within the context of a single shard's processing, protected implicitly through the work distribution architecture rather than explicit locks. This is a valid and common synchronization pattern.Exception safety:
HoldingRangeReadLock()is a const method usingstd::unordered_map::find(), which does not throw exceptions. Thepush_back()operations onstd::dequeandstd::listalso cannot throw (no capacity reservations are involved).Atomicity: The assert in
WaitListSizeForMemory()verifies consistency across the three structures, preventing inconsistent states from being exposed to callers.The code does not require the explicit mutex synchronization suggested in the review because the synchronization model is enforced through thread-safe work distribution, not through locks on shared data structures.
Likely an incorrect or invalid review comment.
include/cc/cc_shard.h (2)
323-339: LGTM - API signatures and documentation are clear.The updated method signatures and documentation clearly reflect the new OOM handling design:
EnqueueWaitListIfMemoryFullnow requires the node group ID to determine abort behavior.DequeueWaitListAfterMemoryFreehas a simplified signature with a single boolean parameter.AbortRequestsAfterMemoryFreeprovides a clear API for aborting waiting requests.The documentation accurately describes the new behavior.
519-519: The verification shows only one declaration ofHoldingRangeReadLockat line 519 ininclude/cc/cc_shard.h. The AI summary's claim of duplicate declarations in both public and private sections is not supported by the code. No compilation errors from duplicates would occur because none exist.Likely an incorrect or invalid review comment.
71c99bf to
0b7210f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/tx_execution.cpp (5)
1090-1099: Wrong variable in trace macro (compilation error)ReloadCacheTxRequest handler passes &fi_req to TX_TRACE_ACTION_WITH_CONTEXT, but the parameter is req.
Apply this diff:
void TransactionExecution::ProcessTxRequest(ReloadCacheTxRequest &req) { TX_TRACE_ACTION_WITH_CONTEXT( this, - &fi_req, + &req, [this]() -> std::string { return std::string("\"tx_number\":") .append(std::to_string(this->TxNumber())) .append("\"tx_term\":") .append(std::to_string(this->tx_term_)); });
1136-1140: to_string misuse in trace string (compilation error)std::string(this->TxNumber()) is invalid. Use std::to_string.
Apply this diff:
- return std::string("\"tx_number\":") - .append(std::string(this->TxNumber())) + return std::string("\"tx_number\":") + .append(std::to_string(this->TxNumber())) .append("\"tx_term\":") .append(std::to_string(this->tx_term_));
6396-6405: Wrong variable name in trace macro (compilation error)In PostProcess(ReleaseScanExtraLockOp&), TX_TRACE_ACTION_WITH_CONTEXT uses &lock_op, but the parameter is unlock_op.
Apply this diff:
TX_TRACE_ACTION_WITH_CONTEXT( this, - &lock_op, + &unlock_op, [this]() -> std::string { return std::string("\"tx_number\":") .append(std::to_string(this->TxNumber())) .append("\"tx_term\":") .append(std::to_string(this->tx_term_)); });
6452-6461: Missing semicolon in trace string (compilation error)The return expression building the string is missing a terminating semicolon.
Apply this diff:
- .append(std::to_string(this->tx_term_)) + .append(std::to_string(this->tx_term_));
7444-7452: Missing semicolon in trace string (compilation error)Same issue in PostProcess(KickoutDataAllOp&).
Apply this diff:
- .append(std::to_string(this->tx_term_)) + .append(std::to_string(this->tx_term_));
🧹 Nitpick comments (3)
include/tx_execution.h (1)
559-560: LGTM! Consider adding documentation.The method addition correctly supports the OOM retry logic by checking whether range read locks are held. The private access with friend visibility to operation classes is appropriate.
To improve maintainability, consider adding a brief doc comment:
+ /** + * @brief Checks if the transaction currently holds any range read locks. + * + * Used to determine whether it's safe to retry on OUT_OF_MEMORY errors. + * Transactions holding range locks should not retry to avoid consistency issues. + */ bool HoldingRangeReadLock() const;include/read_write_set.h (1)
554-566: Consider adding documentation to clarify the exclusion logic.While the implementation is correct, adding a doc comment would help future maintainers understand why
catalog_ccm_name_sv,range_bucket_ccm_name_sv, andcluster_config_ccm_name_svare excluded from being classified as range read locks. This distinction is important for the OOM retry logic.Example documentation:
+ /** + * @brief Checks if the transaction is holding any range read locks. + * + * Range read locks are represented by metadata read entries, excluding + * system catalog tables (catalog, range_bucket, cluster_config) which + * do not prevent OOM retry operations. + * + * @return true if holding range read locks, false otherwise. + */ bool HasRangeReadLock() constsrc/tx_operation.cpp (1)
8431-8445: Align BatchReadOperation OOM handling with new gating.This path still retries on OOM unconditionally. Consider gating with !HoldingRangeReadLock() to avoid starving checkpoints when range-read locks are held.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
include/cc/cc_request.h(0 hunks)include/read_write_set.h(1 hunks)include/tx_execution.h(1 hunks)src/tx_execution.cpp(1 hunks)src/tx_operation.cpp(5 hunks)
💤 Files with no reviewable changes (1)
- include/cc/cc_request.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
include/read_write_set.h
🧬 Code graph analysis (1)
include/tx_execution.h (1)
src/tx_execution.cpp (2)
HoldingRangeReadLock(337-340)HoldingRangeReadLock(337-337)
🔇 Additional comments (6)
src/tx_execution.cpp (1)
337-341: Expose range-lock state for OOM retry gating — looks goodSimple, const delegation to rw_set_.HasRangeReadLock(). Fits the intended use in OOM retry gating.
Please confirm HasRangeReadLock() returns true only when a real range table read lock is present (and excludes catalog, range_bucket, and cluster_config entries) to avoid false positives.
include/read_write_set.h (1)
554-566: Verification complete: Constants are properly defined and accessible.The three constants (
catalog_ccm_name_sv,range_bucket_ccm_name_sv,cluster_config_ccm_name_sv) are defined ininclude/type.hat lines 577-581 and are accessible toinclude/read_write_set.hvia transitive include throughread_write_entry.h. The implementation ofHasRangeReadLock()is correct and follows the same pattern used elsewhere in the codebase (e.g.,src/tx_operation.cpp:645-650).src/tx_operation.cpp (4)
199-214: Good OOM retry gate in ReadOperation.Gated by !HoldingRangeReadLock(), resets hd_result_ and re-runs with a sane 2s backoff via retry_num_++ semantics. Looks correct.
2079-2089: Confirm ref-count reinit on PostWriteAllOp retry.After hd_result_.Reset() + Value().Clear(), ensure the ref count is re-established on retry (via Process() path). If not, calls may never complete.
Can you confirm the ref-count is set again on retry or add an explicit hd_result_.SetRefCnt(ng_cnt * keys_.size()) before ReRunOp?
5468-5478: OOM retry gate in ObjectCommandOp looks good.Gated by !HoldingRangeReadLock(), resets result before retry. No issues spotted.
1-8697: Action: Verify OOM retry gating and handlingMost tx ops correctly gate retries on CcErrorCode::OUT_OF_MEMORY with !txm->HoldingRangeReadLock(); however two areas need explicit verification:
- BatchReadOperation::Forward — src/tx_operation.cpp (around lines 8427–8432): OUT_OF_MEMORY is marked and ReRunOp is invoked without a HoldingRangeReadLock() guard — confirm this is intentional or add the same gate.
- sk_generator retry logic — src/sk_generator.cpp (around lines 374–379 and 1112–1114): uses sleep/retry on OOM — confirm this behavior is acceptable.
Other OUT_OF_MEMORY occurrences in include/cc/* and src/cc/cc_shard.cpp appear to be CC-internal error returns (expected).
| // If return CcErrorCode::OUT_OF_MEMORY, and if the tx is not holding | ||
| // range read lock, we should retry the request. | ||
| else if (hd_result_.ErrorCode() == CcErrorCode::OUT_OF_MEMORY && | ||
| !txm->HoldingRangeReadLock()) | ||
| { | ||
| retry_num_++; | ||
| hd_result_.Reset(); | ||
| ReRunOp(txm); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Reset remote_ack_cnt_ before re-run to avoid stale acks.
On OOM you re-run without clearing remote_ack_cnt_. Stale acks can mislead the timeout/deadlock path. Reset it before retry.
Apply this diff:
- else if (hd_result_.ErrorCode() == CcErrorCode::OUT_OF_MEMORY &&
- !txm->HoldingRangeReadLock())
+ else if (hd_result_.ErrorCode() == CcErrorCode::OUT_OF_MEMORY &&
+ !txm->HoldingRangeReadLock())
{
retry_num_++;
hd_result_.Reset();
+ remote_ack_cnt_.store(0, std::memory_order_relaxed);
ReRunOp(txm);
return;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // If return CcErrorCode::OUT_OF_MEMORY, and if the tx is not holding | |
| // range read lock, we should retry the request. | |
| else if (hd_result_.ErrorCode() == CcErrorCode::OUT_OF_MEMORY && | |
| !txm->HoldingRangeReadLock()) | |
| { | |
| retry_num_++; | |
| hd_result_.Reset(); | |
| ReRunOp(txm); | |
| return; | |
| } | |
| else if (hd_result_.ErrorCode() == CcErrorCode::OUT_OF_MEMORY && | |
| !txm->HoldingRangeReadLock()) | |
| { | |
| retry_num_++; | |
| hd_result_.Reset(); | |
| remote_ack_cnt_.store(0, std::memory_order_relaxed); | |
| ReRunOp(txm); | |
| return; | |
| } |
🤖 Prompt for AI Agents
In src/tx_operation.cpp around lines 517 to 526, on the OUT_OF_MEMORY retry path
you re-run the operation without clearing remote_ack_cnt_, which can leave stale
ACKs and confuse timeout/deadlock logic; before calling ReRunOp(txm)
increment/reset logic, explicitly reset remote_ack_cnt_ (e.g., set it to 0 or
call the existing reset helper) and any related per-run ack state, then clear
hd_result_ and call ReRunOp(txm) as before.
| // If return CcErrorCode::OUT_OF_MEMORY, and if the tx is | ||
| // not holding range read lock, we should retry the request. | ||
| else if (hd_result.ErrorCode() == | ||
| CcErrorCode::OUT_OF_MEMORY && | ||
| !txm->HoldingRangeReadLock()) | ||
| { | ||
| retry_num_++; | ||
| ReRunOp(txm); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Reset finish/fail/ack counters on AcquireAllOp OOM retry.
Counters accumulate across retries and can prematurely satisfy finish conditions or inflate failures. Reset them before re-run.
Apply this diff:
- else if (hd_result.ErrorCode() ==
- CcErrorCode::OUT_OF_MEMORY &&
- !txm->HoldingRangeReadLock())
+ else if (hd_result.ErrorCode() ==
+ CcErrorCode::OUT_OF_MEMORY &&
+ !txm->HoldingRangeReadLock())
{
retry_num_++;
+ finish_cnt_.store(0, std::memory_order_relaxed);
+ fail_cnt_.store(0, std::memory_order_relaxed);
+ remote_ack_cnt_.store(0, std::memory_order_relaxed);
ReRunOp(txm);
return;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // If return CcErrorCode::OUT_OF_MEMORY, and if the tx is | |
| // not holding range read lock, we should retry the request. | |
| else if (hd_result.ErrorCode() == | |
| CcErrorCode::OUT_OF_MEMORY && | |
| !txm->HoldingRangeReadLock()) | |
| { | |
| retry_num_++; | |
| ReRunOp(txm); | |
| return; | |
| } | |
| // If return CcErrorCode::OUT_OF_MEMORY, and if the tx is | |
| // not holding range read lock, we should retry the request. | |
| else if (hd_result.ErrorCode() == | |
| CcErrorCode::OUT_OF_MEMORY && | |
| !txm->HoldingRangeReadLock()) | |
| { | |
| retry_num_++; | |
| finish_cnt_.store(0, std::memory_order_relaxed); | |
| fail_cnt_.store(0, std::memory_order_relaxed); | |
| remote_ack_cnt_.store(0, std::memory_order_relaxed); | |
| ReRunOp(txm); | |
| return; | |
| } |
🤖 Prompt for AI Agents
In src/tx_operation.cpp around lines 1951 to 1960, the AcquireAllOp path retries
on CcErrorCode::OUT_OF_MEMORY without resetting the operation's finish/fail/ack
counters, causing counters to accumulate across retries; before incrementing
retry_num_ and calling ReRunOp(txm) reset the finish, fail, and ack counters to
their initial zeroed state so the retried operation starts with clean counters
(e.g., set finish_count_, fail_count_, ack_count_ = 0 or call the op's reset
method) and then proceed to increment retry_num_ and ReRunOp(txm).
0b7210f to
a88aa31
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
include/cc/cc_request.h (1)
411-417: Bug confirmed: iso_level not propagated in AcquireCc Reset (string-key path).The pointer-key Reset overload (lines 384–385) passes
iso_levelto the base Reset, but the string-key overload (lines 415–416) does not, despite acceptingiso_levelas a parameter. This causes silent behavior divergence.Apply this fix at lines 415–416:
TemplatedCcRequest<AcquireCc, std::vector<AcquireKeyResult>>::Reset( - tname, res, ng_id, txn, tx_term, proto); + tname, res, ng_id, txn, tx_term, proto, iso_level);include/cc/cc_shard.h (1)
28-35: Add the missing<list>include.
cc_wait_list_for_memory_is now astd::list, but this header still only pulls in<deque>. Any translation unit that includescc_shard.hwithout already including<list>will fail to compile. Please include<list>here (and drop<deque>if it’s no longer used).Apply this diff:
-#include <deque> +#include <list>
🧹 Nitpick comments (3)
include/cc/cc_request.h (2)
3516-3517: Clarify the meaning of thedeque_allparameter and ensure consistent usage.The function
DequeueWaitListAfterMemoryFree(bool deque_all = false)is a single API with an optional parameter, but it's called inconsistently: without arguments (lines 3516, 8528, 8544) and withtrue(line 3563). Add documentation to clarify whatdeque_allcontrols (likely "dequeue all pending requests vs. dequeue a batch"), and standardize call sites to use explicit arguments when the intent differs from the default, or rename overload-style to reduce ambiguity.
7383-7390: Now let me examine the actual code to verify the concern:Based on the evidence gathered from the script results and web searches, I can now provide a definitive analysis. The search results confirm that:
- Backpressure prevents excessive memory or CPU consumption caused by an unbounded number of queued requests
- When a queue is unavailable due to being full, that signals backpressure and threads should sleep until the situation changes
- To prevent OOM, backpressure should limit pending task queues; continuously submitting faster than processing causes task accumulation and OOM
From the script output, I verified that:
- A dedicated
EnqueueWaitListIfMemoryFull()API exists to route requests to a wait list- The code at lines 7383-7390 calls regular
Enqueue()instead of the wait list method- This regular Enqueue places requests back into the ready queue, NOT the wait list
The review comment's concern is valid and substantive: re-enqueueing to the ready queue under OOM causes requests to loop back immediately instead of yielding, creating thrashing behavior rather than proper backpressure.
Re-enqueue to wait list, not ready queue under OOM.
The review identifies a real performance issue: when OUT_OF_MEMORY and
req->AbortIfOom()is false, the code at lines 7383-7390 ofinclude/cc/cc_request.hcallsccs.Enqueue(ccs.LocalCoreId(), req), which routes the request to the shard's ready queue. Under sustained memory pressure, this causes requests to loop through OOM again immediately instead of yielding, creating thrashing.The codebase provides
EnqueueWaitListIfMemoryFull()specifically for this scenario. Replace the call withccs.EnqueueWaitListIfMemoryFull(req)so requests defer until memory is freed, allowing proper memory recovery and avoiding tight re-queueing loops.src/tx_execution.cpp (1)
2064-2081: Passing partition_id and abort_if_oom into Read() is consistentThe added
partition_idandHoldingRangeReadLock()arguments line up with howpartition_idis computed earlier for both range and hash-partitioned tables, and they let the CC layer distinguish OOM-on-range-lock cases without changing other semantics. Looks correct for the single-key read path.If you later add more calls with
HoldingRangeReadLock()in this method, consider caching it into a localabort_if_oomfor readability, but it’s not necessary now.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
include/cc/cc_handler.h(9 hunks)include/cc/cc_req_base.h(1 hunks)include/cc/cc_req_misc.h(1 hunks)include/cc/cc_request.h(43 hunks)include/cc/cc_shard.h(2 hunks)include/cc/local_cc_handler.h(6 hunks)include/cc/template_cc_map.h(1 hunks)include/proto/cc_request.proto(5 hunks)include/read_write_set.h(1 hunks)include/remote/remote_cc_handler.h(5 hunks)include/tx_execution.h(1 hunks)src/cc/cc_shard.cpp(2 hunks)src/cc/local_cc_handler.cpp(17 hunks)src/remote/remote_cc_handler.cpp(10 hunks)src/remote/remote_cc_request.cpp(6 hunks)src/tx_execution.cpp(20 hunks)src/tx_operation.cpp(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- include/cc/template_cc_map.h
🚧 Files skipped from review as they are similar to previous changes (1)
- src/tx_operation.cpp
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
src/tx_execution.cppsrc/cc/cc_shard.cppsrc/cc/local_cc_handler.cppinclude/cc/cc_shard.hinclude/cc/cc_request.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
src/cc/cc_shard.cppinclude/read_write_set.hsrc/cc/local_cc_handler.cppinclude/cc/cc_shard.hinclude/cc/cc_request.hinclude/cc/cc_handler.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
src/cc/local_cc_handler.cppinclude/cc/cc_shard.hinclude/cc/cc_request.hinclude/cc/cc_handler.h
🧬 Code graph analysis (3)
include/tx_execution.h (1)
src/tx_execution.cpp (2)
HoldingRangeReadLock(337-340)HoldingRangeReadLock(337-337)
include/cc/cc_shard.h (2)
src/cc/cc_shard.cpp (6)
EnqueueWaitListIfMemoryFull(378-381)EnqueueWaitListIfMemoryFull(378-378)DequeueWaitListAfterMemoryFree(383-404)DequeueWaitListAfterMemoryFree(383-383)AbortRequestsAfterMemoryFree(406-426)AbortRequestsAfterMemoryFree(406-406)include/cc/cc_request.h (2)
CcRequestBase(4517-4537)CcRequestBase(8415-8551)
include/cc/cc_request.h (2)
include/cc/cc_req_misc.h (2)
err_code_(530-554)err_code_(691-691)include/error_messages.h (1)
CcErrorCode(237-436)
🔇 Additional comments (35)
include/cc/cc_request.h (8)
8509-8509: Shard OOM handling path reads well.Combining AbortRequestsAfterMemoryFree() with DequeueWaitListAfterMemoryFree() ensures abort-on-OOM requests are surfaced promptly and others resume after free. LGTM.
Also applies to: 8528-8529, 8544-8545
931-948: PostWriteAllCc: tx_term and abort-if-OOM propagation LGTM.All Reset paths carry tx_term and persist abort_if_oom_ consistently.
Also applies to: 959-976, 989-1005
539-556: AcquireAllCc: defaults and propagation look correct.iso_level defaults and abort_if_oom defaults are sensible; both paths set abort_if_oom_. Good.
Also applies to: 569-586
1307-1313: ReadCc: abort-if-OOM wiring is consistent across overloads.All Reset variants carry abort_if_oom and set the flag. Good.
Also applies to: 1361-1366, 1415-1421
2958-2962: ScanSliceCc always aborts on OOM — appropriate for range scans.Given ScanSliceCc serves range-partition scans, aborting instead of queuing under OOM prevents prolonged memory pressure. LGTM. Based on learnings.
6621-6654: ApplyCc abort-if-OOM semantics confirmed correct.Verification shows
abort_if_oomis set consistently based onHoldingRangeReadLock()across all Reset() overloads and CC request types. Since range read locks are used for blocking commands (BLMOVE, BLMPOP, etc.), the parameter is set judiciously—true only when the transaction holds a range read lock, preventing OOM during blocking operations. The implementation is sound.
380-398: Inaccurate characterization of implementation.The review states that "abort_if_oom_ flags are consistently added to AcquireCc, AcquireAllCc, PostWriteAllCc, ReadCc, ApplyCc, UploadBatchCc, UploadBatchSlicesCc, and ScanSliceCc." Verification shows this is incorrect:
- 5 types (AcquireCc, AcquireAllCc, PostWriteAllCc, ReadCc, ApplyCc) define configurable
abort_if_oom_members- 3 types (UploadBatchCc, UploadBatchSlicesCc, ScanSliceCc) have no
abort_if_oom_member and return hardcodedtruefromAbortIfOom()The code itself is correct—different types require different abort-on-OOM semantics, and hardcoding
truefor the latter three is intentional and appropriate. The review's approval is valid, but its supporting statement is factually inaccurate.Likely an incorrect or invalid review comment.
7688-7692: Verified: Callers have retry/backoff logic in place for UploadBatch operations.The review's concern is already addressed in the codebase. In
src/sk_generator.cpp(lines 1113-1122),OUT_OF_MEMORYerrors are explicitly handled within a do-while retry loop with a 30-second backoff before retrying. The loop continues untilNO_ERRORis returned or a terminal error (leadership changes) occurs. Transient failures use a 1-second backoff. TheAbortIfOom() = truedesign for memory-heavy batch operations is intentional and appropriate.include/cc/cc_req_misc.h (1)
515-518: LGTM! Clean override implementation.The
AbortIfOom()override correctly signals thatFillStoreSliceCcrequests should abort on out-of-memory conditions rather than waiting. This aligns with the PR's goal of improving OOM handling for slice operations.include/tx_execution.h (1)
559-559: LGTM! Well-placed method declaration.The
HoldingRangeReadLock()method is appropriately declared and will be used to determine whether a transaction should abort on OOM based on its range read lock state.include/cc/cc_req_base.h (1)
95-98: LGTM! Safe virtual method addition.The virtual
AbortIfOom()method with a safe default (returnsfalse) allows derived classes to opt into abort-on-OOM behavior without breaking existing functionality.include/proto/cc_request.proto (1)
799-799: LGTM! Consistent protobuf field additions.The
abort_if_oomboolean field has been added to five request messages with appropriate field numbers. The proto3 default value offalseprovides safe backwards compatibility.Also applies to: 827-827, 889-889, 952-952, 1327-1327
include/read_write_set.h (1)
554-566: The constants are properly accessible; no compilation issues exist.The string_view constants are defined in
include/type.hand are correctly accessible to theHasRangeReadLock()method through the transitive include chain:read_write_set.h→read_write_entry.h→type.h. No modifications are needed.Likely an incorrect or invalid review comment.
src/tx_execution.cpp (9)
337-340: Expose rw_set range-lock state via HoldingRangeReadLock()Thin wrapper over
rw_set_.HasRangeReadLock()is clear, const-correct, and centralizes the “holding range lock” predicate for reuse in the new OOM paths.
4013-4055: AcquireWrite now propagates iso_level and abort_if_oom correctlyDeriving
abort_if_oomonce perAcquireWriteOperationand threading it (withiso_level_) into allAcquireWritecalls—both primary and forwarded—keeps behavior uniform across all write locks. This matches the intended “abort if holding any range read lock” semantics and doesn’t alter the acquire loop structure.Given this touches commit-time locking, please run/extend tests that (a) hold range locks and hit OOM during acquire-write, and (b) commit without range locks, to confirm error reporting and abort behavior are as expected.
5605-5625: AcquireAllOp: wiring abort_if_oom into AcquireWriteAllUsing a single
abort_if_oom = HoldingRangeReadLock()and passing it, together withacq_all_op.cc_op_, intoAcquireWriteAllfor every (ng, key) pair is consistent with the rest of the CC API and gives cluster-wide acquisitions the same OOM behavior as per-key acquires. No issues spotted.
5663-5704: PostWriteAllOp: abort_if_oom and write_type propagation looks correct
PostWriteAllOpnow derivesabort_if_oomonce and supplies bothwrite_type_andabort_if_oomfor remote and localPostWriteAllcalls. That keeps semantics symmetric across node groups and matches the updated CC interface. The surrounding control flow is unchanged.
5778-5818: ReleaseCatalogWriteAll now honors abort_if_oomComputing
abort_if_oomviaHoldingRangeReadLock()and passingPostWriteType::PostCommitplusabort_if_oominto all catalogPostWriteAllcalls aligns catalog post-commit with the general OOM-handling contract, while still usingcommit_ts/tx_op_failed_ts_to distinguish commit vs abort. Wiring looks consistent.
6675-6693: ObjectCommandOp: OOM behavior gated by range-lock stateThe extended
ObjectCommandcall now passesobj_cmd_op.always_redirect_andHoldingRangeReadLock()as the final arguments. That lets object commands inherit the same abort-on-OOM behavior when the transaction is holding range locks, without changing existing routing or commit flags. Signature usage looks correct.
7065-7090: MultiObjectCommandOp: shared abort_if_oom flag per batchHere
abort_if_oomis computed once and reused for each per-keyObjectCommandinvocation in the batch, which is appropriate since the condition depends only on the transaction’s global range-lock state. Combined with the precomputed shard codes, this keeps multi-key commands consistent with single-keyObjectCommandOp.
7317-7347: CmdForwardAcquireWriteOp: forward writes aligned with main acquire pathIntroducing
abort_if_oom = HoldingRangeReadLock()and passing bothiso_level_andabort_if_oomintoAcquireWritefor forwarded entries ensures forwarded write locks see the same isolation/abort semantics as primary ones. The placement and reuse ofabort_if_oomare sound.
7647-7691: BatchReadOperation: new partition_id and abort_if_oom parametersFor each batch tuple,
partition_idis derived from eitherMapKeyHashToHashPartitionIdorrange_ids_, then passed along with a per-operationabort_if_oomintoRead. This mirrors the single-key read path changes and should give batched reads the same OOM behavior when range locks are held. The additional arguments look correctly ordered and typed.src/remote/remote_cc_handler.cpp (5)
52-93: AcquireWrite: abort_if_oom and iso_level are plumbed correctly into the RPCThe new
IsolationLevel iso_levelandbool abort_if_oomparameters are consistently forwarded and serialized (set_iso_level,set_abort_if_oom), matching the interface expectations. No issues here.
108-148: AcquireWriteAll: abort_if_oom flag is correctly serialized
CcOperation cc_opandabort_if_oomare cleanly added and forwarded, andabort_if_oomis written toAcquireAllRequest. This aligns with the updated local handler and interface.
213-268: PostWriteAll: abort_if_oom added without changing existing semanticsThe new
PostWriteType post_write_typeandabort_if_oomparameters are used to setcommit_typeandabort_if_oomonPostWriteAllRequest, preserving the existing record/commit_ts handling. Looks consistent with the rest of the API.
330-384: Read: partition_id and abort_if_oom are propagated as expectedAdding
partition_idandabort_if_oomto the signature and serializing them intoReadRequestkeeps the wire format aligned with the new proto fields. Usingpartition_idas anint32_twith the local default of-1is reasonable.
1060-1089: ObjectCommand: explicit commit + abort_if_oom wiring looks correctThe new
commitparameter is used to setapply_and_commit, andabort_if_oomis serialized intoApplyRequest. This matches the LocalCcHandler changes and keeps commit semantics explicit.src/cc/local_cc_handler.cpp (5)
64-122: AcquireWrite: abort_if_oom flag is consistently forwarded (local + remote paths)The additional
IsolationLevel iso_levelandbool abort_if_oomare passed through both toAcquireCc::Resetand toremote_hd_.AcquireWrite(...). The pre‑existing setup ofAcquireKeyResultand enqueue logic remains unchanged, so behavior is preserved with the new OOM control.
135-180: AcquireWriteAll: new cc_op and abort_if_oom parameters are wired correctly
CcOperation cc_opandabort_if_oomare passed to bothAcquireAllCc::Reset(withIsolationLevel::ReadCommittedkept as the local iso) and the remote handler. This keeps the replicated‑map path consistent while enabling abort‑on‑OOM behavior.
194-265: PostWriteAll: abort_if_oom is threaded through all local/remote pathsBoth branches (
ng_id == cc_ng_idvs. duplicate record) now callPostWriteAllCc::Resetwithpost_write_type,tx_term, andabort_if_oom, and the remote branch forwardspost_write_typeandabort_if_oomas well. The record‑ownership comments and behavior remain intact.
438-504: Read: partition_id and abort_if_oom are propagated end‑to‑endThe new
partition_idandabort_if_oomparameters are:
- Stored in the signature.
- Passed into
ReadCc::Resetfor local execution.- Forwarded to
remote_hd_.Read(...)for remote execution.This keeps read behavior unchanged while enabling partition‑aware routing and OOM abort semantics where needed.
1667-1763: ObjectCommand: always_redirect override and abort_if_oom integration look sound
always_redirectis only consulted when the leader is remote, allowing selected callers to force redirect even iftxservice_auto_redirect_redis_cmdis disabled.abort_if_oomis passed toApplyCc::Resetin both standby and local‑leader paths and forwarded toremote_hd_.ObjectCommand(...)in the remote case.- Existing standby constraints (read‑only and NG locality) and
cmd_result_plumbing are preserved.Semantics are clear and backward‑compatible aside from the intentional redirect override.
include/cc/cc_handler.h (3)
86-145: AcquireWrite / AcquireWriteAll / PostWriteAll: interface extended cleanly with abort_if_oomThe new
abort_if_oomparameter and its documentation are added consistently across these write APIs, with no default values, which forces implementations/callers to make an explicit choice. This matches the LocalCcHandler/RemoteCcHandler usage.
225-245: Read: new partition_id and abort_if_oom arguments with sensible defaultsExposing
partition_idandabort_if_oomonCcHandler::Readwith defaults (-1,false) keeps existing callers source‑compatible while allowing the new behavior where needed. The brief doc forabort_if_oomis clear enough for this interface level.
503-519: ObjectCommand: abort_if_oom and always_redirect are clearly documented in the interfaceAdding
always_redirectandabort_if_oomtoObjectCommandaligns the public API with the LocalCcHandler implementation, and the new doc comment makes the OOM behavior explicit. Signature ordering matches the implementation and remote handler usage.
1. Normally, when a `ccrequest` can not find enough memory when `FindEmplace`, it triggers a shard clean operation and waits until free memory becomes available. However, if the transaction already holds a range read lock, the `ccrequest` should be aborted with error code `OUT_OF_MEMORY` if no free memory is available; otherwise, it may block range splitting. 2. Remove the macro `ON_KEY_OBJECT`.
a88aa31 to
2ace73c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tx_execution.cpp (1)
2064-2082: Fix line 2064: incorrect abort_if_oom parameterThe call at line 2064 incorrectly passes
HoldingRangeReadLock()as the finalabort_if_oomparameter. These are semantically distinct:HoldingRangeReadLock()returns whether the transaction holds a range read lock, whileabort_if_oomcontrols out-of-memory abort behavior. The second call site at line 7671 correctly passesabort_if_oomseparately. Change the last argument fromHoldingRangeReadLock()to the appropriate abort-on-OOM value (likelyfalseor derived from transaction context).
🧹 Nitpick comments (3)
include/cc/cc_request.h (2)
2958-2961: ScanSliceCc always aborts on OOM — aligns with range-scan semantics.This fits the “don’t block range split” requirement. Consider adding a brief comment documenting why it’s unconditional.
Based on learnings
7371-7406: Add counters/logs for OOM-aborted vs requeued to aid ops.Expose metrics: requests_aborted_by_oom, requests_requeued_due_to_oom, wait_list_length_peak. Helps validate the new policy in prod.
src/tx_execution.cpp (1)
2755-2866: Valid design gap: ScanNextBatch lacks abort_if_oom for range-locked scansVerification confirms the concern. Lines 2776 and 2835 call
ScanNextBatch()for range partition scans while holding range read locks. The method has noabort_if_oomparameter, unlike other CC operations (Read, Write, Lock, ScanOpen). If OOM occurs during these calls, the range lock would block splitting. Addingabort_if_oomto bothScanNextBatchoverloads in the handler interface (and implementations) would align with existing patterns and allow callers to passHoldingRangeReadLock()as suggested.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
include/cc/cc_handler.h(9 hunks)include/cc/cc_req_base.h(1 hunks)include/cc/cc_req_misc.h(1 hunks)include/cc/cc_request.h(43 hunks)include/cc/cc_shard.h(3 hunks)include/cc/local_cc_handler.h(6 hunks)include/cc/template_cc_map.h(1 hunks)include/proto/cc_request.proto(5 hunks)include/read_write_set.h(1 hunks)include/remote/remote_cc_handler.h(5 hunks)include/tx_execution.h(1 hunks)src/cc/cc_shard.cpp(2 hunks)src/cc/local_cc_handler.cpp(17 hunks)src/remote/remote_cc_handler.cpp(10 hunks)src/remote/remote_cc_request.cpp(6 hunks)src/tx_execution.cpp(20 hunks)src/tx_operation.cpp(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- include/cc/cc_req_base.h
- src/tx_operation.cpp
- include/proto/cc_request.proto
- include/tx_execution.h
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
include/read_write_set.hinclude/cc/cc_shard.hsrc/cc/cc_shard.cppinclude/cc/cc_handler.hsrc/cc/local_cc_handler.cppinclude/cc/cc_request.hsrc/tx_execution.cpp
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
include/cc/cc_shard.hinclude/cc/template_cc_map.hsrc/cc/cc_shard.cppsrc/cc/local_cc_handler.cppinclude/cc/cc_request.hsrc/tx_execution.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
include/cc/cc_shard.hinclude/cc/cc_handler.hsrc/cc/local_cc_handler.cppinclude/cc/cc_request.h
🧬 Code graph analysis (2)
include/cc/cc_shard.h (2)
src/cc/cc_shard.cpp (6)
EnqueueWaitListIfMemoryFull(378-381)EnqueueWaitListIfMemoryFull(378-378)DequeueWaitListAfterMemoryFree(383-404)DequeueWaitListAfterMemoryFree(383-383)AbortRequestsAfterMemoryFree(406-426)AbortRequestsAfterMemoryFree(406-406)include/cc/cc_request.h (2)
CcRequestBase(4517-4537)CcRequestBase(8415-8551)
include/cc/cc_request.h (2)
src/remote/remote_cc_request.cpp (15)
Reset(93-135)Reset(93-94)Reset(198-252)Reset(198-199)Reset(323-359)Reset(323-324)Reset(409-502)Reset(409-409)Reset(563-631)Reset(563-564)Reset(658-722)Reset(658-659)Reset(791-880)Reset(791-792)Reset(1052-1180)include/error_messages.h (1)
CcErrorCode(237-436)
🔇 Additional comments (45)
include/cc/cc_req_misc.h (1)
515-518: LGTM! Code is correct as implemented.The
AbortIfOom()override forFillStoreSliceCcis the only request type in this file that needs this behavior, which is appropriate.FillStoreSliceCcis specifically involved in store slice loading during range operations, while other request types in the file (FetchCatalogCc,FetchTableStatisticsCc,FetchTableRangesCc, etc.) are read-only fetch or utility operations that don't require abort-on-OOM handling.The implementation correctly signals that this request type should abort rather than wait when memory is unavailable, preventing OOM from blocking critical range split operations.
include/cc/template_cc_map.h (1)
7498-7513: Callsites are consistent and correct; no action required.All 16 callsites to
EnqueueWaitListIfMemoryFulluse the current single-argument formEnqueueWaitListIfMemoryFull(&req)with no inconsistencies. The method signature ininclude/cc/cc_shard.h:330remainsvoid EnqueueWaitListIfMemoryFull(CcRequestBase *req)and has not changed to require aNodeGroupIdparameter. No two-argument calls orNodeGroup()vsNodeGroupId()mismatches exist across the codebase. The implementation (src/cc/cc_shard.cpp:378–380) simply queues requests without conditional logic.Likely an incorrect or invalid review comment.
include/cc/cc_shard.h (2)
33-33: LGTM! Container change supports efficient selective abortion.The change from
std::dequetostd::listis appropriate for the new abort pattern where requests may be selectively removed from the middle of the wait list based on theAbortIfOom()flag. Lists provide O(1) erasure at any position when you have an iterator, whereas deque erasure in the middle requires element shifting.The separation of
DequeueWaitListAfterMemoryFree()(for re-enqueueing) andAbortRequestsAfterMemoryFree()(for aborting) is a clean design that makes the intent explicit at each call site.Also applies to: 328-340
337-340: AbortIfOom() method is properly declared and implemented.Verification confirms that
AbortIfOom()is declared as a virtual method inCcRequestBase(include/cc/cc_req_base.h:95) and is correctly overridden in all relevant derived classes. The call to(*req_it)->AbortIfOom()in the newAbortRequestsAfterMemoryFree()method is valid and safe.src/cc/cc_shard.cpp (2)
383-404: LGTM! Correct iterator-based processing for std::list.The implementation correctly:
- Uses iterator-based traversal and counting
- Enqueues waiting requests back to the shard queue
- Erases processed elements using
erase(begin, it)which is O(n) for list but necessary here- Returns whether the wait list is now empty
The limit of 20 requests per call (when
deque_allis false) prevents blocking the processor for too long.
406-426: Core fix looks correct - selective abortion based on abort_if_oom flag.This method implements the PR objective: abort requests holding range read locks when OOM occurs. The implementation correctly:
- Iterates through the wait list checking each request's
AbortIfOom()status- Aborts qualifying requests with
OUT_OF_MEMORYerror code- Uses proper erase-increment pattern for
std::listNote: This method is O(n) where n is the wait list size, but this is unavoidable since we must check each request. The call site (ShardCleanCc) invokes this only when memory is still full after cleaning, so the performance impact should be acceptable.
Consider adding a metric to track how often requests are aborted due to OOM, which could help with capacity planning and debugging.
include/cc/local_cc_handler.h (1)
57-58: All verification requirements confirmed.The implementation is correct:
Parameter propagation verified:
src/cc/local_cc_handler.cppproperly passesabort_if_oomto CC request constructors at lines 95, 121, 158, 179, 225, 240, 264, 475, 503, 1710, 1732, 1762.AbortIfOom() implementation verified: All CC request classes implement the method:
- AcquireCc (line 491), AcquireAllCc (648), PostWriteAllCc (1074), ReadCc (1574), ApplyCc (6802)
- Each returns the stored
abort_if_oom_member variableCall site logic verified:
src/tx_execution.cppconsistently sets the flag based on range read lock status usingHoldingRangeReadLock()at lines 4012, 5605, 5663, 5777, 7065, 7316, 7647 before invoking handler methods.include/cc/cc_request.h (7)
7688-7692: UploadBatchCc/UploadBatchSlicesCc: AbortIfOom returns true.Makes sense for memory-heavy bulk paths; OK.
Also applies to: 8226-8230
380-398: abort_if_oom propagation verified across all call sites; implementation is correct.Verification confirms the API change is properly implemented: virtual AbortIfOom() exists in CcRequestBase (returns false by default), all Reset overloads consistently accept and store the flag, remote handlers properly set protobuf fields, local handlers delegate correctly, and tx_execution consistently passes HoldingRangeReadLock() result. No issues found.
3516-3517: Memory pressure handling verified: selective abort correctly filters by AbortIfOom().The implementation at
src/cc/cc_shard.cpp:406–426correctly filters abort candidates—only requests withAbortIfOom()==trueare aborted (line 416–419), while others remain queued (line 421–424). Both DefragShardHeapCc (lines 3563–3564) and ShardCleanCc (line 8508) callAbortRequestsAfterMemoryFree()appropriately for their respective cleanup scenarios.
1290-1343: Verified: abort_if_oom correctly threaded through all ReadCc::Reset overloads and properly propagated via RemoteRead::Reset.All three ReadCc::Reset overloads accept the
abort_if_oomparameter and store it in theabort_if_oom_member. RemoteRead::Reset extractsreq.abort_if_oom()from the incoming request and passes it to both ReadCc::Reset call sites (lines 449 and 476). The override accessorAbortIfOom()is present at line 1574.
922-949: PostWriteAllCc: OOM flag verified across all Reset overloads.All three Reset overloads include the
bool abort_if_oomparameter. The only call site at line 703 insrc/remote/remote_cc_request.cppcorrectly passespost_write_all.abort_if_oom(). TheAbortIfOom()getter method exists to return the stored flag.
6621-6654: Verification complete: abort_if_oom parameter handling is correct and consistent.All observations confirmed:
- The flag is properly stored in the
abort_if_oom_member variable and exposed via theAbortIfOom()getter (lines 6802-6805)- Parameter order is identical across both Reset overloads (local and remote versions end with
bool commit, bool abort_if_oom)- The upstream remote handler in
remote_cc_request.cppcorrectly passesreq.abort_if_oom()as the final parameter to the Reset call
539-556: Incorrect consistency claim: AcquireAllCc and AcquireCc have different parameter defaults.The review asserts that AcquireAllCc is "consistent with AcquireCc" and that defaults are "reasonable." However, verification shows:
- AcquireCc::Reset:
bool abort_if_oomparameter has no default (lines 371, 400)- AcquireAllCc::Reset:
bool abort_if_oom = falseparameter has default (lines 539, 569)This is an API inconsistency. While current call sites pass the value explicitly in both cases (so backward compatibility isn't broken), the parameter defaults are not aligned as the review claims.
Likely an incorrect or invalid review comment.
include/read_write_set.h (1)
554-566: Constants verified—implementation is sound.All three string_view constants are properly defined in
include/type.h(lines 577–581):
catalog_ccm_name_sv="__catalog"range_bucket_ccm_name_sv="__range_bucket"cluster_config_ccm_name_sv="__cluster_config"The
HasRangeReadLock()method correctly identifies range read locks by filtering out these system tables and returningtrueonly for user-table reads. The implementation aligns with the PR's abort-on-OOM behavior.include/remote/remote_cc_handler.h (1)
63-63: All call sites properly updated with abort_if_oom parameter.Verification confirms:
- Method signatures correctly implement the mixed approach as stated (defaults for AcquireWrite/Read, no defaults for AcquireWriteAll/PostWriteAll/ObjectCommand)
- All three methods without defaults have been updated at their single call site in
src/cc/local_cc_handler.cppto pass theabort_if_oomparametersrc/tx_execution.cpp (9)
337-341: Helper for OOM-abort decision looks goodSimple, clear, and consistent with the intended semantics.
4012-4055: Verification confirms all implementations accept new parameters correctlyAll handler implementations accept
iso_levelandabort_if_oomin the correct order:
- Base class (virtual): iso_level, abort_if_oom
- LocalCcHandler: iso_level, abort_if_oom
- RemoteCcHandler: iso_level, abort_if_oom
All three call sites in tx_execution.cpp (lines 4022, 4040, 7332) pass these parameters consistently and in the correct order, including within the forward_addr_ loop.
5776-5819: ReleaseCatalogWriteAll(): abort_if_oom correctly passedVerified: Both catalog post-write all calls pass abort_if_oom to remote nodes (line 5801) and local node (line 5818), correctly deriving it from HoldingRangeReadLock() to ensure OOM safety when holding range locks during commit.
5663-5705: Code change verified: abort_if_oom parameter consistency confirmedVerification confirms the review assessment. Both local and remote
PostWriteAll()calls consistently pass theabort_if_oomparameter:
- Remote calls (lines 5675, 5789): Pass
abort_if_oom✓- Local calls (lines 5693, 5807): Pass
abort_if_oom✓Method signatures properly define
abort_if_oomas the final parameter for bothLocalCcHandler::PostWriteAll()andRemoteCcHandler::PostWriteAll(). No inconsistencies or missing parameters detected.
7316-7347: <!-- [scratchpad_update]
The web search provided general background on OOM handling but not specific to this codebase. I need to search within the codebase itself to understand:
- How abort_if_oom is actually used and defined
- Whether this is a forward-acquire specific fix or a general pattern
- Check other AcquireWrite calls to see consistency
[pending]
- Search for HoldingRangeReadLock definition and how abort_if_oom derives from it
- Find other calls to AcquireWrite to see if they also use abort_if_oom
- Look for git diff or comments explaining what the fix was
[scratchpad_end] --><!-- [scratchpad_update]
These results are too general. I need to focus on checking the codebase directly to understand:
- The HoldingRangeReadLock() function
- How abort_if_oom is used in other AcquireWrite calls
- Whether there's a git log/commit message
Let me search the codebase more specifically.
[scratchpad_end] -->Now let me search the codebase directly for more context:<!-- [scratchpad_update]
Generic web search results on abort() behavior and locks don't help. I need to directly examine the codebase for:
- HoldingRangeReadLock() definition
- Other AcquireWrite calls to see if abort_if_oom is used consistently
Let me search the codebase more targeted.
[scratchpad_end] -->Based on my analysis of the code snippet and the search results from the codebase, I can now provide a final assessment. The review comment is too vague and unsubstantiated to provide clear verification.Clarify the original issue and fix
The review comment lacks substance—it says changes are good but doesn't explain what was caught or why
abort_if_oommatters in the forward-acquire path. Without understanding the underlying bug or prior behavior, verification is incomplete.Was this fix addressing a missing parameter, incorrect value, or consistency issue with other code paths? Provide context on the specific problem that prompted this change.
7065-7092: I need to examine the implementation of the multi-key handler path to verify theabort_if_oomflag is properly respected throughout.Let me search for the full implementation of the multi-key handler to verify the
abort_if_oomflag propagation.Now I need to search the codebase directly for the implementation details. Let me look for the full context around the multi-object command processing.
Based on my analysis of the provided information, I can now provide a final response. The script output shows the code structure and the review comment indicates this is already approved but needs verification. Let me generate the rewritten review comment:
Verify: MultiObjectCommand abort_if_oom propagation in multi-key handler path
The per-key
ObjectCommandinvocations in the loop correctly receive theabort_if_oomflag. To complete verification, confirm that:
HoldingRangeReadLock()semantics correctly reflect the desired abort behavior- All multi-key command execution paths (including remote handlers via
RemoteCcHandler::ObjectCommand) consistently receive and respect the flag- No alternate multi-key processing code paths bypass this OOM abort behavior
5605-5626: Design verified: abort_if_oom propagation complete across layers.All concerns confirmed:
- Local handler receives and forwards
abort_if_oomto remote handler (src/cc/local_cc_handler.cpp:179)- Remote handler sets it in proto message (src/remote/remote_cc_handler.cpp:147)
- Proto definition includes field (include/proto/cc_request.proto:827 in AcquireAllRequest)
7647-7691: No issues found; batch read abort_if_oom handling is correct and verifiedThe batch read implementation correctly aligns with single-key read behavior. The
abort_if_oomflag is computed once viaHoldingRangeReadLock()before the batch loop and passed to eachRead()call within the loop. The handler processes each request individually with uniform flag handling—there is no separate "bulk-read path." TheCcShardprocesses each request'sAbortIfOom()flag independently, treating batch reads as a series of uniform individual requests, identical to single-key read semantics.
6675-6694: No changes needed – signatures are consistent across all declarations and implementations.All ObjectCommand signatures properly include the abort_if_oom parameter:
- Base interface (CcHandler): ✓ includes abort_if_oom
- Local handler (LocalCcHandler): ✓ matches base signature with abort_if_oom
- Remote handler (RemoteCcHandler): ✓ has abort_if_oom (intentionally omits always_redirect as it's not used in remote calls)
- Call site (tx_execution.cpp:6679): ✓ passes abort_if_oom as the 15th argument (HoldingRangeReadLock())
src/remote/remote_cc_request.cpp (5)
115-127: LGTM: abort_if_oom propagated correctly in AcquireCc path.The
abort_if_oomparameter is properly extracted from the request and passed toAcquireCc::Reset.
449-466: LGTM: abort_if_oom propagated correctly in both ReadCc branches.Both the
Insideread type branch (lines 449-466) and theOutsideNormal/OutsideDeletedbranch (lines 476-493) correctly propagateabort_if_oomtoReadCc::Reset. The else branch properly fills in default values for parameters not used in that path (nullptr, false, -1) before adding the abort_if_oom flag.Also applies to: 476-493
703-714: LGTM: abort_if_oom propagated correctly in PostWriteAllCc path.The
abort_if_oomparameter is properly extracted frompost_write_all.abort_if_oom()and passed toPostWriteAllCc::Reset.
2233-2245: LGTM: abort_if_oom propagated correctly in ApplyCc path.The
abort_if_oomparameter is properly extracted from the request and passed as the final parameter toApplyCc::Reset.
232-244: Verify hardcoded IsolationLevel::ReadCommitted for AcquireAllCc and confirm it doesn't require dynamic conversion.Line 243 in src/remote/remote_cc_request.cpp hardcodes
IsolationLevel::ReadCommittedwhen callingAcquireAllCc::Reset. However, in the same file, other similar operations (e.g., AcquireCc at line 126 and ReadOutside at lines 459, 486) dynamically convert the isolation level from the request viaToLocalType::ConvertIsolation(req.iso_level()). The request object has aniso_level()field available.Confirm whether hardcoding is intentional for replicated CC semantics or if this should follow the same pattern as other operations in this file.
src/remote/remote_cc_handler.cpp (5)
37-95: LGTM: abort_if_oom properly added to AcquireWrite.The parameter is correctly added to the method signature (lines 52-53) and properly serialized into the
AcquireRequestprotobuf message (line 92).
97-150: LGTM: abort_if_oom properly added to AcquireWriteAll.The parameter is correctly added to the method signature (lines 108-109) and properly serialized into the
AcquireAllRequestprotobuf message (line 147).
201-270: LGTM: abort_if_oom properly added to PostWriteAll.The parameter is correctly added to the method signature (lines 213-214) and properly serialized into the
PostWriteAllRequestprotobuf message (line 267).
311-386: LGTM: abort_if_oom properly added to Read.The parameter is correctly added to the method signature (lines 330-331) and properly serialized into the
ReadRequestprotobuf message (line 383).
1045-1095: LGTM: abort_if_oom properly added to ObjectCommand.The parameter is correctly added to the method signature (lines 1060-1061) and properly serialized into the
ApplyRequestprotobuf message (line 1088).src/cc/local_cc_handler.cpp (5)
51-123: LGTM: abort_if_oom properly propagated in AcquireWrite.Both the local path (lines 82-95) and remote path (lines 106-121) correctly propagate the
abort_if_oomparameter. The local path passes it toAcquireCc::Resetand the remote path forwards it toremote_hd_.AcquireWrite.
125-181: LGTM: abort_if_oom properly propagated in AcquireWriteAll.Both the local path (lines 146-158) and remote path (lines 168-179) correctly propagate the
abort_if_oomparameter. The local path passes it toAcquireAllCc::Resetand the remote path forwards it toremote_hd_.AcquireWriteAll.
183-266: LGTM: abort_if_oom properly propagated in PostWriteAll.Both the local path (lines 206-240) and remote path (lines 252-264) correctly propagate the
abort_if_oomparameter. The local path handles two scenarios (same and different node groups) and passes the flag toPostWriteAllCc::Resetin both cases.
422-505: LGTM: abort_if_oom properly propagated in Read.Both the local path (lines 456-475) and remote path (lines 484-503) correctly propagate the
abort_if_oomparameter. The local path passes it toReadCc::Resetand the remote path forwards it toremote_hd_.Read.
1653-1764: LGTM: abort_if_oom properly propagated in ObjectCommand.All three execution paths correctly propagate the
abort_if_oomparameter:
- Standby transaction path (lines 1697-1710)
- Local node path (lines 1718-1732)
- Remote node path (lines 1747-1762)
Each path passes the flag to either
ApplyCc::Reset(local/standby) orremote_hd_.ObjectCommand(remote).include/cc/cc_handler.h (5)
70-102: LGTM: AcquireWrite signature updated correctly.The method signature correctly adds
abort_if_oomas the final parameter (line 102), and the documentation is updated to describe the parameter (line 86).
104-131: LGTM: AcquireWriteAll signature updated correctly.The method signature correctly adds
abort_if_oomas the final parameter (line 131), and the documentation is updated to describe the parameter (line 119).
133-144: LGTM: PostWriteAll signature updated correctly.The method signature correctly adds
abort_if_oomas the final parameter (line 144).
207-244: LGTM: Read signature updated correctly.The method signature correctly adds
abort_if_oomwith a default value offalse(line 244), and the documentation is updated to describe the parameter (line 225). The default value ensures backward compatibility for existing callers.
485-519: LGTM: ObjectCommand signature updated correctly.The method signature correctly adds
abort_if_oomas the final parameter (line 519), and the documentation is updated to describe the parameter (line 503).
| { | ||
| for (CcRequestBase *req : reqs_) | ||
| { | ||
| if (err_code_ == CcErrorCode::OUT_OF_MEMORY && !req->AbortIfOom()) |
There was a problem hiding this comment.
when do we use the request aborter cc on OOM error? This cc req is usually used when we want to abort deadlock cc req
There was a problem hiding this comment.
If LoadSlice fails (FillStoreSlice failed), call the waiting request abort.
| // to avoid blocking the range split. | ||
| // This is a performance optimization, and it does't matter if the | ||
| // request is aborted. | ||
| shard_->EnqueueWaitListIfMemoryFull(&req); |
There was a problem hiding this comment.
should we put the req in wait list here? Seems like the request should fail directly according to comment
There was a problem hiding this comment.
Only if no memory is available after attempting to clean up the memory should the request be aborted.
ccrequestcan not find enough memory whenFindEmplace, it triggers a shard clean operation and waits until free memory becomes available. However, if the transaction already holds a range read lock, theccrequestshould be aborted with error codeOUT_OF_MEMORYif no free memory is available; otherwise, it may block range splitting.ON_KEY_OBJECT.Summary by CodeRabbit
New Features
Bug Fixes
Refactor