PrepareCommit: create table set payload status to delete#232
Conversation
|
Warning Rate limit exceeded@xiexiaoy has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 44 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (3)
WalkthroughChanges simplify and clarify schema handling in transaction service for CreateTable operations, removing dirty_schema_ fallback logic and standardizing behavior when catalog entries or schemas are missing. The modifications centralize error responses and explicitly mark residual states for cleanup. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
include/cc/cc_request.h (1)
4045-4055: Add null guard and simplify the loop logic at lines 4043–4052The code dereferences
old_slices_delta_sizewithout checking for null, even though the parameter defaults tonullptr(line 4020). Inside the!export_base_table_item_branch (line 4043),old_slices_delta_size->size(),->begin(), and->end()are called directly, creating an undefined behavior risk if a caller passesexport_base_table_item = falsewithout providing the pointer.Additionally, the lambda with
std::move(elem.first.GetShallowCopy())is unnecessarily complex—GetShallowCopy()already returns a prvalue, so the explicit move is redundant.Replace with an assertion and simpler range-based for loop:
if (!export_base_table_item_) { + assert(old_slices_delta_size != nullptr); slices_to_scan_.reserve(old_slices_delta_size->size()); - std::for_each(old_slices_delta_size->begin(), - old_slices_delta_size->end(), - [&](decltype(*old_slices_delta_size->begin()) &elem) - { - slices_to_scan_.emplace_back( - std::move(elem.first.GetShallowCopy())); - }); + for (const auto &[slice_key, _] : *old_slices_delta_size) + { + slices_to_scan_.emplace_back(slice_key.GetShallowCopy()); + } }
🧹 Nitpick comments (2)
include/cc/template_cc_map.h (1)
1125-1134: PrepareCommit(CreateTable) semantics look correct; adjust comment to avoid confusionThe new branch that, on
PostWriteType::PrepareCommitforOperationType::CreateTable, sets(commit_ts, status) = (1, RecordStatus::Deleted)is consistent with howcommit_ts == 1+Deletedis used elsewhere to mean “known non-existent key” (e.g., key-cache and slice backfill paths). It keeps the catalog entry effectively invisible between prepare and final commit, which matches the intent implied by the PR title.The only mismatch is the preceding comment:
// A prepare commit request only installs the dirty value, // and does not change the record status and commit_ts.This is no longer true for the CreateTable special case. Consider updating the comment to call out the CreateTable exception (or move the comment inside the
Commit/PostCommitbranch) so future readers don’t assume PrepareCommit is always metadata‑no‑op. Optionally, you may also want to use1Uhere for consistency with other call sites ofSetCommitTsPayloadStatus.include/cc/cc_request.h (1)
159-172: Null schema for range-partition tables now treated asREQUESTED_TABLE_NOT_EXISTSThis branch now immediately returns
REQUESTED_TABLE_NOT_EXISTSwhencatalog_entry->schema_ == nullptr, removing the previous retry path that depended ondirty_schema_state. That’s fine if the new invariant is “wheneverschema_is null for a range table, the table is logically dropped and no DDL is in-flight,” but if there are still cases whereschema_ == nullptrand a DDL is usingdirty_schema_, requests that used to be retried will now fail permanently.To make the intent safer and self-documenting, consider:
- Asserting the expected invariant (e.g.,
assert(catalog_entry->dirty_schema_ == nullptr);) here, or- Adding a brief comment that a null
schema_means a deleted table and not “DDL in progress,” and- Adding a test that exercises a range-partition read/scan while a drop/create is in progress to confirm the new behavior is acceptable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
include/cc/catalog_cc_map.h(1 hunks)include/cc/cc_request.h(2 hunks)include/cc/template_cc_map.h(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
include/cc/catalog_cc_map.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
include/cc/cc_request.hinclude/cc/template_cc_map.h
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
include/cc/cc_request.hinclude/cc/template_cc_map.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.
Applied to files:
include/cc/template_cc_map.h
🧬 Code graph analysis (2)
include/cc/cc_request.h (1)
include/error_messages.h (1)
CcErrorCode(237-436)
include/cc/template_cc_map.h (2)
include/remote/remote_type.h (2)
PostWriteType(346-365)RecordStatus(382-403)include/tx_record.h (1)
RecordStatus(36-138)
🔇 Additional comments (1)
include/cc/template_cc_map.h (1)
3515-3519:is_cache_fulllambda change is stylistic; behavior preservedWrapping the lambda body in braces retains the original semantics (
req.IsLocal() ? scan_cache->IsFull() : remote_scan_cache->IsFull()) and the captures remain correct given the earlier initialization ofscan_cache/remote_scan_cache. No issues here.
| shard_->FetchCatalog( | ||
| table_key->Name(), req.NodeGroupId(), ng_term, &req); | ||
| return false; |
There was a problem hiding this comment.
🧩 Analysis chain
Simplified catalog fetch logic looks good.
The unconditional fetch-and-retry pattern cleanly delegates error handling to FetchCatalog. Ensure that FetchCatalog has proper error handling to set appropriate error codes on the request (e.g., REQUESTED_TABLE_NOT_EXISTS, REQUESTED_NODE_NOT_LEADER) and prevents infinite retry loops if the catalog cannot be loaded.
Run the following script to verify error handling in the FetchCatalog implementation:
🏁 Script executed:
#!/bin/bash
# Description: Verify FetchCatalog has proper error handling and termination conditions
# Search for FetchCatalog implementation to check error handling
ast-grep --pattern $'FetchCatalog($$$) {
$$$
}'
# Also search for error code assignments in FetchCatalog context
rg -nP -A10 -B5 'FetchCatalog.*\(' --type=cpp | rg -P '(SetError|error_code|ErrorCode|return false|return true)'Length of output: 4584
Need to search the actual codebase for the FetchCatalog implementation:
🏁 Script executed:
#!/bin/bash
# Search for FetchCatalog function definition/implementation
# Find FetchCatalog in .h files to understand signature and behavior
echo "=== Searching FetchCatalog in header files ==="
rg -n -A20 "FetchCatalog\s*\(" --type=cpp include/cc/cc_shard.h 2>/dev/null | head -100
echo -e "\n=== Searching FetchCatalog calls to see how errors are handled ==="
rg -n -B5 -A15 "->FetchCatalog" --type=cpp include/cc/catalog_cc_map.h | head -150Length of output: 2386
🏁 Script executed:
#!/bin/bash
# Fix syntax: escape or properly quote the pattern
echo "=== Searching FetchCatalog calls in catalog_cc_map.h ==="
rg -n -B5 -A15 'FetchCatalog' --type=cpp include/cc/catalog_cc_map.h | head -200
echo -e "\n=== Searching for max retry limits or retry count in codebase ==="
rg -n -i 'max.*retry|retry.*limit|attempt.*limit' --type=cpp include/cc/ 2>/dev/null | head -50
echo -e "\n=== Looking for retry mechanism implementation ==="
rg -n -B2 -A8 'req.*Retry|retry.*queue|enqueue.*retry' --type=cpp include/cc/ 2>/dev/null | head -100Length of output: 6609
🏁 Script executed:
#!/bin/bash
# Search for retry/re-enqueue mechanism and FetchCatalog implementation
echo "=== Searching for request retry mechanism (where requests get re-enqueued) ==="
rg -n -B3 -A3 'SetFinish|setFinish|Finish.*retry|re.*queue' --type=cpp include/cc/cc_request.h | head -80
echo -e "\n=== Looking for FetchCatalog implementation details ==="
rg -n 'void.*FetchCatalog' --type=cpp -A50 src/cc/ include/cc/ 2>/dev/null | head -150
echo -e "\n=== Checking if there's a mechanism to prevent infinite retries ==="
rg -n 'attempt|retry.*count|exec.*count' --type=cpp include/cc/cc_request.h 2>/dev/null | head -50Length of output: 268
🏁 Script executed:
#!/bin/bash
# First, find the actual files
echo "=== Finding cc_request and cc_shard files ==="
fd -t f 'cc_request\.(h|cpp)$' .
fd -t f 'cc_shard\.(h|cpp)$' .
echo -e "\n=== Finding catalog_cc_map files ==="
fd -t f 'catalog_cc_map\.(h|cpp)$' .Length of output: 257
🏁 Script executed:
#!/bin/bash
# Search for retry mechanism in cc_request.h
echo "=== Searching for retry mechanism and max retries in cc_request.h ==="
rg -n -B2 -A2 'retry|Retry|attempt|Attempt' ./include/cc/cc_request.h | head -100
echo -e "\n=== Searching for FetchCatalog implementation in cc_shard.cpp ==="
rg -n -B3 -A30 'void.*FetchCatalog' ./src/cc/cc_shard.cpp | head -150
echo -e "\n=== Checking for max retry limits or execution count ==="
rg -n 'max.*attempt|execute.*count|retry.*max' ./include/cc/cc_request.h ./include/cc/cc_shard.h | head -50Length of output: 2338
🏁 Script executed:
#!/bin/bash
# Get full FetchCatalog implementation
echo "=== Full FetchCatalog implementation ==="
rg -n -A80 'void CcShard::FetchCatalog' ./src/cc/cc_shard.cpp
echo -e "\n=== Search for FetchCatalogCc class and error handling ==="
rg -n -B5 'class FetchCatalogCc' ./include/cc/ ./src/cc/
echo -e "\n=== Search for fetch_reqs_ cleanup/removal ==="
rg -n 'fetch_reqs_.*erase|RemoveFetchRequest' ./src/cc/cc_shard.cpp | head -30Length of output: 4074
🏁 Script executed:
#!/bin/bash
# Search for FetchCatalogCc implementation and how it completes
echo "=== FetchCatalogCc class definition and implementation ==="
rg -n -B3 -A40 'class FetchCatalogCc' ./include/cc/cc_request.h
echo -e "\n=== FetchCatalogClosure and completion callback ==="
rg -n -B3 -A30 'class FetchCatalogClosure|FetchCatalogClosure::' ./include/rpc_closure.h ./src/
echo -e "\n=== How requesters are notified after fetch completes ==="
rg -n 'AddRequester|notify.*requester|wake.*requester' ./include/cc/cc_request.h ./src/cc/ 2>/dev/null | head -40Length of output: 2662
🏁 Script executed:
#!/bin/bash
# Get full FetchCatalogClosure::Run() implementation
echo "=== Full FetchCatalogClosure::Run() implementation ==="
rg -n -A100 'void Run\(\) override' ./include/rpc_closure.h | grep -A100 'FetchCatalogClosure' | head -120
echo -e "\n=== SetFinish implementation in FetchCatalogCc/FetchCc ==="
rg -n -B3 -A15 'void.*SetFinish.*RecordStatus' ./include/cc/cc_request.h ./src/cc/ 2>/dev/null | head -80
echo -e "\n=== Understanding what happens after SetFinish - request completion ==="
rg -n 'SetFinish.*Deleted|SET_FINISH|finish.*requester' ./include/cc/cc_request.h | head -30Length of output: 5772
Infinite retry loop risk in FetchCatalog error handling confirmed—critical fix required.
The simplification at lines 1206-1208 unconditionally delegates to FetchCatalog, but its error handling has a critical issue: FetchCatalogClosure::Run() (include/rpc_closure.h lines 998-1013) retries indefinitely for certain transient errors without any max attempt limit. When errors like EOVERCROWDED, timeouts, or non-leader errors occur, the RPC is retried with self_guard.release() and no retry counter, causing the original request to hang indefinitely.
Required fix: Add max retry limits or exponential backoff with termination conditions to FetchCatalogClosure::Run() for transient error cases (lines 943–959, 961–974, 998–1013), or add a max attempt counter to prevent infinite retries.
🤖 Prompt for AI Agents
In include/cc/catalog_cc_map.h around lines 1206-1208 and related retry logic in
include/rpc_closure.h (notably blocks around lines 943–959, 961–974, 998–1013),
the current FetchCatalogClosure::Run() path can retry indefinitely for transient
errors (EOVERCROWDED, timeouts, non-leader) because self_guard.release() is used
with no attempt counter or backoff; modify the closure to track attempts (e.g.,
a max_attempts field on the closure or shared state), increment it on each
retry, apply exponential backoff (or jitter) between retries, and when
max_attempts is reached or terminal errors occur invoke the original callback
with an appropriate error instead of retrying further; ensure
self_guard.release() use is paired with the attempt logic so resources are not
leaked and add tests or logging to cover retry limit termination.
| { | ||
| // For prepare commit of create table operation, | ||
| // set the status to Deleted. | ||
| cce_ptr->SetCommitTsPayloadStatus( |
There was a problem hiding this comment.
might need to add this to ReplayLogCc when participant sees prepare log
0ad72e5 to
0fede4d
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tx_service/include/cc/catalog_cc_map.h (1)
1228-1230: Infinite retry loop risk already flagged.This simplification maintains the same infinite retry issue previously identified. The FetchCatalog error handling lacks max retry limits for transient errors (EOVERCROWDED, timeouts, non-leader), causing indefinite retries.
See the detailed analysis in previous review comments for the fix approach.
tx_service/include/cc/template_cc_map.h (1)
1129-1137: Ensure PrepareCommit(CreateTable) semantics and recovery are consistentThe new PrepareCommit branch for
OperationType::CreateTable(settingcommit_ts=1andRecordStatus::Deleted) looks intentional and matches the sentinel semantics used elsewhere for “known non‑existent” entries. However:
- The preceding comment (Lines 1116–1118: “A prepare commit request only installs the dirty value, and does not change the record status and commit_ts.”) is no longer accurate for this special case. It should be updated to call out the CreateTable exception, or moved inside the
Commit/PostCommitbranch to avoid confusion.- Please double‑check that recovery paths mirror this behavior. In particular,
Execute(ReplayLogCc &req)(and any other code that replays prepare logs for CreateTable) must produce the samecommit_ts=1/Deletedstate, otherwise a node recovering from logs could diverge from an online node. This aligns with the earlier review note about handling this inReplayLogCcwhen participants see prepare logs.- Minor: for consistency with surrounding code, consider using
1Uinstead of1in theSetCommitTsPayloadStatuscall.
🧹 Nitpick comments (1)
tx_service/include/cc/catalog_cc_map.h (1)
1700-1705: Consider runtime error check instead of assertion.The safeguard correctly handles the CreateTable edge case during log replay where
schema_is null butdirty_schema_exists, marking the payload as Deleted at timestamp 1. However, the assertion at line 1703 will crash if the assumption is violated.Consider replacing the assertion with a runtime error check that logs the violation and sets an appropriate error code, improving debuggability if log corruption or unexpected states occur during recovery.
🔎 Proposed refactor
if (catalog_entry->schema_ == nullptr && catalog_entry->dirty_schema_ != nullptr) { - assert(op_type == OperationType::CreateTable); + if (op_type != OperationType::CreateTable) + { + LOG(ERROR) << "Unexpected operation type " << static_cast<int>(op_type) + << " with null schema and non-null dirty_schema during replay" + << " for table " << table_name.StringView(); + req.Result()->SetError(CcErrorCode::REQUESTED_TABLE_NOT_EXISTS); + return false; + } cce->SetCommitTsPayloadStatus(1, RecordStatus::Deleted); }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tx_service/include/cc/catalog_cc_map.htx_service/include/cc/cc_request.htx_service/include/cc/template_cc_map.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
Applied to files:
tx_service/include/cc/catalog_cc_map.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
tx_service/include/cc/cc_request.h
🧬 Code graph analysis (1)
tx_service/include/cc/catalog_cc_map.h (1)
tx_service/src/cc/local_cc_shards.cpp (4)
table_key(819-819)table_key(3646-3646)table_key(4637-4637)table_key(6141-6141)
🔇 Additional comments (1)
tx_service/include/cc/cc_request.h (1)
170-176: Confirm semantics of treatingcatalog_entry->schema_ == nullptras hard TABLE_NOT_EXISTSThis branch now unconditionally maps a missing base-table
schema_toREQUESTED_TABLE_NOT_EXISTSand returns, with nodirty_schema_fallback or retry. That’s a clear and simpler behavior, but it changes the semantics for range-partition requests that hit a catalog entry in a transitional DDL state.Please double‑check that in all states where
schema_ == nullptrcan occur (e.g., CreateTable/TruncateTable in progress vs. fully dropped table), we truly want to surface a permanentREQUESTED_TABLE_NOT_EXISTSto callers rather than re-enqueueing or consultingdirty_schema_, and that this aligns with the new PrepareCommit / payload‑status‑Deleted handling elsewhere in the PR.If those invariants hold, this early‑exit looks good and avoids any accidental use of a null schema pointer.
0fede4d to
2ae5e68
Compare
Summary by CodeRabbit
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.