fix standby OnLeaderStart and FetchCatalog from primary#326
Conversation
|
Warning Rate limit exceeded@MrGuin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 12 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughAdds a new error code and introduces a Sharder atomic cache/API to mark a standby becoming leader; several term-checks and abort/log paths were added or gated across catalog fetch, RPC closure, shard fetch, stream receiver, and leader-start logic to block operations during the standby→leader transition. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Remote as RemoteRPC/Receiver
participant Sharder
participant CcShard
participant FetchClosure as FetchCatalogClosure
participant CCNode as Fault::CCNode
Note over Sharder: standby_becoming_leader_term_cache_ (atomic)
Client->>Remote: send catalog fetch / messages
Remote->>Sharder: check StandbyBecomingLeaderNodeTerm()
alt StandbyBecomingLeaderNodeTerm == -1
Remote->>CcShard: forward / increment inflight / FetchCatalog
CcShard->>Sharder: validate standby terms (StandbyNodeTerm / CandidateStandbyNodeTerm)
alt terms valid
CcShard->>FetchClosure: issue RPC fetch
FetchClosure->>Remote: handle RPC response
alt RPC success & NO_ERROR
FetchClosure->>CcShard: deliver catalog
else RPC error
FetchClosure->>CcShard: finish with LEADER_NODE_UNREACHABLE
end
else terms mismatch
CcShard->>FetchClosure: abort (LEADER_NODE_UNREACHABLE)
end
else Standby becoming leader (term != -1)
Remote->>FetchClosure: short-circuit / abort with LEADER_NODE_UNREACHABLE
end
Note left of CCNode: OnLeaderStart may set/clear standby_becoming_leader_term_cache_
CCNode->>Sharder: SetStandbyBecomingLeaderNodeTerm(...) / reset to -1
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tx_service/include/error_messages.h (1)
339-348: AddLEADER_NODE_UNREACHABLEtocc_error_messagesfor clearer diagnosticsYou added
CcErrorCode::LEADER_NODE_UNREACHABLEbut it’s not present incc_error_messages, soCcErrorMessage(LEADER_NODE_UNREACHABLE)will fall back to the numeric code string.Consider adding a mapping entry for consistency and better logs.
Proposed addition
static const std::unordered_map<CcErrorCode, std::string> cc_error_messages{ @@ - {CcErrorCode::UPDATE_SEQUENCE_TABLE_FAIL, "UPDATE_SEQUENCE_TABLE_FAIL"}, + {CcErrorCode::UPDATE_SEQUENCE_TABLE_FAIL, "UPDATE_SEQUENCE_TABLE_FAIL"}, + {CcErrorCode::LEADER_NODE_UNREACHABLE, "LEADER_NODE_UNREACHABLE"}, @@ - {CcErrorCode::LAST_ERROR_CODE, "LAST_ERROR_CODE"}, + {CcErrorCode::LAST_ERROR_CODE, "LAST_ERROR_CODE"}, };tx_service/include/rpc_closure.h (1)
924-1029: FetchCatalogClosure correctly aborts on term / RPC failures, but specific error codes are not propagated to requestersThe new logic:
- Aborts on
!ValidTermCheck()withSetFinish(Deleted, NG_TERM_CHANGED).- Aborts on non-retriable RPC failure with
SetFinish(Deleted, LEADER_NODE_UNREACHABLE).- Logs and finishes with
Unknown+err_codefor terminal response errors.Functionally this ensures
FetchCatalogCcstops retrying when the node is becoming leader or the leader is unreachable, which matches the PR intent.Note, though, that
FetchCatalogCc::Execute()currently treats anyerror_code_ != 0as a generic data-store failure and callsAbortCcRequest(CcErrorCode::DATA_STORE_ERR)for all waiting requests, so callers never seeNG_TERM_CHANGEDorLEADER_NODE_UNREACHABLE.If you want upstream code to distinguish these cases, consider mapping
error_code_through inExecute()instead of collapsing everything toDATA_STORE_ERR(at least for the new codes).Illustrative change in
FetchCatalogCc::Execute(cc_req_misc.cpp)- if (error_code_ == 0) + if (error_code_ == 0) @@ - else - { - for (CcRequestBase *req : requesters_) - { - req->AbortCcRequest(CcErrorCode::DATA_STORE_ERR); - } - } + else + { + CcErrorCode ec = static_cast<CcErrorCode>(error_code_); + // Preserve more specific error codes for leader/term issues. + CcErrorCode mapped = + (ec == CcErrorCode::NG_TERM_CHANGED || + ec == CcErrorCode::LEADER_NODE_UNREACHABLE) + ? ec + : CcErrorCode::DATA_STORE_ERR; + + for (CcRequestBase *req : requesters_) + { + req->AbortCcRequest(mapped); + } + }This keeps existing behavior for generic failures while exposing the new, more specific errors.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tx_service/include/error_messages.htx_service/include/rpc_closure.htx_service/include/sharder.htx_service/src/cc/cc_req_misc.cpptx_service/src/fault/cc_node.cpptx_service/src/remote/cc_stream_receiver.cpptx_service/src/sharder.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
Applied to files:
tx_service/src/cc/cc_req_misc.cpptx_service/src/remote/cc_stream_receiver.cpptx_service/src/sharder.cpptx_service/src/fault/cc_node.cpptx_service/include/sharder.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.
Applied to files:
tx_service/src/remote/cc_stream_receiver.cpp
🧬 Code graph analysis (1)
tx_service/src/cc/cc_req_misc.cpp (1)
tx_service/include/cc/cc_request.h (1)
cc_ng_term_(3021-3129)
🔇 Additional comments (5)
tx_service/src/remote/cc_stream_receiver.cpp (1)
196-207: Guarding standby inflight counting during leader transition looks correctRequiring both
PrimaryNodeTerm() > 0andStandbyBecomingLeaderNodeTerm() == -1before incrementingInflightStandbyReqCountcleanly prevents new standby traffic from extending the drain window while a standby is transitioning to leader. Existing standby vs primary paths remain unchanged.tx_service/src/cc/cc_req_misc.cpp (1)
88-107: Expanded term check for fetch-from-primary aligns with standby→leader transitionIncluding
StandbyBecomingLeaderNodeTerm() != -1in the guard ensures catalog fetches from primary are aborted while the node is in the “becoming leader” window, preventing use of potentially stale primary state. The added debug log gives enough context to diagnose these rejections.tx_service/src/sharder.cpp (1)
165-178: Initialization ofstandby_becoming_leader_term_cache_is consistentSeeding
standby_becoming_leader_term_cache_to-1alongside the other term caches matches the getter’s contract and avoids relying on default-initialized atomics.tx_service/src/fault/cc_node.cpp (1)
283-342: Standby-becoming-leader marker is correctly scoped around the transitionMarking
StandbyBecomingLeaderNodeTerm(term)before unsubscribing / draining, and then clearing it back to-1after standby term reset, cleanly brackets the “becoming leader” window. That gives cc_stream_receiver and term checks a precise signal without leaving the marker set on error paths in this branch.tx_service/include/sharder.h (1)
684-705: New standby-becoming-leader term cache API is consistent with existing term handling
SetStandbyBecomingLeaderNodeTerm/StandbyBecomingLeaderNodeTermmirror the patterns used for standby and candidate-standby term caches (cc_nodes_init_ guard, release store / acquire load), and the added atomic member is documented clearly. This gives other components a clean way to detect the transient “becoming leader” state.Also applies to: 809-814
| return; | ||
| } | ||
|
|
||
| standby_becoming_leader_term_cache_.store(leader_term, |
There was a problem hiding this comment.
seems like this only need to be a boolean
| return; | ||
| } | ||
| DLOG(ERROR) << "FetchCatalogClosure fails, abort fetch_cc_"; | ||
| fetch_cc_->SetFinish(RecordStatus::Deleted, |
There was a problem hiding this comment.
need to fix ForwardStandbyCc abort logic. ForwardStandbyCc cannot be discarded even if called abort cc req, it needs to be reenqueued into the cc req queue otherwise we might lost data.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tx_service/include/error_messages.htx_service/src/cc/cc_req_misc.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- tx_service/src/cc/cc_req_misc.cpp
🔇 Additional comments (1)
tx_service/include/error_messages.h (1)
342-343: LGTM! New error code is correctly defined.The addition of
LEADER_NODE_UNREACHABLEis consistent between the enum definition and the error messages map. The placement beforeLAST_ERROR_CODEis correct, and the comment clearly describes its purpose for leader node connectivity issues.Also applies to: 425-425
| @@ -339,6 +339,9 @@ enum struct CcErrorCode : uint8_t | |||
| // The cursor has become invalid. | |||
| INVALID_CURSOR, | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Read the file to understand the structure around line 340 and locate the map
cat -n tx_service/include/error_messages.h | sed -n '330,450p'Repository: eloqdata/tx_service
Length of output: 4922
🏁 Script executed:
# Search for CcErrorCode::INVALID_CURSOR usage
echo "=== Searching for CcErrorCode::INVALID_CURSOR usage ==="
rg -n --type=cpp 'INVALID_CURSOR' -C2
# Also check for any direct references to INVALID_CURSOR
echo -e "\n=== Checking for plain INVALID_CURSOR references ==="
rg -n 'INVALID_CURSOR' --type=cppRepository: eloqdata/tx_service
Length of output: 3285
Add INVALID_CURSOR to the cc_error_messages map.
The INVALID_CURSOR enum value is actively used in the codebase (tx_operation.cpp:1377, tx_execution.cpp:404) but is missing from the cc_error_messages map. When CcErrorMessage() is called with this error code, it will return a generic "CcErrorCode:340" message instead of a descriptive one. Add the mapping to maintain consistency with other error codes:
{CcErrorCode::INVALID_CURSOR, "INVALID_CURSOR"},🤖 Prompt for AI Agents
In tx_service/include/error_messages.h around line 340, the
CcErrorCode::INVALID_CURSOR enum is missing from the cc_error_messages map
causing CcErrorMessage() to fall back to a generic code string; add an entry to
the map associating CcErrorCode::INVALID_CURSOR with the string "INVALID_CURSOR"
(i.e. add the mapping {CcErrorCode::INVALID_CURSOR, "INVALID_CURSOR"}, matching
the style/format of the surrounding entries).
Summary by CodeRabbit
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.