Skip to content

fix standby OnLeaderStart and FetchCatalog from primary#326

Merged
MrGuin merged 4 commits into
mainfrom
fix_standby
Dec 29, 2025
Merged

fix standby OnLeaderStart and FetchCatalog from primary#326
MrGuin merged 4 commits into
mainfrom
fix_standby

Conversation

@MrGuin

@MrGuin MrGuin commented Dec 26, 2025

Copy link
Copy Markdown
Collaborator
  • mark that this node is becoming leader and discard the standby msgs in OnLeaderStart before wait InflightStandbyReqCount;
  • FetchCatalog from primary aborts the fetch_cc if rpc fails or node is becoming leader.

Summary by CodeRabbit

  • New Features

    • Public API to track when a standby node is becoming leader, with initialization support.
  • Bug Fixes

    • Improved error logging and clearer aborts when leader is unreachable during catalog fetches.
    • Added validation gates to prevent actions during standby→leader transitions, including gating inflight forwarding and remote fetches.
  • Chores

    • Introduced a new error code to surface leader-unreachable conditions.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Dec 26, 2025

Copy link
Copy Markdown

Warning

Rate limit exceeded

@MrGuin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 12 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 305a29c and 918e859.

📒 Files selected for processing (1)
  • tx_service/src/sharder.cpp

Walkthrough

Adds a new error code and introduces a Sharder atomic cache/API to mark a standby becoming leader; several term-checks and abort/log paths were added or gated across catalog fetch, RPC closure, shard fetch, stream receiver, and leader-start logic to block operations during the standby→leader transition.

Changes

Cohort / File(s) Summary
Error Code Addition
tx_service/include/error_messages.h
Added LEADER_NODE_UNREACHABLE to CcErrorCode and to cc_error_messages.
Sharder API & State
tx_service/include/sharder.h, tx_service/src/sharder.cpp
Added standby_becoming_leader_term_cache_ (atomic), SetStandbyBecomingLeaderNodeTerm(int64_t), StandbyBecomingLeaderNodeTerm(), and initialized cache to -1.
Leader start / standby term handling
tx_service/src/fault/cc_node.cpp
Reordered OnLeaderStart/standby flow: early handling for prev_candidate_standby_term escalation, set/clear becoming-leader marker, clear caches and stop retries when escalation detected.
Catalog fetch validation
tx_service/src/cc/cc_req_misc.cpp
FetchCatalogCc::ValidTermCheck() now fails if StandbyBecomingLeaderNodeTerm() != -1; AbortCcRequest uses actual CcErrorCode value when aborting.
RPC closure error handling
tx_service/include/rpc_closure.h
FetchCatalogClosure logs and finishes with LEADER_NODE_UNREACHABLE on term-check failures and on RPC timeout/retry aborts; adds logging for other catalog errors.
Shard fetch gating
tx_service/src/cc/cc_shard.cpp
CcShard::FetchCatalog now requires StandbyBecomingLeaderNodeTerm() == -1 (and matching standby/candidate term) before fetching from primary.
Stream receiver gating
tx_service/src/remote/cc_stream_receiver.cpp
on_received_messages only increments inflight standby requests / forwards when StandbyBecomingLeaderNodeTerm() == -1.
Request abort behavior
tx_service/include/cc/cc_request.h
KeyObjectStandbyForwardCc::AbortCcRequest no longer sets status to Finished before enqueueing.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Client
    participant Remote as RemoteRPC/Receiver
    participant Sharder
    participant CcShard
    participant FetchClosure as FetchCatalogClosure
    participant CCNode as Fault::CCNode

    Note over Sharder: standby_becoming_leader_term_cache_ (atomic)

    Client->>Remote: send catalog fetch / messages
    Remote->>Sharder: check StandbyBecomingLeaderNodeTerm()
    alt StandbyBecomingLeaderNodeTerm == -1
        Remote->>CcShard: forward / increment inflight / FetchCatalog
        CcShard->>Sharder: validate standby terms (StandbyNodeTerm / CandidateStandbyNodeTerm)
        alt terms valid
            CcShard->>FetchClosure: issue RPC fetch
            FetchClosure->>Remote: handle RPC response
            alt RPC success & NO_ERROR
                FetchClosure->>CcShard: deliver catalog
            else RPC error
                FetchClosure->>CcShard: finish with LEADER_NODE_UNREACHABLE
            end
        else terms mismatch
            CcShard->>FetchClosure: abort (LEADER_NODE_UNREACHABLE)
        end
    else Standby becoming leader (term != -1)
        Remote->>FetchClosure: short-circuit / abort with LEADER_NODE_UNREACHABLE
    end

    Note left of CCNode: OnLeaderStart may set/clear standby_becoming_leader_term_cache_
    CCNode->>Sharder: SetStandbyBecomingLeaderNodeTerm(...) / reset to -1
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • liunyl

Poem

🐰 I cached a hop, a leader near,
I pause the stream when terms appear,
If messages knock while I ascend,
I wave them off till shifts all end,
Safe hops, safe cluster — nibble cheer! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description explains the key changes but lacks reference to the issue, RFC links, testing notes, and documentation details specified in the template. Add issue reference (fixes #112), mention testing with ./mtr suite, and clarify if documentation updates are needed for the new API additions.
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main changes: fixing standby OnLeaderStart behavior and FetchCatalog from primary, which aligns with the core objectives of marking nodes as becoming leader and aborting fetches appropriately.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MrGuin MrGuin requested review from liunyl and lokax December 26, 2025 09:19
@MrGuin

MrGuin commented Dec 26, 2025

Copy link
Copy Markdown
Collaborator Author

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tx_service/include/error_messages.h (1)

339-348: Add LEADER_NODE_UNREACHABLE to cc_error_messages for clearer diagnostics

You added CcErrorCode::LEADER_NODE_UNREACHABLE but it’s not present in cc_error_messages, so CcErrorMessage(LEADER_NODE_UNREACHABLE) will fall back to the numeric code string.

Consider adding a mapping entry for consistency and better logs.

Proposed addition
 static const std::unordered_map<CcErrorCode, std::string> cc_error_messages{
@@
-    {CcErrorCode::UPDATE_SEQUENCE_TABLE_FAIL, "UPDATE_SEQUENCE_TABLE_FAIL"},
+    {CcErrorCode::UPDATE_SEQUENCE_TABLE_FAIL, "UPDATE_SEQUENCE_TABLE_FAIL"},
+    {CcErrorCode::LEADER_NODE_UNREACHABLE, "LEADER_NODE_UNREACHABLE"},
@@
-    {CcErrorCode::LAST_ERROR_CODE, "LAST_ERROR_CODE"},
+    {CcErrorCode::LAST_ERROR_CODE, "LAST_ERROR_CODE"},
 };
tx_service/include/rpc_closure.h (1)

924-1029: FetchCatalogClosure correctly aborts on term / RPC failures, but specific error codes are not propagated to requesters

The new logic:

  • Aborts on !ValidTermCheck() with SetFinish(Deleted, NG_TERM_CHANGED).
  • Aborts on non-retriable RPC failure with SetFinish(Deleted, LEADER_NODE_UNREACHABLE).
  • Logs and finishes with Unknown + err_code for terminal response errors.

Functionally this ensures FetchCatalogCc stops retrying when the node is becoming leader or the leader is unreachable, which matches the PR intent.

Note, though, that FetchCatalogCc::Execute() currently treats any error_code_ != 0 as a generic data-store failure and calls AbortCcRequest(CcErrorCode::DATA_STORE_ERR) for all waiting requests, so callers never see NG_TERM_CHANGED or LEADER_NODE_UNREACHABLE.

If you want upstream code to distinguish these cases, consider mapping error_code_ through in Execute() instead of collapsing everything to DATA_STORE_ERR (at least for the new codes).

Illustrative change in FetchCatalogCc::Execute (cc_req_misc.cpp)
-    if (error_code_ == 0)
+    if (error_code_ == 0)
@@
-    else
-    {
-        for (CcRequestBase *req : requesters_)
-        {
-            req->AbortCcRequest(CcErrorCode::DATA_STORE_ERR);
-        }
-    }
+    else
+    {
+        CcErrorCode ec = static_cast<CcErrorCode>(error_code_);
+        // Preserve more specific error codes for leader/term issues.
+        CcErrorCode mapped =
+            (ec == CcErrorCode::NG_TERM_CHANGED ||
+             ec == CcErrorCode::LEADER_NODE_UNREACHABLE)
+                ? ec
+                : CcErrorCode::DATA_STORE_ERR;
+
+        for (CcRequestBase *req : requesters_)
+        {
+            req->AbortCcRequest(mapped);
+        }
+    }

This keeps existing behavior for generic failures while exposing the new, more specific errors.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2d73fb and ef2f57f.

📒 Files selected for processing (7)
  • tx_service/include/error_messages.h
  • tx_service/include/rpc_closure.h
  • tx_service/include/sharder.h
  • tx_service/src/cc/cc_req_misc.cpp
  • tx_service/src/fault/cc_node.cpp
  • tx_service/src/remote/cc_stream_receiver.cpp
  • tx_service/src/sharder.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/cc_req_misc.cpp
  • tx_service/src/remote/cc_stream_receiver.cpp
  • tx_service/src/sharder.cpp
  • tx_service/src/fault/cc_node.cpp
  • tx_service/include/sharder.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

  • tx_service/src/remote/cc_stream_receiver.cpp
🧬 Code graph analysis (1)
tx_service/src/cc/cc_req_misc.cpp (1)
tx_service/include/cc/cc_request.h (1)
  • cc_ng_term_ (3021-3129)
🔇 Additional comments (5)
tx_service/src/remote/cc_stream_receiver.cpp (1)

196-207: Guarding standby inflight counting during leader transition looks correct

Requiring both PrimaryNodeTerm() > 0 and StandbyBecomingLeaderNodeTerm() == -1 before incrementing InflightStandbyReqCount cleanly prevents new standby traffic from extending the drain window while a standby is transitioning to leader. Existing standby vs primary paths remain unchanged.

tx_service/src/cc/cc_req_misc.cpp (1)

88-107: Expanded term check for fetch-from-primary aligns with standby→leader transition

Including StandbyBecomingLeaderNodeTerm() != -1 in the guard ensures catalog fetches from primary are aborted while the node is in the “becoming leader” window, preventing use of potentially stale primary state. The added debug log gives enough context to diagnose these rejections.

tx_service/src/sharder.cpp (1)

165-178: Initialization of standby_becoming_leader_term_cache_ is consistent

Seeding standby_becoming_leader_term_cache_ to -1 alongside the other term caches matches the getter’s contract and avoids relying on default-initialized atomics.

tx_service/src/fault/cc_node.cpp (1)

283-342: Standby-becoming-leader marker is correctly scoped around the transition

Marking StandbyBecomingLeaderNodeTerm(term) before unsubscribing / draining, and then clearing it back to -1 after standby term reset, cleanly brackets the “becoming leader” window. That gives cc_stream_receiver and term checks a precise signal without leaving the marker set on error paths in this branch.

tx_service/include/sharder.h (1)

684-705: New standby-becoming-leader term cache API is consistent with existing term handling

SetStandbyBecomingLeaderNodeTerm / StandbyBecomingLeaderNodeTerm mirror the patterns used for standby and candidate-standby term caches (cc_nodes_init_ guard, release store / acquire load), and the added atomic member is documented clearly. This gives other components a clean way to detect the transient “becoming leader” state.

Also applies to: 809-814

return;
}

standby_becoming_leader_term_cache_.store(leader_term,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this only need to be a boolean

return;
}
DLOG(ERROR) << "FetchCatalogClosure fails, abort fetch_cc_";
fetch_cc_->SetFinish(RecordStatus::Deleted,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix ForwardStandbyCc abort logic. ForwardStandbyCc cannot be discarded even if called abort cc req, it needs to be reenqueued into the cc req queue otherwise we might lost data.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f189803 and 57411f1.

📒 Files selected for processing (2)
  • tx_service/include/error_messages.h
  • tx_service/src/cc/cc_req_misc.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • tx_service/src/cc/cc_req_misc.cpp
🔇 Additional comments (1)
tx_service/include/error_messages.h (1)

342-343: LGTM! New error code is correctly defined.

The addition of LEADER_NODE_UNREACHABLE is consistent between the enum definition and the error messages map. The placement before LAST_ERROR_CODE is correct, and the comment clearly describes its purpose for leader node connectivity issues.

Also applies to: 425-425

@@ -339,6 +339,9 @@ enum struct CcErrorCode : uint8_t
// The cursor has become invalid.
INVALID_CURSOR,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Read the file to understand the structure around line 340 and locate the map
cat -n tx_service/include/error_messages.h | sed -n '330,450p'

Repository: eloqdata/tx_service

Length of output: 4922


🏁 Script executed:

# Search for CcErrorCode::INVALID_CURSOR usage
echo "=== Searching for CcErrorCode::INVALID_CURSOR usage ==="
rg -n --type=cpp 'INVALID_CURSOR' -C2

# Also check for any direct references to INVALID_CURSOR
echo -e "\n=== Checking for plain INVALID_CURSOR references ==="
rg -n 'INVALID_CURSOR' --type=cpp

Repository: eloqdata/tx_service

Length of output: 3285


Add INVALID_CURSOR to the cc_error_messages map.

The INVALID_CURSOR enum value is actively used in the codebase (tx_operation.cpp:1377, tx_execution.cpp:404) but is missing from the cc_error_messages map. When CcErrorMessage() is called with this error code, it will return a generic "CcErrorCode:340" message instead of a descriptive one. Add the mapping to maintain consistency with other error codes:

{CcErrorCode::INVALID_CURSOR, "INVALID_CURSOR"},
🤖 Prompt for AI Agents
In tx_service/include/error_messages.h around line 340, the
CcErrorCode::INVALID_CURSOR enum is missing from the cc_error_messages map
causing CcErrorMessage() to fall back to a generic code string; add an entry to
the map associating CcErrorCode::INVALID_CURSOR with the string "INVALID_CURSOR"
(i.e. add the mapping {CcErrorCode::INVALID_CURSOR, "INVALID_CURSOR"}, matching
the style/format of the surrounding entries).

@MrGuin MrGuin merged commit 527c2d3 into main Dec 29, 2025
4 checks passed
@MrGuin MrGuin deleted the fix_standby branch December 29, 2025 03:49
@coderabbitai coderabbitai Bot mentioned this pull request Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants