fix standby OnLeaderStart and FetchCatalog from primary by MrGuin · Pull Request #326 · eloqdata/tx_service

MrGuin · 2025-12-26T09:16:59Z

mark that this node is becoming leader and discard the standby msgs in OnLeaderStart before wait InflightStandbyReqCount;
FetchCatalog from primary aborts the fetch_cc if rpc fails or node is becoming leader.

Summary by CodeRabbit

New Features
- Public API to track when a standby node is becoming leader, with initialization support.
Bug Fixes
- Improved error logging and clearer aborts when leader is unreachable during catalog fetches.
- Added validation gates to prevent actions during standby→leader transitions, including gating inflight forwarding and remote fetches.
Chores
- Introduced a new error code to surface leader-unreachable conditions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-26T09:17:09Z

Warning

Rate limit exceeded

@MrGuin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 12 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 305a29c and 918e859.

📒 Files selected for processing (1)

tx_service/src/sharder.cpp

Walkthrough

Adds a new error code and introduces a Sharder atomic cache/API to mark a standby becoming leader; several term-checks and abort/log paths were added or gated across catalog fetch, RPC closure, shard fetch, stream receiver, and leader-start logic to block operations during the standby→leader transition.

Changes

Cohort / File(s)	Summary
Error Code Addition `tx_service/include/error_messages.h`	Added `LEADER_NODE_UNREACHABLE` to `CcErrorCode` and to `cc_error_messages`.
Sharder API & State `tx_service/include/sharder.h`, `tx_service/src/sharder.cpp`	Added `standby_becoming_leader_term_cache_` (atomic), `SetStandbyBecomingLeaderNodeTerm(int64_t)`, `StandbyBecomingLeaderNodeTerm()`, and initialized cache to `-1`.
Leader start / standby term handling `tx_service/src/fault/cc_node.cpp`	Reordered OnLeaderStart/standby flow: early handling for prev_candidate_standby_term escalation, set/clear becoming-leader marker, clear caches and stop retries when escalation detected.
Catalog fetch validation `tx_service/src/cc/cc_req_misc.cpp`	`FetchCatalogCc::ValidTermCheck()` now fails if `StandbyBecomingLeaderNodeTerm() != -1`; `AbortCcRequest` uses actual `CcErrorCode` value when aborting.
RPC closure error handling `tx_service/include/rpc_closure.h`	`FetchCatalogClosure` logs and finishes with `LEADER_NODE_UNREACHABLE` on term-check failures and on RPC timeout/retry aborts; adds logging for other catalog errors.
Shard fetch gating `tx_service/src/cc/cc_shard.cpp`	`CcShard::FetchCatalog` now requires `StandbyBecomingLeaderNodeTerm() == -1` (and matching standby/candidate term) before fetching from primary.
Stream receiver gating `tx_service/src/remote/cc_stream_receiver.cpp`	`on_received_messages` only increments inflight standby requests / forwards when `StandbyBecomingLeaderNodeTerm() == -1`.
Request abort behavior `tx_service/include/cc/cc_request.h`	`KeyObjectStandbyForwardCc::AbortCcRequest` no longer sets status to Finished before enqueueing.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Client
    participant Remote as RemoteRPC/Receiver
    participant Sharder
    participant CcShard
    participant FetchClosure as FetchCatalogClosure
    participant CCNode as Fault::CCNode

    Note over Sharder: standby_becoming_leader_term_cache_ (atomic)

    Client->>Remote: send catalog fetch / messages
    Remote->>Sharder: check StandbyBecomingLeaderNodeTerm()
    alt StandbyBecomingLeaderNodeTerm == -1
        Remote->>CcShard: forward / increment inflight / FetchCatalog
        CcShard->>Sharder: validate standby terms (StandbyNodeTerm / CandidateStandbyNodeTerm)
        alt terms valid
            CcShard->>FetchClosure: issue RPC fetch
            FetchClosure->>Remote: handle RPC response
            alt RPC success & NO_ERROR
                FetchClosure->>CcShard: deliver catalog
            else RPC error
                FetchClosure->>CcShard: finish with LEADER_NODE_UNREACHABLE
            end
        else terms mismatch
            CcShard->>FetchClosure: abort (LEADER_NODE_UNREACHABLE)
        end
    else Standby becoming leader (term != -1)
        Remote->>FetchClosure: short-circuit / abort with LEADER_NODE_UNREACHABLE
    end

    Note left of CCNode: OnLeaderStart may set/clear standby_becoming_leader_term_cache_
    CCNode->>Sharder: SetStandbyBecomingLeaderNodeTerm(...) / reset to -1

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Add OnLeaderStop in DataStoreHandler #181: Overlaps with OnLeaderStart() sequencing and leader-start/standby handling in src/fault/cc_node.cpp.
Store ng leader term with ng_leader_cache. #129: Related Sharder term-caching and term-aware APIs additions.
Replace fixed-size standby message buffer with memory-bounded queue #211: Touches standby/candidate-standby management and related gating logic.

Suggested reviewers

liunyl

Poem

🐰 I cached a hop, a leader near,
I pause the stream when terms appear,
If messages knock while I ascend,
I wave them off till shifts all end,
Safe hops, safe cluster — nibble cheer! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description explains the key changes but lacks reference to the issue, RFC links, testing notes, and documentation details specified in the template.	Add issue reference (fixes #112), mention testing with ./mtr suite, and clarify if documentation updates are needed for the new API additions.
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main changes: fixing standby OnLeaderStart behavior and FetchCatalog from primary, which aligns with the core objectives of marking nodes as becoming leader and aborting fetches appropriately.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

MrGuin · 2025-12-26T09:19:46Z

Fixes https://github.com/eloqdata/project_tracker/issues/112

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tx_service/include/error_messages.h (1)
339-348: Add LEADER_NODE_UNREACHABLE to cc_error_messages for clearer diagnostics

You added CcErrorCode::LEADER_NODE_UNREACHABLE but it’s not present in cc_error_messages, so CcErrorMessage(LEADER_NODE_UNREACHABLE) will fall back to the numeric code string.

Consider adding a mapping entry for consistency and better logs.
Proposed addition
 static const std::unordered_map<CcErrorCode, std::string> cc_error_messages{
@@
-    {CcErrorCode::UPDATE_SEQUENCE_TABLE_FAIL, "UPDATE_SEQUENCE_TABLE_FAIL"},
+    {CcErrorCode::UPDATE_SEQUENCE_TABLE_FAIL, "UPDATE_SEQUENCE_TABLE_FAIL"},
+    {CcErrorCode::LEADER_NODE_UNREACHABLE, "LEADER_NODE_UNREACHABLE"},
@@
-    {CcErrorCode::LAST_ERROR_CODE, "LAST_ERROR_CODE"},
+    {CcErrorCode::LAST_ERROR_CODE, "LAST_ERROR_CODE"},
 };
tx_service/include/rpc_closure.h (1)
924-1029: FetchCatalogClosure correctly aborts on term / RPC failures, but specific error codes are not propagated to requesters

The new logic:

Aborts on !ValidTermCheck() with SetFinish(Deleted, NG_TERM_CHANGED).

Aborts on non-retriable RPC failure with SetFinish(Deleted, LEADER_NODE_UNREACHABLE).

Logs and finishes with Unknown + err_code for terminal response errors.

Functionally this ensures FetchCatalogCc stops retrying when the node is becoming leader or the leader is unreachable, which matches the PR intent.

Note, though, that FetchCatalogCc::Execute() currently treats any error_code_ != 0 as a generic data-store failure and calls AbortCcRequest(CcErrorCode::DATA_STORE_ERR) for all waiting requests, so callers never see NG_TERM_CHANGED or LEADER_NODE_UNREACHABLE.

If you want upstream code to distinguish these cases, consider mapping error_code_ through in Execute() instead of collapsing everything to DATA_STORE_ERR (at least for the new codes).
Illustrative change in FetchCatalogCc::Execute (cc_req_misc.cpp)
-    if (error_code_ == 0)
+    if (error_code_ == 0)
@@
-    else
-    {
-        for (CcRequestBase *req : requesters_)
-        {
-            req->AbortCcRequest(CcErrorCode::DATA_STORE_ERR);
-        }
-    }
+    else
+    {
+        CcErrorCode ec = static_cast<CcErrorCode>(error_code_);
+        // Preserve more specific error codes for leader/term issues.
+        CcErrorCode mapped =
+            (ec == CcErrorCode::NG_TERM_CHANGED ||
+             ec == CcErrorCode::LEADER_NODE_UNREACHABLE)
+                ? ec
+                : CcErrorCode::DATA_STORE_ERR;
+
+        for (CcRequestBase *req : requesters_)
+        {
+            req->AbortCcRequest(mapped);
+        }
+    }
This keeps existing behavior for generic failures while exposing the new, more specific errors.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2d73fb and ef2f57f.

📒 Files selected for processing (7)

tx_service/include/error_messages.h
tx_service/include/rpc_closure.h
tx_service/include/sharder.h
tx_service/src/cc/cc_req_misc.cpp
tx_service/src/fault/cc_node.cpp
tx_service/src/remote/cc_stream_receiver.cpp
tx_service/src/sharder.cpp

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-12-02T10:43:27.431Z

Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

tx_service/src/cc/cc_req_misc.cpp
tx_service/src/remote/cc_stream_receiver.cpp
tx_service/src/sharder.cpp
tx_service/src/fault/cc_node.cpp
tx_service/include/sharder.h

📚 Learning: 2025-10-21T06:46:53.700Z

Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

tx_service/src/remote/cc_stream_receiver.cpp

🧬 Code graph analysis (1)

tx_service/src/cc/cc_req_misc.cpp (1)

tx_service/include/cc/cc_request.h (1)

cc_ng_term_ (3021-3129)

🔇 Additional comments (5)

tx_service/src/remote/cc_stream_receiver.cpp (1)

196-207: Guarding standby inflight counting during leader transition looks correct

Requiring both PrimaryNodeTerm() > 0 and StandbyBecomingLeaderNodeTerm() == -1 before incrementing InflightStandbyReqCount cleanly prevents new standby traffic from extending the drain window while a standby is transitioning to leader. Existing standby vs primary paths remain unchanged.

tx_service/src/cc/cc_req_misc.cpp (1)

88-107: Expanded term check for fetch-from-primary aligns with standby→leader transition

Including StandbyBecomingLeaderNodeTerm() != -1 in the guard ensures catalog fetches from primary are aborted while the node is in the “becoming leader” window, preventing use of potentially stale primary state. The added debug log gives enough context to diagnose these rejections.

tx_service/src/sharder.cpp (1)

165-178: Initialization of standby_becoming_leader_term_cache_ is consistent

Seeding standby_becoming_leader_term_cache_ to -1 alongside the other term caches matches the getter’s contract and avoids relying on default-initialized atomics.

tx_service/src/fault/cc_node.cpp (1)

283-342: Standby-becoming-leader marker is correctly scoped around the transition

Marking StandbyBecomingLeaderNodeTerm(term) before unsubscribing / draining, and then clearing it back to -1 after standby term reset, cleanly brackets the “becoming leader” window. That gives cc_stream_receiver and term checks a precise signal without leaving the marker set on error paths in this branch.

tx_service/include/sharder.h (1)

684-705: New standby-becoming-leader term cache API is consistent with existing term handling

SetStandbyBecomingLeaderNodeTerm / StandbyBecomingLeaderNodeTerm mirror the patterns used for standby and candidate-standby term caches (cc_nodes_init_ guard, release store / acquire load), and the added atomic member is documented clearly. This gives other components a clean way to detect the transient “becoming leader” state.

Also applies to: 809-814

liunyl · 2025-12-26T09:21:10Z

+            return;
+        }
+
+        standby_becoming_leader_term_cache_.store(leader_term,


seems like this only need to be a boolean

liunyl · 2025-12-26T10:15:43Z

                return;
            }
+            DLOG(ERROR) << "FetchCatalogClosure fails, abort fetch_cc_";
+            fetch_cc_->SetFinish(RecordStatus::Deleted,


need to fix ForwardStandbyCc abort logic. ForwardStandbyCc cannot be discarded even if called abort cc req, it needs to be reenqueued into the cc req queue otherwise we might lost data.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f189803 and 57411f1.

📒 Files selected for processing (2)

tx_service/include/error_messages.h
tx_service/src/cc/cc_req_misc.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

tx_service/src/cc/cc_req_misc.cpp

🔇 Additional comments (1)

tx_service/include/error_messages.h (1)

342-343: LGTM! New error code is correctly defined.

The addition of LEADER_NODE_UNREACHABLE is consistent between the enum definition and the error messages map. The placement before LAST_ERROR_CODE is correct, and the comment clearly describes its purpose for leader node connectivity issues.

Also applies to: 425-425

coderabbitai · 2025-12-29T03:25:14Z

@@ -339,6 +339,9 @@ enum struct CcErrorCode : uint8_t
    // The cursor has become invalid.
    INVALID_CURSOR,


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Read the file to understand the structure around line 340 and locate the map cat -n tx_service/include/error_messages.h | sed -n '330,450p'

Repository: eloqdata/tx_service

Length of output: 4922

🏁 Script executed:

# Search for CcErrorCode::INVALID_CURSOR usage echo "=== Searching for CcErrorCode::INVALID_CURSOR usage ===" rg -n --type=cpp 'INVALID_CURSOR' -C2 # Also check for any direct references to INVALID_CURSOR echo -e "\n=== Checking for plain INVALID_CURSOR references ===" rg -n 'INVALID_CURSOR' --type=cpp

Repository: eloqdata/tx_service

Length of output: 3285

Add INVALID_CURSOR to the cc_error_messages map.

The INVALID_CURSOR enum value is actively used in the codebase (tx_operation.cpp:1377, tx_execution.cpp:404) but is missing from the cc_error_messages map. When CcErrorMessage() is called with this error code, it will return a generic "CcErrorCode:340" message instead of a descriptive one. Add the mapping to maintain consistency with other error codes:

{CcErrorCode::INVALID_CURSOR, "INVALID_CURSOR"},

🤖 Prompt for AI Agents

In tx_service/include/error_messages.h around line 340, the CcErrorCode::INVALID_CURSOR enum is missing from the cc_error_messages map causing CcErrorMessage() to fall back to a generic code string; add an entry to the map associating CcErrorCode::INVALID_CURSOR with the string "INVALID_CURSOR" (i.e. add the mapping {CcErrorCode::INVALID_CURSOR, "INVALID_CURSOR"}, matching the style/format of the surrounding entries).

MrGuin requested review from liunyl and lokax December 26, 2025 09:19

coderabbitai Bot reviewed Dec 26, 2025

View reviewed changes

liunyl reviewed Dec 26, 2025

View reviewed changes

coderabbitai Bot reviewed Dec 29, 2025

View reviewed changes

MrGuin added 3 commits December 29, 2025 11:30

fix standby OnLeaderStart and FetchCatalog from primary

9e39934

open store_handler first in OnLeaderStart; do not abort StandbyForwardCc

85416c0

comment fix

305a29c

MrGuin force-pushed the fix_standby branch from 57411f1 to 305a29c Compare December 29, 2025 03:30

liunyl approved these changes Dec 29, 2025

View reviewed changes

format

918e859

MrGuin merged commit 527c2d3 into main Dec 29, 2025
4 checks passed

MrGuin deleted the fix_standby branch December 29, 2025 03:49

coderabbitai Bot mentioned this pull request Jan 21, 2026

Pass term to eloqstore during start data store. #318

Merged

5 tasks

coderabbitai Bot mentioned this pull request Feb 2, 2026

Allow run on candidate #392

Open

coderabbitai Bot mentioned this pull request Mar 20, 2026

tx_service: guard standby term promotion with candidate check #469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix standby OnLeaderStart and FetchCatalog from primary#326

fix standby OnLeaderStart and FetchCatalog from primary#326
MrGuin merged 4 commits into
mainfrom
fix_standby

MrGuin commented Dec 26, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Dec 26, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

MrGuin commented Dec 26, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

liunyl Dec 26, 2025

Uh oh!

liunyl Dec 26, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -339,6 +339,9 @@ enum struct CcErrorCode : uint8_t
		// The cursor has become invalid.
		INVALID_CURSOR,

Uh oh!

Conversation

MrGuin commented Dec 26, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

MrGuin commented Dec 26, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

liunyl Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

liunyl Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGuin commented Dec 26, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 26, 2025 •

edited

Loading