Skip to content

tx_service: guard standby term promotion with candidate check#469

Merged
MrGuin merged 4 commits into
mainfrom
fix_async_term
Mar 25, 2026
Merged

tx_service: guard standby term promotion with candidate check#469
MrGuin merged 4 commits into
mainfrom
fix_async_term

Conversation

@MrGuin

@MrGuin MrGuin commented Mar 20, 2026

Copy link
Copy Markdown
Collaborator

Fixes https://github.com/eloqdata/project_tracker/issues/243.

Summary by CodeRabbit

  • Improvements
    • Standby-term promotions now validate against the current candidate term to prevent applying stale promotions.
    • Snapshot and checkpoint tasks re-check staleness when executed and discard outdated work to reduce race conditions.
    • Cache restoration after snapshot synchronization only runs when a promotion is confirmed, avoiding unnecessary or incorrect cache updates.

@coderabbitai

coderabbitai Bot commented Mar 20, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e924c81-e78d-4ac7-a68b-14f8b7d91a61

📥 Commits

Reviewing files that changed from the base of the PR and between 3bf5193 and e2b1dbb.

📒 Files selected for processing (5)
  • tx_service/include/fault/cc_node.h
  • tx_service/include/sharder.h
  • tx_service/src/fault/cc_node.cpp
  • tx_service/src/remote/cc_node_service.cpp
  • tx_service/src/sharder.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • tx_service/include/sharder.h
  • tx_service/src/sharder.cpp

Walkthrough

Adds conditional standby-term promotion: new private CcNode::PromoteStandbyTermIfCandidate and public Sharder::PromoteStandbyTermIfCandidate; CcNode implements latch-protected check/update. Snapshot-sync and checkpoint tasks now move stale checks into enqueued tasks and only restore cache/log success when promotion succeeds.

Changes

Cohort / File(s) Summary
Header Declarations
tx_service/include/sharder.h, tx_service/include/fault/cc_node.h
Added Sharder::PromoteStandbyTermIfCandidate(uint32_t ng_id, int64_t standby_term) (public) and CcNode::PromoteStandbyTermIfCandidate(int64_t) (private) declarations.
CcNode Implementation
tx_service/src/fault/cc_node.cpp
New method acquires is_processing_ latch (spin), compares requested term with CandidateStandbyNodeTerm(), promotes by setting StandbyNodeTerm and clearing candidate if matched; logs and returns bool.
Sharder Implementation
tx_service/src/sharder.cpp
New Sharder::PromoteStandbyTermIfCandidate acquires shared config lock, finds CC node for ng_id, delegates to CcNode::PromoteStandbyTermIfCandidate, and returns the result.
Remote Service Flow
tx_service/src/remote/cc_node_service.cpp
Moved stale-checks into enqueued tasks; replaced unconditional term-setting after snapshot sync with call to Sharder::PromoteStandbyTermIfCandidate and conditionally run RestoreTxCache / success logging based on promotion result.

Sequence Diagram

sequenceDiagram
    participant Task as RequestSyncSnapshot Task
    participant Sharder as Sharder
    participant CcNode as CcNode
    participant KV as KVStore

    Task->>Sharder: PromoteStandbyTermIfCandidate(ng_id, standby_term)
    activate Sharder
    Sharder->>Sharder: acquire shared cluster_cnf_mux_
    Sharder->>CcNode: PromoteStandbyTermIfCandidate(standby_term)
    activate CcNode
    CcNode->>CcNode: acquire is_processing_ latch (spin)
    CcNode->>CcNode: read CandidateStandbyNodeTerm()
    alt candidate_term == standby_term
        CcNode->>KV: SetStandbyNodeTerm(standby_term)
        CcNode->>KV: SetCandidateStandbyNodeTerm(-1)
        CcNode-->>Sharder: return true
    else
        CcNode-->>Sharder: return false (log mismatch)
    end
    CcNode->>CcNode: release latch
    deactivate CcNode
    Sharder-->>Task: promoted (bool)
    deactivate Sharder

    alt promoted == true
        Task->>KV: RestoreTxCache(...)
        Task->>Task: log "RequestSyncSnapshot successfully"
    else
        Task->>Task: log "stale term is discarded"
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • liunyl
  • lokax

Poem

🐇 I nudged a term from shadow to sun,
Latch gripped tight until the check was done.
If numbers matched, I hopped it up high,
If not, I logged and let it slide by.
Tiny paws, big hops — a carrot-coded cry. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description references an issue but provides minimal detail about the implementation, testing, or rationale. It does not follow the repository's template with checklists for tests, documentation, and test suite verification. Complete the description template by documenting changes, adding test information, confirming test suite execution, and providing implementation details beyond the issue reference.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'tx_service: guard standby term promotion with candidate check' accurately summarizes the main change—introducing a candidate check mechanism to guard standby term promotion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix_async_term

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tx_service/src/remote/cc_node_service.cpp`:
- Around line 2087-2094: The call to store_hd->RequestSyncSnapshot(ng_id, term,
snapshot_ts) can start stale work because it uses only
PrimaryTermFromStandbyTerm; either guard that call with the full standby_term or
thread subscribe_id through the API so the store can ignore stale requests:
modify the logic around PrimaryTermFromStandbyTerm/RequestSyncSnapshot and
PromoteStandbyTermIfCandidate so you first validate the standby_term (or pass
the subscribe_id) before invoking RequestSyncSnapshot, or change
RequestSyncSnapshot signature to accept subscribe_id and ensure the store's
ReloadData filters by subscribe_id as well; update calls and callers of
RequestSyncSnapshot and any store_hd implementations accordingly (also ensure
RestoreTxCache usage remains gated by PromoteStandbyTermIfCandidate).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eee38712-3c8a-4dfe-938e-ad7ea20727f6

📥 Commits

Reviewing files that changed from the base of the PR and between 1446380 and 55f58e6.

📒 Files selected for processing (5)
  • tx_service/include/fault/cc_node.h
  • tx_service/include/sharder.h
  • tx_service/src/fault/cc_node.cpp
  • tx_service/src/remote/cc_node_service.cpp
  • tx_service/src/sharder.cpp

Comment thread tx_service/src/remote/cc_node_service.cpp Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tx_service/src/remote/cc_node_service.cpp (1)

2090-2092: ⚠️ Potential issue | 🟠 Major

The new timestamp gate still lets stale same-term subscriptions start store work.

The snapshot_ts discard helps, but Line 2091 still calls the store with only term; the full standby_term is checked only afterwards in PromoteStandbyTermIfCandidate(). Two follow sessions on the same primary term can still kick off store-side sync for the older subscribe_id before it is rejected. Please gate the store call on the full standby term, or thread standby_term/subscribe_id through the store API.

Verify whether the downstream snapshot path ever sees the full standby term or subscribe ID. If every implementation still only takes (ng_id, term, snapshot_ts), this stale-work window is still open.

#!/bin/bash
set -euo pipefail

echo "== RequestSyncSnapshot signatures and implementations =="
rg -n -C3 '\bRequestSyncSnapshot\s*\(' --type=cpp --type=h

echo
echo "== Store-side snapshot filtering inputs =="
rg -n -C3 'ReloadData\(|subscribe_id|standby_term|standby_node_term|PrimaryTermFromStandbyTerm' --type=cpp --type=h
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/remote/cc_node_service.cpp` around lines 2090 - 2092, The call
to store_hd->RequestSyncSnapshot(ng_id, term, snapshot_ts) can start store work
for a stale same-term subscriber; change the logic in the caller (around
RequestSyncSnapshot and PromoteStandbyTermIfCandidate) to gate the store
invocation on the full standby term/subscribe_id (e.g., compare
standby_term/subscribe_id before calling RequestSyncSnapshot) or modify the
store API to accept standby_term and/or subscribe_id and have the store
reject/ignore stale requests; update references to RequestSyncSnapshot,
PromoteStandbyTermIfCandidate, store_hd, standby_term, subscribe_id,
snapshot_ts, ng_id and term accordingly and run the supplied ripgrep checks to
verify all RequestSyncSnapshot implementations and downstream snapshot paths
accept and enforce the full standby term/subscribe_id.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tx_service/src/remote/cc_node_service.cpp`:
- Around line 2090-2092: The call to store_hd->RequestSyncSnapshot(ng_id, term,
snapshot_ts) can start store work for a stale same-term subscriber; change the
logic in the caller (around RequestSyncSnapshot and
PromoteStandbyTermIfCandidate) to gate the store invocation on the full standby
term/subscribe_id (e.g., compare standby_term/subscribe_id before calling
RequestSyncSnapshot) or modify the store API to accept standby_term and/or
subscribe_id and have the store reject/ignore stale requests; update references
to RequestSyncSnapshot, PromoteStandbyTermIfCandidate, store_hd, standby_term,
subscribe_id, snapshot_ts, ng_id and term accordingly and run the supplied
ripgrep checks to verify all RequestSyncSnapshot implementations and downstream
snapshot paths accept and enforce the full standby term/subscribe_id.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e3e053f-d4e1-4302-ab71-d1633a525fd4

📥 Commits

Reviewing files that changed from the base of the PR and between 55f58e6 and c585aac.

📒 Files selected for processing (1)
  • tx_service/src/remote/cc_node_service.cpp

@MrGuin MrGuin merged commit 4d64cce into main Mar 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants