Skip to content

Store ng leader term with ng_leader_cache.#129

Merged
MrGuin merged 3 commits into
mainfrom
ng_leader_term_cache
Sep 18, 2025
Merged

Store ng leader term with ng_leader_cache.#129
MrGuin merged 3 commits into
mainfrom
ng_leader_term_cache

Conversation

@MrGuin

@MrGuin MrGuin commented Sep 10, 2025

Copy link
Copy Markdown
Collaborator

Check term before UpdateLeader.

Related PR:
https://github.com/eloqdata/raft_host_manager/pull/3

Summary by CodeRabbit

  • New Features
    • Leader updates now include term information, enabling term-aware leadership changes.
    • Added per-group tracking of leader terms for more precise state management.
  • Bug Fixes
    • Prevents applying outdated leadership updates, improving consistency during failovers and reducing misrouting under contention.
    • More robust handling of concurrent leader changes to avoid stale state propagation.

@MrGuin MrGuin force-pushed the ng_leader_term_cache branch from 593ed58 to ae0369d Compare September 18, 2025 07:27
@coderabbitai

coderabbitai Bot commented Sep 18, 2025

Copy link
Copy Markdown

Walkthrough

Adds leader term handling across the system: a term field is added to the NotifyNewLeaderStartRequest proto; Sharder gains a term-aware UpdateLeader method and a per-node-group leader-term cache; callers (CcNode and cc_node_service) pass term through; Sharder implements atomic, term-ordered updates and init for the new cache.

Changes

Cohort / File(s) Summary of Changes
Protocol: leader term
include/proto/cc_request.proto
Added field int64 term = 3 to message NotifyNewLeaderStartRequest.
Sharder API and state
include/sharder.h
UpdateLeader signature now void UpdateLeader(uint32_t ng_id, uint32_t node_id, int64_t term = -1);. Added std::atomic<int64_t> ng_leader_term_cache_[1000]; with note on reliability.
Sharder implementation
src/sharder.cpp
Initialized ng_leader_term_cache_ to -1 in Init. Implemented term-aware UpdateLeader: rejects outdated terms, CAS loop to publish newer terms, then updates ng_leader_cache_. Uses memory orderings and sentinel -1.
Inbound RPC path
src/remote/cc_node_service.cpp
NotifyNewLeaderStart reads term from request and calls UpdateLeader(ng_id, node_id, term).
Subscription path
src/fault/cc_node.cpp
CcNode::SubscribePrimaryNode passes primary_term to UpdateLeader(ng_id_, leader_node_id, primary_term).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client/Peer
  participant Svc as CCNodeService
  participant Sh as Sharder
  participant Caches as Leader Caches

  C->>Svc: NotifyNewLeaderStart(ng_id, node_id, term)
  Svc->>Sh: UpdateLeader(ng_id, node_id, term)
  alt term provided (!= -1)
    Sh->>Caches: read cached_term[ng_id]
    opt outdated check
      Note over Sh,Caches: If term <= cached_term and cached != -1, skip
    end
    Sh->>Caches: CAS loop: cached_term -> term
  else no term
    Note over Sh: Skip term update
  end
  Sh->>Caches: publish leader_id[ng_id] = node_id
  Svc-->>C: OK/Status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Poem

A whisk of term, a hop of lead,
I cache the truth the shards all need.
Old tales fade—new epochs bloom,
Atomic thumps disperse the gloom.
With ears alert, I guard the turn—
The newest term is what we learn. 🐇🔧

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is minimal ("Check term before UpdateLeader." plus a related-PR link) and does not follow the repository's required template checklist; it omits confirmation of added tests, documentation, an issue/RFC reference, and the required ./mtr test results. Because several required sections from the description_template are missing or not addressed, the description does not meet the repository's standards for a complete PR description. Reviewers will lack the information needed to verify tests, docs, and CI expectations. Please update the PR description using the repository template: state whether tests were added (and link them), document the change and rationale, reference the issue or RFC if applicable, include the related PR link, and report the results of running ./mtr --suite=mono_main,mono_multi,mono_basic; if any checklist items are not applicable, mark them and explain why. This will speed review and ensure CI/test expectations are clear.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Store ng leader term with ng_leader_cache." accurately captures the primary change in the diff: adding per-node-group leader-term storage alongside the existing ng_leader_cache (see changes to sharder.h/sharder.cpp and UpdateLeader signature), and it is concise and relevant for someone scanning history. It directly relates to the main implementation work (ng_leader_term_cache_ and term-aware UpdateLeader) rather than listing files or extraneous details. The phrasing is clear enough for a teammate to understand the primary intent of the PR at a glance.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ng_leader_term_cache

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
include/sharder.h (1)

765-766: Document the unreliability of the cache more clearly.

The comment states "this is not reliable" but doesn't explain why or what the implications are. Consider adding more context about when and how this cache might be unreliable, and what callers should be aware of.

     // Ng leader cache. We preallocate it to the max cluster size so that we
     // don't need to modify the size of it.
     std::atomic<uint32_t> ng_leader_cache_[1000];
-    // Ng leader term cache. Note this is not reliable.
+    // Ng leader term cache. Note this is not reliable as it may contain stale
+    // term values during network partitions or when updates from remote nodes
+    // are delayed. The cache is best-effort and should not be used for 
+    // critical leader verification - use CheckLeaderTerm() instead.
     std::atomic<int64_t> ng_leader_term_cache_[1000];
src/sharder.cpp (1)

684-709: Consider potential ABA problem in the CAS loop.

The current implementation has a theoretical ABA problem where the cached_leader_term could change from A to B and back to A between the load and CAS, potentially allowing an outdated update. While unlikely in practice, consider whether this edge case matters for your consistency requirements.

Additionally, the double-check pattern inside the loop is redundant since cached_leader_term is updated by the CAS failure.

 void Sharder::UpdateLeader(uint32_t ng_id, uint32_t node_id, int64_t term)
 {
     DLOG(INFO) << "ccnode group ng" << ng_id
                << " updates leader to node_id:" << node_id;
     auto cached_leader_term =
         ng_leader_term_cache_[ng_id].load(std::memory_order_relaxed);
     // If the term is passed and out-dated, skip it.
     if (term != -1 && cached_leader_term != -1 && term <= cached_leader_term)
     {
         DLOG(INFO) << "skip out-dated leader update, term: " << term
                    << ", cached term: " << cached_leader_term;
         return;
     }
 
-    while (!ng_leader_term_cache_[ng_id].compare_exchange_weak(
-        cached_leader_term, term))
-    {
-        if (term != -1 && cached_leader_term != -1 &&
-            term <= cached_leader_term)
-        {
-            DLOG(INFO) << "skip out-dated leader update, term: " << term
-                       << ", cached term: " << cached_leader_term;
-            return;
-        }
-    }
+    // Update term cache if we have a newer term
+    if (term != -1) {
+        while (!ng_leader_term_cache_[ng_id].compare_exchange_weak(
+            cached_leader_term, term))
+        {
+            // cached_leader_term is updated by CAS failure
+            if (cached_leader_term != -1 && term <= cached_leader_term)
+            {
+                DLOG(INFO) << "skip out-dated leader update, term: " << term
+                           << ", cached term: " << cached_leader_term;
+                return;
+            }
+        }
+    }
     ng_leader_cache_[ng_id].store(node_id, std::memory_order_release);
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd21f9a and ae0369d.

📒 Files selected for processing (5)
  • include/proto/cc_request.proto (1 hunks)
  • include/sharder.h (2 hunks)
  • src/fault/cc_node.cpp (1 hunks)
  • src/remote/cc_node_service.cpp (1 hunks)
  • src/sharder.cpp (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
include/sharder.h (1)
src/sharder.cpp (4)
  • UpdateLeader (663-682)
  • UpdateLeader (663-663)
  • UpdateLeader (684-710)
  • UpdateLeader (684-684)
🔇 Additional comments (6)
include/proto/cc_request.proto (1)

44-44: LGTM!

The addition of the term field to NotifyNewLeaderStartRequest is appropriate for tracking leader terms.

include/sharder.h (1)

356-358: LGTM!

The updated API signature with the optional term parameter is well-designed. The default value of -1 provides backward compatibility.

src/remote/cc_node_service.cpp (1)

219-224: LGTM!

The integration correctly reads the term from the request and passes it to UpdateLeader.

src/fault/cc_node.cpp (1)

838-838: LGTM!

The call site correctly passes the primary_term to UpdateLeader, maintaining consistency with the term-aware updates.

src/sharder.cpp (2)

172-172: LGTM!

Proper initialization of the ng_leader_term_cache_ to -1 for all potential node groups.


698-708: No code path depends on ng_leader_term_cache_ and ng_leader_cache_ being strictly consistent.
Repo search shows only initialization and the update in src/sharder.cpp; LeaderNodeId (include/sharder.h) loads only ng_leader_cache_ and ng_leader_term_cache_ is documented "not reliable." No action required.

@MrGuin MrGuin merged commit 48e5c86 into main Sep 18, 2025
3 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Nov 3, 2025
5 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants