Add OnLeaderStop in DataStoreHandler by lzxddz · Pull Request #181 · eloqdata/tx_service

lzxddz · 2025-10-29T10:39:31Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Reference the link of issue using fixes eloqdb/tx_service#issue_id
Reference the link of RFC if exists
Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Note

Introduce datastore OnLeaderStop and parameterized OnStartFollowing, wire them into cc_node, and add retrying snapshot-sync notification plus improved diagnostics.

Data store handler (include/store/data_store_handler.h):
- Add virtual bool OnLeaderStop(int64_t term).
- Extend OnStartFollowing to OnStartFollowing(uint32_t node_id, int64_t term, int64_t standby_term, bool resubscribe).
CC node (src/fault/cc_node.cpp):
- Call store_hd_->OnLeaderStart(...) unconditionally when KV enabled (remove shared-storage check).
- Invoke store_hd_->OnLeaderStop(term) on leader step-down with retry gating.
- Pass leader/follow terms to store_hd_->OnStartFollowing(leader_node_id, primary_term, standby_term, resubscribe).
- Add concise DLOG/LOG context in snapshot receive and subscribe flows.
Snapshot manager (src/store/snapshot_manager.cpp):
- Add DLOG on receive; when notifying standby via OnSnapshotSynced, add retry loop with backoff and warnings.
Docs/Comments:
- Minor comment wrap in include/sharder.h.

^{Written by Cursor Bugbot for commit e4913d6. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

Bug Fixes
- Improved snapshot synchronization reliability with automatic retry attempts for more stable standby startup.
- Hardened leader transition and follower state handling to reduce failures during role changes.
- Added retry-backed notifications and backoff checks for shard sync, improving timeout recovery and error reporting.

coderabbitai · 2025-10-29T10:39:56Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Added a new OnLeaderStop(term) hook and expanded OnStartFollowing to accept (node_id, term, standby_term, resubscribe). Updated call sites and logging across leader/standby flows and added retry loops for shard/snapshot synchronization.

Changes

Cohort / File(s)	Summary
API Contract Updates `include/store/data_store_handler.h`	Added `OnLeaderStop(int64_t term)` and changed `OnStartFollowing()` to `OnStartFollowing(uint32_t node_id, int64_t term, int64_t standby_term, bool resubscribe)`; minor formatting tweak to `ScheduleTimerTasks`.
Leader lifecycle & standby handling `src/fault/cc_node.cpp`	Removed IsSharedStorage() gate in `OnLeaderStart`; added KV-store `OnLeaderStop(term)` invocation with retry-on-failure; updated `OnStartFollowing` call to use new signature; enhanced logging and standby_term propagation in `OnSnapshotReceived` and `SubscribePrimaryNode`; replaced one-shot shard-sync check with a retry loop.
Standby snapshot synchronization `src/store/snapshot_manager.cpp`	Added debug logs for snapshot sync requests; replaced single `OnSnapshotSynced` RPC with a retry loop (up to 5 attempts, short RPC timeout, 1s retry sleep) and warning logs on failures.

Sequence Diagram(s)

sequenceDiagram
    participant Leader as Leader Node
    participant Store as Store Handler
    participant Standby as Standby Node

    rect rgb(240,248,255)
    Note over Leader,Store: Leader stop (NEW)
    Leader->>Store: OnLeaderStop(term)
    alt stop succeeds
        Store-->>Leader: true
    else stop fails
        Store-->>Leader: false (retry by caller)
    end
    end

    rect rgb(245,245,220)
    Note over Leader,Store: Start following (UPDATED)
    Leader->>Store: OnStartFollowing(node_id, term, standby_term, resubscribe)
    Store-->>Leader: done
    end

    rect rgb(255,250,240)
    Note over Standby,Store: Snapshot sync with retries (NEW)
    Standby->>Store: OnSnapshotSynced()
    alt RPC fails
        Store-->>Standby: error (log & retry up to 5x with 1s backoff)
        Standby->>Store: OnSnapshotSynced() (retry)
    else RPC succeeds
        Store-->>Standby: success
    end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Areas needing attention:
- All call sites and implementations of OnStartFollowing and the new OnLeaderStop.
- Retry logic correctness and backoff in WaitForShardSyncWithRetry and snapshot RPC retries.
- Correct propagation and usage of standby_term across snapshot and restore paths.

Possibly related PRs

Store ng leader term with ng_leader_cache. #129 — Edits src/fault/cc_node.cpp to propagate term information during leader/standby transitions; closely related to the term-aware changes here.

Suggested reviewers

liunyl
thweetkomputer

Poem

🐰 A leader bows, the term rolls on,
Standby hums and logs the song,
Retries tap at sleepy gates,
Hooks now watch those stateful fates,
Little rabbit cheers: sync strong! 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding OnLeaderStop method to DataStoreHandler, which is a core addition highlighted in the changeset.
Description check	✅ Passed	The description includes comprehensive details of changes across multiple files, objectives, and implementation details, though it follows a custom format rather than strictly adhering to the template checklist.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update_cluster_config

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d24514a and 4889d95.

📒 Files selected for processing (3)

include/store/data_store_handler.h (3 hunks)
src/fault/cc_node.cpp (7 hunks)
src/store/snapshot_manager.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/store/snapshot_manager.cpp

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

src/fault/cc_node.cpp

🧬 Code graph analysis (1)

include/store/data_store_handler.h (3)

src/fault/cc_node.cpp (4)

OnStartFollowing (691-715)

OnStartFollowing (691-691)

OnLeaderStop (471-520)

OnLeaderStop (471-471)

src/sharder.cpp (4)

OnStartFollowing (906-921)

OnStartFollowing (906-909)

OnLeaderStop (861-873)

OnLeaderStop (861-861)

src/remote/cc_node_service.cpp (4)

OnStartFollowing (125-145)

OnStartFollowing (125-129)

OnLeaderStop (113-123)

OnLeaderStop (113-116)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

liunyl · 2025-11-03T08:27:15Z

@CodeRabbit review

coderabbitai · 2025-11-03T08:27:25Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cursor

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/fault/cc_node.cpp (1)

487-503: Don’t clear leader terms before the store shutdown finishes.

We still set LeaderTerm/CandidateTerm to -1 before calling store_hd_->OnLeaderStop. When the store asks for a retry (returns false), we exit with both terms cleared. A concurrent new leader can install its term in between retries, but the next attempt will immediately overwrite it with -1 again—this is the same race flagged earlier. Guard the term transitions around the store callback so we only clear them once the store finishes and the term hasn’t advanced.

-    {
-        std::lock_guard<std::mutex> lk(recovery_mux_);
-        Sharder::Instance().SetLeaderTerm(ng_id_, -1);
-        Sharder::Instance().SetCandidateTerm(ng_id_, -1);
-    }
-
-    if (!txservice_skip_kv)
-    {
-        if (!local_cc_shards_.store_hd_->OnLeaderStop(term))
-        {
-            // Keep retrying
-            return false;
-        }
-    }
+    int64_t prev_leader_term;
+    int64_t prev_candidate_term;
+    {
+        std::lock_guard<std::mutex> lk(recovery_mux_);
+        prev_leader_term = Sharder::Instance().LeaderTerm(ng_id_);
+        prev_candidate_term = Sharder::Instance().CandidateLeaderTerm(ng_id_);
+        if (prev_leader_term != term && prev_candidate_term != term)
+        {
+            return true;
+        }
+    }
+
+    if (!txservice_skip_kv)
+    {
+        if (!local_cc_shards_.store_hd_->OnLeaderStop(term))
+        {
+            return false;
+        }
+    }
+
+    {
+        std::lock_guard<std::mutex> lk(recovery_mux_);
+        if (Sharder::Instance().LeaderTerm(ng_id_) != prev_leader_term ||
+            Sharder::Instance().CandidateLeaderTerm(ng_id_) != prev_candidate_term)
+        {
+            return true;
+        }
+        Sharder::Instance().SetLeaderTerm(ng_id_, -1);
+        Sharder::Instance().SetCandidateTerm(ng_id_, -1);
+    }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7cab24 and d24514a.

📒 Files selected for processing (3)

include/store/data_store_handler.h (3 hunks)
src/fault/cc_node.cpp (7 hunks)
src/store/snapshot_manager.cpp (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

src/fault/cc_node.cpp

🧬 Code graph analysis (2)

src/fault/cc_node.cpp (1)

include/sharder.h (1)

node_id_ (119-156)

include/store/data_store_handler.h (3)

src/fault/cc_node.cpp (4)

OnStartFollowing (691-715)

OnStartFollowing (691-691)

OnLeaderStop (471-520)

OnLeaderStop (471-471)

src/sharder.cpp (4)

OnStartFollowing (906-921)

OnStartFollowing (906-909)

OnLeaderStop (861-873)

OnLeaderStop (861-861)

src/remote/cc_node_service.cpp (4)

OnStartFollowing (125-145)

OnStartFollowing (125-129)

OnLeaderStop (113-123)

OnLeaderStop (113-116)

coderabbitai · 2025-11-03T09:41:27Z

+                while (retry_times-- > 0)
+                {
+                    brpc::Controller cntl;
+                    cntl.set_timeout_ms(1);
+                    remote::OnSnapshotSyncedRequest on_synced_req;
+                    remote::OnSnapshotSyncedResponse on_sync_resp;
+                    on_synced_req.set_snapshot_path(req.dest_path());
+                    on_synced_req.set_standby_node_term(
+                        req.standby_node_term());
+                    stub.OnSnapshotSynced(
+                        &cntl, &on_synced_req, &on_sync_resp, nullptr);
+                    if (cntl.Failed())
+                    {
+                        LOG(WARNING) << "OnSnapshotSynced to standby node #"
+                                     << req.standby_node_id() << " failed, "
+                                     << " error: " << cntl.ErrorText()
+                                     << " error code: " << cntl.ErrorCode();
+                        // sleep 1 second and retry
+                        std::this_thread::sleep_for(std::chrono::seconds(1));
+                        continue;
+                    }
+                    else
+                    {
+                        break;
+                    }
+                }


⚠️ Potential issue | 🟠 Major

Increase the retry RPC timeout from 1 ms.

cntl.set_timeout_ms(1) gives the RPC only a microsecond-scale budget, so every attempt will almost certainly time out even when the standby is healthy. That turns the new retry loop into guaranteed log spam and prevents “best-effort” notification from ever succeeding. Please restore a realistic per-attempt timeout (e.g., 1000 ms to match the backoff cadence).

- cntl.set_timeout_ms(1); + cntl.set_timeout_ms(1000);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

while (retry_times-- > 0)

{

brpc::Controller cntl;

cntl.set_timeout_ms(1);

remote::OnSnapshotSyncedRequest on_synced_req;

remote::OnSnapshotSyncedResponse on_sync_resp;

on_synced_req.set_snapshot_path(req.dest_path());

on_synced_req.set_standby_node_term(

req.standby_node_term());

stub.OnSnapshotSynced(

&cntl, &on_synced_req, &on_sync_resp, nullptr);

if (cntl.Failed())

{

LOG(WARNING) << "OnSnapshotSynced to standby node #"

<< req.standby_node_id() << " failed, "

<< " error: " << cntl.ErrorText()

<< " error code: " << cntl.ErrorCode();

// sleep 1 second and retry

std::this_thread::sleep_for(std::chrono::seconds(1));

continue;

}

else

{

break;

}

}

while (retry_times-- > 0)

{

brpc::Controller cntl;

cntl.set_timeout_ms(1000);

remote::OnSnapshotSyncedRequest on_synced_req;

remote::OnSnapshotSyncedResponse on_sync_resp;

on_synced_req.set_snapshot_path(req.dest_path());

on_synced_req.set_standby_node_term(

req.standby_node_term());

stub.OnSnapshotSynced(

&cntl, &on_synced_req, &on_sync_resp, nullptr);

if (cntl.Failed())

{

LOG(WARNING) << "OnSnapshotSynced to standby node #"

<< req.standby_node_id() << " failed, "

<< " error: " << cntl.ErrorText()

<< " error code: " << cntl.ErrorCode();

// sleep 1 second and retry

std::this_thread::sleep_for(std::chrono::seconds(1));

continue;

}

else

{

break;

}

}

🤖 Prompt for AI Agents

In src/store/snapshot_manager.cpp around lines 249 to 274, the RPC per-attempt timeout is set to 1ms which will always time out; change cntl.set_timeout_ms(1) to a realistic value (e.g., 1000) so each retry has a 1s budget to succeed (or make it a named constant matching the retry/backoff cadence) and keep the existing retry sleep logic.

…ng up

lzxddz mentioned this pull request Oct 29, 2025

Add func to convert ng configs to dss_config eloqdata/eloqkv#249

Merged

lzxddz force-pushed the update_cluster_config branch from ec89efb to 4263417 Compare October 29, 2025 10:42

liunyl marked this pull request as ready for review November 3, 2025 08:27

cursor Bot reviewed Nov 3, 2025

View reviewed changes

Comment thread src/fault/cc_node.cpp

liunyl approved these changes Nov 3, 2025

View reviewed changes

githubzilla force-pushed the update_cluster_config branch from e4913d6 to d24514a Compare November 3, 2025 09:35

coderabbitai Bot reviewed Nov 3, 2025

View reviewed changes

lzxddz and others added 4 commits November 4, 2025 12:13

add OnLeaderStop

0c8cb85

Add OnStartFollowing

4c0956a

Retry OnSnapshotSync when the following node is in progress of spinni…

5101d4c

…ng up

Refine OnStartFollowing signature

4889d95

githubzilla force-pushed the update_cluster_config branch from d24514a to 4889d95 Compare November 4, 2025 04:14

githubzilla merged commit 07caff3 into main Nov 4, 2025
3 of 4 checks passed

coderabbitai Bot mentioned this pull request Nov 17, 2025

Change bucket num to 1024 and dss support multi-shards #199

Merged

This was referenced Dec 23, 2025

Pass term to eloqstore during start data store. #318

Merged

fix standby OnLeaderStart and FetchCatalog from primary #326

Merged

coderabbitai Bot mentioned this pull request Jan 28, 2026

Sync file cache from cloud for eloqstore on standby node #380

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 3, 2026

Refactor standby for eloqstore cloud #405

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 23, 2026

tx_service: guard standby term promotion with candidate check #469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OnLeaderStop in DataStoreHandler#181

Add OnLeaderStop in DataStoreHandler#181
githubzilla merged 4 commits into
mainfrom
update_cluster_config

lzxddz commented Oct 29, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 29, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

liunyl commented Nov 3, 2025

Uh oh!

coderabbitai Bot commented Nov 3, 2025

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lzxddz commented Oct 29, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here are some reminders before you submit the pull request

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

liunyl commented Nov 3, 2025

Uh oh!

coderabbitai Bot commented Nov 3, 2025

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lzxddz commented Oct 29, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 29, 2025 •

edited

Loading