feat: normal standby sync do not let master node snapshot by thweetkomputer · Pull Request #479 · eloqdata/tx_service

thweetkomputer · 2026-04-27T02:23:30Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Reference the link of issue using fixes eloqdb/tx_service#issue_id
Reference the link of RFC if exists
Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

Improvements
- Better snapshot reload behavior with clearer reload context for standby operations
- TTL-based snapshot cleanup to free storage more predictably
- Simplified snapshot synchronization and error handling for more robust standby syncs
- Standby broadcast made non-blocking and less stateful to reduce coordination delays

coderabbitai · 2026-04-27T02:23:43Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 18a32966-0caa-41c0-a4c0-3d4f1fe4278e

📥 Commits

Reviewing files that changed from the base of the PR and between 209996f and 88c3400.

📒 Files selected for processing (1)

store_handler/eloq_data_store_service/eloqstore

✅ Files skipped from review due to trivial changes (1)

store_handler/eloq_data_store_service/eloqstore

Walkthrough

Adds a new bool from_snapshot parameter to ReloadData across datastore interfaces and implementations; updates call sites to pass the flag; converts standby snapshot lifecycle to TTL-based cleanup with an in-memory cleanup queue; gates standby task worker to ELOQSTORE builds; removes current_ckpt_ts from a protobuf response; updates an eloqstore submodule pointer.

Changes

Cohort / File(s)	Summary
ReloadData Interface & Overrides `store_handler/eloq_data_store_service/data_store.h`, `store_handler/eloq_data_store_service/eloq_store_data_store.h`, `store_handler/eloq_data_store_service/rocksdb_data_store_common.h`	Extended virtual `ReloadData` signature to add `bool from_snapshot`; default/unused handling updated.
ReloadData Implementations `store_handler/eloq_data_store_service/data_store_service.cpp`, `store_handler/eloq_data_store_service/eloq_store_data_store.cpp`	Propagate and log `from_snapshot`; EloqStore conditional tagging now depends on `from_snapshot`.
Call Site Updates `store_handler/data_store_service_client.cpp`	Call sites updated to invoke `ReloadData(..., false)` or `ReloadData(..., true)` depending on path.
Snapshot lifecycle & manager `tx_service/src/store/snapshot_manager.cpp`, `tx_service/include/store/snapshot_manager.h`	Introduce TTL-based snapshot cleanup: cleanup queue, helper methods, expire/delete logic; adjust worker loop and idempotency handling.
Standby task worker gating `tx_service/include/remote/cc_node_service.h`, `tx_service/src/remote/cc_node_service.cpp`	Enqueue/worker/thread and related state compiled only for `DATA_STORE_TYPE_ELOQDSS_ELOQSTORE`; condition-variable wait refactor.
RPC / Standby broadcast changes `tx_service/include/proto/cc_request.proto`, `tx_service/src/standby.cpp`	Remove `current_ckpt_ts` from `UpdateStandbyCkptTsResponse`; simplify broadcast handler to per-RPC logging and remove aggregate ack tracking and pre-broadcast snapshot creation.
Minor includes & submodule `tx_service/src/remote/cc_stream_receiver.cpp`, `store_handler/eloq_data_store_service/eloqstore`	Add `store/snapshot_manager.h` include; update eloqstore submodule commit reference.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CcNodeService
    participant SnapshotManager
    participant DataStoreService
    participant DataStore

    Client->>CcNodeService: RequestSyncSnapshot(snapshot_ts)
    alt ELOQSTORE build
        CcNodeService->>SnapshotManager: Ensure snapshot exists / track TTL
        CcNodeService->>DataStoreService: ReloadData(shard, term, snapshot_ts, true)
        DataStoreService->>DataStore: ReloadData(term, snapshot_ts, true)
        DataStore-->>DataStoreService: reload result
        DataStoreService-->>CcNodeService: result
        CcNodeService-->>Client: response
    else non-ELOQSTORE build
        CcNodeService-->>Client: noop response (current_ckpt_ts via NativeNodeGroupCkptTs)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

tx_service: guard standby term promotion with candidate check #469 — Modifies the same remote node service paths (RequestSyncSnapshot / UpdateStandbyCkptTs) and task enqueue handling.
fix snapshot coverage check for snapshot sync requests #292 — Changes SnapshotManager::SyncWithStandby logic and pending-task coverage checks overlapping snapshot lifecycle edits.
fix while loop in SyncWithStandby #280 — Adjusts SnapshotManager sync/cleanup control flow; closely related to TTL-based cleanup refactor.

Suggested reviewers

liunyl
lzxddz
MrGuin

"I hop and log where snapshots sleep,
TTL winds blow and old files leap,
A tiny flag says where it came—
From snapshot's nest or live-time flame,
Fresh reloads hum, the archives keep." 🐇✨

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description only contains an unchecked submission checklist with no actual content explaining the changes, rationale, or implementation details.	Complete the checklist items and add a detailed description explaining what changed, why, and which issue/RFC it addresses.
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title refers to a real aspect of the changes (preventing master node snapshots during standby sync) but lacks clarity and specificity about the actual implementation.	Consider a more descriptive title such as 'Add from_snapshot parameter to ReloadData to distinguish snapshot-sourced reloads' to better reflect the core change.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-standby-snapshot-zc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

liunyl · 2026-04-28T10:43:43Z

@CodeRabbit review

coderabbitai · 2026-04-28T10:43:52Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tx_service/src/store/snapshot_manager.cpp (2)
878-887: ⚠️ Potential issue | 🟠 Major

Keep or drain the cleanup queue on leader loss instead of clearing it.

snapshot_cleanup_queue_ is the only TTL schedule for deleting archived standby snapshots. Clearing it here forgets every snapshot already created in the previous leader term, so those archives will never be deleted until process restart runs the startup purge again. This can leak standby snapshot storage across failovers.

1062-1189: ⚠️ Potential issue | 🟠 Major

Do not mark standby sync complete before the async RPC actually succeeds.

DispatchRequestSyncSnapshotAsync() only confirms that the RPC was queued. Lines 1144-1189 still mark the term completed and erase the pending request/barrier before RequestSyncSnapshotDone::Run() tells you whether the standby accepted it. If that RPC times out or returns error=true, the callback only updates counters, so this sync can be dropped permanently with no retry state left on the primary. Please move MarkSnapshotSyncCompletedLocked() / barrier removal behind the async success path, or re-queue failures from the callback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/store/snapshot_manager.cpp` around lines 1062 - 1189, The
current code calls MarkSnapshotSyncCompletedLocked(...) and
EraseSubscriptionBarrier(...) immediately after
DispatchRequestSyncSnapshotAsync(...) returns success (notify_succ), but
DispatchRequestSyncSnapshotAsync only enqueues an async RPC; move the
completion/cleanup to the async success callback (RequestSyncSnapshotDone::Run)
so the snapshot is only marked completed when the RPC actually succeeds, or
alternatively have the callback re-queue the pending_req_ and barrier on
failure; specifically, remove/relocate the calls to
MarkSnapshotSyncCompletedLocked, EraseSubscriptionBarrierLocked and
EraseSubscriptionBarrier from the synchronous path that follows
DispatchRequestSyncSnapshotAsync (and from the non-braft channel branch), and
ensure RequestSyncSnapshotDone::Run inspects the RPC result and invokes
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrierLocked(...) on
success (or reinserts/keeps pending_req_ and barrier on failure) so no pending
state is dropped.

🧹 Nitpick comments (3)

tx_service/src/remote/cc_stream_receiver.cpp (1)
39-39: Remove (or justify) the newly added snapshot_manager.h include.

tx_service/src/remote/cc_stream_receiver.cpp now includes store/snapshot_manager.h (Line 39), but in the provided file contents there’s no visible reference to SnapshotManager symbols afterward. If it isn’t needed for types used in this translation unit, please drop it to avoid extra compile-time cost and tighter coupling.
♻️ Proposed fix
-#include "store/snapshot_manager.h"
If you intentionally added it for a side-effect or future-use, consider adding a short comment explaining why it must remain here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/remote/cc_stream_receiver.cpp` at line 39, The new include of
store/snapshot_manager.h is unused in cc_stream_receiver.cpp; remove the include
(snapshot_manager.h) unless you actually reference SnapshotManager or other
symbols from it—if you must keep it for a side-effect or future use, add a short
comment above the include explaining the reason; otherwise delete the line to
avoid unnecessary coupling and compile-time cost.
store_handler/eloq_data_store_service/data_store_service.h (1)
680-683: Replace bool from_snapshot with a small enum before this API spreads further.

This flag changes reload semantics, but the call sites now read as ReloadData(..., false/true), which is easy to invert and hard to scan. An enum like ReloadSource::Snapshot / ReloadSource::Latest would keep the interface self-describing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_handler/eloq_data_store_service/data_store_service.h` around lines 680
- 683, The ReloadData API uses a boolean flag (from_snapshot) that should be
replaced with a small enum to make call sites self-describing; add an enum
(e.g., enum class ReloadSource { Snapshot, Latest }) and change the signature of
ReloadData(uint32_t shard_id, int64_t ng_term, uint64_t snapshot_ts,
ReloadSource source), then update all callers of ReloadData to pass
ReloadSource::Snapshot or ReloadSource::Latest instead of true/false and adjust
any switch/if logic inside the ReloadData implementation to branch on the new
enum.
tx_service/include/store/snapshot_manager.h (1)
145-150: Use std::chrono::steady_clock for the snapshot cleanup queue.

CollectExpiredSnapshotsLocked() and NextSnapshotCleanupDeadlineLocked() treat snapshot_cleanup_queue_ as an ordered expiry queue drained from the front, making it vulnerable to wall-clock adjustments (NTP corrections, manual system time changes, leap seconds). Switch SnapshotCleanupEntry::expire_at and all related method signatures to std::chrono::steady_clock::time_point for monotonic TTL bookkeeping.

Affected: SnapshotCleanupEntry (line 171), method signatures (lines 145–150), and all call sites in snapshot_manager.cpp that compute or compare expiry deadlines.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/include/store/snapshot_manager.h` around lines 145 - 150, Change
snapshot cleanup timing to use a monotonic clock: update
SnapshotCleanupEntry::expire_at and the method signatures TrackSnapshotLocked,
CollectExpiredSnapshotsLocked, and NextSnapshotCleanupDeadlineLocked to use
std::chrono::steady_clock::time_point instead of system_clock::time_point, and
update all uses of snapshot_cleanup_queue_ comparisons/assignments in
snapshot_manager.cpp to compute and compare steady_clock::time_point values
(convert any system_clock now() calls to steady_clock::now() at the call sites
or compute steady_clock offsets consistently) so the expiry queue is monotonic
and resilient to wall-clock changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tx_service/src/store/snapshot_manager.cpp`:
- Around line 1062-1189: The current code calls
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrier(...)
immediately after DispatchRequestSyncSnapshotAsync(...) returns success
(notify_succ), but DispatchRequestSyncSnapshotAsync only enqueues an async RPC;
move the completion/cleanup to the async success callback
(RequestSyncSnapshotDone::Run) so the snapshot is only marked completed when the
RPC actually succeeds, or alternatively have the callback re-queue the
pending_req_ and barrier on failure; specifically, remove/relocate the calls to
MarkSnapshotSyncCompletedLocked, EraseSubscriptionBarrierLocked and
EraseSubscriptionBarrier from the synchronous path that follows
DispatchRequestSyncSnapshotAsync (and from the non-braft channel branch), and
ensure RequestSyncSnapshotDone::Run inspects the RPC result and invokes
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrierLocked(...) on
success (or reinserts/keeps pending_req_ and barrier on failure) so no pending
state is dropped.

---

Nitpick comments:
In `@store_handler/eloq_data_store_service/data_store_service.h`:
- Around line 680-683: The ReloadData API uses a boolean flag (from_snapshot)
that should be replaced with a small enum to make call sites self-describing;
add an enum (e.g., enum class ReloadSource { Snapshot, Latest }) and change the
signature of ReloadData(uint32_t shard_id, int64_t ng_term, uint64_t
snapshot_ts, ReloadSource source), then update all callers of ReloadData to pass
ReloadSource::Snapshot or ReloadSource::Latest instead of true/false and adjust
any switch/if logic inside the ReloadData implementation to branch on the new
enum.

In `@tx_service/include/store/snapshot_manager.h`:
- Around line 145-150: Change snapshot cleanup timing to use a monotonic clock:
update SnapshotCleanupEntry::expire_at and the method signatures
TrackSnapshotLocked, CollectExpiredSnapshotsLocked, and
NextSnapshotCleanupDeadlineLocked to use std::chrono::steady_clock::time_point
instead of system_clock::time_point, and update all uses of
snapshot_cleanup_queue_ comparisons/assignments in snapshot_manager.cpp to
compute and compare steady_clock::time_point values (convert any system_clock
now() calls to steady_clock::now() at the call sites or compute steady_clock
offsets consistently) so the expiry queue is monotonic and resilient to
wall-clock changes.

In `@tx_service/src/remote/cc_stream_receiver.cpp`:
- Line 39: The new include of store/snapshot_manager.h is unused in
cc_stream_receiver.cpp; remove the include (snapshot_manager.h) unless you
actually reference SnapshotManager or other symbols from it—if you must keep it
for a side-effect or future use, add a short comment above the include
explaining the reason; otherwise delete the line to avoid unnecessary coupling
and compile-time cost.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 429916a1-9bed-4391-990c-8f3dfb938b83

📥 Commits

Reviewing files that changed from the base of the PR and between 2c6c757 and 209996f.

📒 Files selected for processing (15)

store_handler/data_store_service_client.cpp
store_handler/eloq_data_store_service/data_store.h
store_handler/eloq_data_store_service/data_store_service.cpp
store_handler/eloq_data_store_service/data_store_service.h
store_handler/eloq_data_store_service/eloq_store_data_store.cpp
store_handler/eloq_data_store_service/eloq_store_data_store.h
store_handler/eloq_data_store_service/eloqstore
store_handler/eloq_data_store_service/rocksdb_data_store_common.h
tx_service/include/proto/cc_request.proto
tx_service/include/remote/cc_node_service.h
tx_service/include/store/snapshot_manager.h
tx_service/src/remote/cc_node_service.cpp
tx_service/src/remote/cc_stream_receiver.cpp
tx_service/src/standby.cpp
tx_service/src/store/snapshot_manager.cpp

💤 Files with no reviewable changes (1)

tx_service/include/proto/cc_request.proto

thweetkomputer added 2 commits April 22, 2026 19:22

fix

8f79b94

clean

325338a

fmt

209996f

liunyl approved these changes Apr 28, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

update

88c3400

thweetkomputer merged commit a22b77e into main Apr 30, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: normal standby sync do not let master node snapshot#479

feat: normal standby sync do not let master node snapshot#479
thweetkomputer merged 4 commits into
mainfrom
fix-standby-snapshot-zc

thweetkomputer commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

liunyl commented Apr 28, 2026

Uh oh!

coderabbitai Bot commented Apr 28, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

thweetkomputer commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here are some reminders before you submit the pull request

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

liunyl commented Apr 28, 2026

Uh oh!

coderabbitai Bot commented Apr 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thweetkomputer commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading