Skip to content

feat: normal standby sync do not let master node snapshot#479

Merged
thweetkomputer merged 4 commits into
mainfrom
fix-standby-snapshot-zc
Apr 30, 2026
Merged

feat: normal standby sync do not let master node snapshot#479
thweetkomputer merged 4 commits into
mainfrom
fix-standby-snapshot-zc

Conversation

@thweetkomputer

@thweetkomputer thweetkomputer commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • Improvements
    • Better snapshot reload behavior with clearer reload context for standby operations
    • TTL-based snapshot cleanup to free storage more predictably
    • Simplified snapshot synchronization and error handling for more robust standby syncs
    • Standby broadcast made non-blocking and less stateful to reduce coordination delays

@coderabbitai

coderabbitai Bot commented Apr 27, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 18a32966-0caa-41c0-a4c0-3d4f1fe4278e

📥 Commits

Reviewing files that changed from the base of the PR and between 209996f and 88c3400.

📒 Files selected for processing (1)
  • store_handler/eloq_data_store_service/eloqstore
✅ Files skipped from review due to trivial changes (1)
  • store_handler/eloq_data_store_service/eloqstore

Walkthrough

Adds a new bool from_snapshot parameter to ReloadData across datastore interfaces and implementations; updates call sites to pass the flag; converts standby snapshot lifecycle to TTL-based cleanup with an in-memory cleanup queue; gates standby task worker to ELOQSTORE builds; removes current_ckpt_ts from a protobuf response; updates an eloqstore submodule pointer.

Changes

Cohort / File(s) Summary
ReloadData Interface & Overrides
store_handler/eloq_data_store_service/data_store.h, store_handler/eloq_data_store_service/eloq_store_data_store.h, store_handler/eloq_data_store_service/rocksdb_data_store_common.h
Extended virtual ReloadData signature to add bool from_snapshot; default/unused handling updated.
ReloadData Implementations
store_handler/eloq_data_store_service/data_store_service.cpp, store_handler/eloq_data_store_service/eloq_store_data_store.cpp
Propagate and log from_snapshot; EloqStore conditional tagging now depends on from_snapshot.
Call Site Updates
store_handler/data_store_service_client.cpp
Call sites updated to invoke ReloadData(..., false) or ReloadData(..., true) depending on path.
Snapshot lifecycle & manager
tx_service/src/store/snapshot_manager.cpp, tx_service/include/store/snapshot_manager.h
Introduce TTL-based snapshot cleanup: cleanup queue, helper methods, expire/delete logic; adjust worker loop and idempotency handling.
Standby task worker gating
tx_service/include/remote/cc_node_service.h, tx_service/src/remote/cc_node_service.cpp
Enqueue/worker/thread and related state compiled only for DATA_STORE_TYPE_ELOQDSS_ELOQSTORE; condition-variable wait refactor.
RPC / Standby broadcast changes
tx_service/include/proto/cc_request.proto, tx_service/src/standby.cpp
Remove current_ckpt_ts from UpdateStandbyCkptTsResponse; simplify broadcast handler to per-RPC logging and remove aggregate ack tracking and pre-broadcast snapshot creation.
Minor includes & submodule
tx_service/src/remote/cc_stream_receiver.cpp, store_handler/eloq_data_store_service/eloqstore
Add store/snapshot_manager.h include; update eloqstore submodule commit reference.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CcNodeService
    participant SnapshotManager
    participant DataStoreService
    participant DataStore

    Client->>CcNodeService: RequestSyncSnapshot(snapshot_ts)
    alt ELOQSTORE build
        CcNodeService->>SnapshotManager: Ensure snapshot exists / track TTL
        CcNodeService->>DataStoreService: ReloadData(shard, term, snapshot_ts, true)
        DataStoreService->>DataStore: ReloadData(term, snapshot_ts, true)
        DataStore-->>DataStoreService: reload result
        DataStoreService-->>CcNodeService: result
        CcNodeService-->>Client: response
    else non-ELOQSTORE build
        CcNodeService-->>Client: noop response (current_ckpt_ts via NativeNodeGroupCkptTs)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • liunyl
  • lzxddz
  • MrGuin

"I hop and log where snapshots sleep,
TTL winds blow and old files leap,
A tiny flag says where it came—
From snapshot's nest or live-time flame,
Fresh reloads hum, the archives keep." 🐇✨

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description only contains an unchecked submission checklist with no actual content explaining the changes, rationale, or implementation details. Complete the checklist items and add a detailed description explaining what changed, why, and which issue/RFC it addresses.
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title refers to a real aspect of the changes (preventing master node snapshots during standby sync) but lacks clarity and specificity about the actual implementation. Consider a more descriptive title such as 'Add from_snapshot parameter to ReloadData to distinguish snapshot-sourced reloads' to better reflect the core change.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-standby-snapshot-zc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@liunyl

liunyl commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Apr 28, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tx_service/src/store/snapshot_manager.cpp (2)

878-887: ⚠️ Potential issue | 🟠 Major

Keep or drain the cleanup queue on leader loss instead of clearing it.

snapshot_cleanup_queue_ is the only TTL schedule for deleting archived standby snapshots. Clearing it here forgets every snapshot already created in the previous leader term, so those archives will never be deleted until process restart runs the startup purge again. This can leak standby snapshot storage across failovers.


1062-1189: ⚠️ Potential issue | 🟠 Major

Do not mark standby sync complete before the async RPC actually succeeds.

DispatchRequestSyncSnapshotAsync() only confirms that the RPC was queued. Lines 1144-1189 still mark the term completed and erase the pending request/barrier before RequestSyncSnapshotDone::Run() tells you whether the standby accepted it. If that RPC times out or returns error=true, the callback only updates counters, so this sync can be dropped permanently with no retry state left on the primary. Please move MarkSnapshotSyncCompletedLocked() / barrier removal behind the async success path, or re-queue failures from the callback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/store/snapshot_manager.cpp` around lines 1062 - 1189, The
current code calls MarkSnapshotSyncCompletedLocked(...) and
EraseSubscriptionBarrier(...) immediately after
DispatchRequestSyncSnapshotAsync(...) returns success (notify_succ), but
DispatchRequestSyncSnapshotAsync only enqueues an async RPC; move the
completion/cleanup to the async success callback (RequestSyncSnapshotDone::Run)
so the snapshot is only marked completed when the RPC actually succeeds, or
alternatively have the callback re-queue the pending_req_ and barrier on
failure; specifically, remove/relocate the calls to
MarkSnapshotSyncCompletedLocked, EraseSubscriptionBarrierLocked and
EraseSubscriptionBarrier from the synchronous path that follows
DispatchRequestSyncSnapshotAsync (and from the non-braft channel branch), and
ensure RequestSyncSnapshotDone::Run inspects the RPC result and invokes
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrierLocked(...) on
success (or reinserts/keeps pending_req_ and barrier on failure) so no pending
state is dropped.
🧹 Nitpick comments (3)
tx_service/src/remote/cc_stream_receiver.cpp (1)

39-39: Remove (or justify) the newly added snapshot_manager.h include.

tx_service/src/remote/cc_stream_receiver.cpp now includes store/snapshot_manager.h (Line 39), but in the provided file contents there’s no visible reference to SnapshotManager symbols afterward. If it isn’t needed for types used in this translation unit, please drop it to avoid extra compile-time cost and tighter coupling.

♻️ Proposed fix
-#include "store/snapshot_manager.h"

If you intentionally added it for a side-effect or future-use, consider adding a short comment explaining why it must remain here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/remote/cc_stream_receiver.cpp` at line 39, The new include of
store/snapshot_manager.h is unused in cc_stream_receiver.cpp; remove the include
(snapshot_manager.h) unless you actually reference SnapshotManager or other
symbols from it—if you must keep it for a side-effect or future use, add a short
comment above the include explaining the reason; otherwise delete the line to
avoid unnecessary coupling and compile-time cost.
store_handler/eloq_data_store_service/data_store_service.h (1)

680-683: Replace bool from_snapshot with a small enum before this API spreads further.

This flag changes reload semantics, but the call sites now read as ReloadData(..., false/true), which is easy to invert and hard to scan. An enum like ReloadSource::Snapshot / ReloadSource::Latest would keep the interface self-describing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_handler/eloq_data_store_service/data_store_service.h` around lines 680
- 683, The ReloadData API uses a boolean flag (from_snapshot) that should be
replaced with a small enum to make call sites self-describing; add an enum
(e.g., enum class ReloadSource { Snapshot, Latest }) and change the signature of
ReloadData(uint32_t shard_id, int64_t ng_term, uint64_t snapshot_ts,
ReloadSource source), then update all callers of ReloadData to pass
ReloadSource::Snapshot or ReloadSource::Latest instead of true/false and adjust
any switch/if logic inside the ReloadData implementation to branch on the new
enum.
tx_service/include/store/snapshot_manager.h (1)

145-150: Use std::chrono::steady_clock for the snapshot cleanup queue.

CollectExpiredSnapshotsLocked() and NextSnapshotCleanupDeadlineLocked() treat snapshot_cleanup_queue_ as an ordered expiry queue drained from the front, making it vulnerable to wall-clock adjustments (NTP corrections, manual system time changes, leap seconds). Switch SnapshotCleanupEntry::expire_at and all related method signatures to std::chrono::steady_clock::time_point for monotonic TTL bookkeeping.

Affected: SnapshotCleanupEntry (line 171), method signatures (lines 145–150), and all call sites in snapshot_manager.cpp that compute or compare expiry deadlines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/include/store/snapshot_manager.h` around lines 145 - 150, Change
snapshot cleanup timing to use a monotonic clock: update
SnapshotCleanupEntry::expire_at and the method signatures TrackSnapshotLocked,
CollectExpiredSnapshotsLocked, and NextSnapshotCleanupDeadlineLocked to use
std::chrono::steady_clock::time_point instead of system_clock::time_point, and
update all uses of snapshot_cleanup_queue_ comparisons/assignments in
snapshot_manager.cpp to compute and compare steady_clock::time_point values
(convert any system_clock now() calls to steady_clock::now() at the call sites
or compute steady_clock offsets consistently) so the expiry queue is monotonic
and resilient to wall-clock changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tx_service/src/store/snapshot_manager.cpp`:
- Around line 1062-1189: The current code calls
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrier(...)
immediately after DispatchRequestSyncSnapshotAsync(...) returns success
(notify_succ), but DispatchRequestSyncSnapshotAsync only enqueues an async RPC;
move the completion/cleanup to the async success callback
(RequestSyncSnapshotDone::Run) so the snapshot is only marked completed when the
RPC actually succeeds, or alternatively have the callback re-queue the
pending_req_ and barrier on failure; specifically, remove/relocate the calls to
MarkSnapshotSyncCompletedLocked, EraseSubscriptionBarrierLocked and
EraseSubscriptionBarrier from the synchronous path that follows
DispatchRequestSyncSnapshotAsync (and from the non-braft channel branch), and
ensure RequestSyncSnapshotDone::Run inspects the RPC result and invokes
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrierLocked(...) on
success (or reinserts/keeps pending_req_ and barrier on failure) so no pending
state is dropped.

---

Nitpick comments:
In `@store_handler/eloq_data_store_service/data_store_service.h`:
- Around line 680-683: The ReloadData API uses a boolean flag (from_snapshot)
that should be replaced with a small enum to make call sites self-describing;
add an enum (e.g., enum class ReloadSource { Snapshot, Latest }) and change the
signature of ReloadData(uint32_t shard_id, int64_t ng_term, uint64_t
snapshot_ts, ReloadSource source), then update all callers of ReloadData to pass
ReloadSource::Snapshot or ReloadSource::Latest instead of true/false and adjust
any switch/if logic inside the ReloadData implementation to branch on the new
enum.

In `@tx_service/include/store/snapshot_manager.h`:
- Around line 145-150: Change snapshot cleanup timing to use a monotonic clock:
update SnapshotCleanupEntry::expire_at and the method signatures
TrackSnapshotLocked, CollectExpiredSnapshotsLocked, and
NextSnapshotCleanupDeadlineLocked to use std::chrono::steady_clock::time_point
instead of system_clock::time_point, and update all uses of
snapshot_cleanup_queue_ comparisons/assignments in snapshot_manager.cpp to
compute and compare steady_clock::time_point values (convert any system_clock
now() calls to steady_clock::now() at the call sites or compute steady_clock
offsets consistently) so the expiry queue is monotonic and resilient to
wall-clock changes.

In `@tx_service/src/remote/cc_stream_receiver.cpp`:
- Line 39: The new include of store/snapshot_manager.h is unused in
cc_stream_receiver.cpp; remove the include (snapshot_manager.h) unless you
actually reference SnapshotManager or other symbols from it—if you must keep it
for a side-effect or future use, add a short comment above the include
explaining the reason; otherwise delete the line to avoid unnecessary coupling
and compile-time cost.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 429916a1-9bed-4391-990c-8f3dfb938b83

📥 Commits

Reviewing files that changed from the base of the PR and between 2c6c757 and 209996f.

📒 Files selected for processing (15)
  • store_handler/data_store_service_client.cpp
  • store_handler/eloq_data_store_service/data_store.h
  • store_handler/eloq_data_store_service/data_store_service.cpp
  • store_handler/eloq_data_store_service/data_store_service.h
  • store_handler/eloq_data_store_service/eloq_store_data_store.cpp
  • store_handler/eloq_data_store_service/eloq_store_data_store.h
  • store_handler/eloq_data_store_service/eloqstore
  • store_handler/eloq_data_store_service/rocksdb_data_store_common.h
  • tx_service/include/proto/cc_request.proto
  • tx_service/include/remote/cc_node_service.h
  • tx_service/include/store/snapshot_manager.h
  • tx_service/src/remote/cc_node_service.cpp
  • tx_service/src/remote/cc_stream_receiver.cpp
  • tx_service/src/standby.cpp
  • tx_service/src/store/snapshot_manager.cpp
💤 Files with no reviewable changes (1)
  • tx_service/include/proto/cc_request.proto

@thweetkomputer thweetkomputer merged commit a22b77e into main Apr 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants