feat: normal standby sync do not let master node snapshot#479
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughAdds a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CcNodeService
participant SnapshotManager
participant DataStoreService
participant DataStore
Client->>CcNodeService: RequestSyncSnapshot(snapshot_ts)
alt ELOQSTORE build
CcNodeService->>SnapshotManager: Ensure snapshot exists / track TTL
CcNodeService->>DataStoreService: ReloadData(shard, term, snapshot_ts, true)
DataStoreService->>DataStore: ReloadData(term, snapshot_ts, true)
DataStore-->>DataStoreService: reload result
DataStoreService-->>CcNodeService: result
CcNodeService-->>Client: response
else non-ELOQSTORE build
CcNodeService-->>Client: noop response (current_ckpt_ts via NativeNodeGroupCkptTs)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tx_service/src/store/snapshot_manager.cpp (2)
878-887:⚠️ Potential issue | 🟠 MajorKeep or drain the cleanup queue on leader loss instead of clearing it.
snapshot_cleanup_queue_is the only TTL schedule for deleting archived standby snapshots. Clearing it here forgets every snapshot already created in the previous leader term, so those archives will never be deleted until process restart runs the startup purge again. This can leak standby snapshot storage across failovers.
1062-1189:⚠️ Potential issue | 🟠 MajorDo not mark standby sync complete before the async RPC actually succeeds.
DispatchRequestSyncSnapshotAsync()only confirms that the RPC was queued. Lines 1144-1189 still mark the term completed and erase the pending request/barrier beforeRequestSyncSnapshotDone::Run()tells you whether the standby accepted it. If that RPC times out or returnserror=true, the callback only updates counters, so this sync can be dropped permanently with no retry state left on the primary. Please moveMarkSnapshotSyncCompletedLocked()/ barrier removal behind the async success path, or re-queue failures from the callback.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tx_service/src/store/snapshot_manager.cpp` around lines 1062 - 1189, The current code calls MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrier(...) immediately after DispatchRequestSyncSnapshotAsync(...) returns success (notify_succ), but DispatchRequestSyncSnapshotAsync only enqueues an async RPC; move the completion/cleanup to the async success callback (RequestSyncSnapshotDone::Run) so the snapshot is only marked completed when the RPC actually succeeds, or alternatively have the callback re-queue the pending_req_ and barrier on failure; specifically, remove/relocate the calls to MarkSnapshotSyncCompletedLocked, EraseSubscriptionBarrierLocked and EraseSubscriptionBarrier from the synchronous path that follows DispatchRequestSyncSnapshotAsync (and from the non-braft channel branch), and ensure RequestSyncSnapshotDone::Run inspects the RPC result and invokes MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrierLocked(...) on success (or reinserts/keeps pending_req_ and barrier on failure) so no pending state is dropped.
🧹 Nitpick comments (3)
tx_service/src/remote/cc_stream_receiver.cpp (1)
39-39: Remove (or justify) the newly addedsnapshot_manager.hinclude.
tx_service/src/remote/cc_stream_receiver.cppnow includesstore/snapshot_manager.h(Line 39), but in the provided file contents there’s no visible reference toSnapshotManagersymbols afterward. If it isn’t needed for types used in this translation unit, please drop it to avoid extra compile-time cost and tighter coupling.♻️ Proposed fix
-#include "store/snapshot_manager.h"If you intentionally added it for a side-effect or future-use, consider adding a short comment explaining why it must remain here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tx_service/src/remote/cc_stream_receiver.cpp` at line 39, The new include of store/snapshot_manager.h is unused in cc_stream_receiver.cpp; remove the include (snapshot_manager.h) unless you actually reference SnapshotManager or other symbols from it—if you must keep it for a side-effect or future use, add a short comment above the include explaining the reason; otherwise delete the line to avoid unnecessary coupling and compile-time cost.store_handler/eloq_data_store_service/data_store_service.h (1)
680-683: Replacebool from_snapshotwith a small enum before this API spreads further.This flag changes reload semantics, but the call sites now read as
ReloadData(..., false/true), which is easy to invert and hard to scan. An enum likeReloadSource::Snapshot/ReloadSource::Latestwould keep the interface self-describing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@store_handler/eloq_data_store_service/data_store_service.h` around lines 680 - 683, The ReloadData API uses a boolean flag (from_snapshot) that should be replaced with a small enum to make call sites self-describing; add an enum (e.g., enum class ReloadSource { Snapshot, Latest }) and change the signature of ReloadData(uint32_t shard_id, int64_t ng_term, uint64_t snapshot_ts, ReloadSource source), then update all callers of ReloadData to pass ReloadSource::Snapshot or ReloadSource::Latest instead of true/false and adjust any switch/if logic inside the ReloadData implementation to branch on the new enum.tx_service/include/store/snapshot_manager.h (1)
145-150: Usestd::chrono::steady_clockfor the snapshot cleanup queue.
CollectExpiredSnapshotsLocked()andNextSnapshotCleanupDeadlineLocked()treatsnapshot_cleanup_queue_as an ordered expiry queue drained from the front, making it vulnerable to wall-clock adjustments (NTP corrections, manual system time changes, leap seconds). SwitchSnapshotCleanupEntry::expire_atand all related method signatures tostd::chrono::steady_clock::time_pointfor monotonic TTL bookkeeping.Affected:
SnapshotCleanupEntry(line 171), method signatures (lines 145–150), and all call sites in snapshot_manager.cpp that compute or compare expiry deadlines.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tx_service/include/store/snapshot_manager.h` around lines 145 - 150, Change snapshot cleanup timing to use a monotonic clock: update SnapshotCleanupEntry::expire_at and the method signatures TrackSnapshotLocked, CollectExpiredSnapshotsLocked, and NextSnapshotCleanupDeadlineLocked to use std::chrono::steady_clock::time_point instead of system_clock::time_point, and update all uses of snapshot_cleanup_queue_ comparisons/assignments in snapshot_manager.cpp to compute and compare steady_clock::time_point values (convert any system_clock now() calls to steady_clock::now() at the call sites or compute steady_clock offsets consistently) so the expiry queue is monotonic and resilient to wall-clock changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tx_service/src/store/snapshot_manager.cpp`:
- Around line 1062-1189: The current code calls
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrier(...)
immediately after DispatchRequestSyncSnapshotAsync(...) returns success
(notify_succ), but DispatchRequestSyncSnapshotAsync only enqueues an async RPC;
move the completion/cleanup to the async success callback
(RequestSyncSnapshotDone::Run) so the snapshot is only marked completed when the
RPC actually succeeds, or alternatively have the callback re-queue the
pending_req_ and barrier on failure; specifically, remove/relocate the calls to
MarkSnapshotSyncCompletedLocked, EraseSubscriptionBarrierLocked and
EraseSubscriptionBarrier from the synchronous path that follows
DispatchRequestSyncSnapshotAsync (and from the non-braft channel branch), and
ensure RequestSyncSnapshotDone::Run inspects the RPC result and invokes
MarkSnapshotSyncCompletedLocked(...) and EraseSubscriptionBarrierLocked(...) on
success (or reinserts/keeps pending_req_ and barrier on failure) so no pending
state is dropped.
---
Nitpick comments:
In `@store_handler/eloq_data_store_service/data_store_service.h`:
- Around line 680-683: The ReloadData API uses a boolean flag (from_snapshot)
that should be replaced with a small enum to make call sites self-describing;
add an enum (e.g., enum class ReloadSource { Snapshot, Latest }) and change the
signature of ReloadData(uint32_t shard_id, int64_t ng_term, uint64_t
snapshot_ts, ReloadSource source), then update all callers of ReloadData to pass
ReloadSource::Snapshot or ReloadSource::Latest instead of true/false and adjust
any switch/if logic inside the ReloadData implementation to branch on the new
enum.
In `@tx_service/include/store/snapshot_manager.h`:
- Around line 145-150: Change snapshot cleanup timing to use a monotonic clock:
update SnapshotCleanupEntry::expire_at and the method signatures
TrackSnapshotLocked, CollectExpiredSnapshotsLocked, and
NextSnapshotCleanupDeadlineLocked to use std::chrono::steady_clock::time_point
instead of system_clock::time_point, and update all uses of
snapshot_cleanup_queue_ comparisons/assignments in snapshot_manager.cpp to
compute and compare steady_clock::time_point values (convert any system_clock
now() calls to steady_clock::now() at the call sites or compute steady_clock
offsets consistently) so the expiry queue is monotonic and resilient to
wall-clock changes.
In `@tx_service/src/remote/cc_stream_receiver.cpp`:
- Line 39: The new include of store/snapshot_manager.h is unused in
cc_stream_receiver.cpp; remove the include (snapshot_manager.h) unless you
actually reference SnapshotManager or other symbols from it—if you must keep it
for a side-effect or future use, add a short comment above the include
explaining the reason; otherwise delete the line to avoid unnecessary coupling
and compile-time cost.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 429916a1-9bed-4391-990c-8f3dfb938b83
📒 Files selected for processing (15)
store_handler/data_store_service_client.cppstore_handler/eloq_data_store_service/data_store.hstore_handler/eloq_data_store_service/data_store_service.cppstore_handler/eloq_data_store_service/data_store_service.hstore_handler/eloq_data_store_service/eloq_store_data_store.cppstore_handler/eloq_data_store_service/eloq_store_data_store.hstore_handler/eloq_data_store_service/eloqstorestore_handler/eloq_data_store_service/rocksdb_data_store_common.htx_service/include/proto/cc_request.prototx_service/include/remote/cc_node_service.htx_service/include/store/snapshot_manager.htx_service/src/remote/cc_node_service.cpptx_service/src/remote/cc_stream_receiver.cpptx_service/src/standby.cpptx_service/src/store/snapshot_manager.cpp
💤 Files with no reviewable changes (1)
- tx_service/include/proto/cc_request.proto
Here are some reminders before you submit the pull request
fixes eloqdb/tx_service#issue_id./mtr --suite=mono_main,mono_multi,mono_basicSummary by CodeRabbit