add compaction queue, fix CkptTsCc hang bug#483
Conversation
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds distributed approximate key-count and compaction APIs (contracts → proto → service → RocksDB → client aggregation) and refactors CkptTsCc synchronization to use an atomic completion counter with a polling Wait(). ChangesStore Key Count and Compaction
CkptTsCc Checkpoint Synchronization Refactor
Sequence DiagramsequenceDiagram
participant Client as DataStoreServiceClient
participant Service as DataStoreService
participant RocksDB as RocksDBDataStoreCommon
participant Pool as WorkerPool
Client->>Service: Request per-shard operation
Service->>RocksDB: Call ApproxStoreKeyCount or CompactStore
RocksDB->>RocksDB: Check DB open / read property or CAS compact_running
RocksDB->>Pool: Submit async compaction task
Pool->>RocksDB: Run db.CompactRange and reset compact_running
Service-->>Client: Return result / CommonResult
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@store_handler/data_store_service_client.cpp`:
- Around line 618-621: The code currently treats remote failures (checking
cntl.Failed() || resp.result().error_code() != remote::DataStoreError::NO_ERROR)
as an immediate continue/skip which ignores transient owner-change errors;
update the handling in the blocks around these checks (the if at cntl/Fault and
the similar block near line 660) to call HandleShardingError(cntl, resp,
/*context info as needed*/) and use retry_limit_ to retry the request when
HandleShardingError indicates a shard-map/owner-change transient error, only
falling back to continue/fail after retries are exhausted or when
HandleShardingError deems the error permanent; ensure you reference the same
control objects (cntl, resp) and respect retry_limit_ and existing retry/backoff
semantics so behavior matches other callers that already use
HandleShardingError.
In `@tx_service/include/cc/cc_request.h`:
- Around line 3033-3040: The decrement of unfinish_cnt_ currently uses
fetch_sub(..., std::memory_order_release) which only synchronizes with the final
decrement and can leave earlier shard writes (memory_allocated_vec_,
memory_committed_vec_, heap_full_vec_, total_key_cnt_vec_, dirty_key_cnt_vec_,
standby_msg_seq_id_vec_, subscribed_node_ids_) not visible to the waiter; update
the fetch_sub call on unfinish_cnt_ to use std::memory_order_acq_rel (or
otherwise establish an acquire-release barrier) so each shard's writes performed
before fetch_sub are released and become visible to the thread that does the
corresponding acquire (e.g., the waiter in ckpt_req.Wait()); keep the decrement
on the same unfinish_cnt_ variable and use ccs.LocalCoreId() references
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 06e0baf9-af54-4556-bd13-a66f7618f34f
📒 Files selected for processing (11)
store_handler/data_store_service_client.cppstore_handler/data_store_service_client.hstore_handler/eloq_data_store_service/data_store.hstore_handler/eloq_data_store_service/data_store_service.cppstore_handler/eloq_data_store_service/data_store_service.hstore_handler/eloq_data_store_service/ds_request.protostore_handler/eloq_data_store_service/rocksdb_data_store_common.cppstore_handler/eloq_data_store_service/rocksdb_data_store_common.htx_service/include/cc/cc_request.htx_service/include/store/data_store_handler.htx_service/src/cc/cc_request.cpp
| if (cntl.Failed() || | ||
| resp.result().error_code() != remote::DataStoreError::NO_ERROR) | ||
| { | ||
| continue; |
There was a problem hiding this comment.
Handle owner-change errors with retry before skipping/failing.
At Line 618 and Line 660, remote failures are handled as immediate skip/fail. During topology changes (REQUESTED_NODE_NOT_OWNER), this can produce partial key counts or false compaction failure even though a retry after shard-map refresh would succeed. Reuse HandleShardingError(...) and retry_limit_ here.
🔧 Suggested fix sketch
for (uint32_t shard_id : shard_ids)
{
+ bool shard_ok = false;
+ for (int retry = 0; retry <= retry_limit_; ++retry)
+ {
uint32_t node_index = GetOwnerNodeIndexOfShard(shard_id);
auto *channel = dss_nodes_[node_index].Channel();
if (channel == nullptr)
{
- continue;
+ break;
}
remote::DataStoreRpcService_Stub stub(channel);
brpc::Controller cntl;
cntl.set_timeout_ms(60000);
// ... issue RPC ...
- if (cntl.Failed() ||
- resp.result().error_code() != remote::DataStoreError::NO_ERROR)
- {
- continue; // or success = false
- }
+ if (!cntl.Failed() &&
+ resp.result().error_code() == remote::DataStoreError::NO_ERROR)
+ {
+ // apply resp payload
+ shard_ok = true;
+ break;
+ }
+ if (!cntl.Failed() &&
+ resp.result().error_code() ==
+ remote::DataStoreError::REQUESTED_NODE_NOT_OWNER)
+ {
+ HandleShardingError(resp.result());
+ }
+ }
+ if (!shard_ok)
+ {
+ // ApproxStoreKeyCount: log partial-result warning
+ // CompactStore: success = false
+ }
}Also applies to: 660-663
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@store_handler/data_store_service_client.cpp` around lines 618 - 621, The code
currently treats remote failures (checking cntl.Failed() ||
resp.result().error_code() != remote::DataStoreError::NO_ERROR) as an immediate
continue/skip which ignores transient owner-change errors; update the handling
in the blocks around these checks (the if at cntl/Fault and the similar block
near line 660) to call HandleShardingError(cntl, resp, /*context info as
needed*/) and use retry_limit_ to retry the request when HandleShardingError
indicates a shard-map/owner-change transient error, only falling back to
continue/fail after retries are exhausted or when HandleShardingError deems the
error permanent; ensure you reference the same control objects (cntl, resp) and
respect retry_limit_ and existing retry/backoff semantics so behavior matches
other callers that already use HandleShardingError.
…aiters Same bug class as the CkptTsCc hang fixed in #483: a bthread::Mutex / ConditionVariable shared between CcRequest::Execute() (which runs on the brpc worker main stack via the tx processor module) and a bthread waiter can deadlock. When the worker main stack sleeps in the mutex's butex queue as a pthread waiter, an unlock may route the wake to a bthread bound to that same blocked worker; the bthread is parked in the worker's bound run queue and can never be scheduled, so both sides hang forever. Timeouts (wait_for) do not help because no timer remains pending in the end state. Convert the affected requests to the same pattern as the CkptTsCc fix: shard-side completion is tracked with atomics and the waiter polls with bthread_usleep exponential backoff. - ActiveTxMaxTsCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc: drop the mutex/cv, use an atomic counter (plus a done flag for ClearCcNodeGroup so the waiter returns only after the last core's cleanup). - WaitableCc: blocking Wait() now polls; IsFinished/IsError/ErrorCode read atomics. mux_ is kept solely for the coroutine yield/resume handshake, whose waiter is a dedicated flush-worker thread, so bthread owners never touch the butex. - DbSizeCc / RemoteDbSizeCc: atomic ref counts; Wait() keeps the give-up-on-remote-after-2s semantics. remote_ref_cnt_ is decremented before total_ref_cnt_ so the waiter never mistakes a pending local core for a missing remote response. - UploadBatchCc: the external coordination state (req_mux/req_cv/ finished_req_cnt/req_result/ng_term) becomes std::atomic references; ValidTermCheck uses a CAS instead of locking a mutex that lives on the owner bthread's stack. The CcNodeService::UploadBatch handler and the sk_generator upload path (SendIndexes, UploadIndexInternal, and the UploadBatchClosure callback, which runs on a bthread) are converted accordingly. Requests with a single in-flight execution and a single waiter (CheckTxStatusCc, WaitNoNakedBucketRefCc, UploadRangeSlicesCc, UploadBatchSlicesCc) are structurally safe and left unchanged, as are paths whose waiter is a dedicated std::thread (UpdateCceCkptTsCc, the flush-data coroutine path). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…aiters (#491) * fix: avoid bthread mutex deadlock between tx processors and bthread waiters Same bug class as the CkptTsCc hang fixed in #483: a bthread::Mutex / ConditionVariable shared between CcRequest::Execute() (which runs on the brpc worker main stack via the tx processor module) and a bthread waiter can deadlock. When the worker main stack sleeps in the mutex's butex queue as a pthread waiter, an unlock may route the wake to a bthread bound to that same blocked worker; the bthread is parked in the worker's bound run queue and can never be scheduled, so both sides hang forever. Timeouts (wait_for) do not help because no timer remains pending in the end state. Convert the affected requests to the same pattern as the CkptTsCc fix: shard-side completion is tracked with atomics and the waiter polls with bthread_usleep exponential backoff. - ActiveTxMaxTsCc, ClearCcNodeGroup, EscalateStandbyCcmCc, ClearTxCc: drop the mutex/cv, use an atomic counter (plus a done flag for ClearCcNodeGroup so the waiter returns only after the last core's cleanup). - WaitableCc: blocking Wait() now polls; IsFinished/IsError/ErrorCode read atomics. mux_ is kept solely for the coroutine yield/resume handshake, whose waiter is a dedicated flush-worker thread, so bthread owners never touch the butex. - DbSizeCc / RemoteDbSizeCc: atomic ref counts; Wait() keeps the give-up-on-remote-after-2s semantics. remote_ref_cnt_ is decremented before total_ref_cnt_ so the waiter never mistakes a pending local core for a missing remote response. - UploadBatchCc: the external coordination state (req_mux/req_cv/ finished_req_cnt/req_result/ng_term) becomes std::atomic references; ValidTermCheck uses a CAS instead of locking a mutex that lives on the owner bthread's stack. The CcNodeService::UploadBatch handler and the sk_generator upload path (SendIndexes, UploadIndexInternal, and the UploadBatchClosure callback, which runs on a bthread) are converted accordingly. Requests with a single in-flight execution and a single waiter (CheckTxStatusCc, WaitNoNakedBucketRefCc, UploadRangeSlicesCc, UploadBatchSlicesCc) are structurally safe and left unchanged, as are paths whose waiter is a dedicated std::thread (UpdateCceCkptTsCc, the flush-data coroutine path). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: address CodeRabbit review comments - WaitableCc: latch the first error in AbortCcRequest with a CAS and stop storing NO_ERROR on every successful Execute, so a later success on another core can no longer erase an earlier failure. - sk_generator: same latch-first-error contract for the shared res_code in SendIndexes (channel failure path and the UploadBatchClosure callback), mirroring UploadBatchCc::SetError. - DbSizeCc: back total_obj_sizes_ with real std::atomic<int64_t> storage instead of reinterpret_cast on a std::vector<int64_t> (undefined behavior per the C++ atomic object model), and make term_ atomic since AddRemoteObjSize races IncTerm lock-free. - Backoff loops: cap the polling interval at max_interval instead of stopping one doubling short of it (also fixes the pre-existing CkptTsCc::Wait). - Fix "Waitting" typo in the dbsize timeout warning log. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * style: apply clang-format-18 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: add WaitableCc completion-counter regression tests Guards the atomic completion counter + first-error latching that replaced the bthread::Mutex + ConditionVariable handshake in the waitable CC requests (ClearTxCc / ActiveTxMaxTsCc / EscalateStandbyCcmCc / DbSizeCc / ClearCcNodeGroup share the same atomic-counter + polling Wait()). WaitableCc is the generic carrier of that pattern. Covers: concurrent multi-core completion (Wait() returns, no lost decrement), first-error-wins latching (a later success must not clear it), Reset re-arming, and a repeated-rounds stress loop. A watchdog aborts with a clear message if Wait() hangs (the deadlock the fix prevents). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Here are some reminders before you submit the pull request
fixes eloqdb/tx_service#issue_id./mtr --suite=mono_main,mono_multi,mono_basicSummary by CodeRabbit