Skip to content

Notify checkpoint scan workers when raising concurrency#167

Merged
thweetkomputer merged 4 commits into
mainfrom
feat-notify-ckpt-zc
Dec 11, 2025
Merged

Notify checkpoint scan workers when raising concurrency#167
thweetkomputer merged 4 commits into
mainfrom
feat-notify-ckpt-zc

Conversation

@thweetkomputer

@thweetkomputer thweetkomputer commented Oct 21, 2025

Copy link
Copy Markdown
Collaborator

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • Performance Improvements
    • Reduced unnecessary worker wake-ups and spurious notifications during data synchronization and concurrent scanning by making notifications conditional and better synchronized, lowering CPU overhead and improving stability under high concurrency.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Oct 21, 2025

Copy link
Copy Markdown

Walkthrough

Adjusts data-sync notification behavior: notifications are now sent while holding the worker mutex when waking the DataSyncWorker, and DataSyncForHashPartition conditionally notifies only when an acquire-load detects an increase in scan_concurrency before storing the new value.

Changes

Cohort / File(s) Summary
Guarded notify in worker wake-up
tx_service/src/cc/local_cc_shards.cpp
When waking a DataSyncWorker with no pending data_sync_tasks, notify_all() is invoked while holding a lock_guard on the associated mutex.
Conditional notify on concurrency increase
tx_service/src/cc/local_cc_shards.cpp
In DataSyncForHashPartition, when scan_concurrency > 0 the code performs an acquire-load of the previous value, computes need_notify if the new value is larger, stores the new value, and calls cv_.notify_all() only if need_notify is true.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Inspect acquire-load/store ordering and the compare logic for scan_concurrency.
  • Verify correct mutex usage around the guarded notify_all() to avoid missed wakeups.

Possibly related PRs

Suggested reviewers

  • liunyl

Poem

🐰 I held the lock before the call tonight,

Quiet paws keep race and ruckus light.
If counters climb, I nudge the sleeping crew,
Else gentle silence lets the threads chew. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is a template checklist with all items unchecked, lacking actual details about the changes, motivation, testing, documentation, and issue references. Complete the checklist items by adding actual descriptions, test information, documentation changes, issue/RFC references, and confirming test suite passage.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Notify checkpoint scan workers when raising concurrency' directly describes the main change: conditionally notifying workers when scan_concurrency is increased.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-notify-ckpt-zc

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dcbe09 and 2265b08.

📒 Files selected for processing (1)
  • tx_service/src/cc/local_cc_shards.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tx_service/src/cc/local_cc_shards.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tx_service/src/cc/local_cc_shards.cpp (1)

4655-4661: Conditional notify on scan_concurrency_ increase matches intent; atomic ordering can be simplified

The new need_notify check correctly avoids spurious notify_all() calls when the concurrency doesn't increase and still wakes sleepers when it does, so behaviorally this looks good and aligns with the gating logic in DataSyncWorker.

One nit: using std::memory_order_acquire on the load while the store uses memory_order_relaxed doesn’t actually buy extra ordering here and may give a misleading sense of a release/acquire pair. You can either:

  • Keep the logic as-is but make both operations memory_order_relaxed for clarity, or
  • If you ever need a precise “old value” for need_notify, switch to an atomic RMW (e.g. exchange with memory_order_acq_rel) so the comparison is against the real previous value.

Given this is only a tuning hint for concurrency, the current code is acceptable; the above is just for clarity/semantics hygiene.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53dcec0 and 02b3b20.

📒 Files selected for processing (1)
  • tx_service/src/cc/local_cc_shards.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/src/cc/local_cc_shards.cpp (1)

2867-2901: Remove the redundant lock; task_worker_lk already holds data_sync_worker_ctx_.mux_

At line 2856, DataSyncWorker acquires std::unique_lock<std::mutex> task_worker_lk(data_sync_worker_ctx_.mux_) and holds it through the entire while(true) loop. The new std::lock_guard<std::mutex> lk(data_sync_worker_ctx_.mux_); at lines 2891-2893 attempts to lock the same non-recursive mutex again, which will deadlock when need_notify becomes true.

Condition variable notifications may be safely issued while already holding the mutex. Remove the extra lock:

             if (need_notify)
             {
-                std::lock_guard<std::mutex> lk(data_sync_worker_ctx_.mux_);
                 data_sync_worker_ctx_.cv_.notify_all();
             }
🧹 Nitpick comments (1)
tx_service/src/cc/local_cc_shards.cpp (1)

4654-4663: Conditional notify on scan_concurrency_ looks good; consider tightening memory‑order usage

The new logic to only call notify_all() when the computed scan_concurrency exceeds the previous value makes sense and should reduce unnecessary wakeups. It also avoids ever writing 0 (which would stall workers), since the block is gated on scan_concurrency > 0.

If you want cleaner happens‑before reasoning between the producer here and the wait predicate in DataSyncWorker, you might consider:

  • Storing with memory_order_release, and
  • Loading in the wait predicate with memory_order_acquire instead of relaxed.

This isn’t a correctness blocker, more about making the concurrency story easier to reason about.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 404e31f and 7dcbe09.

📒 Files selected for processing (1)
  • tx_service/src/cc/local_cc_shards.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp

@thweetkomputer thweetkomputer merged commit ef4c17b into main Dec 11, 2025
4 checks passed
@thweetkomputer thweetkomputer deleted the feat-notify-ckpt-zc branch December 11, 2025 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants