Notify checkpoint scan workers when raising concurrency by thweetkomputer · Pull Request #167 · eloqdata/tx_service

thweetkomputer · 2025-10-21T03:00:25Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Reference the link of issue using fixes eloqdb/tx_service#issue_id
Reference the link of RFC if exists
Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

Performance Improvements
- Reduced unnecessary worker wake-ups and spurious notifications during data synchronization and concurrent scanning by making notifications conditional and better synchronized, lowering CPU overhead and improving stability under high concurrency.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-10-21T03:00:35Z

Walkthrough

Adjusts data-sync notification behavior: notifications are now sent while holding the worker mutex when waking the DataSyncWorker, and DataSyncForHashPartition conditionally notifies only when an acquire-load detects an increase in scan_concurrency before storing the new value.

Changes

Cohort / File(s)	Summary
Guarded notify in worker wake-up `tx_service/src/cc/local_cc_shards.cpp`	When waking a DataSyncWorker with no pending `data_sync_tasks`, `notify_all()` is invoked while holding a `lock_guard` on the associated mutex.
Conditional notify on concurrency increase `tx_service/src/cc/local_cc_shards.cpp`	In `DataSyncForHashPartition`, when `scan_concurrency > 0` the code performs an acquire-load of the previous value, computes `need_notify` if the new value is larger, stores the new value, and calls `cv_.notify_all()` only if `need_notify` is true.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Inspect acquire-load/store ordering and the compare logic for scan_concurrency.
Verify correct mutex usage around the guarded notify_all() to avoid missed wakeups.

Possibly related PRs

Make hash partition checkpoint workers use a single heap #185 — Modifies DataSyncForHashPartition in the same file and changes related notification/reporting behavior.

Suggested reviewers

liunyl

Poem

🐰 I held the lock before the call tonight,

Quiet paws keep race and ruckus light.
If counters climb, I nudge the sleeping crew,
Else gentle silence lets the threads chew. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is a template checklist with all items unchecked, lacking actual details about the changes, motivation, testing, documentation, and issue references.	Complete the checklist items by adding actual descriptions, test information, documentation changes, issue/RFC references, and confirming test suite passage.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Notify checkpoint scan workers when raising concurrency' directly describes the main change: conditionally notifying workers when scan_concurrency is increased.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat-notify-ckpt-zc

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dcbe09 and 2265b08.

📒 Files selected for processing (1)

tx_service/src/cc/local_cc_shards.cpp (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tx_service/src/cc/local_cc_shards.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tx_service/src/cc/local_cc_shards.cpp (1)

4655-4661: Conditional notify on scan_concurrency_ increase matches intent; atomic ordering can be simplified

The new need_notify check correctly avoids spurious notify_all() calls when the concurrency doesn't increase and still wakes sleepers when it does, so behaviorally this looks good and aligns with the gating logic in DataSyncWorker.

One nit: using std::memory_order_acquire on the load while the store uses memory_order_relaxed doesn’t actually buy extra ordering here and may give a misleading sense of a release/acquire pair. You can either:

Keep the logic as-is but make both operations memory_order_relaxed for clarity, or

If you ever need a precise “old value” for need_notify, switch to an atomic RMW (e.g. exchange with memory_order_acq_rel) so the comparison is against the real previous value.

Given this is only a tuning hint for concurrency, the current code is acceptable; the above is just for clarity/semantics hygiene.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53dcec0 and 02b3b20.

📒 Files selected for processing (1)

tx_service/src/cc/local_cc_shards.cpp (1 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-11-11T07:10:40.346Z

Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-12-02T10:43:27.431Z

Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-10-20T04:30:07.884Z

Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tx_service/src/cc/local_cc_shards.cpp (1)
2867-2901: Remove the redundant lock; task_worker_lk already holds data_sync_worker_ctx_.mux_

At line 2856, DataSyncWorker acquires std::unique_lock<std::mutex> task_worker_lk(data_sync_worker_ctx_.mux_) and holds it through the entire while(true) loop. The new std::lock_guard<std::mutex> lk(data_sync_worker_ctx_.mux_); at lines 2891-2893 attempts to lock the same non-recursive mutex again, which will deadlock when need_notify becomes true.

Condition variable notifications may be safely issued while already holding the mutex. Remove the extra lock:
             if (need_notify)
             {
-                std::lock_guard<std::mutex> lk(data_sync_worker_ctx_.mux_);
                 data_sync_worker_ctx_.cv_.notify_all();
             }

🧹 Nitpick comments (1)

tx_service/src/cc/local_cc_shards.cpp (1)

4654-4663: Conditional notify on scan_concurrency_ looks good; consider tightening memory‑order usage

The new logic to only call notify_all() when the computed scan_concurrency exceeds the previous value makes sense and should reduce unnecessary wakeups. It also avoids ever writing 0 (which would stall workers), since the block is gated on scan_concurrency > 0.

If you want cleaner happens‑before reasoning between the producer here and the wait predicate in DataSyncWorker, you might consider:

Storing with memory_order_release, and

Loading in the wait predicate with memory_order_acquire instead of relaxed.

This isn’t a correctness blocker, more about making the concurrency story easier to reason about.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 404e31f and 7dcbe09.

📒 Files selected for processing (1)

tx_service/src/cc/local_cc_shards.cpp (2 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-12-02T10:43:27.431Z

Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-11-11T07:10:40.346Z

Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

thweetkomputer force-pushed the feat-notify-ckpt-zc branch from fa2560c to 02b3b20 Compare December 11, 2025 03:33

coderabbitai Bot reviewed Dec 11, 2025

View reviewed changes

fix

2b11c6f

thweetkomputer force-pushed the feat-notify-ckpt-zc branch from 02b3b20 to 2b11c6f Compare December 11, 2025 06:36

thweetkomputer added 2 commits December 11, 2025 14:59

fix

404e31f

fix

7dcbe09

coderabbitai Bot reviewed Dec 11, 2025

View reviewed changes

fix

2265b08

MrGuin approved these changes Dec 11, 2025

View reviewed changes

thweetkomputer merged commit ef4c17b into main Dec 11, 2025
4 checks passed

thweetkomputer deleted the feat-notify-ckpt-zc branch December 11, 2025 07:54

coderabbitai Bot mentioned this pull request Jan 8, 2026

enhancement: tune hash partition checkpoint params, increase ckpt scan buffer, fix hash ckpt partition calc, and limit min data sync scan concurrency #347

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Notify checkpoint scan workers when raising concurrency#167

Notify checkpoint scan workers when raising concurrency#167
thweetkomputer merged 4 commits into
mainfrom
feat-notify-ckpt-zc

thweetkomputer commented Oct 21, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

thweetkomputer commented Oct 21, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here are some reminders before you submit the pull request

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thweetkomputer commented Oct 21, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 21, 2025 •

edited

Loading