Skip to content

Bound low priority reqs process time by 50 us#320

Merged
liunyl merged 2 commits into
mainfrom
bound_low_prio
Dec 25, 2025
Merged

Bound low priority reqs process time by 50 us#320
liunyl merged 2 commits into
mainfrom
bound_low_prio

Conversation

@liunyl

@liunyl liunyl commented Dec 24, 2025

Copy link
Copy Markdown
Contributor

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • New Features

    • Low-priority request queue with bounded processing for background/batch work.
    • Configurable per-run data-size hint and enforced export/data-sync limits.
    • Exposed low-priority queue size for visibility.
  • Refactor

    • Replaced std::chrono checks with a low-overhead microsecond timer for consistent cross-platform timing.
    • Rerouted retry/requeue paths to low-priority flows to better budget CPU/time for background tasks.
  • Chores

    • Initialize high-resolution timer early at startup.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Dec 24, 2025

Copy link
Copy Markdown

Walkthrough

Adds a low‑priority CC request path (enqueue, queue‑size, and time‑bounded processing), a DataSyncScanDataSize constant for data‑bounded exports, and cross‑platform ReadTimeMicroseconds()/InitializeTscFrequency() timing utilities; startup is updated to initialize the timer.

Changes

Cohort / File(s) Summary
Low-priority queue & shard APIs
tx_service/include/cc/cc_shard.h, tx_service/src/cc/cc_shard.cpp, tx_service/include/cc/local_cc_shards.h
Added ProcessLowPriorityRequests() and LowPriorityQueueSize() to CcShard; LocalCcShards delegators; implemented a time‑bounded (~50µs) low‑priority processing loop and moved background/defrag work into the low‑priority queue.
Enqueue & template map changes
tx_service/include/cc/template_cc_map.h
Replaced Enqueue(...) calls with EnqueueLowPriorityCcRequest(...) on retry/requeue paths; added export_data_size tracking limited by DataSyncScanDataSize; switched timing checks to ReadTimeMicroseconds() with wraparound handling.
TxService integration
tx_service/include/tx_service.h
RunOneRound now calls ProcessLowPriorityRequests(thd_id_), includes LowPriorityQueueSize() in busy‑round detection, and exposes ProcessLowPriorityRequests on LocalCcShards.
Request constant addition
tx_service/include/cc/cc_request.h
Added static constexpr size_t DataSyncScanDataSize = 100 * 1024 to HashPartitionDataSyncScanCc.
Timing utility & startup
tx_service/include/util.h, core/src/data_substrate.cpp
Added txservice::InitializeTscFrequency() and txservice::ReadTimeMicroseconds() with platform handling (x86_64/aarch64/chrono fallback); DataSubstrate::Start() calls the initializer.

Sequence Diagram

sequenceDiagram
    participant Tx as TxService
    participant Local as LocalCcShards
    participant Shard as CcShard
    participant LPQ as LowPriorityQueue
    participant Req as CC_Request

    rect rgb(240,248,255)
    note over Tx,Local: RunOneRound triggers low‑priority processing
    Tx->>Local: ProcessLowPriorityRequests(thd_id)
    Local->>Shard: ProcessLowPriorityRequests()
    end

    rect rgb(250,250,240)
    note over Shard,LPQ: Time‑bounded loop (~50µs) using ReadTimeMicroseconds()
    loop while time budget && queue not empty
      Shard->>LPQ: Dequeue()
      LPQ-->>Shard: Req
      Shard->>Req: Execute()
      Req-->>Shard: Completed / Free
      Shard->>Shard: update counters / check time (ReadTimeMicroseconds)
    end
    end

    Shard-->>Local: processed_count
    Local-->>Tx: processed_count
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • thweetkomputer
  • MrGuin

Poem

🐰 I nibble bytes in tiny hops, timed tight,
Low‑priority carrots glowing in the night.
Microsecond beats guide each small quest,
Requeue, rest, then onward — code at best.
A happy hop for the merge request.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains an unchecked template checklist with no actual implementation details, rationale, or explanations of the changes. Add substantive description explaining the motivation, changes made, and why the 50-microsecond bound was chosen. Complete or address the checklist items with actual references or explanations.
Docstring Coverage ⚠️ Warning Docstring coverage is 10.34% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: introducing a time bound (50 microseconds) for low-priority request processing, which aligns with the core modifications across multiple files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bound_low_prio

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e51f30d and 17e3bdf.

📒 Files selected for processing (3)
  • core/src/data_substrate.cpp
  • tx_service/include/tx_service.h
  • tx_service/include/util.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/src/data_substrate.cpp
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/tx_service.h
  • tx_service/include/util.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/tx_service.h
  • tx_service/include/util.h
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/include/tx_service.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/tx_service.h
🧬 Code graph analysis (1)
tx_service/include/tx_service.h (1)
core/include/data_substrate.h (1)
  • metrics (103-106)
🔇 Additional comments (6)
tx_service/include/util.h (4)

28-41: LGTM: Includes are appropriate for the timing utilities.

The new includes support the cross-platform timing implementation. The conditional inclusion of x86intrin.h is correctly guarded for x86_64 platforms.


479-482: LGTM: Global state declarations are correct.

The use of inline (without static) ensures a single shared instance across all translation units, with proper atomic and once_flag semantics for thread-safe initialization.


489-572: LGTM: TSC frequency initialization is well-designed.

The implementation properly handles platform-specific measurement:

  • x86_64: Calibrates TSC frequency with stability checks
  • aarch64: Reads hardware counter frequency directly

The use of std::call_once ensures thread-safe single initialization. The fallback value of 2000 (2 GHz) on x86_64 is reasonable for rare measurement failures.


580-619: LGTM: ReadTimeMicroseconds implementation is correct and portable.

The function provides low-overhead timing with appropriate platform optimizations:

  • x86_64/aarch64: Direct hardware counter reads with division (avoids overflow)
  • Fallback: std::chrono for portability

The lazy initialization pattern with __builtin_expect hint optimizes the common case. The division-based approach on aarch64 correctly avoids the overflow issue mentioned in previous reviews.

tx_service/include/tx_service.h (2)

398-400: LGTM: Busy round detection correctly accounts for both queue types.

The busy round threshold check now includes both regular and low-priority queue sizes, ensuring metrics capture the full workload.


406-413: LGTM: Low-priority request processing is properly integrated.

The processing flow is correct:

  1. Regular requests are processed in the main loop (3 iterations)
  2. Low-priority requests are processed once afterward
  3. All processed requests count toward req_cnt for metrics

This aligns with the PR objective of processing low-priority requests after regular requests with time bounds.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@liunyl liunyl requested a review from MrGuin December 24, 2025 05:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
tx_service/include/util.h (1)

32-35: Guard x86 intrinsics for your actual platform targets

#if defined(__x86_64__) || defined(_M_X64) plus #include <x86intrin.h> is fine for GCC/Clang on Linux, but _M_X64 with MSVC typically expects <intrin.h> instead. If Windows/MSVC is (or might become) a target, consider tightening the condition to the compilers you actually support or adding a separate #ifdef _MSC_VER include path.

tx_service/include/cc/cc_request.h (1)

3715-3717: New DataSyncScanDataSize constant looks fine; consider documenting units/intent

The DataSyncScanDataSize = 100 * 1024 bound is reasonable for a batch limit and aligns with the “100KB” comment. If this constant is used as a hard limit in TemplateCcMap export loops, consider:

  • Making the “bytes” unit explicit in the name or comment (e.g., // 100 KiB in bytes) and
  • Ensuring it’s either configurable or at least referenced in any tuning doc, since it directly affects low‑priority export pacing.
tx_service/include/tx_service.h (1)

391-407: Integrating low-priority queue into busy-round metrics and req count looks correct

Including LowPriorityQueueSize(thd_id_) in the busy-round threshold and adding ProcessLowPriorityRequests(thd_id_) into req_cnt matches the intent of treating low-priority work as part of overall load while still bounding its per-round processing time.

Minor nit: the comment on Line 400 is now misleading – low-priority requests are processed after the for-loop, not “already processed above.” Consider updating it to avoid confusion.

tx_service/src/cc/cc_shard.cpp (1)

554-619: Time-bounded low-priority loop looks correct and thread-safe

The new ProcessLowPriorityRequests():

  • Uses the existing low_priority_cc_queue_ plus low_priority_cc_queue_size_ counter in a manner consistent with the main queue.
  • Breaks on empty queue, dequeue failure, or after ~50µs based on ReadTimeMicroseconds(), so low-priority work cannot monopolize the tx-processor thread.
  • Preserves the Execute / Free lifecycle identical to ProcessRequests().

Given only the tx-processor thread consumes this queue and all producers only enqueue, the relaxed loads and fetch_sub pattern are safe here.

If you find that individual low-priority requests are expensive, you might later consider checking elapsed time less frequently (e.g., every N requests) to reduce ReadTimeMicroseconds() overhead, but it’s fine as-is for correctness and the stated 50µs bound.

tx_service/include/cc/template_cc_map.h (3)

5836-6053: Time-bounding hash-partition data sync and tracking export size look correct; consider minor polish

The additions in HashPartitionDataSyncScanCc::Execute all look consistent with the intent:

  • export_data_size plus the export_data_size < HashPartitionDataSyncScanCc::DataSyncScanDataSize guard cleanly caps bytes exported per call without affecting overall progress (which still uses accumulated_scan_cnt_ / scan_batch_size_ and pause position).
  • The 50µs time-slice via ReadTimeMicroseconds() (with wrap check l_now < l_start) is safe and only causes an early loop break; state (it, pause key, counters) is preserved and the request is re-enqueued to continue.
  • Re-enqueuing via EnqueueLowPriorityCcRequest(&req) is appropriate for this background task.

Two small refinements you might consider:

  1. The time-slice check is gated on export_data_cnt (incremented only in the non-VersionedRecord branch). If VersionedRecord == true also appears in production, that path will still be governed only by DataSyncScanBatchSize, not the 50µs cap. If you want a uniform 50µs bound, using scan_cnt instead of export_data_cnt (or incrementing export_data_cnt in both branches) would be clearer.

  2. export_data_size is size_t while flush_data_size / accumulated_flush_data_size_ are uint64_t; aligning these types (e.g., making export_data_size a uint64_t) would avoid any surprise on non-64-bit builds.

You might also want to hoist the literal 50 into a named constant shared with the other time-sliced loops to make the policy easier to tune.


6099-6204: DefragShardHeapCc now time-sliced & low-priority; behavior preserved with minor nits

In both phases of DefragShardHeapCc::Execute:

  • Introducing auto l_start = ReadTimeMicroseconds(); with a l_now - l_start >= 50 || l_now < l_start break every 4 successful defrags bounds per-call work without affecting correctness; resume state (pause_pos, ccmp_key_defraged) is still correctly updated before re-enqueue.
  • Swapping Enqueue(...) for EnqueueLowPriorityCcRequest(&req) ensures this maintenance work runs on the low-priority path, which matches the PR’s goal.

Optional cleanups:

  • The time-slice pattern here is almost identical to the one in HashPartitionDataSyncScanCc and KickoutCcEntryCc; a small helper (or at least a shared named constant for the 50µs budget) would reduce duplication and keep the policy centralized.
  • Because the time check is conditioned on current_defrag_cnt > 0, a batch that scans but doesn’t actually defrag anything can still run for a full scan_batch_size without the 50µs guard. If you want a stricter wall-clock bound, basing the check on scan_cnt instead would handle that edge case.

Current behavior is still safe; these are purely maintainability/performance-polish suggestions.

Also applies to: 6228-6239, 6335-6336


7273-7287: KickoutCcEntryCc loop is now bounded in wall-clock time

The added ReadTimeMicroseconds() sampling in KickoutCcEntryCc::Execute cleanly caps each kickout batch: after every two cleaned pages, the loop breaks if more than ~50µs have elapsed (or on wrap). This integrates well with existing resume logic (resume key and re-enqueue) and does not alter termination conditions.

If you end up tuning the 50µs budget elsewhere, consider promoting that literal to a shared constant to avoid diverging values across different maintenance loops.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84f8234 and 2b66e65.

📒 Files selected for processing (7)
  • tx_service/include/cc/cc_request.h
  • tx_service/include/cc/cc_shard.h
  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/cc/template_cc_map.h
  • tx_service/include/tx_service.h
  • tx_service/include/util.h
  • tx_service/src/cc/cc_shard.cpp
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/util.h
  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/cc/cc_shard.h
  • tx_service/include/tx_service.h
  • tx_service/src/cc/cc_shard.cpp
  • tx_service/include/cc/template_cc_map.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/cc/cc_shard.h
  • tx_service/include/tx_service.h
  • tx_service/src/cc/cc_shard.cpp
  • tx_service/include/cc/template_cc_map.h
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/tx_service.h
  • tx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/tx_service.h
  • tx_service/src/cc/cc_shard.cpp
  • tx_service/include/cc/cc_request.h
  • tx_service/include/cc/template_cc_map.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

  • tx_service/include/cc/template_cc_map.h
🧬 Code graph analysis (4)
tx_service/include/cc/local_cc_shards.h (2)
tx_service/src/cc/cc_shard.cpp (2)
  • ProcessLowPriorityRequests (554-619)
  • ProcessLowPriorityRequests (554-554)
tx_service/include/cc/cc_shard.h (1)
  • LowPriorityQueueSize (465-468)
tx_service/include/cc/cc_shard.h (2)
tx_service/include/cc/local_cc_shards.h (2)
  • ProcessLowPriorityRequests (244-247)
  • LowPriorityQueueSize (254-257)
tx_service/src/cc/cc_shard.cpp (2)
  • ProcessLowPriorityRequests (554-619)
  • ProcessLowPriorityRequests (554-554)
tx_service/include/tx_service.h (1)
core/include/data_substrate.h (1)
  • metrics (103-106)
tx_service/src/cc/cc_shard.cpp (1)
tx_service/include/cc/local_cc_shards.h (1)
  • ProcessLowPriorityRequests (244-247)
🔇 Additional comments (9)
tx_service/include/cc/cc_shard.h (2)

379-382: Low-priority processing API is consistent with existing design

Adding ProcessLowPriorityRequests() directly beside ProcessRequests() keeps the interface clean and matches how LocalCcShards delegates per-shard work. No issues from an API or threading perspective.


460-468: LowPriorityQueueSize accessor correctly mirrors QueueSize

LowPriorityQueueSize() reuses the same relaxed-load pattern as QueueSize(), so callers (busy-round detection, IsIdle, etc.) get a consistent view of both queues.

tx_service/include/cc/local_cc_shards.h (2)

239-247: Delegation of low-priority processing to per-shard CcShard is straightforward

ProcessLowPriorityRequests(size_t thd_id) follows the same pattern as ProcessRequests and keeps all per-shard logic inside CcShard. This is the right level of abstraction.


249-257: LowPriorityQueueSize wrapper is aligned with other shard accessors

LowPriorityQueueSize(thd_id) is a thin, consistent wrapper around the shard’s own accessor, making it easy for higher layers (TxProcessor) to treat both queues uniformly.

tx_service/src/cc/cc_shard.cpp (1)

2643-2645: DefragShardHeapCc correctly moved onto low-priority queue

Scheduling defrag_heap_cc_ via EnqueueLowPriorityCcRequest(defrag_heap_cc_.get()) aligns defragmentation with the new low-priority processing path, so shard heap defrag now adheres to the same 50µs-per-round budget instead of competing with foreground CC work on the main queue.

tx_service/include/cc/template_cc_map.h (4)

5709-5714: Route range-partition data sync scans through low-priority queue

Switching RangePartitionDataSyncScanCc re-enqueue from Enqueue(&req) to EnqueueLowPriorityCcRequest(&req) matches the “low-priority bounded work” goal and doesn’t change completion semantics (pause state and finish conditions are unchanged). No issues spotted here.


8088-8103: ScanSliceDeltaSizeCcForRangePartition retries and continuations moved to low-priority queue

Within ScanSliceDeltaSizeCcForRangePartition::Execute:

  • On RangeSliceOpStatus::Retry in pin_range_slice, switching from Enqueue(LocalCoreId, &req) to EnqueueLowPriorityCcRequest(LocalCoreId, &req) ensures repeated pin attempts don’t compete with high-priority CC work.
  • At the end of the scan, when not finished, using EnqueueLowPriorityCcRequest(&req) instead of Enqueue(&req) keeps subsequent scan batches on the low-priority path; finish vs. resume semantics are unchanged.

Assuming EnqueueLowPriorityCcRequest(core_id, ...) preserves the same per-core routing guarantees as Enqueue(core_id, ...), this looks correct and aligned with the PR objective.

Also applies to: 8188-8194


8338-8350: Hash-partition delta-size scans now resume via low-priority queue

The only change in ScanDeltaSizeCcForHashPartition::Execute is to re-enqueue unfinished work via EnqueueLowPriorityCcRequest(&req) instead of the normal Enqueue(&req). The finish path is untouched, and pause-key handling remains the same, so semantics are preserved while ensuring this accounting job runs at reduced priority.


8453-8456: Sampling sub-range keys: background continuation correctly downgraded to low priority

For SampleSubRangeKeysCc::Execute, when the scan doesn’t reach end_it, the request is now re-enqueued through EnqueueLowPriorityCcRequest(&req) instead of the main queue. This is appropriate for an analysis/sampling task and doesn’t change how pause position or completion are determined.

Comment thread tx_service/include/util.h

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tx_service/include/util.h (1)

446-472: Fix ReadTimeMicroseconds() accuracy on aarch64 and optionally refine x86/calibration

The current implementation has a correctness problem on aarch64 and a couple of portability/accuracy nits:

  • aarch64 scaling is wrong: cntvct_el0 increments at a platform‑defined frequency (readable from cntfrq_el0), not at ~2 GHz. Dividing ticks by the hard‑coded APPROX_CYCLES_PER_MICROSECOND = 2000 will mis-estimate microseconds by a large factor on typical ARM systems, so the 50 µs processing window will be far off (often an order of magnitude). This breaks the intent of the bound even if it “fails safe”.
  • Better pattern for aarch64: Keep using the architectural timer and convert with its real frequency, but avoid overflow via a wider intermediate as previously suggested. That gives an accurate, monotonic µs reading with negligible extra cost on ARM.
  • x86_64 approximation & MSVC nit (optional):
    • The 2000 cycles/µs constant assumes ~2 GHz invariant TSC; deployments at 2.5–3.5 GHz will see time bounds off by ~25–75%. If this matters, consider calibrating once at startup versus steady_clock and caching the ratio, or at least document the assumption.
    • For _M_X64 with MSVC, <x86intrin.h> may not be available; __rdtsc() normally lives in <intrin.h>. A small #ifdef _MSC_VER wrapper would make this more portable.

A concrete aarch64 fix that keeps the low‑overhead path while restoring correct scaling:

Proposed aarch64 conversion using cntfrq_el0 with 128‑bit intermediate
 static inline uint64_t ReadTimeMicroseconds()
 {
     // Approximate conversion: assumes ~2GHz CPU (1μs = 2000 cycles)
     // This is approximate but acceptable per requirements
     static constexpr uint64_t APPROX_CYCLES_PER_MICROSECOND = 2000;
 #if defined(__x86_64__) || defined(_M_X64)
@@
-#elif defined(__aarch64__)
-    // Read ARM virtual counter - returns timer ticks
-    uint64_t ticks;
-    __asm__ volatile("mrs %0, cntvct_el0" : "=r"(ticks));
-    return ticks / APPROX_CYCLES_PER_MICROSECOND;
+#elif defined(__aarch64__)
+    // Read ARM virtual counter and its frequency, then convert ticks→microseconds.
+    uint64_t ticks;
+    __asm__ volatile("mrs %0, cntvct_el0" : "=r"(ticks));
+    uint64_t freq_hz;
+    __asm__ volatile("mrs %0, cntfrq_el0" : "=r"(freq_hz));
+    // ticks * 1'000'000 / freq_hz, in a widened intermediate to avoid overflow.
+    __uint128_t us =
+        (static_cast<__uint128_t>(ticks) * 1000000ULL) / freq_hz;
+    return static_cast<uint64_t>(us);
 #else

If __uint128_t is not available on one of your ARM toolchains, we can instead rearrange the math to divide first where possible and keep it in 64‑bit.

Optionally, to improve x86 portability to MSVC:

Optional: make the x86 include MSVC‑friendly
-#if defined(__x86_64__) || defined(_M_X64)
-#include <x86intrin.h>  // For __rdtsc()
-#endif
+#if defined(__x86_64__) || defined(_M_X64)
+#  if defined(_MSC_VER)
+#    include <intrin.h>
+#  else
+#    include <x86intrin.h>  // For __rdtsc()
+#  endif
+#endif
🧹 Nitpick comments (4)
tx_service/include/cc/template_cc_map.h (4)

5836-5854: HashPartitionDataSyncScanCc: time & data-size bounding logic is sound; consider tightening types

The new export_data_size accumulator and the ReadTimeMicroseconds()-based 50µs guard give this handler both a per-call time budget and a per-batch data-size budget, on top of the existing DataSyncScanBatchSize and scan_batch_size_ limits. The wrap-around guard (l_now < l_start) is harmless and ensures we bail out on any clock anomaly.

One small refinement you might consider: flush_data_size and HashPartitionDataSyncScanDataSize are conceptually byte counts; if they are wider than size_t on any supported platform, using uint64_t for export_data_size would avoid any narrowing in 32‑bit builds. Behavior is otherwise correct, including the re-enqueue via EnqueueLowPriorityCcRequest(&req) when only the size/time budget (not data exhaustion) was hit.

Also applies to: 5903-5907, 5995-5999, 6050-6055


6099-6112: DefragShardHeapCc: 50µs slicing + low-priority scheduling match the PR goal

Both defrag phases (ccmap keys and pages) now use ReadTimeMicroseconds() to cap work at ~50µs per pass (checked every few successful defrags) and re-enqueue via EnqueueLowPriorityCcRequest(&req). That achieves the intended latency bounding while still honoring scan_batch_size_ and the pause_pos state machine.

Just ensure the low-priority queue is serviced frequently enough in the scheduler so long-running defrag sequences can still make steady progress; the logic here otherwise looks consistent.

Also applies to: 6202-6203, 6230-6242, 6335-6336


7273-7289: KickoutCcEntryCc: new 50µs budget is fine; verify desired priority class

Using ReadTimeMicroseconds() to break after ~50µs of work in the kickout loop (checked every couple of cleaned pages) keeps eviction from monopolizing a TxProcessor. Unlike the data-sync/defrag handlers, this still re-enqueues on the normal queue, which is reasonable if you want TTL/memory-pressure cleanup to retain higher priority.

If the intent is to classify all long-running maintenance work as “low priority”, you could also consider moving this to EnqueueLowPriorityCcRequest, but that’s more of a policy choice than a correctness issue.


8091-8094: Routing scan-delta and sampling requests to low-priority queue aligns with background nature

ScanSliceDeltaSizeCcForRangePartition, ScanDeltaSizeCcForHashPartition, and SampleSubRangeKeysCc now reschedule via EnqueueLowPriorityCcRequest(...) instead of the main Enqueue(...). Given these handlers are used for data-size accounting and statistics rather than foreground transaction processing, demoting them to the low-priority queue is consistent with the stated objective of bounding impact on critical paths.

Please just confirm there are no callers depending on strict interleaving with high-priority CC work (e.g., expecting these to complete before subsequent foreground operations), since low-priority scheduling can relax those ordering assumptions.

Also applies to: 8192-8194, 8455-8456

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b66e65 and bf37013.

📒 Files selected for processing (7)
  • tx_service/include/cc/cc_request.h
  • tx_service/include/cc/cc_shard.h
  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/cc/template_cc_map.h
  • tx_service/include/tx_service.h
  • tx_service/include/util.h
  • tx_service/src/cc/cc_shard.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • tx_service/include/cc/cc_shard.h
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/tx_service.h
  • tx_service/include/cc/template_cc_map.h
  • tx_service/include/util.h
  • tx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/tx_service.h
  • tx_service/include/cc/template_cc_map.h
  • tx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/include/cc/local_cc_shards.h
  • tx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/tx_service.h
  • tx_service/include/cc/cc_request.h
  • tx_service/include/cc/template_cc_map.h
  • tx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

  • tx_service/include/cc/template_cc_map.h
🧬 Code graph analysis (3)
tx_service/include/cc/local_cc_shards.h (2)
tx_service/src/cc/cc_shard.cpp (2)
  • ProcessLowPriorityRequests (554-620)
  • ProcessLowPriorityRequests (554-554)
tx_service/include/cc/cc_shard.h (1)
  • LowPriorityQueueSize (465-468)
tx_service/include/tx_service.h (1)
core/include/data_substrate.h (1)
  • metrics (103-106)
tx_service/src/cc/cc_shard.cpp (1)
tx_service/include/cc/local_cc_shards.h (1)
  • ProcessLowPriorityRequests (244-247)
🔇 Additional comments (7)
tx_service/include/cc/local_cc_shards.h (2)

244-247: LGTM! Clean delegation following established patterns.

The implementation correctly delegates low-priority request processing to the per-thread shard, following the exact same pattern as ProcessRequests(). The actual 50μs time-bound logic is appropriately handled at the CcShard level.


254-257: LGTM! Consistent with existing queue size accessor.

The implementation correctly mirrors the pattern of QueueSize() and properly delegates to the per-thread shard's low-priority queue size accessor.

tx_service/include/cc/cc_request.h (1)

3711-3717: New DataSyncScanDataSize constant looks good

Constant is clearly named, scoped, and sized (100 * 1024 bytes) alongside the existing batch-size knob; no issues from this header-side addition.

tx_service/include/util.h (1)

24-34: Header additions align with new timing helper

Including <chrono> and gating <x86intrin.h> on x86_64/_M_X64 matches the new ReadTimeMicroseconds implementation; no problems here from a header/ODR standpoint.

tx_service/src/cc/cc_shard.cpp (2)

2645-2645: LGTM!

Routing defragmentation work to the low-priority queue is appropriate since it's a background maintenance task that shouldn't impact latency-sensitive operations.


554-620: Implementation is correct for the 50µs time-bounded processing.

The one-at-a-time dequeue approach provides fine-grained timing control compared to the bulk dequeue in ProcessRequests(). The wraparound handling at lines 607-616 is conservative but safe.

ReadTimeMicroseconds() (in tx_service/include/util.h:452-467) uses __rdtsc() on x86_64 and ARM virtual counter on aarch64—both single-instruction operations with minimal overhead. It divides CPU cycles by 2000 to approximate microseconds (assuming ~2GHz CPU), which provides sufficient precision for the 50µs bound. The implementation explicitly notes this approximation is "acceptable per requirements" and is consistently used throughout the codebase with the same 50µs threshold.

tx_service/include/cc/template_cc_map.h (1)

5703-5715: Re-enqueue RangePartitionDataSyncScanCc via low-priority queue looks appropriate

Switching the reschedule path to shard_->EnqueueLowPriorityCcRequest(&req); keeps this background range data-sync work off the main CC queue while still respecting the existing batch limits (DataSyncScanBatchSize and scan_batch_size_). Semantics of pause/resume and PausePos handling remain unchanged, so this looks safe.

Comment thread tx_service/include/tx_service.h

@MrGuin MrGuin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

butil::cpuwide_time_ns has a similar implementation based on cpu cycles, which is enabled only in BAIDU_INTERNAL. Outside baidu, it just uses clock_gettime. It seems there are several cpu frequency issues they want to get rid of.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
tx_service/include/tx_service.h (1)

406-407: Incorrect comment on line 406.

The comment states "low priority already processed above" but the low-priority processing actually happens below on lines 410-413. This was already flagged in a previous review.

tx_service/include/util.h (1)

608-611: Critical: Syntax error in aarch64 path breaks compilation.

Line 609-610 has a malformed namespace qualifier: std: memory_order_acquire should be std::memory_order_acquire.

🔎 Proposed fix
         cycles_per_us =
-            tsc_cycles_per_microsecond_.load(std
-                                             : memory_order_acquire);
+            tsc_cycles_per_microsecond_.load(std::memory_order_acquire);
🧹 Nitpick comments (4)
tx_service/include/cc/template_cc_map.h (4)

5845-5855: HashPartitionDataSyncScanCc: bounded by both time and exported data size

  • export_data_size + HashPartitionDataSyncScanCc::DataSyncScanDataSize adds a clear data‑volume cap per execute, and you consistently accumulate flush_data_size in both Versioned and non‑Versioned branches.
  • Replacing chrono-based timing with ReadTimeMicroseconds() and breaking when l_now - l_start >= 50 || l_now < l_start gives a defensible 50µs budget with simple wrap‑around protection.
  • Requeueing with EnqueueLowPriorityCcRequest(&req) keeps this maintenance work off the normal CC queue.

Consider (non‑blocking) making export_data_size a uint64_t to match flush_data_size / DataSyncScanDataSize, and lifting the hardcoded 50 into a named constant shared across these low‑prio loops.

Also applies to: 5857-5864, 5913-5922, 6005-6012, 6061-6063


6110-6121: DefragShardHeapCc (key defrag): time‑bounded iterations + low‑priority requeue

The added ReadTimeMicroseconds() budget and early break on >= 50µs (with l_now < l_start guard) nicely caps each defrag slice, and switching to EnqueueLowPriorityCcRequest(&req) keeps this maintenance task low‑impact. Logic for pause/resume via pause_pos is untouched and still correct.

Also applies to: 6212-6213


6240-6251: DefragShardHeapCc (page/entry defrag): consistent low‑prio, time‑capped behavior

Mirroring the key-defrag branch, this loop now:

  • Uses ReadTimeMicroseconds() with a 50µs cap checked every few effective operations.
  • Requeues via EnqueueLowPriorityCcRequest(&req).

This keeps long page/entry defrag passes from monopolizing a TxProcessor core while preserving existing pause/resume semantics.

Also applies to: 6245-6246


7282-7283: KickoutCcEntryCc: add 50µs guard around per‑batch page cleaning

The new ReadTimeMicroseconds() start/now checks in the kickout loop bound how long a single KickoutCcEntryCc execution can run, on top of the existing KickoutPageBatchSize limit. Breaking before CleanPageAndReBalance on time‑out is safe because ccp still points at the next page and resume_key is derived from that page’s FirstKey().

Also applies to: 7291-7297

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf37013 and 669511a.

📒 Files selected for processing (8)
  • core/src/data_substrate.cpp
  • tx_service/include/cc/cc_request.h
  • tx_service/include/cc/cc_shard.h
  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/cc/template_cc_map.h
  • tx_service/include/tx_service.h
  • tx_service/include/util.h
  • tx_service/src/cc/cc_shard.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
  • tx_service/include/cc/local_cc_shards.h
  • tx_service/include/cc/cc_shard.h
  • tx_service/include/cc/cc_request.h
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/util.h
  • tx_service/src/cc/cc_shard.cpp
  • tx_service/include/cc/template_cc_map.h
  • tx_service/include/tx_service.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/util.h
  • tx_service/src/cc/cc_shard.cpp
  • tx_service/include/cc/template_cc_map.h
  • tx_service/include/tx_service.h
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/cc_shard.cpp
  • tx_service/include/tx_service.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/src/cc/cc_shard.cpp
  • tx_service/include/cc/template_cc_map.h
  • tx_service/include/tx_service.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

  • tx_service/include/cc/template_cc_map.h
🧬 Code graph analysis (3)
core/src/data_substrate.cpp (1)
tx_service/include/util.h (1)
  • InitializeTscFrequency (492-575)
tx_service/src/cc/cc_shard.cpp (2)
tx_service/include/cc/local_cc_shards.h (1)
  • ProcessLowPriorityRequests (252-255)
tx_service/include/util.h (1)
  • ReadTimeMicroseconds (583-623)
tx_service/include/tx_service.h (1)
core/include/data_substrate.h (1)
  • metrics (103-106)
🔇 Additional comments (10)
core/src/data_substrate.cpp (1)

38-40: LGTM!

The TSC frequency initialization is appropriately placed early in the Start() sequence, ensuring the timing infrastructure is calibrated before any time-bounded processing occurs. The std::call_once inside InitializeTscFrequency() makes this safe to call even if the init_state check fails later.

Also applies to: 139-140

tx_service/include/tx_service.h (2)

398-400: LGTM!

The busy round threshold check correctly sums both regular and low-priority queue sizes to determine if metrics collection is warranted.


410-413: LGTM!

Low-priority requests are correctly processed after regular requests, and the count is appropriately added to the total req_cnt.

tx_service/src/cc/cc_shard.cpp (2)

564-630: LGTM - Time-bounded low-priority processing implementation.

The implementation correctly bounds processing time to 50 microseconds:

  • Uses ReadTimeMicroseconds() for low-overhead timing
  • Conservative wraparound handling (exits on detected wraparound)
  • Per-request dequeue is appropriate for a best-effort low-priority path

One minor observation: the time check happens after processing each request, so the actual elapsed time may slightly exceed 50µs by the duration of the last processed request. This is acceptable for a soft bound.


2706-2706: LGTM!

Routing defragmentation work through the low-priority queue is appropriate since defrag is a background maintenance operation that shouldn't compete with regular request processing.

tx_service/include/util.h (1)

492-575: Well-designed TSC frequency calibration.

The InitializeTscFrequency() implementation is solid:

  • Uses std::call_once for thread-safe one-time initialization
  • Retry loop with stability threshold (1%) ensures accurate measurement
  • Graceful fallback if stable measurement cannot be obtained
  • aarch64 path correctly reads cntfrq_el0 for timer frequency
tx_service/include/cc/template_cc_map.h (4)

5719-5721: RangePartitionDataSyncScanCc now correctly requeues via low‑priority CC queue

Using EnqueueLowPriorityCcRequest(&req) for RangePartitionDataSyncScanCc matches its background nature and keeps it from competing with foreground CC work; the control flow (return false) is unchanged otherwise.


8098-8104: RangePartitionDataSyncScanCc: retries and continuation moved to low‑priority queue

Both:

  • The RangeSliceOpStatus::Retry path, and
  • The normal continuation path at the end of Execute(ScanSliceDeltaSizeCcForRangePartition &req)

now use EnqueueLowPriorityCcRequest(...) instead of the main queue. Given these are data‑sync / delta‑size maintenance scans, this is an appropriate downgrade in scheduling priority and should help bound impact on foreground CC traffic.

Also applies to: 8203-8204


8359-8360: HashPartition delta‑size scan: requeue on low‑priority CC queue

Switching the re‑enqueue from Enqueue to EnqueueLowPriorityCcRequest(&req) for ScanDeltaSizeCcForHashPartition keeps this accounting/estimation work out of the hot CC path, consistent with other data‑sync scans in this PR.


8465-8466: SampleSubRangeKeysCc: treat key‑sampling scans as low‑priority

Re‑enqueuing SampleSubRangeKeysCc via EnqueueLowPriorityCcRequest(&req) is aligned with it being a statistics/sampling job rather than foreground transaction work; resume behavior via PausePos is otherwise unchanged.

Comment thread tx_service/include/util.h Outdated
Comment thread tx_service/include/util.h

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 669511a and e51f30d.

📒 Files selected for processing (2)
  • core/src/data_substrate.cpp
  • tx_service/include/util.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/src/data_substrate.cpp
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/include/util.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/include/util.h
🔇 Additional comments (3)
tx_service/include/util.h (3)

28-41: LGTM!

The new includes are appropriate for the timing functionality: <atomic> for thread-safe state, <mutex> for std::call_once, <thread> for sleep, <chrono> for the fallback path, and <x86intrin.h> properly guarded for x86_64.


489-572: Initialization logic is sound with appropriate stability checks.

The x86_64 path with retry-until-stable is well designed. The aarch64 path using freq_hz / 1000000 trades ~2-3% precision for simplicity and overflow safety—acceptable for approximate 50 µs bounds.

One edge case: if cntfrq_el0 ever returns < 1 MHz (highly unlikely on any real ARM64 hardware), cycles_per_us would be 0, causing a division-by-zero in ReadTimeMicroseconds(). Consider adding a minimum-value guard if defensive coding is desired.


580-619: Implementation is correct with appropriate memory ordering.

The lazy-initialization pattern with memory_order_relaxed on the fast path and memory_order_acquire after initialization is correct. The __builtin_expect hint for the unlikely cold path is a nice optimization. The previous overflow and syntax issues have been properly addressed.

Comment thread tx_service/include/util.h
@liunyl liunyl merged commit 962d604 into main Dec 25, 2025
4 checks passed
@liunyl liunyl deleted the bound_low_prio branch December 25, 2025 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants