Bound low priority reqs process time by 50 us#320
Conversation
WalkthroughAdds a low‑priority CC request path (enqueue, queue‑size, and time‑bounded processing), a DataSyncScanDataSize constant for data‑bounded exports, and cross‑platform ReadTimeMicroseconds()/InitializeTscFrequency() timing utilities; startup is updated to initialize the timer. Changes
Sequence DiagramsequenceDiagram
participant Tx as TxService
participant Local as LocalCcShards
participant Shard as CcShard
participant LPQ as LowPriorityQueue
participant Req as CC_Request
rect rgb(240,248,255)
note over Tx,Local: RunOneRound triggers low‑priority processing
Tx->>Local: ProcessLowPriorityRequests(thd_id)
Local->>Shard: ProcessLowPriorityRequests()
end
rect rgb(250,250,240)
note over Shard,LPQ: Time‑bounded loop (~50µs) using ReadTimeMicroseconds()
loop while time budget && queue not empty
Shard->>LPQ: Dequeue()
LPQ-->>Shard: Req
Shard->>Req: Execute()
Req-->>Shard: Completed / Free
Shard->>Shard: update counters / check time (ReadTimeMicroseconds)
end
end
Shard-->>Local: processed_count
Local-->>Tx: processed_count
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧠 Learnings (4)📚 Learning: 2025-11-11T07:10:40.346ZApplied to files:
📚 Learning: 2025-10-09T03:56:58.811ZApplied to files:
📚 Learning: 2025-12-02T10:43:27.431ZApplied to files:
📚 Learning: 2025-10-20T04:30:07.884ZApplied to files:
🧬 Code graph analysis (1)tx_service/include/tx_service.h (1)
🔇 Additional comments (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (7)
tx_service/include/util.h (1)
32-35: Guard x86 intrinsics for your actual platform targets
#if defined(__x86_64__) || defined(_M_X64)plus#include <x86intrin.h>is fine for GCC/Clang on Linux, but_M_X64with MSVC typically expects<intrin.h>instead. If Windows/MSVC is (or might become) a target, consider tightening the condition to the compilers you actually support or adding a separate#ifdef _MSC_VERinclude path.tx_service/include/cc/cc_request.h (1)
3715-3717: New DataSyncScanDataSize constant looks fine; consider documenting units/intentThe
DataSyncScanDataSize = 100 * 1024bound is reasonable for a batch limit and aligns with the “100KB” comment. If this constant is used as a hard limit inTemplateCcMapexport loops, consider:
- Making the “bytes” unit explicit in the name or comment (e.g.,
// 100 KiB in bytes) and- Ensuring it’s either configurable or at least referenced in any tuning doc, since it directly affects low‑priority export pacing.
tx_service/include/tx_service.h (1)
391-407: Integrating low-priority queue into busy-round metrics and req count looks correctIncluding
LowPriorityQueueSize(thd_id_)in the busy-round threshold and addingProcessLowPriorityRequests(thd_id_)intoreq_cntmatches the intent of treating low-priority work as part of overall load while still bounding its per-round processing time.Minor nit: the comment on Line 400 is now misleading – low-priority requests are processed after the for-loop, not “already processed above.” Consider updating it to avoid confusion.
tx_service/src/cc/cc_shard.cpp (1)
554-619: Time-bounded low-priority loop looks correct and thread-safeThe new
ProcessLowPriorityRequests():
- Uses the existing
low_priority_cc_queue_pluslow_priority_cc_queue_size_counter in a manner consistent with the main queue.- Breaks on empty queue, dequeue failure, or after ~50µs based on
ReadTimeMicroseconds(), so low-priority work cannot monopolize the tx-processor thread.- Preserves the
Execute/Freelifecycle identical toProcessRequests().Given only the tx-processor thread consumes this queue and all producers only enqueue, the relaxed loads and
fetch_subpattern are safe here.If you find that individual low-priority requests are expensive, you might later consider checking elapsed time less frequently (e.g., every N requests) to reduce
ReadTimeMicroseconds()overhead, but it’s fine as-is for correctness and the stated 50µs bound.tx_service/include/cc/template_cc_map.h (3)
5836-6053: Time-bounding hash-partition data sync and tracking export size look correct; consider minor polishThe additions in
HashPartitionDataSyncScanCc::Executeall look consistent with the intent:
export_data_sizeplus theexport_data_size < HashPartitionDataSyncScanCc::DataSyncScanDataSizeguard cleanly caps bytes exported per call without affecting overall progress (which still usesaccumulated_scan_cnt_/scan_batch_size_and pause position).- The 50µs time-slice via
ReadTimeMicroseconds()(with wrap checkl_now < l_start) is safe and only causes an early loop break; state (it, pause key, counters) is preserved and the request is re-enqueued to continue.- Re-enqueuing via
EnqueueLowPriorityCcRequest(&req)is appropriate for this background task.Two small refinements you might consider:
The time-slice check is gated on
export_data_cnt(incremented only in the non-VersionedRecordbranch). IfVersionedRecord == truealso appears in production, that path will still be governed only byDataSyncScanBatchSize, not the 50µs cap. If you want a uniform 50µs bound, usingscan_cntinstead ofexport_data_cnt(or incrementingexport_data_cntin both branches) would be clearer.
export_data_sizeissize_twhileflush_data_size/accumulated_flush_data_size_areuint64_t; aligning these types (e.g., makingexport_data_sizeauint64_t) would avoid any surprise on non-64-bit builds.You might also want to hoist the literal
50into a named constant shared with the other time-sliced loops to make the policy easier to tune.
6099-6204: DefragShardHeapCc now time-sliced & low-priority; behavior preserved with minor nitsIn both phases of
DefragShardHeapCc::Execute:
- Introducing
auto l_start = ReadTimeMicroseconds();with al_now - l_start >= 50 || l_now < l_startbreak every 4 successful defrags bounds per-call work without affecting correctness; resume state (pause_pos,ccmp_key_defraged) is still correctly updated before re-enqueue.- Swapping
Enqueue(...)forEnqueueLowPriorityCcRequest(&req)ensures this maintenance work runs on the low-priority path, which matches the PR’s goal.Optional cleanups:
- The time-slice pattern here is almost identical to the one in
HashPartitionDataSyncScanCcandKickoutCcEntryCc; a small helper (or at least a shared named constant for the 50µs budget) would reduce duplication and keep the policy centralized.- Because the time check is conditioned on
current_defrag_cnt > 0, a batch that scans but doesn’t actually defrag anything can still run for a fullscan_batch_sizewithout the 50µs guard. If you want a stricter wall-clock bound, basing the check onscan_cntinstead would handle that edge case.Current behavior is still safe; these are purely maintainability/performance-polish suggestions.
Also applies to: 6228-6239, 6335-6336
7273-7287: KickoutCcEntryCc loop is now bounded in wall-clock timeThe added
ReadTimeMicroseconds()sampling inKickoutCcEntryCc::Executecleanly caps each kickout batch: after every two cleaned pages, the loop breaks if more than ~50µs have elapsed (or on wrap). This integrates well with existing resume logic (resume key and re-enqueue) and does not alter termination conditions.If you end up tuning the 50µs budget elsewhere, consider promoting that literal to a shared constant to avoid diverging values across different maintenance loops.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tx_service/include/cc/cc_request.htx_service/include/cc/cc_shard.htx_service/include/cc/local_cc_shards.htx_service/include/cc/template_cc_map.htx_service/include/tx_service.htx_service/include/util.htx_service/src/cc/cc_shard.cpp
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
tx_service/include/util.htx_service/include/cc/local_cc_shards.htx_service/include/cc/cc_shard.htx_service/include/tx_service.htx_service/src/cc/cc_shard.cpptx_service/include/cc/template_cc_map.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
tx_service/include/cc/local_cc_shards.htx_service/include/cc/cc_shard.htx_service/include/tx_service.htx_service/src/cc/cc_shard.cpptx_service/include/cc/template_cc_map.h
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
Applied to files:
tx_service/include/cc/local_cc_shards.htx_service/include/tx_service.htx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
tx_service/include/tx_service.htx_service/src/cc/cc_shard.cpptx_service/include/cc/cc_request.htx_service/include/cc/template_cc_map.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.
Applied to files:
tx_service/include/cc/template_cc_map.h
🧬 Code graph analysis (4)
tx_service/include/cc/local_cc_shards.h (2)
tx_service/src/cc/cc_shard.cpp (2)
ProcessLowPriorityRequests(554-619)ProcessLowPriorityRequests(554-554)tx_service/include/cc/cc_shard.h (1)
LowPriorityQueueSize(465-468)
tx_service/include/cc/cc_shard.h (2)
tx_service/include/cc/local_cc_shards.h (2)
ProcessLowPriorityRequests(244-247)LowPriorityQueueSize(254-257)tx_service/src/cc/cc_shard.cpp (2)
ProcessLowPriorityRequests(554-619)ProcessLowPriorityRequests(554-554)
tx_service/include/tx_service.h (1)
core/include/data_substrate.h (1)
metrics(103-106)
tx_service/src/cc/cc_shard.cpp (1)
tx_service/include/cc/local_cc_shards.h (1)
ProcessLowPriorityRequests(244-247)
🔇 Additional comments (9)
tx_service/include/cc/cc_shard.h (2)
379-382: Low-priority processing API is consistent with existing designAdding
ProcessLowPriorityRequests()directly besideProcessRequests()keeps the interface clean and matches howLocalCcShardsdelegates per-shard work. No issues from an API or threading perspective.
460-468: LowPriorityQueueSize accessor correctly mirrors QueueSize
LowPriorityQueueSize()reuses the same relaxed-load pattern asQueueSize(), so callers (busy-round detection, IsIdle, etc.) get a consistent view of both queues.tx_service/include/cc/local_cc_shards.h (2)
239-247: Delegation of low-priority processing to per-shard CcShard is straightforward
ProcessLowPriorityRequests(size_t thd_id)follows the same pattern asProcessRequestsand keeps all per-shard logic insideCcShard. This is the right level of abstraction.
249-257: LowPriorityQueueSize wrapper is aligned with other shard accessors
LowPriorityQueueSize(thd_id)is a thin, consistent wrapper around the shard’s own accessor, making it easy for higher layers (TxProcessor) to treat both queues uniformly.tx_service/src/cc/cc_shard.cpp (1)
2643-2645: DefragShardHeapCc correctly moved onto low-priority queueScheduling
defrag_heap_cc_viaEnqueueLowPriorityCcRequest(defrag_heap_cc_.get())aligns defragmentation with the new low-priority processing path, so shard heap defrag now adheres to the same 50µs-per-round budget instead of competing with foreground CC work on the main queue.tx_service/include/cc/template_cc_map.h (4)
5709-5714: Route range-partition data sync scans through low-priority queueSwitching
RangePartitionDataSyncScanCcre-enqueue fromEnqueue(&req)toEnqueueLowPriorityCcRequest(&req)matches the “low-priority bounded work” goal and doesn’t change completion semantics (pause state and finish conditions are unchanged). No issues spotted here.
8088-8103: ScanSliceDeltaSizeCcForRangePartition retries and continuations moved to low-priority queueWithin
ScanSliceDeltaSizeCcForRangePartition::Execute:
- On
RangeSliceOpStatus::Retryinpin_range_slice, switching fromEnqueue(LocalCoreId, &req)toEnqueueLowPriorityCcRequest(LocalCoreId, &req)ensures repeated pin attempts don’t compete with high-priority CC work.- At the end of the scan, when not finished, using
EnqueueLowPriorityCcRequest(&req)instead ofEnqueue(&req)keeps subsequent scan batches on the low-priority path; finish vs. resume semantics are unchanged.Assuming
EnqueueLowPriorityCcRequest(core_id, ...)preserves the same per-core routing guarantees asEnqueue(core_id, ...), this looks correct and aligned with the PR objective.Also applies to: 8188-8194
8338-8350: Hash-partition delta-size scans now resume via low-priority queueThe only change in
ScanDeltaSizeCcForHashPartition::Executeis to re-enqueue unfinished work viaEnqueueLowPriorityCcRequest(&req)instead of the normalEnqueue(&req). The finish path is untouched, and pause-key handling remains the same, so semantics are preserved while ensuring this accounting job runs at reduced priority.
8453-8456: Sampling sub-range keys: background continuation correctly downgraded to low priorityFor
SampleSubRangeKeysCc::Execute, when the scan doesn’t reachend_it, the request is now re-enqueued throughEnqueueLowPriorityCcRequest(&req)instead of the main queue. This is appropriate for an analysis/sampling task and doesn’t change how pause position or completion are determined.
2b66e65 to
bf37013
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tx_service/include/util.h (1)
446-472: Fix ReadTimeMicroseconds() accuracy on aarch64 and optionally refine x86/calibrationThe current implementation has a correctness problem on aarch64 and a couple of portability/accuracy nits:
- aarch64 scaling is wrong:
cntvct_el0increments at a platform‑defined frequency (readable fromcntfrq_el0), not at ~2 GHz. Dividing ticks by the hard‑codedAPPROX_CYCLES_PER_MICROSECOND = 2000will mis-estimate microseconds by a large factor on typical ARM systems, so the 50 µs processing window will be far off (often an order of magnitude). This breaks the intent of the bound even if it “fails safe”.- Better pattern for aarch64: Keep using the architectural timer and convert with its real frequency, but avoid overflow via a wider intermediate as previously suggested. That gives an accurate, monotonic µs reading with negligible extra cost on ARM.
- x86_64 approximation & MSVC nit (optional):
- The 2000 cycles/µs constant assumes ~2 GHz invariant TSC; deployments at 2.5–3.5 GHz will see time bounds off by ~25–75%. If this matters, consider calibrating once at startup versus
steady_clockand caching the ratio, or at least document the assumption.- For
_M_X64with MSVC,<x86intrin.h>may not be available;__rdtsc()normally lives in<intrin.h>. A small#ifdef _MSC_VERwrapper would make this more portable.A concrete aarch64 fix that keeps the low‑overhead path while restoring correct scaling:
Proposed aarch64 conversion using cntfrq_el0 with 128‑bit intermediate
static inline uint64_t ReadTimeMicroseconds() { // Approximate conversion: assumes ~2GHz CPU (1μs = 2000 cycles) // This is approximate but acceptable per requirements static constexpr uint64_t APPROX_CYCLES_PER_MICROSECOND = 2000; #if defined(__x86_64__) || defined(_M_X64) @@ -#elif defined(__aarch64__) - // Read ARM virtual counter - returns timer ticks - uint64_t ticks; - __asm__ volatile("mrs %0, cntvct_el0" : "=r"(ticks)); - return ticks / APPROX_CYCLES_PER_MICROSECOND; +#elif defined(__aarch64__) + // Read ARM virtual counter and its frequency, then convert ticks→microseconds. + uint64_t ticks; + __asm__ volatile("mrs %0, cntvct_el0" : "=r"(ticks)); + uint64_t freq_hz; + __asm__ volatile("mrs %0, cntfrq_el0" : "=r"(freq_hz)); + // ticks * 1'000'000 / freq_hz, in a widened intermediate to avoid overflow. + __uint128_t us = + (static_cast<__uint128_t>(ticks) * 1000000ULL) / freq_hz; + return static_cast<uint64_t>(us); #elseIf
__uint128_tis not available on one of your ARM toolchains, we can instead rearrange the math to divide first where possible and keep it in 64‑bit.Optionally, to improve x86 portability to MSVC:
Optional: make the x86 include MSVC‑friendly
-#if defined(__x86_64__) || defined(_M_X64) -#include <x86intrin.h> // For __rdtsc() -#endif +#if defined(__x86_64__) || defined(_M_X64) +# if defined(_MSC_VER) +# include <intrin.h> +# else +# include <x86intrin.h> // For __rdtsc() +# endif +#endif
🧹 Nitpick comments (4)
tx_service/include/cc/template_cc_map.h (4)
5836-5854: HashPartitionDataSyncScanCc: time & data-size bounding logic is sound; consider tightening typesThe new
export_data_sizeaccumulator and theReadTimeMicroseconds()-based 50µs guard give this handler both a per-call time budget and a per-batch data-size budget, on top of the existingDataSyncScanBatchSizeandscan_batch_size_limits. The wrap-around guard (l_now < l_start) is harmless and ensures we bail out on any clock anomaly.One small refinement you might consider:
flush_data_sizeandHashPartitionDataSyncScanDataSizeare conceptually byte counts; if they are wider thansize_ton any supported platform, usinguint64_tforexport_data_sizewould avoid any narrowing in 32‑bit builds. Behavior is otherwise correct, including the re-enqueue viaEnqueueLowPriorityCcRequest(&req)when only the size/time budget (not data exhaustion) was hit.Also applies to: 5903-5907, 5995-5999, 6050-6055
6099-6112: DefragShardHeapCc: 50µs slicing + low-priority scheduling match the PR goalBoth defrag phases (ccmap keys and pages) now use
ReadTimeMicroseconds()to cap work at ~50µs per pass (checked every few successful defrags) and re-enqueue viaEnqueueLowPriorityCcRequest(&req). That achieves the intended latency bounding while still honoringscan_batch_size_and thepause_posstate machine.Just ensure the low-priority queue is serviced frequently enough in the scheduler so long-running defrag sequences can still make steady progress; the logic here otherwise looks consistent.
Also applies to: 6202-6203, 6230-6242, 6335-6336
7273-7289: KickoutCcEntryCc: new 50µs budget is fine; verify desired priority classUsing
ReadTimeMicroseconds()to break after ~50µs of work in the kickout loop (checked every couple of cleaned pages) keeps eviction from monopolizing a TxProcessor. Unlike the data-sync/defrag handlers, this still re-enqueues on the normal queue, which is reasonable if you want TTL/memory-pressure cleanup to retain higher priority.If the intent is to classify all long-running maintenance work as “low priority”, you could also consider moving this to
EnqueueLowPriorityCcRequest, but that’s more of a policy choice than a correctness issue.
8091-8094: Routing scan-delta and sampling requests to low-priority queue aligns with background nature
ScanSliceDeltaSizeCcForRangePartition,ScanDeltaSizeCcForHashPartition, andSampleSubRangeKeysCcnow reschedule viaEnqueueLowPriorityCcRequest(...)instead of the mainEnqueue(...). Given these handlers are used for data-size accounting and statistics rather than foreground transaction processing, demoting them to the low-priority queue is consistent with the stated objective of bounding impact on critical paths.Please just confirm there are no callers depending on strict interleaving with high-priority CC work (e.g., expecting these to complete before subsequent foreground operations), since low-priority scheduling can relax those ordering assumptions.
Also applies to: 8192-8194, 8455-8456
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tx_service/include/cc/cc_request.htx_service/include/cc/cc_shard.htx_service/include/cc/local_cc_shards.htx_service/include/cc/template_cc_map.htx_service/include/tx_service.htx_service/include/util.htx_service/src/cc/cc_shard.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- tx_service/include/cc/cc_shard.h
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
tx_service/include/cc/local_cc_shards.htx_service/include/tx_service.htx_service/include/cc/template_cc_map.htx_service/include/util.htx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
tx_service/include/cc/local_cc_shards.htx_service/include/tx_service.htx_service/include/cc/template_cc_map.htx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
Applied to files:
tx_service/include/cc/local_cc_shards.htx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
tx_service/include/tx_service.htx_service/include/cc/cc_request.htx_service/include/cc/template_cc_map.htx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.
Applied to files:
tx_service/include/cc/template_cc_map.h
🧬 Code graph analysis (3)
tx_service/include/cc/local_cc_shards.h (2)
tx_service/src/cc/cc_shard.cpp (2)
ProcessLowPriorityRequests(554-620)ProcessLowPriorityRequests(554-554)tx_service/include/cc/cc_shard.h (1)
LowPriorityQueueSize(465-468)
tx_service/include/tx_service.h (1)
core/include/data_substrate.h (1)
metrics(103-106)
tx_service/src/cc/cc_shard.cpp (1)
tx_service/include/cc/local_cc_shards.h (1)
ProcessLowPriorityRequests(244-247)
🔇 Additional comments (7)
tx_service/include/cc/local_cc_shards.h (2)
244-247: LGTM! Clean delegation following established patterns.The implementation correctly delegates low-priority request processing to the per-thread shard, following the exact same pattern as
ProcessRequests(). The actual 50μs time-bound logic is appropriately handled at theCcShardlevel.
254-257: LGTM! Consistent with existing queue size accessor.The implementation correctly mirrors the pattern of
QueueSize()and properly delegates to the per-thread shard's low-priority queue size accessor.tx_service/include/cc/cc_request.h (1)
3711-3717: New DataSyncScanDataSize constant looks goodConstant is clearly named, scoped, and sized (100 * 1024 bytes) alongside the existing batch-size knob; no issues from this header-side addition.
tx_service/include/util.h (1)
24-34: Header additions align with new timing helperIncluding
<chrono>and gating<x86intrin.h>on x86_64/_M_X64 matches the new ReadTimeMicroseconds implementation; no problems here from a header/ODR standpoint.tx_service/src/cc/cc_shard.cpp (2)
2645-2645: LGTM!Routing defragmentation work to the low-priority queue is appropriate since it's a background maintenance task that shouldn't impact latency-sensitive operations.
554-620: Implementation is correct for the 50µs time-bounded processing.The one-at-a-time dequeue approach provides fine-grained timing control compared to the bulk dequeue in
ProcessRequests(). The wraparound handling at lines 607-616 is conservative but safe.
ReadTimeMicroseconds()(intx_service/include/util.h:452-467) uses __rdtsc() on x86_64 and ARM virtual counter on aarch64—both single-instruction operations with minimal overhead. It divides CPU cycles by 2000 to approximate microseconds (assuming ~2GHz CPU), which provides sufficient precision for the 50µs bound. The implementation explicitly notes this approximation is "acceptable per requirements" and is consistently used throughout the codebase with the same 50µs threshold.tx_service/include/cc/template_cc_map.h (1)
5703-5715: Re-enqueue RangePartitionDataSyncScanCc via low-priority queue looks appropriateSwitching the reschedule path to
shard_->EnqueueLowPriorityCcRequest(&req);keeps this background range data-sync work off the main CC queue while still respecting the existing batch limits (DataSyncScanBatchSizeandscan_batch_size_). Semantics of pause/resume andPausePoshandling remain unchanged, so this looks safe.
bf37013 to
669511a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
tx_service/include/tx_service.h (1)
406-407: Incorrect comment on line 406.The comment states "low priority already processed above" but the low-priority processing actually happens below on lines 410-413. This was already flagged in a previous review.
tx_service/include/util.h (1)
608-611: Critical: Syntax error in aarch64 path breaks compilation.Line 609-610 has a malformed namespace qualifier:
std: memory_order_acquireshould bestd::memory_order_acquire.🔎 Proposed fix
cycles_per_us = - tsc_cycles_per_microsecond_.load(std - : memory_order_acquire); + tsc_cycles_per_microsecond_.load(std::memory_order_acquire);
🧹 Nitpick comments (4)
tx_service/include/cc/template_cc_map.h (4)
5845-5855: HashPartitionDataSyncScanCc: bounded by both time and exported data size
export_data_size+HashPartitionDataSyncScanCc::DataSyncScanDataSizeadds a clear data‑volume cap per execute, and you consistently accumulateflush_data_sizein both Versioned and non‑Versioned branches.- Replacing chrono-based timing with
ReadTimeMicroseconds()and breaking whenl_now - l_start >= 50 || l_now < l_startgives a defensible 50µs budget with simple wrap‑around protection.- Requeueing with
EnqueueLowPriorityCcRequest(&req)keeps this maintenance work off the normal CC queue.Consider (non‑blocking) making
export_data_sizeauint64_tto matchflush_data_size/DataSyncScanDataSize, and lifting the hardcoded50into a named constant shared across these low‑prio loops.Also applies to: 5857-5864, 5913-5922, 6005-6012, 6061-6063
6110-6121: DefragShardHeapCc (key defrag): time‑bounded iterations + low‑priority requeueThe added
ReadTimeMicroseconds()budget and early break on>= 50µs (withl_now < l_startguard) nicely caps each defrag slice, and switching toEnqueueLowPriorityCcRequest(&req)keeps this maintenance task low‑impact. Logic for pause/resume viapause_posis untouched and still correct.Also applies to: 6212-6213
6240-6251: DefragShardHeapCc (page/entry defrag): consistent low‑prio, time‑capped behaviorMirroring the key-defrag branch, this loop now:
- Uses
ReadTimeMicroseconds()with a 50µs cap checked every few effective operations.- Requeues via
EnqueueLowPriorityCcRequest(&req).This keeps long page/entry defrag passes from monopolizing a TxProcessor core while preserving existing pause/resume semantics.
Also applies to: 6245-6246
7282-7283: KickoutCcEntryCc: add 50µs guard around per‑batch page cleaningThe new
ReadTimeMicroseconds()start/now checks in the kickout loop bound how long a singleKickoutCcEntryCcexecution can run, on top of the existingKickoutPageBatchSizelimit. Breaking beforeCleanPageAndReBalanceon time‑out is safe becauseccpstill points at the next page andresume_keyis derived from that page’sFirstKey().Also applies to: 7291-7297
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
core/src/data_substrate.cpptx_service/include/cc/cc_request.htx_service/include/cc/cc_shard.htx_service/include/cc/local_cc_shards.htx_service/include/cc/template_cc_map.htx_service/include/tx_service.htx_service/include/util.htx_service/src/cc/cc_shard.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
- tx_service/include/cc/local_cc_shards.h
- tx_service/include/cc/cc_shard.h
- tx_service/include/cc/cc_request.h
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
tx_service/include/util.htx_service/src/cc/cc_shard.cpptx_service/include/cc/template_cc_map.htx_service/include/tx_service.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
tx_service/include/util.htx_service/src/cc/cc_shard.cpptx_service/include/cc/template_cc_map.htx_service/include/tx_service.h
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
Applied to files:
tx_service/src/cc/cc_shard.cpptx_service/include/tx_service.h
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
Applied to files:
tx_service/src/cc/cc_shard.cpptx_service/include/cc/template_cc_map.htx_service/include/tx_service.h
📚 Learning: 2025-10-21T06:46:53.700Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.
Applied to files:
tx_service/include/cc/template_cc_map.h
🧬 Code graph analysis (3)
core/src/data_substrate.cpp (1)
tx_service/include/util.h (1)
InitializeTscFrequency(492-575)
tx_service/src/cc/cc_shard.cpp (2)
tx_service/include/cc/local_cc_shards.h (1)
ProcessLowPriorityRequests(252-255)tx_service/include/util.h (1)
ReadTimeMicroseconds(583-623)
tx_service/include/tx_service.h (1)
core/include/data_substrate.h (1)
metrics(103-106)
🔇 Additional comments (10)
core/src/data_substrate.cpp (1)
38-40: LGTM!The TSC frequency initialization is appropriately placed early in the
Start()sequence, ensuring the timing infrastructure is calibrated before any time-bounded processing occurs. Thestd::call_onceinsideInitializeTscFrequency()makes this safe to call even if the init_state check fails later.Also applies to: 139-140
tx_service/include/tx_service.h (2)
398-400: LGTM!The busy round threshold check correctly sums both regular and low-priority queue sizes to determine if metrics collection is warranted.
410-413: LGTM!Low-priority requests are correctly processed after regular requests, and the count is appropriately added to the total
req_cnt.tx_service/src/cc/cc_shard.cpp (2)
564-630: LGTM - Time-bounded low-priority processing implementation.The implementation correctly bounds processing time to 50 microseconds:
- Uses
ReadTimeMicroseconds()for low-overhead timing- Conservative wraparound handling (exits on detected wraparound)
- Per-request dequeue is appropriate for a best-effort low-priority path
One minor observation: the time check happens after processing each request, so the actual elapsed time may slightly exceed 50µs by the duration of the last processed request. This is acceptable for a soft bound.
2706-2706: LGTM!Routing defragmentation work through the low-priority queue is appropriate since defrag is a background maintenance operation that shouldn't compete with regular request processing.
tx_service/include/util.h (1)
492-575: Well-designed TSC frequency calibration.The
InitializeTscFrequency()implementation is solid:
- Uses
std::call_oncefor thread-safe one-time initialization- Retry loop with stability threshold (1%) ensures accurate measurement
- Graceful fallback if stable measurement cannot be obtained
- aarch64 path correctly reads
cntfrq_el0for timer frequencytx_service/include/cc/template_cc_map.h (4)
5719-5721: RangePartitionDataSyncScanCc now correctly requeues via low‑priority CC queueUsing
EnqueueLowPriorityCcRequest(&req)forRangePartitionDataSyncScanCcmatches its background nature and keeps it from competing with foreground CC work; the control flow (return false) is unchanged otherwise.
8098-8104: RangePartitionDataSyncScanCc: retries and continuation moved to low‑priority queueBoth:
- The
RangeSliceOpStatus::Retrypath, and- The normal continuation path at the end of
Execute(ScanSliceDeltaSizeCcForRangePartition &req)now use
EnqueueLowPriorityCcRequest(...)instead of the main queue. Given these are data‑sync / delta‑size maintenance scans, this is an appropriate downgrade in scheduling priority and should help bound impact on foreground CC traffic.Also applies to: 8203-8204
8359-8360: HashPartition delta‑size scan: requeue on low‑priority CC queueSwitching the re‑enqueue from
EnqueuetoEnqueueLowPriorityCcRequest(&req)forScanDeltaSizeCcForHashPartitionkeeps this accounting/estimation work out of the hot CC path, consistent with other data‑sync scans in this PR.
8465-8466: SampleSubRangeKeysCc: treat key‑sampling scans as low‑priorityRe‑enqueuing
SampleSubRangeKeysCcviaEnqueueLowPriorityCcRequest(&req)is aligned with it being a statistics/sampling job rather than foreground transaction work; resume behavior viaPausePosis otherwise unchanged.
669511a to
e51f30d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
core/src/data_substrate.cpptx_service/include/util.h
🚧 Files skipped from review as they are similar to previous changes (1)
- core/src/data_substrate.cpp
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.
Applied to files:
tx_service/include/util.h
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
tx_service/include/util.h
🔇 Additional comments (3)
tx_service/include/util.h (3)
28-41: LGTM!The new includes are appropriate for the timing functionality:
<atomic>for thread-safe state,<mutex>forstd::call_once,<thread>for sleep,<chrono>for the fallback path, and<x86intrin.h>properly guarded for x86_64.
489-572: Initialization logic is sound with appropriate stability checks.The x86_64 path with retry-until-stable is well designed. The aarch64 path using
freq_hz / 1000000trades ~2-3% precision for simplicity and overflow safety—acceptable for approximate 50 µs bounds.One edge case: if
cntfrq_el0ever returns < 1 MHz (highly unlikely on any real ARM64 hardware),cycles_per_uswould be 0, causing a division-by-zero inReadTimeMicroseconds(). Consider adding a minimum-value guard if defensive coding is desired.
580-619: Implementation is correct with appropriate memory ordering.The lazy-initialization pattern with
memory_order_relaxedon the fast path andmemory_order_acquireafter initialization is correct. The__builtin_expecthint for the unlikely cold path is a nice optimization. The previous overflow and syntax issues have been properly addressed.
e51f30d to
17e3bdf
Compare
Here are some reminders before you submit the pull request
fixes eloqdb/tx_service#issue_id./mtr --suite=mono_main,mono_multi,mono_basicSummary by CodeRabbit
New Features
Refactor
Chores
✏️ Tip: You can customize this high-level summary in your review settings.