Index mem by liunyl · Pull Request #323 · eloqdata/tx_service

liunyl · 2025-12-25T09:21:18Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Reference the link of issue using fixes eloqdb/tx_service#issue_id
Reference the link of RFC if exists
Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

Chores
- Improved memory cleanup in batch operations and reduced retained capacity after clears.
- Tightened memory/quota calculations for data synchronization by including additional in-memory footprints.
- Broadened lock-handling during transaction recovery to treat more non-writable locks uniformly.
- Increased the default load-factor used for range partitioning decisions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

CLAassistant · 2025-12-25T09:21:25Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ liunyl
❌ Ubuntu

Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

coderabbitai · 2025-12-25T09:21:28Z

Walkthrough

Adds vector capacity shrinking and explicit partition-map clearing, increases a range load-factor constant, broadens non-write orphan-lock detection, and includes per-core in-memory vector sizes in data-sync flush memory accounting for both range and hash partitions.

Changes

Cohort / File(s)	Summary
Memory cleanup in store handler `store_handler/data_store_service_client.cpp`	`DataStoreServiceClient::PutAll` now calls `clear()` on `hash_partitions_map` and `range_partitions_map` after their processing loops to release retained memory.
Partition batch capacity shrink `store_handler/data_store_service_client_closure.h`	`PartitionBatchRequest::Clear()` now calls `shrink_to_fit()` on `key_parts`, `record_parts`, `records_ts`, `records_ttl`, `record_tmp_mem_area`, and `op_types` after clearing to reduce retained capacity.
Range tuning constant `tx_service/include/cc/range_slice.h`	`StoreRange::new_range_load_factor` changed from `0.1` to `0.6`.
Orphan-lock detection `tx_service/src/cc/cc_shard.cpp`	In `CheckRecoverTx`, condition broadened: any non-`WriteLock` now triggers the orphan-lock skip/recovery path (previously only `ReadLock`).
Data-sync memory accounting `tx_service/src/cc/local_cc_shards.cpp`	`DataSyncForRangePartition` and `DataSyncForHashPartition` now include per-core sizes of `data_sync_vec`, `archive_vec`, and `mv_base_vec` in `flush_data_size` when evaluating/allocating flush memory quota.

Sequence Diagram(s)

(Skipped — changes are local bookkeeping and conditional logic adjustments that do not introduce a new multi-component sequential flow.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Improve hash-partition data sync scheduling and memory control. #139: Overlaps per-core memory-accounting adjustments for DataSync paths.
Avoid forcing checkpoint flush when data sync memory quota is full #162: Related changes to data-sync memory/quota handling in local_cc_shards.

Poem

🐰 I hopped through code with gentle paws,
Cleared the maps beneath the gauze,
Shrunk the bins that bloomed too wide,
Load-factor rose, locks learned to bide,
Counting bytes so flushes glide.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete and only contains the unchecked contributor checklist template without any substantive explanation of the changes, objectives, or technical details.	Add a detailed description explaining the changes made, including the rationale for memory optimization changes, the load factor update, and how the lock type handling was modified. Reference the related issue and RFC if applicable.
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Index mem' is vague and does not clearly convey the specific changes made; it lacks descriptive detail about memory optimization or the affected components.	Consider using a more descriptive title that reflects the main changes, such as 'Optimize memory allocation and improve partition load factor handling' or 'Refactor memory management for hash and range partitions.'

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch index_mem

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6edf9f7 and b6b35d7.

📒 Files selected for processing (1)

tx_service/src/cc/local_cc_shards.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

tx_service/src/cc/local_cc_shards.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tx_service/src/cc/cc_shard.cpp (1)
1011-1020: Update the log message to reflect the broadened condition.

The condition now triggers for any lock type that is not WriteLock (line 1011), but the log message on line 1018 specifically says "read lock detected". This is misleading if other non-write lock types exist and trigger this path.
🔎 Suggested fix
-                    DLOG(INFO) << "read lock detected in tx "
-                               << lock_holding_txn << ", skip recovery";
+                    DLOG(INFO) << "non-write lock detected in tx "
+                               << lock_holding_txn << " (lock type: " 
+                               << static_cast<int>(lk_type) << "), skip recovery";
Additionally, consider adding a comment explaining why the broadened check is appropriate:
// Check if this is a non-write lock. According to the logic above (lines 1013-1017),
// only write locks should appear as orphan locks on local nodes. Any other lock type
// indicates this is not an orphan lock scenario, so skip recovery.
if (lk_type != LockType::WriteLock)

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4943d80 and dd1e7fd.

📒 Files selected for processing (5)

store_handler/data_store_service_client.cpp
store_handler/data_store_service_client_closure.h
tx_service/include/cc/range_slice.h
tx_service/src/cc/cc_shard.cpp
tx_service/src/cc/local_cc_shards.cpp

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-11-11T07:10:40.346Z

Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp
tx_service/src/cc/cc_shard.cpp

📚 Learning: 2025-12-02T10:43:27.431Z

Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp
tx_service/src/cc/cc_shard.cpp

📚 Learning: 2025-10-20T04:30:07.884Z

Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

🔇 Additional comments (4)

tx_service/include/cc/range_slice.h (1)

590-606: Behavioral change: larger target subrange size due to higher load factor

new_range_load_factor going from 0.1 to 0.6 directly increases the target subrange size in CalculateRangeSplitKeys (fewer, larger subranges per post_ckpt_size). The math and usage are consistent; this is a pure tuning change. Please just double‑check this higher load factor matches the intended balance between range size, memory usage, and split frequency in your deployment.

store_handler/data_store_service_client.cpp (2)

361-387: Good: early cleanup of hash_partitions_map to reduce peak memory

Clearing hash_partitions_map right after preparing all PartitionFlushState batches is safe (no further uses) and helps release potentially large index vectors while partition RPCs are in flight.

387-409: Good: symmetric cleanup of range_partitions_map

Similarly, clearing range_partitions_map once all range partition batches are built is correct and reduces transient memory usage for large flushes, without affecting control flow or error handling.

tx_service/src/cc/local_cc_shards.cpp (1)

5001-5004: Hash-partition mem accounting correctly adds FlushRecord/container overhead

The added terms based on scan_cc.DataSyncVec(), scan_cc.ArchiveVec(), and scan_cc.MoveBaseIdxVec() expand flush_data_size to include the in-memory FlushRecord and mv-base container cost, which aligns with the quota controller’s intent and mirrors the range-partition path (once fixed).

No further changes needed here.

coderabbitai · 2025-12-25T09:28:40Z

    void Clear()
    {
        key_parts.clear();
+        key_parts.shrink_to_fit();
        record_parts.clear();
+        record_parts.shrink_to_fit();
        records_ts.clear();
+        records_ts.shrink_to_fit();
        records_ttl.clear();
+        records_ttl.shrink_to_fit();
        record_tmp_mem_area.clear();
+        record_tmp_mem_area.shrink_to_fit();
        op_types.clear();
+        op_types.shrink_to_fit();
        parts_cnt_per_key = 1;
        parts_cnt_per_record = 1;
    }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "data_store_service_client_closure\.(h|cpp)" | head -5

Repository: eloqdata/tx_service

Length of output: 164

🏁 Script executed:

cat -n store_handler/data_store_service_client_closure.h | sed -n '340,380p'

Repository: eloqdata/tx_service

Length of output: 1534

🏁 Script executed:

# Search for PartitionBatchRequest usage and reuse patterns rg -n "PartitionBatchRequest" --type=cpp -A 3 -B 3 | head -100

Repository: eloqdata/tx_service

Length of output: 7863

🏁 Script executed:

# Look for Clear() and Reset() calls in context rg -n "\.Clear\(\)|\.Reset\(" --type=cpp store_handler/data_store_service_client_closure.h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 9265

Consider the reallocation cost vs. memory benefit trade-off in the hot path.

The shrink_to_fit() calls after clear() will release excess capacity, reducing memory footprint. However, since Reset() (lines 361-374) immediately reserves new capacity before reuse, this will cause unnecessary deallocation/reallocation cycles if PartitionBatchRequest objects are frequently reused—as they are in batch processing pipelines (see inflight_batch reuse patterns and GetNextBatch() method).

Consider profiling to verify this optimization doesn't introduce performance regression in high-throughput scenarios. Alternatively, consider the pattern used elsewhere (e.g., line 1632-1636) which swaps with empty objects instead of shrinking and re-reserving.

🤖 Prompt for AI Agents

In store_handler/data_store_service_client_closure.h around lines 343 to 359, the Clear() method calls shrink_to_fit() after clear() on multiple vectors which forces deallocation and will cause unnecessary reallocation when these objects are immediately Reset() with reserved capacity; remove the shrink_to_fit() calls (or replace each shrink_to_fit() pattern with swap-with-empty-object idiom used elsewhere) so Clear() only clears the vectors without freeing capacity, matching the Reset() reservation behavior and avoiding hot-path reallocations in frequent reuse scenarios.

xiexiaoy · 2025-12-25T10:38:08Z

The calculation of memory usage depends on sizeof, which is incorrect. Better call size_t malloc_usable_size(void *ptr);.

TODO: verify size of PartitionBatchRequest. Maybe we should only free memory if length is more than a threshold

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tx_service/src/cc/local_cc_shards.cpp (1)

4097-4105: Memory quota should consider vector capacity, not just size() * sizeof(T)

Using scan_cc.DataSyncVec*().size() * sizeof(FlushRecord) (and same for archive/mv-base) underestimates actual heap usage, since std::vector typically holds extra reserved capacity. For the quota controller this can materially under-count memory, especially under growth, even though switching to the scan_cc vectors here is otherwise correct.

Consider basing the additional charge on capacity() (or allocator‑reported usable size) instead of size() for both range and hash paths:

Suggested change (range & hash paths)

-            // The cost of FlushRecord also needs to be considered.
-            for (size_t i = 0; i < cc_shards_.size(); ++i)
-            {
-                flush_data_size +=
-                    (scan_cc.DataSyncVec(i).size() * sizeof(FlushRecord) +
-                     scan_cc.ArchiveVec(i).size() * sizeof(FlushRecord) +
-                     scan_cc.MoveBaseIdxVec(i).size() *
-                         sizeof(std::pair<TxKey, int32_t>));
-            }
+            // Also account for the in‑memory containers that will hold this batch.
+            for (size_t i = 0; i < cc_shards_.size(); ++i)
+            {
+                const auto &data_vec = scan_cc.DataSyncVec(i);
+                const auto &archive_vec = scan_cc.ArchiveVec(i);
+                const auto &mv_base_vec = scan_cc.MoveBaseIdxVec(i);
+                flush_data_size +=
+                    data_vec.capacity() * sizeof(FlushRecord) +
+                    archive_vec.capacity() * sizeof(FlushRecord) +
+                    mv_base_vec.capacity() *
+                        sizeof(std::pair<TxKey, int32_t>);
+            }

-        // The cost of FlushRecord also needs to be considered.
-        flush_data_size +=
-            (scan_cc.DataSyncVec().size() * sizeof(FlushRecord) +
-             scan_cc.ArchiveVec().size() * sizeof(FlushRecord) +
-             scan_cc.MoveBaseIdxVec().size() *
-                 sizeof(std::pair<TxKey, int32_t>));
+        // Also account for the in‑memory containers that will hold this batch.
+        const auto &data_vec = scan_cc.DataSyncVec();
+        const auto &archive_vec = scan_cc.ArchiveVec();
+        const auto &mv_base_vec = scan_cc.MoveBaseIdxVec();
+        flush_data_size +=
+            data_vec.capacity() * sizeof(FlushRecord) +
+            archive_vec.capacity() * sizeof(FlushRecord) +
+            mv_base_vec.capacity() *
+                sizeof(std::pair<TxKey, int32_t>);

If you want even tighter accounting, you could go further and use allocator‑specific *_usable_size(vector.data()) instead of capacity()*sizeof(T), but the above keeps things portable while avoiding the worst underestimation.

Also applies to: 5002-5007

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd1e7fd and 6edf9f7.

📒 Files selected for processing (1)

tx_service/src/cc/local_cc_shards.cpp

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-11-11T07:10:40.346Z

Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-12-02T10:43:27.431Z

Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-10-20T04:30:07.884Z

Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

📚 Learning: 2025-10-21T06:46:53.700Z

Learnt from: lokax
Repo: eloqdata/tx_service PR: 149
File: src/remote/cc_stream_receiver.cpp:1066-1075
Timestamp: 2025-10-21T06:46:53.700Z
Learning: In src/remote/cc_stream_receiver.cpp, for ScanNextRequest handling, BucketIds() on RemoteScanNextBatch should never be empty—this is an expected invariant of the scan protocol.

Applied to files:

tx_service/src/cc/local_cc_shards.cpp

xiexiaoy · 2025-12-26T06:32:56Z

+            size_t flush_record_size = mi_malloc_usable_size(&flush_record);
+#ifdef WITH_JEMALLOC
+            flush_record_size = sizeof(FlushRecord);
+#endif


Please take a look at the figure I pasted in eloqdata/project_tracker#73.

std::vector data_sync_vecs, archive_vecs and mv_base_vecs might allocs memory more than calculated value.

I suggest using mi_malloc_usable_size to get the memory usage of those vectors, not the variable flush_record.

address comment in previous pr #323

liunyl and others added 2 commits December 25, 2025 09:18

consider flush record meta data size

10eeecc

Change new range fill factor

dd1e7fd

coderabbitai Bot reviewed Dec 25, 2025

View reviewed changes

fix comment

6edf9f7

TODO: verify size of PartitionBatchRequest. Maybe we should only free memory if length is more than a threshold

coderabbitai Bot reviewed Dec 25, 2025

View reviewed changes

address comment

b6b35d7

xiexiaoy approved these changes Dec 26, 2025

View reviewed changes

liunyl merged commit c5f9174 into main Dec 26, 2025
2 of 4 checks passed

liunyl deleted the index_mem branch December 26, 2025 07:23

liunyl restored the index_mem branch December 26, 2025 07:23

liunyl added a commit that referenced this pull request Dec 26, 2025

address comment (#325)

a2d73fb

address comment in previous pr #323

This was referenced Dec 26, 2025

address comment #325

Merged

fix crash in malloc_usable_size and rocksdb_gcs compile #328

Merged

This was referenced Jan 8, 2026

enhancement: tune hash partition checkpoint params, increase ckpt scan buffer, fix hash ckpt partition calc, and limit min data sync scan concurrency #347

Merged

Fix recent bugs #362

Closed

This was referenced Jan 18, 2026

Bug fix #368

Merged

Make putall across tables parallel #373

Merged

coderabbitai Bot mentioned this pull request Feb 2, 2026

shard flush buffer to multi core #387

Closed

5 tasks

coderabbitai Bot mentioned this pull request Feb 12, 2026

Dirty memory ratio and size can trigger ckpt #406

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index mem#323

Index mem#323
liunyl merged 4 commits into
mainfrom
index_mem

liunyl commented Dec 25, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

CLAassistant commented Dec 25, 2025

Uh oh!

coderabbitai Bot commented Dec 25, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Dec 25, 2025

Uh oh!

Uh oh!

xiexiaoy commented Dec 25, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

xiexiaoy Dec 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

liunyl commented Dec 25, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here are some reminders before you submit the pull request

Summary by CodeRabbit

Uh oh!

CLAassistant commented Dec 25, 2025

Uh oh!

coderabbitai Bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiexiaoy commented Dec 25, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

xiexiaoy Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liunyl commented Dec 25, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 25, 2025 •

edited

Loading