Skip to content

fix bug and refactor init process#250

Merged
liunyl merged 6 commits into
mainfrom
eloqdb
Dec 3, 2025
Merged

fix bug and refactor init process#250
liunyl merged 6 commits into
mainfrom
eloqdb

Conversation

@liunyl

@liunyl liunyl commented Dec 2, 2025

Copy link
Copy Markdown
Contributor

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced improved initialization lifecycle for core services with synchronized startup sequencing.
  • Bug Fixes

    • Optimized lock acquisition for non-partitioned table operations.
  • Chores

    • Simplified path handling in data store configuration.
    • Adjusted metrics configuration options and removed unused metrics wiring.
    • Updated default build configuration settings.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Dec 2, 2025

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

Walkthrough

This PR refactors DataSubstrate's initialization from a global singleton pattern to a controlled lifecycle API with explicit Init/Start phases and per-engine registration. It removes CMake configuration options, simplifies data store path handling, and adjusts lock acquisition semantics in transaction execution for non-hash-partitioned tables.

Changes

Cohort / File(s) Summary
Build Configuration
CMakeLists.txt
Changed ELOQ_MODULE_ENABLED default from OFF to ON; removed USE_ONE_ELOQDSS_PARTITION_ENABLED option; removed reverse synchronization logic for WITH_DATA_STORE based on WITH_LOG_STATE; removed conditional USE_ONE_ELOQDSS_PARTITION compile definition emission.
DataSubstrate Lifecycle Refactor
core/include/data_substrate.h, core/src/data_substrate.cpp
Replaced legacy Initialize*/GetGlobal pattern with singleton API (Instance(), Init(), Start()); introduced InitState enum (NotInitialized, ConfigLoaded, Started); added per-engine lifecycle methods (RegisterEngine(), EnableEngine(), WaitForEnabledEnginesRegistered(), WaitForDataSubstrateStarted()); made constructor private; added synchronization primitives for engine registration and substrate startup; deprecated RegisterEngines() as non-operational alias.
Metrics Initialization
core/src/metrics_init.cpp
Removed EloqKV Redis command metrics wiring block; renamed config key from "enable_busy_round_metrics" to "enable_remote_request_metrics".
Data Store Path Handling
store_handler/eloq_data_store_service/eloq_store_data_store.cpp
Removed path augmentation logic that previously appended "/ds_<shard_id>" suffix to store paths.
Transaction Service Updates
tx_service/src/tx_operation.cpp, tx_service/src/tx_execution.cpp
SetCommitTsOperation constructor now calls Reset() after construction; cluster config read lock acquisition in tx_execution.cpp now gates only for hash-partitioned tables instead of all tables.

Sequence Diagram

sequenceDiagram
    participant Main
    participant Engine
    participant DataSubstrate
    participant Config
    participant Services

    Main->>DataSubstrate: Init(config_file_path)
    activate DataSubstrate
    DataSubstrate->>Config: Load config from path
    Config-->>DataSubstrate: Config loaded
    DataSubstrate->>DataSubstrate: Set init_state_=ConfigLoaded
    deactivate DataSubstrate
    DataSubstrate-->>Main: success

    Main->>DataSubstrate: EnableEngine(engine_type)
    activate DataSubstrate
    DataSubstrate->>DataSubstrate: Mark engine as enabled
    deactivate DataSubstrate

    Engine->>DataSubstrate: RegisterEngine(engine_type, factory, ...)
    activate DataSubstrate
    DataSubstrate->>DataSubstrate: Validate init_state_=ConfigLoaded
    DataSubstrate->>DataSubstrate: Store engine registration
    DataSubstrate->>DataSubstrate: Signal engine_registered
    deactivate DataSubstrate
    DataSubstrate-->>Engine: success

    Engine->>DataSubstrate: WaitForEnabledEnginesRegistered(timeout)
    activate DataSubstrate
    DataSubstrate->>DataSubstrate: Block until all enabled engines registered
    deactivate DataSubstrate
    DataSubstrate-->>Engine: all registered or timeout

    Main->>DataSubstrate: Start()
    activate DataSubstrate
    DataSubstrate->>Services: Initialize log storage
    Services-->>DataSubstrate: ready
    DataSubstrate->>Services: Initialize metrics/storage
    Services-->>DataSubstrate: ready
    DataSubstrate->>DataSubstrate: Set init_state_=Started
    DataSubstrate->>DataSubstrate: Signal data_substrate_started
    deactivate DataSubstrate
    DataSubstrate-->>Main: success

    Engine->>DataSubstrate: WaitForDataSubstrateStarted(timeout)
    activate DataSubstrate
    DataSubstrate->>DataSubstrate: Block until Started state
    deactivate DataSubstrate
    DataSubstrate-->>Engine: Started or timeout
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Key areas requiring careful attention:

  • DataSubstrate state machine transitions — verify correct ordering of Init/Start/RegisterEngine phases and mutex guarding
  • Synchronization primitives — review condition variable usage in WaitForEnabledEnginesRegistered() and WaitForDataSubstrateStarted() for deadlock and spurious wakeup handling
  • Engine registration per-slot storage — confirm that concurrent RegisterEngine() calls from multiple engines do not race or corrupt state
  • Lock gating change in tx_execution.cpp — validate that skipping cluster config lock for non-hash-partitioned tables does not introduce race conditions
  • Path removal in data store — ensure removal of "/ds_<shard_id>" path suffix does not break shard isolation or configuration expectations

Possibly related PRs

  • fix bug and refactor init process #250 — Applies identical DataSubstrate lifecycle refactoring (Init/Instance/Start pattern, RegisterEngine/EnableEngine APIs, InitState state machine) and identical CMakeLists option changes
  • Refactor scan next batch for hash partition #131 — Modifies hash-partitioned scan logic and cluster-config lock gating in tx_execution.cpp, overlapping with lock acquisition gating changes in this PR
  • Restructure for eloqconverge #168 — Modifies DataSubstrate initialization APIs (replacing Initialize/InitializeGlobal/GetGlobal with Instance/Init/Start), directly related to lifecycle refactoring in this PR

Suggested reviewers

  • lzxddz
  • thweetkomputer
  • githubzilla

Poem

🐰 A lifecycle blooms anew, three phases clear—
Init then Start, no globals here!
Engines register with grace and care,
State machines dance through config'd air.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only the template checklist with all items unchecked and no actual description of the changes, objectives, testing performed, or context. Complete the description with: actual changes made, rationale for the refactoring, bugs fixed, testing performed, and check off or address each template item (tests, documentation, issue/RFC references).
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'fix bug and refactor init process' is vague and generic, using non-descriptive terms that don't clearly convey which specific bug is being fixed or the scope of the refactoring. Replace with a more specific title that identifies the main change, e.g., 'Refactor DataSubstrate initialization to use singleton pattern with new lifecycle API' or reference the bug being fixed.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 247a1d0 and df61db7.

📒 Files selected for processing (7)
  • CMakeLists.txt (1 hunks)
  • core/include/data_substrate.h (5 hunks)
  • core/src/data_substrate.cpp (3 hunks)
  • core/src/metrics_init.cpp (2 hunks)
  • store_handler/eloq_data_store_service/eloq_store_data_store.cpp (0 hunks)
  • tx_service/src/tx_execution.cpp (1 hunks)
  • tx_service/src/tx_operation.cpp (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/metrics_init.cpp (1)

85-89: Possible copy-paste error in config key.

Line 87 uses "enable_busy_round_metrics" as the config key but assigns the result to metrics::enable_remote_request_metrics. This appears to be a copy-paste error and should likely read "enable_remote_request_metrics".

         // remote request metrics
         metrics::enable_remote_request_metrics = config_reader.GetBoolean(
-            "metrics", "enable_busy_round_metrics", metrics::enable_tx_metrics);
+            "metrics", "enable_remote_request_metrics", metrics::enable_tx_metrics);
🧹 Nitpick comments (5)
tx_service/src/tx_operation.cpp (1)

798-803: Constructor-level Reset for SetCommitTsOperation looks correct; consider ordering with TX_TRACE_ASSOCIATE

Initializing hd_result_ via Reset() in the constructor is a good way to guarantee a clean handler state before the op is ever used, and it aligns with how other ops rely on Reset() to clear state.

One minor nit: elsewhere in this file, the pattern is consistently:

SetCommitTsOperation::SetCommitTsOperation(TransactionExecution* txm)
    : hd_result_(txm) {
    TX_TRACE_ASSOCIATE(this, &hd_result_);
    // ...
}

For consistency and to avoid any future surprises if tracing starts depending on pre‑associated state, you may want to reorder as:

SetCommitTsOperation::SetCommitTsOperation(TransactionExecution* txm)
    : hd_result_(txm) {
    TX_TRACE_ASSOCIATE(this, &hd_result_);
    Reset();
}

Functionally, the current change is fine; this is just a style/ordering suggestion.

core/src/data_substrate.cpp (3)

355-365: Misleading comment about mutex release.

The comment at Line 357 states "instance_mutex_ is released here when lock goes out of scope" but the lock variable (Line 272) doesn't go out of scope until the function returns at Line 367. Consider restructuring to release instance_mutex_ before notifying, or correct the comment.

     LOG(INFO) << "Engine " << static_cast<int>(engine_type)
               << " registered successfully";
-    // Note: instance_mutex_ is released here when lock goes out of scope
 
     // Set engine_registered and notify waiting threads
     // Use engine_registration_mutex_ to protect engine_registered flag
     {
         std::lock_guard<std::mutex> reg_lock(instance.engine_registration_mutex_);
         instance.engines_[engine_index].engine_registered = true;
         instance.engine_registration_cv_.notify_all();
     }
+    // Note: instance_mutex_ is released when function returns
 
     return true;
 }

149-157: Config file parsed twice during initialization.

The config file is parsed in both Init() (Line 103) and Start() (Line 150). Consider caching the parsed configuration during Init() to avoid redundant file I/O and parsing.


467-483: Deprecated method duplicates Sequences table insertion.

The deprecated RegisterEngines() adds the Sequences table (Lines 476-478), which is already added in Init() (Lines 119-121). While try_emplace handles duplicates safely, this duplication is confusing. Consider removing the duplicate insertion or documenting the intentional redundancy.

core/include/data_substrate.h (1)

288-295: Clarify that instance_mutex_ is defined in the implementation file.

The comment at Lines 293-294 mentions instance_mutex_ which is not visible in this header. Consider adding a note that this mutex is defined as a file-static variable in the implementation, to avoid confusion for header readers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7f8e72 and 9470242.

📒 Files selected for processing (6)
  • CMakeLists.txt (1 hunks)
  • core/include/data_substrate.h (5 hunks)
  • core/src/data_substrate.cpp (3 hunks)
  • core/src/metrics_init.cpp (1 hunks)
  • store_handler/eloq_data_store_service/eloq_store_data_store.cpp (0 hunks)
  • tx_service/src/tx_operation.cpp (1 hunks)
💤 Files with no reviewable changes (1)
  • store_handler/eloq_data_store_service/eloq_store_data_store.cpp
🧰 Additional context used
🧬 Code graph analysis (1)
core/include/data_substrate.h (1)
core/src/data_substrate.cpp (15)
  • Instance (82-86)
  • Instance (82-82)
  • Init (88-128)
  • Init (88-88)
  • Start (130-196)
  • Start (130-130)
  • RegisterEngine (260-368)
  • RegisterEngine (260-267)
  • EnableEngine (370-396)
  • EnableEngine (370-370)
  • WaitForEnabledEnginesRegistered (398-443)
  • WaitForEnabledEnginesRegistered (398-399)
  • WaitForDataSubstrateStarted (445-465)
  • WaitForDataSubstrateStarted (445-446)
  • DataSubstrate (198-210)
🔇 Additional comments (3)
CMakeLists.txt (1)

17-17: Verify impact of changing default from OFF to ON.

Changing ELOQ_MODULE_ENABLED default from OFF to ON is a potentially breaking change for downstream builds that relied on the previous default. Ensure all dependent projects and CI pipelines are updated accordingly.

core/include/data_substrate.h (2)

111-124: LGTM on InitState enum and EngineConfig struct.

The InitState enum provides clear lifecycle states, and the EngineConfig struct additions (enable_engine, engine_registered) are well-documented with comments explaining their purpose and synchronization semantics.


171-200: Well-designed public API for singleton lifecycle.

The new API clearly separates concerns: Init() for config loading, Start() for service initialization, RegisterEngine() for engine registration, and WaitFor* methods for synchronization. The use of rvalue references for vectors enables efficient ownership transfer.

Comment thread core/include/data_substrate.h
Comment thread core/src/data_substrate.cpp
Comment thread core/src/data_substrate.cpp
Comment thread core/src/data_substrate.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/data_substrate.cpp (1)

786-791: Bug: Incorrect error message.

The error message says "when set enable_io_uring, should also set enable_io_uring" which is a copy-paste error. It should reference raft_log_async_fsync in the first part.

     if (raft_log_async_fsync && !enable_io_uring)
     {
-        LOG(ERROR) << "Invalid config: when set `enable_io_uring`, "
+        LOG(ERROR) << "Invalid config: when set `raft_log_async_fsync`, "
                       "should also set `enable_io_uring`.";
         return false;
     }
♻️ Duplicate comments (3)
core/src/data_substrate.cpp (3)

372-399: Previous bug fixed: instance reference now used correctly.

The past review flagged that engine_registration_mutex_ was accessed without the instance. prefix. This has been corrected at line 394.


401-446: Previous bugs fixed: instance references now used correctly.

The past review flagged missing instance. prefixes for engine_registration_mutex_ and engine_registration_cv_. Both are now correctly referenced through the instance.


448-468: Previous bug fixed: instance reference now used correctly for condition variable.

The past review flagged missing instance. prefix for data_substrate_started_cv_. This is now correctly referenced at line 462. The static instance_mutex_ usage at line 452 is intentional and correct.

🧹 Nitpick comments (2)
core/include/data_substrate.h (2)

118-124: Consider adding default member initializers to EngineConfig.

The enable_engine and engine_registered bool fields lack default initializers. While the deprecated RegisterEngines() initializes the engines_[] array, if EngineConfig is used in isolation or copied, uninitialized bools can lead to undefined behavior.

     struct EngineConfig
     {
-        txservice::CatalogFactory *catalog_factory;
+        txservice::CatalogFactory *catalog_factory{nullptr};
         txservice::TableEngine engine;
-        bool enable_engine;      // Set by main thread before engine init
-        bool engine_registered;  // Set true when engine calls RegisterEngine()
+        bool enable_engine{false};      // Set by main thread before engine init
+        bool engine_registered{false};  // Set true when engine calls RegisterEngine()
     };

171-178: Inconsistent static/non-static API design.

Init() is declared as static (line 173), but Start() is a non-static instance method (line 178). Both operate on the singleton via instance_mutex_ and Instance(). For consistency and clarity, consider making Start() static as well:

     // Complete initialization by starting all services
     // Must be called after Init() and all engines have registered
     // Uses config file path saved during Init()
-    bool Start();
+    static bool Start();

This would allow consistent usage: DataSubstrate::Init(path) followed by DataSubstrate::Start().

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9470242 and 30e2bd1.

📒 Files selected for processing (6)
  • CMakeLists.txt (1 hunks)
  • core/include/data_substrate.h (5 hunks)
  • core/src/data_substrate.cpp (3 hunks)
  • core/src/metrics_init.cpp (2 hunks)
  • store_handler/eloq_data_store_service/eloq_store_data_store.cpp (0 hunks)
  • tx_service/src/tx_operation.cpp (1 hunks)
💤 Files with no reviewable changes (1)
  • store_handler/eloq_data_store_service/eloq_store_data_store.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • core/src/data_substrate.cpp
🧬 Code graph analysis (1)
core/include/data_substrate.h (1)
core/src/data_substrate.cpp (15)
  • Instance (82-86)
  • Instance (82-82)
  • Init (88-128)
  • Init (88-88)
  • Start (130-197)
  • Start (130-130)
  • RegisterEngine (261-370)
  • RegisterEngine (261-268)
  • EnableEngine (372-399)
  • EnableEngine (372-372)
  • WaitForEnabledEnginesRegistered (401-446)
  • WaitForEnabledEnginesRegistered (401-402)
  • WaitForDataSubstrateStarted (448-468)
  • WaitForDataSubstrateStarted (448-449)
  • DataSubstrate (199-211)
🔇 Additional comments (9)
tx_service/src/tx_operation.cpp (1)

798-803: SetCommitTsOperation now starts from a clean handler state

Calling Reset() in the constructor to clear hd_result_ looks correct and should prevent stale state when the op is (re)used, while remaining safe given Reset() is idempotent and only touches the handler result. Please ensure existing tests (or new ones) cover the init/reuse path for SetCommitTsOperation under the new lifecycle.

core/src/metrics_init.cpp (1)

86-87: Config key change looks correct.

The config key now correctly reads from "enable_remote_request_metrics" instead of "enable_busy_round_metrics", which aligns with the variable name. The default fallback to metrics::enable_tx_metrics is reasonable since remote request metrics typically require transaction metrics to be enabled.

core/include/data_substrate.h (3)

111-116: Well-designed lifecycle state machine.

The InitState enum clearly defines the three lifecycle stages: NotInitializedConfigLoadedStarted. This provides clear semantics for the initialization workflow.


238-241: Good: Copy and move operations properly deleted.

The singleton pattern is now correctly enforced with deleted copy/move constructors and assignment operators, preventing accidental duplication.


293-300: Synchronization design is appropriate.

The separation between engine_registration_mutex_ (for engine registration coordination) and the static instance_mutex_ (for Init/Start lifecycle) correctly avoids contention and provides clear ownership semantics for each concern.

core/src/data_substrate.cpp (4)

82-86: Correct Meyer's singleton implementation.

The static local variable pattern provides thread-safe lazy initialization in C++11 and later.


88-128: Init() implementation looks correct.

The two-phase initialization pattern with proper mutex protection and state checking is well-implemented. The idempotent behavior (returning true if already initialized) is appropriate.


191-194: Notification while holding mutex is intentional.

The notify_all() on data_substrate_started_cv_ is called while holding instance_mutex_, which is correct here. The waiting threads in WaitForDataSubstrateStarted() will re-acquire the mutex after waking and check init_state_, ensuring proper synchronization.


356-370: Verify lock release ordering between instance_mutex_ and engine_registration_mutex_.

The code releases instance_mutex_ (implicitly at end of scope) before acquiring engine_registration_mutex_ to set engine_registered. While this appears safe because:

  1. Start() also requires instance_mutex_ and only proceeds when init_state_ == ConfigLoaded
  2. WaitForEnabledEnginesRegistered() will wait for engine_registered to become true

There's a brief window where the engine data is written but engine_registered is still false. This seems intentional to avoid holding two mutexes simultaneously. Consider adding a comment clarifying this design decision.

Comment thread core/src/data_substrate.cpp Outdated
Comment on lines 470 to 486
[[deprecated("Engines should call RegisterEngine() directly")]]
bool DataSubstrate::RegisterEngines()
{
// Initialize engines array to default values
engines_[0] = {nullptr, txservice::TableEngine::EloqSql, false, false};
engines_[1] = {nullptr, txservice::TableEngine::EloqKv, false, false};
engines_[2] = {nullptr, txservice::TableEngine::EloqDoc, false, false};

// Add Sequences table (system table, not engine-specific)
prebuilt_tables_.try_emplace(
txservice::Sequences::table_name_,
std::string(txservice::Sequences::table_name_sv_));

// Note: Engines must register themselves via RegisterEngine() API
// This method no longer performs engine-specific registration
return true;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cd core/src && rg -n 'engines_\[' --type cpp -C 2

Repository: eloqdata/tx_service

Length of output: 3380


🏁 Script executed:

cd core/src && rg -n 'engines_' --type hpp --type h -C 2 | head -80

Repository: eloqdata/tx_service

Length of output: 90


🏁 Script executed:

cd core && find . -name "data_substrate.h*" -o -name "data_substrate.hpp"

Repository: eloqdata/tx_service

Length of output: 89


🏁 Script executed:

cd core && cat -n include/data_substrate.h | grep -A 10 -B 2 "engines_\|EngineConfig"

Repository: eloqdata/tx_service

Length of output: 1148


🏁 Script executed:

cd core/src && sed -n '100,250p' data_substrate.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 5492


🏁 Script executed:

cd core/src && rg -n "void DataSubstrate::Init" -A 30 data_substrate.cpp

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

cd core/src && sed -n '235,250p' tx_service_init.cpp

Repository: eloqdata/tx_service

Length of output: 587


🏁 Script executed:

cd core/src && rg -n "RegisterEngines\(\)" data_substrate.cpp

Repository: eloqdata/tx_service

Length of output: 187


🏁 Script executed:

cd core/src && rg -n "RegisterEngines" --type cpp

Repository: eloqdata/tx_service

Length of output: 225


Fix uninitialized engines_ array access in the new flow.

The engines_ array is being read with uninitialized member values. In InitializeTxService() (tx_service_init.cpp:242), the code reads engines_[i].catalog_factory before any initialization has occurred. Since RegisterEngines() is no longer called in the new flow and EngineConfig has no default constructor, array members contain indeterminate values.

This causes undefined behavior when uninitialized pointers are passed to the TxService constructor.

Add initialization in the DataSubstrate constructor:

for (size_t i = 0; i < NUM_EXTERNAL_ENGINES; i++)
{
    engines_[i] = {nullptr, txservice::TableEngine::EloqSql, false, false};
}

Alternatively, add a default constructor to EngineConfig struct to ensure zero-initialization.

🤖 Prompt for AI Agents
In core/src/data_substrate.cpp around lines 470 to 486, engines_ is left with
indeterminate members because RegisterEngines() is no longer invoked;
InitializeTxService() reads engines_[i].catalog_factory before initialization,
causing undefined behavior. Fix by ensuring engines_ elements are initialized:
either initialize the engines_ array in DataSubstrate's constructor (loop
setting each EngineConfig to safe defaults: null catalog_factory and sensible
default enum/flags) or add a default constructor to EngineConfig that
zero-initializes/allows safe defaults so any uninitialized array elements are
well-defined before use.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
core/include/data_substrate.h (1)

234-241: LGTM - Singleton copy/move semantics properly deleted.

The copy and move constructors/assignment operators are now properly deleted, preventing accidental duplication of the singleton instance.

core/src/data_substrate.cpp (3)

372-399: LGTM - EnableEngine() correctly uses instance reference.

The past review concern about missing instance. prefix has been addressed. All member accesses are now properly qualified through the instance reference.


401-446: LGTM - WaitForEnabledEnginesRegistered() correctly uses instance reference.

The past review concerns about missing instance. prefix have been addressed. The wait loop correctly handles spurious wakeups by re-checking the condition after each wake.


448-468: LGTM - WaitForDataSubstrateStarted() correctly uses instance reference.

The past review concern about missing instance. prefix for data_substrate_started_cv_ has been addressed. The use of the predicate form of wait_until correctly handles spurious wakeups.

🧹 Nitpick comments (1)
core/src/data_substrate.cpp (1)

149-157: Config file parsed twice during initialization.

The config file is parsed in both Init() (line 103) and Start() (line 150). While functionally correct, this introduces redundant I/O and parsing. Consider caching the parsed INIReader or the extracted configuration values during Init().

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30e2bd1 and 909f22e.

📒 Files selected for processing (3)
  • core/include/data_substrate.h (5 hunks)
  • core/src/data_substrate.cpp (3 hunks)
  • core/src/log_init.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • core/src/data_substrate.cpp
🧬 Code graph analysis (1)
core/include/data_substrate.h (1)
core/src/data_substrate.cpp (15)
  • Instance (82-86)
  • Instance (82-82)
  • Init (88-128)
  • Init (88-88)
  • Start (130-197)
  • Start (130-130)
  • RegisterEngine (261-370)
  • RegisterEngine (261-268)
  • EnableEngine (372-399)
  • EnableEngine (372-372)
  • WaitForEnabledEnginesRegistered (401-446)
  • WaitForEnabledEnginesRegistered (401-402)
  • WaitForDataSubstrateStarted (448-468)
  • WaitForDataSubstrateStarted (448-449)
  • DataSubstrate (199-211)
🔇 Additional comments (4)
core/include/data_substrate.h (1)

292-299: LGTM - Synchronization primitives for coordinated startup.

The separation of engine_registration_mutex_/engine_registration_cv_ for engine registration and data_substrate_started_cv_ (using the static instance_mutex_) for start signaling provides proper coordination between engine threads and the main initialization flow.

core/src/log_init.cpp (1)

39-42: LGTM - AWS credential declarations aligned with log state types.

The preprocessor guard change correctly aligns the aws_access_key_id and aws_secret_key declarations with log state types (LOG_STATE_TYPE_RKDB_S3 or LOG_STATE_TYPE_RKDB_GCS) rather than data store types. This is consistent with the usage at lines 285-292 where these credentials are accessed within #if defined(LOG_STATE_TYPE_RKDB_S3).

core/src/data_substrate.cpp (2)

79-86: LGTM - Thread-safe singleton implementation.

The combination of Meyer's singleton (static local variable in Instance()) with a static mutex for protecting state transitions provides thread-safe initialization.


88-128: LGTM - Init() method properly implements configuration loading phase.

The method correctly:

  • Guards against double initialization
  • Saves the config path for the later Start() phase
  • Properly resets init_state_ to NotInitialized on failure
  • Transitions to ConfigLoaded on success

Comment thread core/include/data_substrate.h
Comment thread core/src/data_substrate.cpp Outdated
Comment on lines 356 to 367
LOG(INFO) << "Engine " << static_cast<int>(engine_type)
<< " registered successfully";
// Note: instance_mutex_ is released here when lock goes out of scope

// Set engine_registered and notify waiting threads
// Use engine_registration_mutex_ to protect engine_registered flag
{
std::lock_guard<std::mutex> reg_lock(
instance.engine_registration_mutex_);
instance.engines_[engine_index].engine_registered = true;
instance.engine_registration_cv_.notify_all();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all lock acquisitions for both mutexes in the file
rg -n 'instance_mutex_|engine_registration_mutex_' core/src/data_substrate.cpp

Repository: eloqdata/tx_service

Length of output: 851


🏁 Script executed:

#!/bin/bash
# Read the actual code around lines 270-370 to see lock scope and function boundaries
sed -n '260,370p' core/src/data_substrate.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 4948


🏁 Script executed:

#!/bin/bash
# Read around line 394 and 405 to see how engine_registration_mutex_ is used in other methods
sed -n '380,420p' core/src/data_substrate.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 1689


🏁 Script executed:

#!/bin/bash
# Read around line 452 to see if any method acquires engine_registration_mutex_ then instance_mutex_
sed -n '440,480p' core/src/data_substrate.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 1579


🏁 Script executed:

#!/bin/bash
# Get all method definitions to ensure we've checked all lock orderings
grep -n "bool DataSubstrate::\|void DataSubstrate::" core/src/data_substrate.cpp

Repository: eloqdata/tx_service

Length of output: 559


🏁 Script executed:

#!/bin/bash
# Check Init and Start methods for lock patterns
sed -n '88,212p' core/src/data_substrate.cpp | grep -n "instance_mutex_\|engine_registration_mutex_"

Repository: eloqdata/tx_service

Length of output: 236


🏁 Script executed:

#!/bin/bash
# Final verification: check Shutdown method to be thorough
sed -n '213,260p' core/src/data_substrate.cpp | grep -n "instance_mutex_\|engine_registration_mutex_"

Repository: eloqdata/tx_service

Length of output: 45


Correct the misleading comment about lock scope in RegisterEngine().

The comment at line 358 incorrectly states that instance_mutex_ is released at that point. The lock's scope extends to the end of the function (line 370), meaning instance_mutex_ remains held when engine_registration_mutex_ is acquired at line 363. This nested lock pattern is safe (consistent ordering across all methods prevents deadlock), but the comment should clarify the actual scope.

    LOG(INFO) << "Engine " << static_cast<int>(engine_type)
              << " registered successfully";
-   // Note: instance_mutex_ is released here when lock goes out of scope
 
    // Set engine_registered and notify waiting threads
-   // Use engine_registration_mutex_ to protect engine_registered flag
+   // Note: instance_mutex_ is still held here; acquiring engine_registration_mutex_
+   // to protect engine_registered flag and notify waiting threads
    {
        std::lock_guard<std::mutex> reg_lock(
            instance.engine_registration_mutex_);
        instance.engines_[engine_index].engine_registered = true;
        instance.engine_registration_cv_.notify_all();
    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
LOG(INFO) << "Engine " << static_cast<int>(engine_type)
<< " registered successfully";
// Note: instance_mutex_ is released here when lock goes out of scope
// Set engine_registered and notify waiting threads
// Use engine_registration_mutex_ to protect engine_registered flag
{
std::lock_guard<std::mutex> reg_lock(
instance.engine_registration_mutex_);
instance.engines_[engine_index].engine_registered = true;
instance.engine_registration_cv_.notify_all();
}
LOG(INFO) << "Engine " << static_cast<int>(engine_type)
<< " registered successfully";
// Set engine_registered and notify waiting threads
// Note: instance_mutex_ is still held here; acquiring engine_registration_mutex_
// to protect engine_registered flag and notify waiting threads
{
std::lock_guard<std::mutex> reg_lock(
instance.engine_registration_mutex_);
instance.engines_[engine_index].engine_registered = true;
instance.engine_registration_cv_.notify_all();
}
🤖 Prompt for AI Agents
In core/src/data_substrate.cpp around lines 356 to 367, the comment saying
"instance_mutex_ is released here when lock goes out of scope" is incorrect
because the instance_mutex_ lock actually remains held until the end of the
function; update the comment to state that instance_mutex_ is still held here,
explain that engine_registration_mutex_ is being locked while instance_mutex_ is
held, and clarify that this nested locking is safe because the code uses a
consistent lock acquisition order across methods (preventing deadlock). Ensure
the revised comment accurately describes the lock scopes and the reason the
nested lock is safe.

Comment thread core/src/data_substrate.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tx_service/src/tx_execution.cpp (4)

1088-1097: Fix incorrect trace argument: fi_req is undefined here

ProcessTxRequest(ReloadCacheTxRequest &req) passes &fi_req into TX_TRACE_ACTION_WITH_CONTEXT, but fi_req is not in scope in this function. This will not compile and should use &req instead.

-    TX_TRACE_ACTION_WITH_CONTEXT(
-        this,
-        &fi_req,
+    TX_TRACE_ACTION_WITH_CONTEXT(
+        this,
+        &req,
         [this]() -> std::string
         {
             return std::string("\"tx_number\":")
                 .append(std::to_string(this->TxNumber()))
                 .append("\"tx_term\":")
                 .append(std::to_string(this->tx_term_));
         });

6422-6439: Use unlock_op in trace macro instead of undefined lock_op

In PostProcess(ReleaseScanExtraLockOp &unlock_op), the trace macro passes &lock_op, which is not defined in this scope. This will fail to compile and should reference unlock_op.

-    TX_TRACE_ACTION_WITH_CONTEXT(
-        this,
-        &lock_op,
+    TX_TRACE_ACTION_WITH_CONTEXT(
+        this,
+        &unlock_op,
         [this]() -> std::string
         {
             return std::string("\"tx_number\":")
                 .append(std::to_string(this->TxNumber()))
                 .append("\"tx_term\":")
                 .append(std::to_string(this->tx_term_));
         });

6446-6465: Fix trace macro to use kickout_data_op in Process(KickoutDataOp&)

Process(KickoutDataOp &kickout_data_op) passes &kickout_data_all_op to the trace macro, but that name isn’t in scope in this function (it belongs to a different overload). This breaks compilation.

-    TX_TRACE_ACTION_WITH_CONTEXT(
-        this,
-        &kickout_data_all_op,
+    TX_TRACE_ACTION_WITH_CONTEXT(
+        this,
+        &kickout_data_op,
         [this]() -> std::string
         {
             return std::string("\"tx_number\":")
                 .append(std::to_string(this->TxNumber()))
                 .append("\"tx_term\":")
                 .append(std::to_string(this->tx_term_));
         });

7712-7723: Correct trace macro argument in PostProcess(BatchReadOperation&)

PostProcess(BatchReadOperation &batch_read_op) calls TX_TRACE_ACTION_WITH_CONTEXT with &read, but read is not declared in this scope. It should trace batch_read_op.

-    TX_TRACE_ACTION_WITH_CONTEXT(
-        this,
-        &read,
+    TX_TRACE_ACTION_WITH_CONTEXT(
+        this,
+        &batch_read_op,
         [this]() -> std::string
         {
             return std::string("\"tx_number\":")
                 .append(std::to_string(this->TxNumber()))
                 .append("\"tx_term\":")
                 .append(std::to_string(this->tx_term_));
         });
♻️ Duplicate comments (1)
core/include/data_substrate.h (1)

234-248: Singleton semantics and init state tracking are now correctly enforced

Making the constructor private, deleting copy/move ctor/assignment, and adding init_state_ with a default of InitState::NotInitialized plus config_file_path_ storage aligns the header with the singleton implementation in the .cpp and prevents accidental duplication or misuse of the global instance. This also addresses the earlier review concern about copy/move operations on the singleton.

🧹 Nitpick comments (1)
core/include/data_substrate.h (1)

171-201: Consider enforcing the “engines registered before Start()” contract at runtime

The Init/Start + RegisterEngine/EnableEngine + Wait* API cleanly separates configuration and startup phases. Right now the “must be called after Init() and all engines have registered” constraint for Start() is documented but not enforced; a misordered call sequence would only be visible via engine behavior.

If you want stronger safeguards, consider having Start() either (a) internally call WaitForEnabledEnginesRegistered() with a configurable timeout, or (b) in debug builds, assert that all enable_engine slots have engine_registered == true before proceeding, to fail fast on configuration mistakes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 909f22e and 247a1d0.

📒 Files selected for processing (2)
  • core/include/data_substrate.h (5 hunks)
  • tx_service/src/tx_execution.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/src/tx_execution.cpp
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/src/tx_execution.cpp
🧬 Code graph analysis (1)
core/include/data_substrate.h (1)
core/src/data_substrate.cpp (15)
  • Instance (82-86)
  • Instance (82-82)
  • Init (88-128)
  • Init (88-88)
  • Start (130-197)
  • Start (130-130)
  • RegisterEngine (261-370)
  • RegisterEngine (261-268)
  • EnableEngine (372-399)
  • EnableEngine (372-372)
  • WaitForEnabledEnginesRegistered (401-446)
  • WaitForEnabledEnginesRegistered (401-402)
  • WaitForDataSubstrateStarted (448-468)
  • WaitForDataSubstrateStarted (448-449)
  • DataSubstrate (199-211)
🔇 Additional comments (2)
core/include/data_substrate.h (1)

111-124: InitState and EngineConfig defaults make the lifecycle/engine array safe by default

The explicit InitState enum and default member initialization in EngineConfig ensure a well-defined initial state for the singleton and every slot in engines_, avoiding UB when scanning or toggling engines before explicit configuration.

tx_service/src/tx_execution.cpp (1)

2371-2383: Scope cluster-config read lock to hash-partitioned tables only (LGTM)

Conditioning the cluster-config read lock on table_name.IsHashPartitioned() is consistent with how bucket metadata is consumed later and avoids unnecessary metadata locking for non-hash scans. No additional issues spotted here.

Comment on lines +293 to +299
// Synchronization for engine registration (Option A: condition_variable)
std::condition_variable engine_registration_cv_;
std::mutex engine_registration_mutex_;

// Synchronization for DataSubstrate::Start() completion (engines wait on
// this) Uses instance_mutex_ and init_state_ instead of separate mutex/flag
std::condition_variable data_substrate_started_cv_;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for all uses of engine_registered and enable_engine with context
rg -n "engine_registered|enable_engine" core -C3

Repository: eloqdata/tx_service

Length of output: 3765


🏁 Script executed:

#!/bin/bash
# Get context around RegisterEngine() and the line 316 check
sed -n '300,370p' core/src/data_substrate.cpp | cat -n

# Also check around the WaitForEngineRegistration function
sed -n '400,450p' core/src/data_substrate.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 4825


🏁 Script executed:

#!/bin/bash
# Let's see the full RegisterEngine function to understand the overall structure
sed -n '280,375p' core/src/data_substrate.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 4128


🏁 Script executed:

#!/bin/bash
# Find the start of RegisterEngine function and see where locks are acquired
rg -n "bool DataSubstrate::RegisterEngine" core -A 50 | head -80

Repository: eloqdata/tx_service

Length of output: 3764


🏁 Script executed:

#!/bin/bash
# Check where init_state_ transitions happen and what guards them
rg -n "init_state_\s*=" core/src/data_substrate.cpp -B2 -A2

Repository: eloqdata/tx_service

Length of output: 1734


🏁 Script executed:

#!/bin/bash
# Check the Start() function to see where init_state_ is set to Started and what guards it
sed -n '130,195p' core/src/data_substrate.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 2489


🏁 Script executed:

#!/bin/bash
# Check WaitForDataSubstrateStarted() implementation
rg -n "bool DataSubstrate::WaitForDataSubstrateStarted" core -A 30

Repository: eloqdata/tx_service

Length of output: 2106


Fix synchronization guard for engine_registered in RegisterEngine()

The duplicate-registration check at line 316 reads engine_registered under instance_mutex_, but the flag is written under engine_registration_mutex_ (line 365) and read under engine_registration_mutex_ elsewhere (lines 415–416, 436–437 in WaitForEnabledEnginesRegistered()). This inconsistency creates a potential data race: EnableEngine() modifies engine_registered under engine_registration_mutex_ while RegisterEngine() reads it under a different mutex.

The read at line 316 should be moved inside the engine_registration_mutex_ guard (similar to lines 361–367) to maintain consistent synchronization.

The data_substrate_started_cv_ synchronization is correct: Start() transitions init_state_ and notifies under instance_mutex_ (lines 59, 65), and WaitForDataSubstrateStarted() waits under the same mutex (line 452).

🤖 Prompt for AI Agents
In core/include/data_substrate.h around lines 293 to 299 (and specifically the
RegisterEngine() body near line ~316), the duplicate-registration check reads
the engine_registered flag while holding instance_mutex_, but that flag is
written and otherwise read under engine_registration_mutex_, causing a potential
data race; move the read/duplicate-check of engine_registered into the scope
protected by engine_registration_mutex_ (i.e., acquire
engine_registration_mutex_ before checking engine_registered, matching the
pattern used at lines ~361–367), remove the read under instance_mutex_ or
re-check under engine_registration_mutex_ before proceeding, and ensure any
notifications or subsequent writes to engine_registered remain under
engine_registration_mutex_ to keep synchronization consistent.

public:
enum class InitState
{
NotInitialized,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotInitialized -> UnInitialized ?

@liunyl liunyl merged commit 213ff3a into main Dec 3, 2025
2 of 3 checks passed
@liunyl liunyl deleted the eloqdb branch December 3, 2025 09:51
@coderabbitai coderabbitai Bot mentioned this pull request Mar 11, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants