Open-source log service, host manager, and tx-log-protos in-tree (BSL) by liunyl · Pull Request #509 · eloqdata/tx_service

liunyl · 2026-06-15T07:10:34Z

Summary

Open-source the log service, host manager, and tx-log-protos by folding them directly into data_substrate as in-tree source under the repository's root Business Source License 2.0 (the same model as eloqstore), and remove the build toggles that distinguished the open vs proprietary variants.

Absorb eloq_log_service/ and tx_service/raft_host_manager/ (were gitignored nested clones) and tx_service/tx-log-protos/ (was a submodule) as tracked in-tree files. Drop tx-log-protos's GPL/AGPL LICENSE (now covered by the root BSL).
Remove the open-source log_service submodule and build_log_service.cmake.
Remove the OPEN_LOG_SERVICE option; always build the in-tree eloq_log_service via build_eloq_log_service.cmake (target logservice) under WITH_LOG_SERVICE.
Remove the FORK_HM_PROCESS option; the host manager is always built, installed, and forked.
Collapse the now-dead OPEN_LOG_SERVICE / FORK_HM_PROCESS preprocessor guards, keeping the eloq / forked code paths.
Update module docs (01, 06, 07, 08, 10, README) and CLAUDE.md.

WITH_LOG_SERVICE is unchanged. Paired with a companion PR in eloqdata/eloqkv that bumps this submodule and removes the matching CI clone/symlink plumbing.

Test plan

cmake configure + cmake --build (Debug, WITH_DATA_STORE=ELOQDSS_ELOQSTORE, WITH_LOG_STATE=ROCKSDB) succeeds; produces eloqkv and data_substrate/host_manager.
Dangling-reference greps clean (OPEN_LOG_SERVICE, FORK_HM_PROCESS, build_log_service.cmake, removed submodule sections).
Preprocessor-guard collapses reviewed: kept the eloq LogServer constructors and the real braft LogAgent paths (not the stubs); #if/#endif balance confirmed by the build.
Live single-node smoke test (not run locally; behavior is unchanged — same sources/paths, only dead branches removed).

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Build Changes
- Renamed build option OPEN_LOG_SERVICE to WITH_LOG_SERVICE and updated related build wiring.
- Log service now consistently uses the in-tree Raft-based implementation when enabled.
- Host manager is now always built and installed.
New Features
- Added/expanded transaction-log cloud backends support for S3 and GCS.
Documentation
- Refreshed architecture and build-time documentation to reflect the in-tree log service as canonical.
CI/Tests
- Added a log-service-focused CI workflow and broadened log service test coverage (including cloud-backed scenarios).

Fold the previously-separate eloq_log_service (log service) and raft_host_manager (host manager) repos, plus the tx-log-protos submodule, directly into data_substrate as in-tree source. They are now covered by the repository's root Business Source License 2.0 (like eloqstore); the open-vs- proprietary build toggles are removed. - Absorb eloq_log_service/ and tx_service/raft_host_manager/ (were gitignored nested clones) and tx_service/tx-log-protos/ (was a submodule) as tracked files. Drop tx-log-protos's GPL/AGPL LICENSE (now under root BSL). - Remove the open-source log_service submodule and build_log_service.cmake. - Remove the OPEN_LOG_SERVICE option; always build the in-tree eloq_log_service via build_eloq_log_service.cmake under WITH_LOG_SERVICE. - Remove the FORK_HM_PROCESS option; the host manager is always built, installed, and forked. - Collapse the now-dead OPEN_LOG_SERVICE / FORK_HM_PROCESS preprocessor guards, keeping the eloq / forked code paths. - Update module docs (01, 06, 07, 08, 10, README) and CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-15T07:11:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 233f2fcb-310e-4de3-a84c-8865f947655c

📥 Commits

Reviewing files that changed from the base of the PR and between 72f202c and be50d56.

📒 Files selected for processing (2)

.github/workflows/log-service-tests.yml
eloq_log_service/include/log_state.h

🚧 Files skipped from review as they are similar to previous changes (2)

.github/workflows/log-service-tests.yml
eloq_log_service/include/log_state.h

Walkthrough

The PR replaces prior log-service wiring with the in-tree eloq_log_service, adds Raft-backed runtime and storage implementations, updates tx-service and host-manager integration, and refreshes related docs, tests, and CI.

Changes

In-tree log service and host manager

Layer / File(s)	Summary
Build and repository wiring `.gitignore`, `.gitmodules`, `CMakeLists.txt`, `build_log_service.cmake`, `build_eloq_log_service.cmake`, `build_tx_service.cmake`, `tx_service/CMakeLists.txt`, `tx_service/raft_host_manager/*`, `store_handler/eloq_data_store_service/CMakeLists.txt`, `core/src/{data_substrate.cpp,log_init.cpp,tx_service_init.cpp}`, `CLAUDE.md`	Root build flags now use `WITH_LOG_SERVICE`, old open-log wiring is removed, host manager is built and installed unconditionally, INI sources move to `third_party/ini`, and tx-service startup or shutdown paths are updated for the in-tree log service and runtime HM config.
Log service build, contracts, and utilities `eloq_log_service/CMakeLists.txt`, `eloq_log_service/.gitignore`, `eloq_log_service/include/{closure.h,fault_inject.h,glog_error_logging.h,log_utils.h,rocksdb_cloud_config.h,log_service_metrics.h}`	The new log-service tree adds its standalone CMake project, logging and fault-injection helpers, size and flag utilities, cloud-storage config parsing, a Raft completion closure, and a metrics flag.
Log state backends and persistence `eloq_log_service/include/{log_state.h,log_state_memory_impl.h,log_state_rocksdb_impl.h}`, `eloq_log_service/src/{log_state_memory_impl.cpp,log_state_rocksdb_impl.cpp,log_state_rocksdb_cloud_impl.cpp}`	Replay state abstractions track per-node-group metadata and staged operations, while memory, RocksDB, and RocksDB Cloud backends implement log writes, replay lookup, snapshots, purge logic, cloud refill, and replay-size accounting.
Raft service runtime and log flow `eloq_log_service/include/{log_instance.h,log_server.h,log_service.h,log_shipping_agent.h}`, `eloq_log_service/src/{log_instance.cpp,log_server.cpp,launch_sv.cpp,launch_server.cpp,benchmark.cpp,fault_inject.cpp}`	The runtime adds the log server, routed RPC service, Raft state machine, replay-stream shipping, leader or peer management, snapshot handling, recovery paths, and standalone launch or benchmark binaries.
Validation assets and documentation `docs/`, `eloq_log_service/{README.md,how2run.md,logstate.md,blob/images/...}`, `eloq_log_service/include/replay_service_for_test.h`, `eloq_log_service/src/launch_cl.cpp`, `eloq_log_service/src/test_log_state_rocksdb.cpp`, `eloq_log_service/test/`, `.github/workflows/log-service-tests.yml`	Docs now describe `eloq_log_service` as the canonical log service, and new replay helpers, benchmarks, integration tests, cloud tests, and CI jobs exercise write, replay, recovery, snapshot, and RocksDB Cloud flows.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

eloqdata/tx_service#315: Both PRs change the DataSubstrate::Shutdown() RocksDB storage-path guard in core/src/data_substrate.cpp.
eloqdata/tx_service#308: Both PRs change host-manager initialization and fork_hm_process handling in core/src/tx_service_init.cpp.
eloqdata/tx_service#265: Both PRs touch the same RocksDB/log-service startup and shutdown areas in tx-service.

Poem

🐇 I hopped through logs from dune to sky,
where Raft waves hummed and SSTs flew by.
New paths of build and replay gleam,
like carrots packed in a moonlit stream.
A tidy hop, a braver flow—
the log trail shines, all set to go.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/opensource-logservice-hostmanager

liunyl · 2026-06-15T07:12:29Z

Companion PR: eloqdata/eloqkv#512 (bumps this submodule + removes the matching CI plumbing). Merge order: this PR first, then eloqkv#512.

coderabbitai

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

eloq_log_service/.clang-format (1)
1-47: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use the repository root .clang-format as the single source of formatting rules.

Adding eloq_log_service/.clang-format creates a subtree override and can diverge style from the required root configuration.

As per coding guidelines, C++ code should use clang-format configuration from the repository root .clang-format.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/.clang-format` around lines 1 - 47, Remove the
`.clang-format` file from the eloq_log_service directory, as it creates a
subtree override that can cause C++ formatting to diverge from the repository's
root configuration. The repository should rely solely on the root
`.clang-format` file as the single source of truth for formatting rules across
all directories and subdirectories.
Source: Coding guidelines

🟠 Major comments (37)

eloq_log_service/src/INIReader.cpp-52-53 (1)

52-53: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Typed getters do not enforce the documented fallback semantics.

GetString does not treat whitespace-only values as empty, and GetInteger/GetReal accept partial parses instead of requiring full numeric validity before returning parsed values.

Suggested fix

 string INIReader::GetString(const string &section,
                             const string &name,
                             const string &default_value) const
 {
     const string str = Get(section, name, "");
-    return str.empty() ? default_value : str;
+    const auto first = str.find_first_not_of(" \t\r\n");
+    return (str.empty() || first == string::npos) ? default_value : str;
 }
 ...
 long INIReader::GetInteger(const string &section,
                            const string &name,
                            long default_value) const
 {
     string valstr = Get(section, name, "");
     const char *value = valstr.c_str();
     char *end;
     long n = strtol(value, &end, 0);
-    return end > value ? n : default_value;
+    return (end > value && *end == '\0') ? n : default_value;
 }
 ...
 double INIReader::GetReal(const string &section,
                           const string &name,
                           double default_value) const
 {
     string valstr = Get(section, name, "");
     const char *value = valstr.c_str();
     char *end;
     double n = strtod(value, &end);
-    return end > value ? n : default_value;
+    return (end > value && *end == '\0') ? n : default_value;
 }

Also applies to: 64-66, 75-77

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/INIReader.cpp` around lines 52 - 53, The GetString,
GetInteger, and GetReal methods do not enforce the documented fallback
semantics. For GetString at lines 52-53, after obtaining the string value via
Get(), you must trim whitespace from the result and treat whitespace-only
strings as empty before returning the default value. For GetInteger at lines
64-66 and GetReal at lines 75-77, you must validate that the numeric parsing
consumes the entire string (not just a partial parse) by checking that no
non-whitespace characters remain after the parse operation; if the parse is
incomplete or invalid, return the default value instead of the partially parsed
result.

eloq_log_service/CMakeLists.txt-284-287 (1)

284-287: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Check protoc execution result during configure.

execute_process(...) ignores failure status; if protoc fails, configure may continue and break later on generated source assumptions.

Suggested fix

 execute_process(
-	COMMAND protoc ./${PROTO_NAME}.proto --cpp_out=./
-	WORKING_DIRECTORY ${PROTO_SRC}
+    COMMAND protoc ./${PROTO_NAME}.proto --cpp_out=./
+    WORKING_DIRECTORY ${PROTO_SRC}
+    RESULT_VARIABLE PROTOC_RESULT
+    ERROR_VARIABLE PROTOC_STDERR
 )
+if(NOT PROTOC_RESULT EQUAL 0)
+    message(FATAL_ERROR "protoc failed for ${PROTO_NAME}.proto: ${PROTOC_STDERR}")
+endif()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/CMakeLists.txt` around lines 284 - 287, The execute_process
call for running protoc does not check whether the command succeeds or fails,
which means CMake configuration will continue even if protoc fails to generate
the required source files, causing build failures later. Add a RESULT_VARIABLE
parameter to the execute_process call to capture the exit code, then check if
the result is non-zero and use message(FATAL_ERROR ...) to halt the
configuration with a clear error message if protoc execution fails. This ensures
the configure step fails immediately when protoc cannot generate the necessary
files rather than deferring the problem to the build phase.

build_tx_service.cmake-234-247 (1)

234-247: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve braft include path before composing host_manager include dirs.

HOST_MANAGER_INCLUDE_DIR is assembled using ${BRAFT_INCLUDE_PATH} before find_path(BRAFT_INCLUDE_PATH ...), so the braft include path can be omitted from the include list.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@build_tx_service.cmake` around lines 234 - 247, The
set(HOST_MANAGER_INCLUDE_DIR ...) command uses ${BRAFT_INCLUDE_PATH} before it
has been defined by find_path(BRAFT_INCLUDE_PATH NAMES braft/raft.h). Move the
find_path(BRAFT_INCLUDE_PATH...) call along with the validation block that
checks if BRAFT_INCLUDE_PATH and BRAFT_LIB were found to execute before the
set(HOST_MANAGER_INCLUDE_DIR...) call that references ${BRAFT_INCLUDE_PATH}.
This ensures the BRAFT_INCLUDE_PATH variable is properly resolved before being
added to the HOST_MANAGER_INCLUDE_DIR list.

build_tx_service.cmake-262-270 (1)

262-270: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid loop-dependent proto source selection for host_manager.

${PROTO_SRC}/${PROTO_NAME}.pb.cc depends on the final value left by the earlier FOREACH(PROTO_FILE ...) loop, so the selected proto cc file is order-dependent and can miss required symbols.

Use a deterministic explicit proto source (or a curated list) for RaftHM_SOURCES.

Suggested fix

 SET(RaftHM_SOURCES
     ${HOST_MANAGER_SOURCE_DIR}/src/main.cpp
     ${HOST_MANAGER_SOURCE_DIR}/src/raft_host_manager_service.cpp
     ${HOST_MANAGER_SOURCE_DIR}/src/raft_host_manager.cpp
     ${HOST_MANAGER_SOURCE_DIR}/src/ini.c
     ${HOST_MANAGER_SOURCE_DIR}/src/INIReader.cpp
-    ${PROTO_SRC}/${PROTO_NAME}.pb.cc
+    ${PROTO_CC_FILES}
     ${LOG_PROTO_SRC}/log_agent.cpp
     ${LOG_PROTO_SRC}/${LOG_PROTO_NAME}.pb.cc
 )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@build_tx_service.cmake` around lines 262 - 270, The RaftHM_SOURCES variable
uses ${PROTO_NAME}.pb.cc which is set by an earlier FOREACH loop over
PROTO_FILE, making it order-dependent and fragile. Replace the loop-dependent
${PROTO_SRC}/${PROTO_NAME}.pb.cc reference in RaftHM_SOURCES with an explicit,
deterministic proto source file name (either hardcode the specific proto file
name that should be used for host_manager, or set a dedicated variable for the
host_manager proto before the RaftHM_SOURCES definition) to ensure the correct
proto symbols are always included regardless of loop iteration order.

CMakeLists.txt-245-246 (1)

245-246: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard logservice install with WITH_LOG_SERVICE.

build_eloq_log_service.cmake is included conditionally, but install(TARGETS logservice ...) is unconditional. With -DWITH_LOG_SERVICE=OFF, CMake can fail because logservice is undefined.

Suggested fix

-if(WITH_LOG_SERVICE)
-    INCLUDE(build_eloq_log_service.cmake)
-endif()
+if(WITH_LOG_SERVICE)
+    INCLUDE(build_eloq_log_service.cmake)
+endif()
...
-install(TARGETS logservice
-        ARCHIVE DESTINATION lib
-        LIBRARY DESTINATION lib)
+if(WITH_LOG_SERVICE)
+    install(TARGETS logservice
+            ARCHIVE DESTINATION lib
+            LIBRARY DESTINATION lib)
+endif()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CMakeLists.txt` around lines 245 - 246, The install(TARGETS logservice ...)
command is being executed unconditionally, but the logservice target is only
defined when the INCLUDE(build_eloq_log_service.cmake) line is executed inside
the if() block. Move the install(TARGETS logservice ...) command inside the same
if() block that guards the INCLUDE(build_eloq_log_service.cmake) statement, so
that the install command only runs when WITH_LOG_SERVICE is enabled and the
logservice target actually exists.

eloq_log_service/include/replay_service_for_test.h-24-30 (1)

24-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the controller downcast before dereference.

dynamic_cast may return nullptr; dereferencing *cntl before validation can crash the RPC handler.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/replay_service_for_test.h` around lines 24 - 30, The
dynamic_cast of controller to brpc::Controller pointer is not validated before
being dereferenced in the brpc::StreamAccept call. Add a null check immediately
after the dynamic_cast to verify that cntl is not nullptr, and if it is nullptr,
handle the error appropriately (such as setting a failure status and logging an
error) before attempting to dereference cntl with the * operator or pass it to
any functions.

eloq_log_service/include/log_state_rocksdb_impl.h-62-75 (1)

62-75: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deserialize needs runtime key-length validation in release builds.

assert(key.size() == 20) disappears in optimized builds; corrupted keys will still be memcpy’d, causing out-of-bounds reads.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_state_rocksdb_impl.h` around lines 62 - 75, The
key-length validation in the Deserialize method relies on assert(key.size() ==
20), which is compiled out in release builds, leaving the subsequent memcpy
operations vulnerable to out-of-bounds reads if a corrupted or incorrectly-sized
key is passed. Replace the assert with an explicit runtime validation check
(such as an if statement) that verifies key.size() equals 20 bytes and either
returns an error code or throws an exception if the validation fails, ensuring
this safety check persists across all build configurations.

eloq_log_service/test/proto/test.proto-2-2 (1)

2-2: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Package path does not satisfy Buf’s directory rule.

package txlog in this location violates PACKAGE_DIRECTORY_MATCH; this will fail Buf lint unless the file is moved or package/directory conventions are aligned.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/test/proto/test.proto` at line 2, The package declaration
`txlog` in test.proto does not match the directory structure where the file is
located, violating Buf's PACKAGE_DIRECTORY_MATCH rule. Update the package
statement to align with the directory hierarchy: the package name should reflect
the path `eloq_log_service/test/proto/` (such as `eloq_log_service.test.proto`)
to satisfy Buf's naming conventions and allow the linter to pass.
Source: Linters/SAST tools

eloq_log_service/include/glog_error_logging.h-67-70 (1)

67-70: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid throwing exceptions from log directory setup.

std::filesystem::create_directories can throw; this init path should use std::error_code and fallback behavior instead of risking process termination.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/glog_error_logging.h` around lines 67 - 70, The
std::filesystem::create_directories call in the directory existence check can
throw an exception and terminate the process. Replace the call to
create_directories(FLAGS_log_dir) with the overload that accepts a
std::error_code reference parameter, then check the error code and implement
appropriate fallback behavior (such as logging a warning or using an alternative
log directory) instead of allowing the exception to propagate.

eloq_log_service/include/log_state_rocksdb_impl.h-102-103 (1)

102-103: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Shard range initialization ignores the start_ts filter.

Per-worker iterators are partitioned from global DB first_ts, not from requested start_ts, so replay scans can include records older than requested.

Also applies to: 121-133

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_state_rocksdb_impl.h` around lines 102 - 103,
The initialization of first_ts and last_ts variables is hardcoded to 0, which
ignores the start_ts filter parameter and causes replay scans to include records
older than requested. Locate where first_ts and last_ts are initialized to 0 and
change them to initialize from the start_ts parameter instead. This same fix
needs to be applied at the additional location mentioned in the comment (around
lines 121-133) where similar timestamp range initialization occurs. This ensures
per-worker iterators respect the requested start_ts boundary rather than using
the global database's first_ts.

eloq_log_service/include/log_service_metrics.h-5-5 (1)

5-5: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use an atomic type for this global runtime flag.

A mutable global bool can race under concurrent reads/writes; metrics toggles are typically touched from multiple threads.

Suggested fix

+#include <atomic>
@@
-inline bool enable_log_service_metrics = false;
+inline std::atomic_bool enable_log_service_metrics{false};

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_service_metrics.h` at line 5, The global
variable enable_log_service_metrics is declared as a plain bool, which is not
thread-safe for concurrent access from multiple threads and can lead to data
races. Replace the bool type with std::atomic<bool> to ensure atomic operations
and thread-safe access to this runtime flag that is shared across threads.

eloq_log_service/include/rocksdb_cloud_config.h-166-180 (1)

166-180: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a fallback when no cloud backend macro is selected.

status is only defined in S3/GCS branches. With LOG_STATE_TYPE_RKDB_CLOUD but no backend macro, this function fails to compile.

Suggested fix

 inline rocksdb::Status NewCloudFileSystem(
     const rocksdb::CloudFileSystemOptions &cfs_options,
     rocksdb::CloudFileSystem **cfs)
 {
 `#if` defined(LOG_STATE_TYPE_RKDB_S3)
@@
 `#elif` defined(LOG_STATE_TYPE_RKDB_GCS)
@@
+#else
+    return rocksdb::Status::InvalidArgument(
+        "No cloud backend selected: define LOG_STATE_TYPE_RKDB_S3 or LOG_STATE_TYPE_RKDB_GCS");
 `#endif`
-    return status;
+    return status;
 };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/rocksdb_cloud_config.h` around lines 166 - 180, The
NewCloudFileSystem function declares the status variable only within the
LOG_STATE_TYPE_RKDB_S3 and LOG_STATE_TYPE_RKDB_GCS preprocessor conditional
branches, but the function unconditionally returns status at the end. When
neither backend macro is defined, the status variable is undeclared, causing
compilation failure. Add an `#else` clause after the `#elif`
defined(LOG_STATE_TYPE_RKDB_GCS) block to initialize status with an appropriate
error Status value (such as a Status indicating no cloud backend is configured)
to ensure status is always defined before the return statement.

tx_service/raft_host_manager/CMakeLists.txt-48-49 (1)

48-49: ⚠️ Potential issue | 🟠 Major

Dependency resolution is not enforcing what the script intends.

find_library(... VERSION ">=0.6.0" ...) does not enforce version constraints—CMake's find_library command does not recognize or honor the VERSION parameter. Additionally, ${OPENSSL_LIB} is used at line 109 but never defined (the corresponding find_library call is commented out in the parent CMakeLists.txt), which will silently result in an empty variable during linking.

Use find_package(glog 0.6 CONFIG REQUIRED) instead to properly enforce version requirements and modern CMake targets (glog::glog, OpenSSL::SSL, OpenSSL::Crypto) for correct dependency management.
Suggested fix
-        find_library(GLOG_LIB NAMES glog VERSION ">=0.6.0" REQUIRED)
+        find_package(glog 0.6 CONFIG REQUIRED)
@@
-        set(LINK_LIB ${LINK_LIB} ${GLOG_LIB})
+        set(LINK_LIB ${LINK_LIB} glog::glog)
@@
-       ${OPENSSL_LIB}
+       OpenSSL::SSL
+       OpenSSL::Crypto
Also applies to: 101-110
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/raft_host_manager/CMakeLists.txt` around lines 48 - 49, The
find_library command for glog does not enforce version constraints because
CMake's find_library does not recognize the VERSION parameter. Additionally,
${OPENSSL_LIB} is used but never defined since the corresponding find_library
call is commented out. Replace the glog find_library call with find_package(glog
0.6 CONFIG REQUIRED) to properly enforce version requirements and use the modern
CMake target glog::glog. Similarly, add find_package(OpenSSL REQUIRED) and use
the modern CMake targets OpenSSL::SSL and OpenSSL::Crypto instead of the raw
${OPENSSL_LIB} variable to ensure correct dependency management and version
enforcement throughout the CMakeLists.txt file.

eloq_log_service/src/launch_server.cpp-17-31 (1)

17-31: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

run_direct() ignores all its parameters.

The function accepts ip, port, id, raft_conf, group, storage_path parameters but constructs RaftServer using FLAGS_* globals instead. This appears to be a copy-paste error.

🐛 Proposed fix

 void run_direct(std::string ip,
                 uint32_t port,
                 uint32_t id,
                 std::string raft_conf,
                 std::string group,
                 std::string storage_path)
 {
-    txlog::RaftServer raftServer(FLAGS_ip,
-                                 FLAGS_port,
-                                 FLAGS_id,
-                                 FLAGS_raft_conf,
-                                 FLAGS_group,
-                                 FLAGS_storage_path);
+    txlog::RaftServer raftServer(ip,
+                                 port,
+                                 id,
+                                 raft_conf,
+                                 group,
+                                 storage_path);
     raftServer.Run();
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/launch_server.cpp` around lines 17 - 31, The
run_direct() function accepts parameters (ip, port, id, raft_conf, group,
storage_path) but ignores them by constructing RaftServer using FLAGS_* global
variables instead. Replace each FLAGS_* reference in the RaftServer constructor
call with its corresponding function parameter: FLAGS_ip becomes ip, FLAGS_port
becomes port, FLAGS_id becomes id, FLAGS_raft_conf becomes raft_conf,
FLAGS_group becomes group, and FLAGS_storage_path becomes storage_path.

eloq_log_service/src/launch_sv.cpp-596-619 (1)

596-619: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Invalid log_purger_schedule causes silent server non-start.

When parsing fails, only an error is logged but launch() is never called. The process appears to start normally but the server isn't running. Consider either returning an error code or exiting with failure.

🛠️ Suggested fix

     if (iss.fail())
     {
         LOG(ERROR)
             << "The argument `log_purger_schedule` has invalid time format. "
                "expected: HH:MM:SS";
+        if (!FLAGS_alsologtostderr)
+        {
+            std::cout << "Failed to start LogServer, invalid log_purger_schedule format"
+                      << std::endl;
+        }
+        return -1;
     }
-    else
-    {
+
         rocksdb_cloud_config.log_purger_starting_hour_ = log_purger_tm.tm_hour;
         // ... rest of launch logic
-    }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/launch_sv.cpp` around lines 596 - 619, When parsing of
log_purger_schedule fails in the if (iss.fail()) block, the code only logs an
error but never calls launch(), causing the server to not actually start while
appearing successful. After logging the error for the invalid time format, add
an explicit error handling mechanism such as returning an error code from the
function or calling a process exit function with a failure status to prevent
silent server non-start. Ensure the else block containing the successful
launch() call remains unchanged.

eloq_log_service/include/log_utils.h-48-53 (1)

48-53: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add suffix-length guard before offset math in ends_with.

At Line 51, str.size() - suffix.size() is unsafe when suffix is longer than str. Add an early if (str.size() < suffix.size()) return false;.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_utils.h` around lines 48 - 53, The ends_with
function performs unsafe arithmetic on unsigned values when calculating the
offset for str.compare(). Add a guard condition at the beginning of the
ends_with function that returns false immediately if str.size() is less than
suffix.size(). This prevents integer underflow when suffix is longer than str,
ensuring the subsequent str.size() - suffix.size() operation is safe.

eloq_log_service/src/log_state_rocksdb_impl.cpp-250-295 (1)

250-295: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Release checkpoint object after snapshot creation.

checkpoint allocated at Line 250/251 is never released in this function. Repeated snapshots will leak memory/resources.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/log_state_rocksdb_impl.cpp` around lines 250 - 295, The
checkpoint object created by rocksdb::Checkpoint::Create is never released
before the function returns, causing a memory leak on repeated snapshot calls.
Add a delete statement to release the checkpoint pointer before the return res
statement at the end of the function, ensuring the allocated checkpoint resource
is properly cleaned up.

eloq_log_service/src/log_state_rocksdb_cloud_impl.cpp-952-977 (1)

952-977: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Do not call observer_->OnInMemStateFull while holding in_mem_state_mutex_.

At Line 974, external callback dispatch occurs under in_mem_state_mutex_. If observer execution is synchronous or re-entrant, this can deadlock or significantly stall writers.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/log_state_rocksdb_cloud_impl.cpp` around lines 952 -
977, The observer callback `OnInMemStateFull` is being invoked while holding the
`in_mem_state_mutex_` lock, which can cause deadlock or writer stalling if the
callback is synchronous or re-entrant. Release the lock by ending the
unique_lock scope before calling the observer callback. Capture the necessary
parameter values (such as `log_count_before_purge_`) into local variables while
still holding the lock, then release the lock before passing those values to
`observer_->OnInMemStateFull`. The lambda callback `done` can continue to
acquire the lock when needed to update state after the purge completes.

eloq_log_service/src/log_state_rocksdb_cloud_impl.cpp-1791-1798 (1)

1791-1798: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Free SstFileReader iterators in all paths during refill.

The iterator created at Line 1791 is never deleted, including early-return error paths. Refill over many files will leak iterators and degrade long-running stability.

Also applies to: 1800-1815

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/log_state_rocksdb_cloud_impl.cpp` around lines 1791 -
1798, The iterator created from sst_reader.NewIterator at line 1791 is never
deleted, causing a memory leak especially in the error path where the function
returns false. You must ensure the iterator is properly freed in all code paths
within the refill operation, including the error handling block at lines
1794-1797 and any other paths up to line 1815. Use RAII pattern with a scoped
cleanup mechanism or add explicit delete statements before every return path to
guarantee the iterator pointed to by it is deallocated. Identify all locations
where iterators are created and returned without cleanup, and apply consistent
cleanup logic across all affected code paths in the refill operation.

eloq_log_service/include/log_utils.h-23-27 (1)

23-27: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align size validation with integer-only byte-size parsing.

is_number currently accepts decimal/scientific/signed input, but parse_size treats input as integer capacity. Values like 1.5GB or negative values pass validation and can be parsed incorrectly.

Also applies to: 59-81, 101-113

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_utils.h` around lines 23 - 27, The is_number
function's regex pattern accepts decimal values, scientific notation, and signed
numbers, but parse_size (referenced in the affected ranges at lines 59-81 and
101-113) treats input as integer-only byte capacity, causing values like "1.5GB"
or negative numbers to incorrectly pass validation. Update the regex pattern in
is_number to match only unsigned integer formats (digits only, no decimal point,
no scientific notation, no sign). Then verify that parse_size and any related
size parsing logic at the consolidated sites (lines 59-81, 101-113) correctly
handle the restricted input format and reject non-integer or negative inputs
appropriately.

eloq_log_service/src/fault_inject.cpp-17-19 (1)

17-19: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix inverted strike-window condition in trigger gating.

Line 17-18 rejects the in-range case. The guard should return when count_strike_ is outside [start_strike_, end_strike_], not inside it.

Proposed fix

-        if (entry->start_strike_ < entry->count_strike_ ||
-            entry->end_strike_ > entry->count_strike_)
+        if (entry->count_strike_ < entry->start_strike_ ||
+            entry->count_strike_ > entry->end_strike_)
         {
             return;
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/fault_inject.cpp` around lines 17 - 19, The condition on
lines 17-18 has inverted inequality comparisons that cause it to reject when
count_strike_ is inside the range [start_strike_, end_strike_] instead of
outside. Fix the guard condition by inverting the comparisons: change
"entry->start_strike_ < entry->count_strike_" to "entry->count_strike_ <
entry->start_strike_" and change "entry->end_strike_ > entry->count_strike_" to
"entry->count_strike_ > entry->end_strike_" so the condition correctly returns
when count_strike_ falls outside the valid strike window.

tx_service/raft_host_manager/include/INIReader.h-41-106 (1)

41-106: ⚠️ Potential issue | 🟠 Major

Wrap INIReader in module-specific namespaces to comply with coding guidelines and prevent symbol collisions.

INIReader is declared in global scope, but your codebase includes three separate implementations:

tx_service/raft_host_manager/include/INIReader.h

eloq_log_service/include/INIReader.h

store_handler/eloq_data_store_service/INIReader.h

All three are compiled into the same data_substrate library, creating an ODR (One Definition Rule) violation risk. Your CMakeLists.txt acknowledges this ("The INIReader definition in the eloqstore library was renamed to avoid clashing with the other INIReader implementations"), but the actual code still declares all three in global scope.

Per coding guidelines, C++ code must be organized within namespaces. Move each INIReader class into its respective namespace (e.g., namespace txservice, namespace eloq_log, namespace eloq_data_store) to eliminate the collision risk and comply with project standards.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/raft_host_manager/include/INIReader.h` around lines 41 - 106, The
INIReader class is declared in global scope across three separate
implementations in different parts of the codebase, creating an ODR (One
Definition Rule) violation risk when all three are compiled into the same
library. Wrap the INIReader class declaration in appropriate module-specific
namespaces (e.g., txservice for the raft_host_manager, eloq_log for the
eloq_log_service, and eloq_data_store for the store_handler implementation) to
prevent symbol collisions and comply with coding guidelines. Ensure that the
namespace wrapping is applied consistently across all three INIReader
implementations in their respective headers.
Source: Coding guidelines

eloq_log_service/src/log_state_memory_impl.cpp-88-110 (1)

88-110: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use binary mode and check stream state in snapshot I/O paths.

Line [92] and Line [105] open snapshot files in default text mode, but the format is binary (read/write on integers and raw bytes). Also, read/write results are not validated, so short I/O can silently produce corrupted in-memory state.

Also applies to: 136-170, 173-207
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/log_state_memory_impl.cpp` around lines 88 - 110, The
ReadSnapshot and WriteSnapshot methods in LogStateMemoryImpl are opening
snapshot files in text mode but performing binary I/O operations
(reading/writing integers and raw bytes), which can cause data corruption.
Additionally, the results of read/write operations are not being validated,
allowing silent failures. Fix this by opening both ifstream in ReadSnapshot and
ofstream in WriteSnapshot with the std::ios::binary flag, and after each
load/write operation, check the stream state using fail() or similar validation
to ensure the I/O completed successfully before proceeding. If stream validation
fails, return an error code or log the failure appropriately.

eloq_log_service/include/log_state.h-1241-1360 (1)

1241-1360: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Fail fast on snapshot read/parse errors before mutating in-memory state.

The deserialization path reads lengths/messages and parses protobuf payloads without checking istream state or ParseFromString results. On truncated/corrupt snapshots, this can partially load invalid state and poison recovery.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_state.h` around lines 1241 - 1360, The snapshot
deserialization code reads binary data and parses protobuf messages without
validating the input stream state or parse results before inserting into
in-memory state containers. Add error checking after each is.read() call to
verify the stream is in a good state, and check the return value of each
ParseFromString() call to ensure successful deserialization. If any read fails
(stream state is bad) or if ParseFromString returns false, log the error and
exit the deserialization immediately before calling try_emplace() on
cc_ng_info_, tx_catalog_ops_, tx_split_range_ops_, or tx_cluster_scale_ops_, to
prevent poisoning recovery with partially-loaded or corrupted data.

eloq_log_service/include/log_service.h-345-348 (1)

345-348: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not log cloud credentials in plaintext.

Line 347 logs aws_secret_key directly. This leaks secrets to service logs.

Suggested fix

 `#if` defined(LOG_STATE_TYPE_RKDB_S3)
             << ", aws_access_key_id = " << aws_access_key_id
-            << ", aws_secret_key = " << aws_secret_key
+            << ", aws_secret_key = [REDACTED]"
 `#endif`

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_service.h` around lines 345 - 348, The
aws_secret_key is being logged in plaintext within the LOG_STATE_TYPE_RKDB_S3
conditional block, which is a security vulnerability as it exposes sensitive
credentials to service logs. Replace the plaintext logging of aws_secret_key
with a redacted or masked representation such as a placeholder string like
"[REDACTED]" or a truncated version that does not expose the actual secret
value.

eloq_log_service/include/log_service.h-386-393 (1)

386-393: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

node_id validation can be bypassed by integer narrowing.

On Line 386, casting node_id to uint16_t can wrap large invalid values and let them pass the check, then fail later with out-of-range access.

Suggested fix

-        if ((uint16_t) node_id >= ip_port_list.size())
+        if (node_id < 0 ||
+            static_cast<size_t>(node_id) >= ip_port_list.size())
         {
             std::string err_msg =
                 "Invalid configuration: `node_id` must be less than node size, "

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_service.h` around lines 386 - 393, The
validation check on line 386 uses an unsafe cast that narrows the `node_id`
value to `uint16_t` before comparison, allowing large invalid values to wrap
around and bypass the bounds check, leading to out-of-range access later. Remove
the cast to `uint16_t` in the condition and compare `node_id` directly against
`ip_port_list.size()` so that invalid values are properly caught without integer
wrapping.

eloq_log_service/include/log_service.h-910-918 (1)

910-918: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Start() may report success after partial startup failure.

Lines 913-916 overwrite err_code per replica. If an early replica fails and a later one succeeds, this returns 0 and masks failure.

Suggested fix

     int Start()
     {
-        int err_code = 0;
         for (auto &logger_pair : log_replicas_)
         {
-            err_code = logger_pair.second.Start();
+            int err_code = logger_pair.second.Start();
+            if (err_code != 0)
+            {
+                return err_code;
+            }
         }
-
-        return err_code;
+        return 0;
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_service.h` around lines 910 - 918, The Start()
method overwrites the err_code variable in each iteration of the loop over
log_replicas_, which causes it to report success if the last replica succeeds
even if earlier replicas failed. Instead of unconditionally overwriting err_code
with the result of each logger_pair.second.Start() call, modify the loop to
preserve error codes from earlier replicas. Either accumulate errors using a
logical OR operation (err_code = err_code || logger_pair.second.Start()) so that
any failure is preserved, or use early exit logic (if (err_code == 0) { err_code
= logger_pair.second.Start(); }) to return the first encountered error. This
ensures the method only returns success if all replicas start successfully.

eloq_log_service/include/log_shipping_agent.h-79-94 (1)

79-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Retry loop never re-attempts channel initialization.

On Line 79, while (err != 0) loops on a stale err value; channel_.Init(...) is never called again inside the loop, so a transient failure becomes a permanent sleep loop.

Suggested fix

-                while (err != 0)
+                auto init_channel = [&]() -> int
+                {
+                    if (0 != butil::str2ip(node_ip_str.c_str(), &ip_t))
+                    {
+                        std::string naming_service_url;
+                        braft::HostNameAddr hostname_addr(node_ip_str, node_port);
+                        braft::HostNameAddr2NSUrl(hostname_addr, naming_service_url);
+                        return channel_.Init(naming_service_url.c_str(),
+                                             braft::LOAD_BALANCER_NAME,
+                                             &options);
+                    }
+                    return channel_.Init(full_ip_.c_str(), &options);
+                };
+
+                while (err != 0)
                 {
                     if (shipping_agent_status_.load(
                             std::memory_order_acquire) != Status::Active)
@@
                     LOG(ERROR)
                         << "Log shipping agent of the log group #"
                         << log_group_id_
                         << " fails to connect to the cc node at " << full_ip_;
                     using namespace std::chrono_literals;
                     std::this_thread::sleep_for(1s);
+                    err = init_channel();
                 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/include/log_shipping_agent.h` around lines 79 - 94, The
retry loop conditional on `while (err != 0)` continuously checks a stale error
value without re-attempting the initialization. The `channel_.Init(...)` method
call must be invoked again inside the loop body after the sleep to actually
retry the connection; otherwise, the loop will sleep indefinitely without ever
retrying. Move the channel initialization call (that sets the `err` value) into
the loop so that each iteration attempts to re-initialize the channel and
updates the `err` variable before the next loop condition check.

eloq_log_service/test/launch_replay_service.cpp-32-35 (1)

32-35: ⚠️ Potential issue | 🟠 Major

Parse command-line flags and propagate server startup status from main.

The main function currently ignores both CLI flag parsing and the return code from start_replay_rpc_server(). This means FLAGS_replay_port overrides passed on the command line won't take effect (only the default value 8888 is used), and startup failures will still cause the process to exit successfully.
Suggested fix
 int main(int argc, char *argv[])
 {
-    start_replay_rpc_server();
+    GFLAGS_NAMESPACE::ParseCommandLineFlags(&argc, &argv, true);
+    return start_replay_rpc_server();
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/test/launch_replay_service.cpp` around lines 32 - 35, The
main function needs to parse command-line flags before starting the server and
propagate the server startup status as the process exit code. Add a call to
parse command-line arguments (such as gflags::ParseCommandLineFlags) at the
beginning of main to ensure CLI flags like FLAGS_replay_port are processed
correctly, then capture the return value from start_replay_rpc_server() and use
it as the return code for main so that startup failures result in a non-zero
exit code.

eloq_log_service/test/rocksdb_cloud_delete_range_test.cpp-300-302 (1)

300-302: ⚠️ Potential issue | 🟠 Major

Check and handle DeleteFilesInRange failures.

The function returns a rocksdb::Status that must be checked. If deletion fails, the test continues with misleading results. The file already handles Status checks consistently (see lines 224–230 and 253–257).

Suggested fix

-        rocksdb::DeleteFilesInRange(
-            db, db->DefaultColumnFamily(), &lower_bound, &upper_bound);
+        auto delete_status = rocksdb::DeleteFilesInRange(
+            db, db->DefaultColumnFamily(), &lower_bound, &upper_bound);
+        if (!delete_status.ok())
+        {
+            LOG(ERROR) << "DeleteFilesInRange failed for ng_id " << ng_id
+                       << ": " << delete_status.ToString();
+            return -1;
+        }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/test/rocksdb_cloud_delete_range_test.cpp` around lines 300 -
302, The rocksdb::DeleteFilesInRange call is not checking the returned
rocksdb::Status, which means if deletion fails the test continues with invalid
results. Capture the Status returned by DeleteFilesInRange and check it for
errors following the existing pattern already used in the file at lines 224–230
and 253–257. Handle any error appropriately (typically by asserting or returning
early from the test) to prevent misleading test results.

eloq_log_service/test/log_server_rocksdb_cloud_tests.cpp-151-164 (1)

151-164: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Synchronize captured replay messages instead of relying on sleeps.

on_received_messages() mutates replay_messages_ on BRPC stream callbacks while the test thread reads and clears the same vector after fixed sleeps. Guard the vector with a mutex or signal replay completion before assertions/clears to avoid data races and flaky results.

Also applies to: 1105-1202, 1270-1306, 1389-1454, 1536-1591, 1670-1696
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/test/log_server_rocksdb_cloud_tests.cpp` around lines 151 -
164, The on_received_messages() method mutates the replay_messages_ vector from
a BRPC stream callback thread while the test thread reads and clears it after
fixed sleeps, creating a data race. Add a mutex to guard all accesses to
replay_messages_ in the on_received_messages() method (at lines 151-164 anchor)
and at all other locations where replay_messages_ is accessed: lines 1105-1202,
1270-1306, 1389-1454, 1536-1591, and 1670-1696. Additionally, consider adding a
condition variable to signal when replay is complete instead of relying on fixed
sleeps, ensuring proper synchronization between the BRPC callback thread and the
test thread before assertions or clears on the shared vector.

eloq_log_service/src/log_instance.cpp-636-699 (1)

636-699: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include cluster-scale and migration logs in retry success detection.

The retry path only re-checks DataLog/SchemaLog/SplitRangeLog. A previously committed kClusterScaleLog or kMigrationLog retried after a term mismatch will fall through as Fail, even though normal execution later treats those log types as duplicate-aware writes. Add searches for SearchTxClusterScaleOp() and SearchTxDataMigrationOp() here before returning failure.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/log_instance.cpp` around lines 636 - 699, The switch
statement on log_content.content_case() is missing handlers for kClusterScaleLog
and kMigrationLog log types. When these log types are retried after a term
mismatch, they fall through to the default case and are incorrectly marked as
failures. Add two new cases before the default case: one for
LogContentMessage::ContentCase::kClusterScaleLog that calls
log_state_->SearchTxClusterScaleOp(tx_number) and sets committed based on
whether the result indicates the operation was previously found and flushed at
the appropriate stage, and another for
LogContentMessage::ContentCase::kMigrationLog that calls
log_state_->SearchTxDataMigrationOp(tx_number) with the same pattern. Follow the
same structure as the existing kSchemaLog and kSplitRangeLog cases, extracting
the found boolean and flushed stage, then comparing against the current stage
from log_content to set the committed flag.

eloq_log_service/test/log_server_rocksdb_cloud_tests.cpp-410-435 (1)

410-435: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require at least one replayed record in VerifyReplayMessages().

A finish-only replay message has empty binary_log_records() and currently passes without checking verify_commit_ts or verify_message. Count decoded records and CATCH_REQUIRE(record_count > 0) so data-loss regressions cannot pass as successful replay.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/test/log_server_rocksdb_cloud_tests.cpp` around lines 410 -
435, The VerifyReplayMessages() function currently allows finish-only replay
messages with empty binary_log_records to pass verification without checking
verify_commit_ts or verify_message. Add a counter variable initialized to zero
before the if statement that checks log_records.size(), increment this counter
each time a record is successfully decoded in the while loop (after the
blob_length offset is processed), and then add a CATCH_REQUIRE assertion after
the if block to ensure record_count is greater than zero, so that data-loss
regressions cannot pass as successful replay.

eloq_log_service/src/log_instance.cpp-1207-1295 (1)

1207-1295: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always clean snapshot state on cloud snapshot failures.

BeginSnapshot() runs before the async save, but the RocksDB Cloud error returns skip Line 1308, leaving snapshot state uncleared after flush/serialization/proposal failures. Use an RAII cleanup guard or a single cleanup exit path so CleanSnapshotState() runs on every return after WriteSnapshot().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/log_instance.cpp` around lines 1207 - 1295, The snapshot
state initialized by BeginSnapshot() is not being cleaned up when errors occur
during the RocksDB Cloud flush, serialization, or proposal operations. Multiple
return paths at different error points (CheckOrWaitForMemDBInSync failure, Flush
failure, SerializeToZeroCopyStream failure, and ProposeTaskAndWait failure) skip
the CleanSnapshotState() call. Create an RAII cleanup guard that wraps the scope
after BeginSnapshot() and ensures CleanSnapshotState() is called in its
destructor on all error returns, or consolidate all error handling to a single
cleanup exit path before the function returns.

eloq_log_service/test/recover_time_test.cpp-143-145 (1)

143-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard latency summaries when no writes succeed.

Both averages divide by cnt; when the cluster is unavailable or no sender succeeds, this becomes an integer divide-by-zero during failure reporting.

Suggested guard

-              << (time_us / cnt) << " microseconds";
+              << (cnt == 0 ? 0 : time_us / cnt) << " microseconds";

Apply the same guard to the aggregate summary.

Also applies to: 199-202

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/test/recover_time_test.cpp` around lines 143 - 145, The
latency summary logging statements perform integer division by cnt without
checking if cnt is zero, which causes a divide-by-zero error when the cluster is
unavailable or no sender succeeds. Add a guard condition to check if cnt is
greater than zero before computing and logging the average response time. Apply
this guard at the LOG(INFO) statement around line 143-145 (the aggregate summary
for write log requests) and also at the corresponding location around line
199-202 to protect both latency summary calculations from division-by-zero
failures.

eloq_log_service/src/log_instance.cpp-1122-1124 (1)

1122-1124: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Put finite timeouts on synchronous BRPC calls.

These RPCs run without a controller timeout, so a stalled peer can block leader notification, the replay-size checker, or peer-change redirection indefinitely; on_leader_stop() then may hang while joining the checker thread. Set a bounded timeout on each controller before the synchronous call.
Suggested shape
 brpc::Controller cntl;
-cntl.set_timeout_ms(-1);
+cntl.set_timeout_ms(10000);
 stub.UpdateLogGroupLeader(&cntl, &req, &res, nullptr);
Apply the same pattern to NotifyCheckpointer, AddPeer, and RemovePeer.
Also applies to: 1650-1654, 1755-1757, 1846-1848
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/log_instance.cpp` around lines 1122 - 1124, Replace the
timeout value of -1 with a bounded positive timeout value in all synchronous
BRPC controller calls to prevent indefinite blocking. In
eloq_log_service/src/log_instance.cpp at lines 1122-1124 (UpdateLogGroupLeader
call), change cntl.set_timeout_ms(-1) to set a finite timeout such as
cntl.set_timeout_ms(5000) or an appropriate bounded value. Apply the same
pattern to the consolidated sibling locations at lines 1650-1654
(NotifyCheckpointer call), lines 1755-1757 (AddPeer call), and lines 1846-1848
(RemovePeer call), ensuring each controller has a bounded timeout before its
corresponding synchronous RPC call to prevent stalled peers from blocking
indefinitely.

eloq_log_service/test/recover_time_test.cpp-38-40 (1)

38-40: ⚠️ Potential issue | 🟠 Major

Make the random generator thread-local or synchronized.

All sender threads call generate_log_request(), which mutates the global std::default_random_engine via distribution(generator). Since std::default_random_engine is not thread-safe, concurrent access from multiple threads (FLAGS_thread_num, default 48) creates a data race and undefined behavior. Use a thread_local engine in the sender or protect the generator with a mutex.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/test/recover_time_test.cpp` around lines 38 - 40, The global
std::default_random_engine generator declared in the test is accessed
concurrently by multiple sender threads that call generate_log_request(), which
mutates the generator's state via distribution(generator). Since
std::default_random_engine is not thread-safe, this creates a data race. Fix
this by making the generator thread_local so each thread has its own isolated
copy, eliminating concurrent access to the same generator instance.
Alternatively, protect the generator with a mutex around distribution(generator)
calls if a single shared sequence is required.

coderabbitai · 2026-06-15T07:32:06Z

+    uint32_t LatestCommittedTxnNumber(uint32_t cc_ng) const
+    {
+        auto it = cc_ng_info_.find(cc_ng);
+        if (it == cc_ng_info_.end())
+        {
+            return 0;
+        }
+        return it->second.latest_txn_no_;
+    }
+
+    void UpdateLatestCommittedTxnNumber(uint32_t tx_cc_ng, uint32_t tx_ident)
+    {
+        // access different fields of node group info with RecoverTx RPC
+        // thread, no need to lock
+        auto it = cc_ng_info_.find(tx_cc_ng);
+        if (it == cc_ng_info_.end())
+        {
+            return;
+        }
+        CcNgInfo &info = it->second;
+
+        // to handle the situation that committed txn number wraps around
+        // uint32, assuming that active txn numbers won't span half of
+        // UINT32_MAX
+        if (tx_ident - info.latest_txn_no_ < (UINT32_MAX >> 1))
+        {
+            info.latest_txn_no_ = tx_ident;
+        }
+    }
+
+    uint64_t LatestCommitTsOfAllNodeGroups() const
+    {
+        uint64_t latest_commit_ts = 0;
+        for (const auto &entry : cc_ng_info_)
+        {
+            latest_commit_ts =
+                std::max(latest_commit_ts, entry.second.latest_commit_ts_);
+        }
+        return latest_commit_ts;
+    }
+
+    void UpdateLatestCommitTs(uint32_t tx_cc_ng, uint64_t commit_ts)
+    {
+        // access different fields of node group info with RecoverTx RPC
+        // thread, no need to lock
+        auto it = cc_ng_info_.find(tx_cc_ng);
+        if (it == cc_ng_info_.end())
+        {
+            return;
+        }
+        CcNgInfo &info = it->second;
+
+        if (commit_ts != 0 && commit_ts > info.latest_commit_ts_)
+        {
+            info.latest_commit_ts_ = commit_ts;
+        }
+    }


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard all cc_ng_info_ accesses with log_state_mutex_.

Line [991], Line [1005], Line [1024], and Line [1036] access cc_ng_info_ without locking, while other paths mutate the same map under std::unique_lock (for example UpdateNgTerm/RemoveCcNodeGroup). Concurrent read/write on std::unordered_map is undefined behavior and can crash or corrupt state.

Suggested fix

uint32_t LatestCommittedTxnNumber(uint32_t cc_ng) const { + std::shared_lock s_lk(log_state_mutex_); auto it = cc_ng_info_.find(cc_ng); if (it == cc_ng_info_.end()) { return 0; } return it->second.latest_txn_no_; } void UpdateLatestCommittedTxnNumber(uint32_t tx_cc_ng, uint32_t tx_ident) { + std::unique_lock lk(log_state_mutex_); auto it = cc_ng_info_.find(tx_cc_ng); if (it == cc_ng_info_.end()) { return; } ... } uint64_t LatestCommitTsOfAllNodeGroups() const { + std::shared_lock s_lk(log_state_mutex_); uint64_t latest_commit_ts = 0; ... } void UpdateLatestCommitTs(uint32_t tx_cc_ng, uint64_t commit_ts) { + std::unique_lock lk(log_state_mutex_); auto it = cc_ng_info_.find(tx_cc_ng); ... }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eloq_log_service/include/log_state.h` around lines 991 - 1047, The methods LatestCommittedTxnNumber, UpdateLatestCommittedTxnNumber, LatestCommitTsOfAllNodeGroups, and UpdateLatestCommitTs all access the cc_ng_info_ member map without holding log_state_mutex_, while other code paths like UpdateNgTerm and RemoveCcNodeGroup mutate this map under lock, creating a race condition. Guard all accesses to cc_ng_info_ in each of these four methods by wrapping the relevant code with a std::unique_lock on log_state_mutex_ at the beginning of each method to protect both the map lookups and subsequent field accesses.

The log service, host manager, and store handler each carried an identical copy of the inih/INIReader library. Move one canonical copy to data_substrate/third_party/ini (BSD-3-Clause, kept separate from the BSL first-party code) and point every build target at it. - Add data_substrate/third_party/ini/{ini.c,ini.h,INIReader.cpp,INIReader.h}. - Repoint host_manager, logservice, and the data_substrate core library, plus the standalone tx_service / eloq_log_service / store_handler / raft_host_manager builds, at third_party/ini via CMAKE_CURRENT_LIST_DIR-relative paths. - Delete the three duplicate copies (12 files). Compile-per-target is preserved, so cross-library INIReader symbol resolution is unchanged. eloqkv keeps its own copy (separate repo). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

store_handler/eloq_data_store_service/CMakeLists.txt (1)

406-412: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix malformed include_directories block around Line 408.

if()/endif() is currently inside include_directories(...), so the conditional is parsed as arguments instead of control flow. Move the conditional outside the call.

Suggested fix

-include_directories(
-        ${PROJECT_SOURCE_DIR}/
-        ${CMAKE_CURRENT_LIST_DIR}/../../third_party/ini
-if(WITH_DATA_STORE STREQUAL "ELOQDSS_ELOQSTORE")
-        ${PROJECT_SOURCE_DIR}/eloqstore/
-endif ()
-)
+include_directories(
+        ${PROJECT_SOURCE_DIR}/
+        ${CMAKE_CURRENT_LIST_DIR}/../../third_party/ini
+)
+
+if(WITH_DATA_STORE STREQUAL "ELOQDSS_ELOQSTORE")
+    include_directories(${PROJECT_SOURCE_DIR}/eloqstore/)
+endif()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@store_handler/eloq_data_store_service/CMakeLists.txt` around lines 406 - 412,
The include_directories block has an if()/endif() conditional incorrectly nested
inside the function call, causing CMake to parse it as arguments rather than
control flow. Close the initial include_directories() call after the
${CMAKE_CURRENT_LIST_DIR}/../../third_party/ini line, then place the
if(WITH_DATA_STORE STREQUAL "ELOQDSS_ELOQSTORE") conditional outside and after
the closing parenthesis. Inside the if block, create a separate
include_directories() call containing just the ${PROJECT_SOURCE_DIR}/eloqstore/
path, followed by the endif() statement.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@store_handler/eloq_data_store_service/CMakeLists.txt`:
- Around line 406-412: The include_directories block has an if()/endif()
conditional incorrectly nested inside the function call, causing CMake to parse
it as arguments rather than control flow. Close the initial
include_directories() call after the
${CMAKE_CURRENT_LIST_DIR}/../../third_party/ini line, then place the
if(WITH_DATA_STORE STREQUAL "ELOQDSS_ELOQSTORE") conditional outside and after
the closing parenthesis. Inside the if block, create a separate
include_directories() call containing just the ${PROJECT_SOURCE_DIR}/eloqstore/
path, followed by the endif() statement.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1d2eddf-241c-4635-9e49-0419fe081d83

📥 Commits

Reviewing files that changed from the base of the PR and between 10fbd55 and 7ec2557.

📒 Files selected for processing (11)

CMakeLists.txt
build_eloq_log_service.cmake
build_tx_service.cmake
eloq_log_service/CMakeLists.txt
store_handler/eloq_data_store_service/CMakeLists.txt
third_party/ini/INIReader.cpp
third_party/ini/INIReader.h
third_party/ini/ini.c
third_party/ini/ini.h
tx_service/CMakeLists.txt
tx_service/raft_host_manager/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (3)

CMakeLists.txt
build_tx_service.cmake
eloq_log_service/CMakeLists.txt

Triage and fix the actionable findings from the PR #509 review across eloq_log_service, the host-manager build, and the consolidated cmake. Findings that were speculative or by-design were verified and intentionally left as-is (single-threaded raft-apply paths, async braft snapshot callback, fire-and-forget control-plane RPCs, Buf lint where Buf is unused, etc.). Correctness/safety fixes: - glog init: handle readlink failure (UB) and use error_code for create_directories. - fault_inject: FaultEntry use-after-free (store shared_ptr); inverted strike-window guard. - resource leaks: rocksdb Checkpoint and SstFileReader iterator (unique_ptr); CleanSnapshotState on cloud-snapshot error paths. - snapshot deserialize: validate stream state + ParseFromString; open snapshot files binary. - log_service: redact aws_secret_key; fix node_id narrowing bypass; Start() no longer masks an early replica failure. - log_shipping_agent: retry loop now re-initializes the channel. - log_utils: ends_with underflow guard; integer-only size validation. - log_state_rocksdb_impl: guard worker_num==0; runtime key-length check; shard range honors start_ts. - metrics flag made std::atomic; rocksdb_cloud_config #else fallback. - tests: remove AWS credential logging; check DeleteFilesInRange/DeleteBucket status; fix DropBucket IsNotFound ordering and wrong status variable; bounded binary append; synchronize shared test state; parse flags / propagate exit codes; bounds-checked record parsing + dynamic_cast guard; proto/iterator API updates. - cmake: guard logservice install with WITH_LOG_SERVICE; resolve braft find_path before use; check protoc result; fix malformed include_directories; legacy host_manager link order + OpenSSL libs. Remove eloq_log_service/.clang-format (identical to root). Also repoint the standalone eloq_log_service build at tx_service/tx-log-protos (its local copy was removed during the earlier inih consolidation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add data_substrate/.github/workflows/log-service-tests.yml: on PRs touching the log service, build eloq_log_service in both the RocksDB and RocksDB-Cloud(S3) configurations (compile-checking all test targets, incl. the Catch2 cloud tests), then run its RocksDB-Cloud unit tests against a MinIO S3-compatible service — no real AWS credentials or buckets required. To let the cloud tests reach MinIO, add an env-gated S3 endpoint override to the cloud test harness (test/test_utils.h + test/log_server_rocksdb_cloud_tests.cpp): when TEST_S3_ENDPOINT is set, install a rocksdb-cloud s3_client_factory (and point the raw bucket-delete S3 client) at that endpoint with path-style addressing. It is a no-op when unset, so real-AWS / production behavior is unchanged. Verified locally against MinIO: routing works (buckets are created in MinIO) and test_rocksdb_cloud_sst_reader passes. The DBCloud::Open-based cases currently crash inside librocksdb-cloud (CreateLoggerFromOptions -> GetAbsolutePath), independently of this change, and are excluded pending a separate investigation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

eloq_log_service/src/benchmark.cpp (1)

97-116: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Break out of the inner send loop after RPC failure/redirect.

Using continue here keeps sending on the same stale channel/stub, so leader recovery is delayed and failures can repeat for the rest of the million-iteration batch.

Suggested fix

 if (cntl.Failed())
 {
     LOG(WARNING) << "Fail to send request to " << leader << " : "
                  << cntl.ErrorText();
     braft::rtb::update_leader(FLAGS_group, braft::PeerId());
     bthread_usleep(FLAGS_timeout_ms * 1000L);
-    continue;
+    break;
 }

 if (response.response_status() !=
     txlog::LogResponse::ResponseStatus::LogResponse_ResponseStatus_Success)
 {
     LOG(WARNING) << "Fail to send request to " << leader
                  << ", redirecting to "
                  << response.write_log_response().redirect();
     braft::rtb::update_leader(
         FLAGS_group, response.write_log_response().redirect());
-    continue;
+    break;
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/benchmark.cpp` around lines 97 - 116, The code uses
`continue` statements when handling RPC failures and redirect responses, which
keeps the loop executing with the same stale channel/stub instead of breaking
out to allow connection recovery. Replace the `continue` statement in the `if
(cntl.Failed())` block and the `continue` statement in the response status check
block (where response.response_status() is not Success) with `break` statements
instead. This will exit the inner send loop and allow the channel/stub to be
re-established on the next attempt, preventing delayed leader recovery and
repeated failures.

🧹 Nitpick comments (1)

eloq_log_service/src/test_log_state_rocksdb.cpp (1)

93-111: ⚡ Quick win

test_truncate_log no longer validates truncation behavior.

The function still reports “before/after truncate,” but no truncation-triggering operation is executed, so this path cannot catch regressions.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eloq_log_service/src/test_log_state_rocksdb.cpp` around lines 93 - 111, The
test function test_truncate_log sets up a truncate_timestamp but does not invoke
any truncation operation before verifying results, making the test unable to
catch regressions. Replace the `(void) truncate_timestamp;` line with a call to
the current truncation API (either UpdateCkptTs or TryCleanMultiStageOps)
passing the truncate_timestamp value, since the old DeleteLogItems API has been
removed and truncation is now driven by checkpoint timestamps. This ensures the
test actually executes a truncation operation and validates that log items are
properly truncated before checking the log replay list.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/log-service-tests.yml:
- Around line 20-30: Replace the unpinned `:latest` image tags with specific
immutable digests: update the eloqdata/eloq-dev-ci-ubuntu2404 image and the
bitnami/minio service image to reference specific version digests rather than
`:latest`. Additionally, update the actions/checkout action reference from the
major version pin (e.g., `@v4`) to a fully qualified commit SHA to ensure
immutable dependency pinning. Finally, add the `persist-credentials: false`
parameter to the checkout step to prevent the GitHub token from being left in
`.git/config` after the checkout action completes.

In `@eloq_log_service/include/log_state.h`:
- Around line 1342-1350: Parse failures in snapshot reading are not being
propagated to the stream state, allowing partial snapshots to be accepted if
upstream code only checks is.fail(). At file
eloq_log_service/include/log_state.h lines 1342-1350 (anchor), 1385-1392
(sibling), and 1434-1441 (sibling), when the ParseFromString call fails and the
error is logged, set the stream is to a failed state (using setstate or similar
stream failure mechanism) before returning, ensuring that parse failures are
properly detected by upstream callers checking is.fail().

---

Outside diff comments:
In `@eloq_log_service/src/benchmark.cpp`:
- Around line 97-116: The code uses `continue` statements when handling RPC
failures and redirect responses, which keeps the loop executing with the same
stale channel/stub instead of breaking out to allow connection recovery. Replace
the `continue` statement in the `if (cntl.Failed())` block and the `continue`
statement in the response status check block (where response.response_status()
is not Success) with `break` statements instead. This will exit the inner send
loop and allow the channel/stub to be re-established on the next attempt,
preventing delayed leader recovery and repeated failures.

---

Nitpick comments:
In `@eloq_log_service/src/test_log_state_rocksdb.cpp`:
- Around line 93-111: The test function test_truncate_log sets up a
truncate_timestamp but does not invoke any truncation operation before verifying
results, making the test unable to catch regressions. Replace the `(void)
truncate_timestamp;` line with a call to the current truncation API (either
UpdateCkptTs or TryCleanMultiStageOps) passing the truncate_timestamp value,
since the old DeleteLogItems API has been removed and truncation is now driven
by checkpoint timestamps. This ensures the test actually executes a truncation
operation and validates that log items are properly truncated before checking
the log replay list.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a8e51fb-f900-4967-b45c-34107b5d2c06

📥 Commits

Reviewing files that changed from the base of the PR and between 7ec2557 and 72f202c.

📒 Files selected for processing (31)

.github/workflows/log-service-tests.yml
CMakeLists.txt
build_tx_service.cmake
eloq_log_service/CMakeLists.txt
eloq_log_service/include/fault_inject.h
eloq_log_service/include/glog_error_logging.h
eloq_log_service/include/log_service.h
eloq_log_service/include/log_service_metrics.h
eloq_log_service/include/log_shipping_agent.h
eloq_log_service/include/log_state.h
eloq_log_service/include/log_state_rocksdb_impl.h
eloq_log_service/include/log_utils.h
eloq_log_service/include/replay_service_for_test.h
eloq_log_service/include/rocksdb_cloud_config.h
eloq_log_service/src/benchmark.cpp
eloq_log_service/src/fault_inject.cpp
eloq_log_service/src/launch_server.cpp
eloq_log_service/src/launch_sv.cpp
eloq_log_service/src/log_instance.cpp
eloq_log_service/src/log_state_memory_impl.cpp
eloq_log_service/src/log_state_rocksdb_cloud_impl.cpp
eloq_log_service/src/log_state_rocksdb_impl.cpp
eloq_log_service/src/test_log_state_rocksdb.cpp
eloq_log_service/test/launch_replay_service.cpp
eloq_log_service/test/log_server_rocksdb_cloud_tests.cpp
eloq_log_service/test/recover_time_test.cpp
eloq_log_service/test/rocksdb_cloud_delete_range_test.cpp
eloq_log_service/test/rocksdb_test.cpp
eloq_log_service/test/test_utils.h
store_handler/eloq_data_store_service/CMakeLists.txt
tx_service/raft_host_manager/CMakeLists.txt

✅ Files skipped from review due to trivial changes (1)

eloq_log_service/include/log_service_metrics.h

🚧 Files skipped from review as they are similar to previous changes (24)

eloq_log_service/test/launch_replay_service.cpp
store_handler/eloq_data_store_service/CMakeLists.txt
eloq_log_service/src/launch_server.cpp
tx_service/raft_host_manager/CMakeLists.txt
CMakeLists.txt
eloq_log_service/test/test_utils.h
eloq_log_service/src/fault_inject.cpp
eloq_log_service/include/replay_service_for_test.h
eloq_log_service/include/log_utils.h
eloq_log_service/include/rocksdb_cloud_config.h
build_tx_service.cmake
eloq_log_service/include/glog_error_logging.h
eloq_log_service/src/log_state_memory_impl.cpp
eloq_log_service/CMakeLists.txt
eloq_log_service/test/log_server_rocksdb_cloud_tests.cpp
eloq_log_service/include/log_service.h
eloq_log_service/include/log_shipping_agent.h
eloq_log_service/src/launch_sv.cpp
eloq_log_service/test/rocksdb_cloud_delete_range_test.cpp
eloq_log_service/test/rocksdb_test.cpp
eloq_log_service/src/log_instance.cpp
eloq_log_service/src/log_state_rocksdb_impl.cpp
eloq_log_service/include/log_state_rocksdb_impl.h
eloq_log_service/src/log_state_rocksdb_cloud_impl.cpp

coderabbitai · 2026-06-15T09:42:22Z

+      image: eloqdata/eloq-dev-ci-ubuntu2404:latest
+      # Run as root so actions/checkout can write under /__w/_temp (mirrors
+      # unit-tests.yml); otherwise checkout fails with EACCES in this image.
+      options: --user root
+    services:
+      # bitnami/minio starts the server with its default entrypoint (no custom
+      # command needed, which GitHub `services:` cannot provide). Reachable from
+      # the job container by the service hostname `minio`.
+      minio:
+        image: bitnami/minio:latest
+        env:


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail wf=".github/workflows/log-service-tests.yml" echo "== Action refs ==" rg -n 'uses:\s*' "$wf" echo "== Mutable image tags (latest) ==" rg -n 'image:\s+.*:latest' "$wf" || true echo "== Checkout credential persistence setting ==" rg -n 'persist-credentials' "$wf" || true

Repository: eloqdata/tx_service

Length of output: 292

Pin workflow dependencies immutably and disable credential persistence in checkout.

Unpinned :latest images (lines 20, 29) allow mutation of dependencies without version control, creating supply-chain risk. Action @v4 (line 45) pins to major version but not immutable commit SHA. Missing persist-credentials: false in checkout leaves GitHub token in .git/config for subsequent steps, exposing credentials. Replace :latest with specific digests, update actions/checkout@v4 to a pinned commit SHA, and add persist-credentials: false to the checkout step.

🧰 Tools

🪛 zizmor (1.25.2)

[error] 20-20: unpinned image references (unpinned-images): container image is pinned to latest

(unpinned-images)

[error] 29-29: unpinned image references (unpinned-images): container image is pinned to latest

(unpinned-images)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/log-service-tests.yml around lines 20 - 30, Replace the unpinned `:latest` image tags with specific immutable digests: update the eloqdata/eloq-dev-ci-ubuntu2404 image and the bitnami/minio service image to reference specific version digests rather than `:latest`. Additionally, update the actions/checkout action reference from the major version pin (e.g., `@v4`) to a fully qualified commit SHA to ensure immutable dependency pinning. Finally, add the `persist-credentials: false` parameter to the checkout step to prevent the GitHub token from being left in `.git/config` after the checkout action completes.

Source: Linters/SAST tools

- log_state.h: on ParseFromString failures in the snapshot deserialize helper (schema / split-range / cluster-scale op messages), set the input stream to failbit before returning, so callers that check is.fail() detect the parse failure instead of silently accepting partially-loaded state. (read failures already set failbit themselves.) - log-service-tests.yml: the bitnami/minio:latest service image was retired (docker pull => "manifest unknown"), failing the job at container init. GitHub `services:` can't run the official minio image (it needs a `server` command), so run MinIO as a local binary in the job container and point TEST_S3_ENDPOINT at 127.0.0.1:9000. Verified locally: the cloud test rebuilds with the change and test_rocksdb_cloud_sst_reader passes against the binary MinIO (bucket create/drop included). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The prior run failed at Start MinIO although the server started fine: the probe used bash-only /dev/tcp but GitHub runs the step under sh, so it never succeeded. Use a POSIX-sh curl health check (/minio/health/live) and merge MinIO startup + the test into one step so the backgrounded server is alive for the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…RVICE (#512) Companion to eloqdata/tx_service#509. data_substrate now bundles the log service, host manager, and tx-log-protos in-tree, so the separate clone/symlink build steps are no longer needed. - Bump the data_substrate submodule to the in-tree open-sourcing commit. - Remove the OPEN_LOG_SERVICE cmake option (WITH_LOG_SERVICE is kept) and the stale -DOPEN_LOG_SERVICE / -DFORK_HM_PROCESS flags. - Drop the eloq_log_service / raft_host_manager clone+symlink plumbing from GitHub Actions (build.yml), the Concourse pipelines/tasks/scripts, and scripts/git-checkout.sh / git-tag.sh; the host_manager binary copy steps are kept. - Remove the obsolete log.pr.ent Concourse pipeline (it tested the now-absorbed eloq_log_service repo). - Update CLAUDE.md build instructions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Squash of the 8 chore/data-substrate-third-party-workspace commits, rebased onto current main and reconciled with main's in-tree restructuring (#509): - Build third-party deps from pinned sources: third_party/manifest.yml, scripts/third_party/{fetch,build,install,common}.sh, cmake/EloqThirdParty.cmake. brpc/braft/mimalloc/rocksdb-cloud remain submodules; others fetched from pinned upstream sources. - tx-log-protos, log_service, and host_manager follow main's open-sourced in-tree layout (not re-introduced as submodules). - Drop the top-level abseil-cpp submodule (workspace provides third_party/src/abseil-cpp). - Build the workspace abseil with its default options.h (ABSL_OPTION_USE_*=2 => std:: string_view/any/optional/variant under C++17+), matching how main consumes the abseil submodule via add_subdirectory(tx_service/abseil-cpp). This avoids any source changes for flat_hash_map<std::string> lookups. Build verification pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

liunyl mentioned this pull request Jun 15, 2026

Bundle log service + host manager in data_substrate; drop OPEN_LOG_SERVICE eloqdata/eloqkv#512

Merged

4 tasks

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

liunyl and others added 2 commits June 15, 2026 09:07

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

liunyl and others added 2 commits June 15, 2026 10:08

thweetkomputer approved these changes Jun 16, 2026

View reviewed changes

liunyl merged commit 561872f into main Jun 16, 2026
6 checks passed

liunyl deleted the chore/opensource-logservice-hostmanager branch June 16, 2026 02:49

coderabbitai Bot mentioned this pull request Jun 29, 2026

[codex] build: add data substrate third-party workspace #487

Merged

Uh oh!

Conversation

liunyl commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

liunyl commented Jun 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liunyl commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading