test: Phase 2 — multi-process cluster harness (txnode + scripted host manager + TestCluster) by liunyl · Pull Request #507 · eloqdata/tx_service

liunyl · 2026-06-13T19:45:38Z

Summary

Phase 2 foundation of the data_substrate test framework (Tier 2): a multi-process cluster harness that spawns real node processes and drives transactions across them. Builds on the merged Phase 1 (TestNode/MemDataStore).

Because the engine's singletons (Sharder, TxStartTsCollector, …) are process-global (one node per process), each node runs as its own txnode OS process. The test driver runs an in-process scripted host manager and deterministically scripts leadership by calling each node's CcRpcService.OnLeaderStart — exactly what the real raft host manager does — instead of relying on raft elections. This makes multi-node behavior testable without timing nondeterminism.

What's included

txnode binary (tx_service/tests/cluster/txnode_main.cpp, txnode_bringup.{h,cpp}): brings up a real multi-node TxService (mock catalog + in-process MemDataStore, skip_wal/skip_kv, external host manager, fork_host_manager=false) and hosts a small workload brpc service (txnode_workload.proto: BeginTx/Upsert/Read/Commit/Abort/NodeInfo/WaitReady/Shutdown) the driver uses to drive and inspect the node.
ScriptedHostManager (scripted_host_manager.{h,cpp}): minimal HostMangerService — nodes register via StartNode, learn NG leaders via GetLeader from a driver-set map.
TestCluster (test_cluster.{h,cpp}): spawns/kills txnode processes (own process group, robust reap — no orphans even on test failure), reserves port windows, runs the scripted HM, drives leadership (OnLeaderStart + NotifyNewLeaderStart + concurrent WaitReady), and exposes Client(node).
ClusterCrossNg-Test: brings up a 2-node / 2-NG cluster and asserts both NGs are led and that cross-NG write transactions commit from both nodes (writing keys that span both NGs exercises remote write-lock acquisition + 2PC across nodes). 3× green, zero orphan processes.
port_util.h: factored the port-reservation helpers out of Phase 1's test_node.cpp (shared by both); Phase 1 suite still passes.

Known limitation (documented, tracked — NOT a harness defect)

Cross-NG read-back is intentionally not asserted. Under skip_kv, a freshly-promoted leader's InitRangeBuckets copies bucket info (enough for write routing) but does not seed the RangeBucket CcMap entries that a read pins (ReadOperation → lock_bucket_op_) — this is flagged // TODO: HARDCORE SEED in cc_node.cpp. So cross-NG/cross-bucket reads hang on the bucket-meta lock while writes commit. This is an engine gap (same skip_kv family that deferred storage-backed Phase 1 work to e2e); resolving it needs core tx_execution/InitRangeBuckets changes and is a follow-up. The cluster test is non-gating in CI (like LargeObjLRU-Test) until then.

CI

ClusterCrossNg-Test runs in the non-gating step (ctest -L), excluded from the gating run (-LE "LargeObjLRU-Test|ClusterCrossNg-Test"), since it spawns processes and is heavier; promote to gating once proven stable across CI runs.

Follow-up Phase 2 increments (own PRs)

standby replication + failover; crash-recovery with an embedded LogServer; message-level fault injection; simplified scale in/out; and the cross-NG read once the RangeBucket seed gap is fixed.

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Added comprehensive multi-node cluster test infrastructure, standalone node driver, scripted host manager, workload RPCs, and end-to-end cross-node commit test.
Chores
- Updated CI to gate unit tests while running stress/cluster tests non-blocking for informational reporting.
- Added shared port-reservation and test orchestration utilities.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…able NodeConfig defaults is_candidate=false (a voter); a non-candidate cannot be made leader (cc_node.cpp rejects a non-candidate leader target). In this 1-member-per-NG foundation each member is its NG's sole leader, so mark it a candidate, matching how production's ParseNgConfig marks NG primaries. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Implement TestCluster::Start() (HM up -> spawn nodes -> await workload servers -> drive OnLeaderStart per NG + SetLeader -> NotifyNewLeaderStart to all nodes -> WaitReady) and add a WaitReady RPC to the txnode workload service that finishes the native-NG recovery (skip_wal leaves no log service to promote candidate->leader) and drives WaitClusterReady. Add ClusterCrossNg-Test: a real 2-node cluster, write 20 keys via node 0 (some remote to ng1) and read them back via node 1 (some remote to ng0). Cluster bring-up and cross-NG writes work; the cross-NG READ is currently blocked by a pre-existing engine issue in the bucket-meta lock path on a freshly-promoted skip_kv leader (ReadOperation::Forward lock_range_bucket_result_->IsFinished() / the RangeBucket ReadLocal never completing under skip_kv). See the STATUS comment in ClusterCrossNg-Test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ine skip_kv gap); harden node cleanup; non-gating CI Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

coderabbitai · 2026-06-13T19:45:50Z

Warning

Review limit reached

@liunyl, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdb0c88a-c147-40a2-a503-317b23ae4609

📥 Commits

Reviewing files that changed from the base of the PR and between 12a07e4 and c426f06.

📒 Files selected for processing (9)

.github/workflows/unit-tests.yml
tx_service/include/cc/template_cc_map.h
tx_service/tests/LargeObjLRU-Test.cpp
tx_service/tests/cluster/scripted_host_manager.cpp
tx_service/tests/cluster/scripted_host_manager.h
tx_service/tests/cluster/test_cluster.cpp
tx_service/tests/cluster/test_cluster.h
tx_service/tests/cluster/txnode_bringup.cpp
tx_service/tests/cluster/txnode_bringup.h

Walkthrough

Implements Phase 2 multi-process cluster test harness: workload proto and port utilities, txnode bring‑up fixtures, scripted HostManager, TestCluster orchestration to spawn txnode subprocesses, txnode binary exposing WorkloadService, Catch2 cross‑NG integration test, and CMake/CI wiring.

Changes

Phase 2 Multi-Process Cluster Test Harness

Layer / File(s)	Summary
Workload protocol and port utilities `tx_service/tests/cluster/txnode_workload.proto`, `tx_service/tests/harness/port_util.h`	gRPC WorkloadService contract (BeginTx, Upsert, Read, Commit, Abort, NodeInfo, WaitReady, Shutdown) and header-only port-binding utilities (`BindEphemeralPort`, `TryBindPort`, `ReserveTxPortWindow`).
ScriptedHostManager test service `tx_service/tests/cluster/scripted_host_manager.h`, `tx_service/tests/cluster/scripted_host_manager.cpp`	In-driver brpc HostManager mock service that registers nodes (StartNode RPC) and resolves NG leaders (GetLeader RPC) from a mutex-protected in-memory map.
TxNode bring-up and configuration `tx_service/tests/cluster/txnode_bringup.h`, `tx_service/tests/cluster/txnode_bringup.cpp`, `tx_service/tests/harness/test_node.cpp`	TxNodeConfig and TxNode to construct per-node in-process DataStore and TxService, register test tables, reserve ports/dirs, and refactor test_node.cpp to use shared port utilities.
txnode binary entry point `tx_service/tests/cluster/txnode_main.cpp`	Standalone executable implementing WorkloadServiceImpl, topology parsing, transaction-handle management, RPC forwarding to TxService, NodeInfo/WaitReady proxying, and clean shutdown on SIGTERM/SIGINT.
TestCluster driver orchestration `tx_service/tests/cluster/test_cluster.h`, `tx_service/tests/cluster/test_cluster.cpp`	Out-of-process cluster driver orchestrating port/directory reservations, txnode subprocess spawning, workload client construction, per-NG leadership driving, leader notification propagation, and readiness synchronization (concurrent WaitReady + NodeInfo + liveness Commit).
Cross-NG integration test and build `tx_service/tests/ClusterCrossNg-Test.cpp`, `tx_service/tests/CMakeLists.txt`, `.github/workflows/unit-tests.yml`	Catch2 end-to-end test validating 2-node cluster bring-up and cross-NG write/commit flows; CMake adds protoc codegen, `cluster_harness`, `txnode`, and `ClusterCrossNg-Test` targets; CI excludes ClusterCrossNg-Test from gated unit tests and runs a non-gating stress/cluster step.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

eloqdata/tx_service#426: Related CI/test gating changes around quarantine of LargeObjLRU-Test.

Suggested reviewers

thweetkomputer
githubzilla

Poem

🐇 I spun up nodes in a testing glade,
RPCs hopped through the grassy glen,
Leaders found homes and writes were made,
Commits returned true—then we ran again. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main contribution: Phase 2 of the multi-process cluster harness implementation (txnode, scripted host manager, TestCluster).
Description check	✅ Passed	The description is comprehensive and well-structured, covering the summary, included components, known limitations, CI placement, and follow-up work. It follows good documentation practices and aligns with the template's intent despite missing explicit checklist items.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/test-framework-phase2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tx_service/tests/cluster/txnode_main.cpp (1)

164-168: 💤 Low value

Consider validating port range before casting to uint16_t.

std::stoul succeeds for values > 65535 (e.g., "70000"), but static_cast<uint16_t> silently truncates them. While TCP ports are inherently 16-bit and the test harness controls inputs, explicit validation would surface configuration errors early.

♻️ Proposed validation

         try
         {
             uint32_t ng_id = static_cast<uint32_t>(std::stoul(ng_str));
             uint32_t node_id = static_cast<uint32_t>(std::stoul(node_str));
-            uint16_t port = static_cast<uint16_t>(std::stoul(port_str));
+            unsigned long port_ul = std::stoul(port_str);
+            if (port_ul > 65535)
+            {
+                throw std::runtime_error("port out of range");
+            }
+            uint16_t port = static_cast<uint16_t>(port_ul);
             ng_members[ng_id].emplace_back(node_id, host, port);
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/tests/cluster/txnode_main.cpp` around lines 164 - 168, The code
converts port_str via std::stoul then static_cast<uint16_t>, which silently
truncates values > 65535; change the port parsing in the try block to first
parse into an unsigned long (or uint32_t) (use the current std::stoul result),
validate that the value is <= 65535 (UINT16_MAX) before casting, and if out of
range throw or log an error and fail the parse (so port is not silently
truncated). Update the parsing for port (the variable currently named port and
the std::stoul(port_str) call) to perform this range check and handle the error
path consistently with the existing ng_id/node_id parse error handling.

tx_service/tests/ClusterCrossNg-Test.cpp (1)

61-114: 💤 Low value

Consider extracting the hardcoded RPC timeout to a named constant.

The 5000ms timeout appears in all four helper functions (lines 64, 78, 92, 107). Extracting it to a constexpr int kRpcTimeoutMs = 5000; at namespace scope would make future adjustments easier and document the timeout policy.

♻️ Proposed refactor

 namespace
 {
+constexpr int kRpcTimeoutMs = 5000;
+
 // --- Workload-RPC helper wrappers over a WorkloadService_Stub. Each opens a
 // fresh brpc::Controller (a Controller is single-use) and asserts the RPC did
 // not fail at the transport level before inspecting the reply. ---
 
 int64_t LeaderTerm(WorkloadStub &stub, uint32_t ng_id)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }
 
 uint64_t Begin(WorkloadStub &stub)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }
 
 bool Upsert(WorkloadStub &stub, uint64_t handle, int key, int value)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }
 
 bool Commit(WorkloadStub &stub, uint64_t handle)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/tests/ClusterCrossNg-Test.cpp` around lines 61 - 114, The four
helper functions LeaderTerm, Begin, Upsert, and Commit duplicate the hardcoded
RPC timeout (cntl.set_timeout_ms(5000)); introduce a single namespace-scope
constexpr int kRpcTimeoutMs = 5000 and replace each cntl.set_timeout_ms(5000)
call with cntl.set_timeout_ms(kRpcTimeoutMs) so the timeout policy is documented
and easy to change in one place; keep the new constant near the top of the test
file in the anonymous or test namespace scope so it is visible to all four
functions.

tx_service/tests/cluster/txnode_workload.proto (1)

2-2: 💤 Low value

Consider aligning package directory with proto convention.

The Buf linter flags that package txnode_workload should reside in a txnode_workload/ directory relative to the repository root, but the file is in tx_service/tests/cluster/. While this works functionally and may be intentional for test organization, aligning with protobuf directory conventions improves tooling compatibility and consistency.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/tests/cluster/txnode_workload.proto` at line 2, The proto package
declaration package txnode_workload does not match protobuf directory
conventions; either move the .proto into a matching txnode_workload/ directory
(so the file path aligns with package txnode_workload) or change the package
declaration to reflect the current test directory layout; update any import or
build references that depend on this package name accordingly so the linter and
tooling see a consistent package-to-directory mapping.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tx_service/tests/cluster/txnode_bringup.cpp`:
- Around line 63-90: Remove the duplicate BindEphemeralPort implementation and
instead include the shared helper from harness/port_util.h and add a using
declaration (e.g., using ::BindEphemeralPort;) so this file reuses the common
function (as test_node.cpp does); also fix the misleading comment that says
"loopback"—either remove that phrase or change it to reflect that the bind uses
INADDR_ANY (0.0.0.0).

---

Nitpick comments:
In `@tx_service/tests/cluster/txnode_main.cpp`:
- Around line 164-168: The code converts port_str via std::stoul then
static_cast<uint16_t>, which silently truncates values > 65535; change the port
parsing in the try block to first parse into an unsigned long (or uint32_t) (use
the current std::stoul result), validate that the value is <= 65535 (UINT16_MAX)
before casting, and if out of range throw or log an error and fail the parse (so
port is not silently truncated). Update the parsing for port (the variable
currently named port and the std::stoul(port_str) call) to perform this range
check and handle the error path consistently with the existing ng_id/node_id
parse error handling.

In `@tx_service/tests/cluster/txnode_workload.proto`:
- Line 2: The proto package declaration package txnode_workload does not match
protobuf directory conventions; either move the .proto into a matching
txnode_workload/ directory (so the file path aligns with package
txnode_workload) or change the package declaration to reflect the current test
directory layout; update any import or build references that depend on this
package name accordingly so the linter and tooling see a consistent
package-to-directory mapping.

In `@tx_service/tests/ClusterCrossNg-Test.cpp`:
- Around line 61-114: The four helper functions LeaderTerm, Begin, Upsert, and
Commit duplicate the hardcoded RPC timeout (cntl.set_timeout_ms(5000));
introduce a single namespace-scope constexpr int kRpcTimeoutMs = 5000 and
replace each cntl.set_timeout_ms(5000) call with
cntl.set_timeout_ms(kRpcTimeoutMs) so the timeout policy is documented and easy
to change in one place; keep the new constant near the top of the test file in
the anonymous or test namespace scope so it is visible to all four functions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 415700e9-30bf-49cc-bd47-843fb46ff92d

📥 Commits

Reviewing files that changed from the base of the PR and between e9899d5 and 52c5dbc.

📒 Files selected for processing (13)

.github/workflows/unit-tests.yml
tx_service/tests/CMakeLists.txt
tx_service/tests/ClusterCrossNg-Test.cpp
tx_service/tests/cluster/scripted_host_manager.cpp
tx_service/tests/cluster/scripted_host_manager.h
tx_service/tests/cluster/test_cluster.cpp
tx_service/tests/cluster/test_cluster.h
tx_service/tests/cluster/txnode_bringup.cpp
tx_service/tests/cluster/txnode_bringup.h
tx_service/tests/cluster/txnode_main.cpp
tx_service/tests/cluster/txnode_workload.proto
tx_service/tests/harness/port_util.h
tx_service/tests/harness/test_node.cpp

Apply fixes from a local multi-agent review of PR #507: Diagnosability / robustness (test_cluster.cpp): - AwaitWorkloadServers: detect a txnode that died during bring-up via waitpid(WNOHANG) and fail immediately with its exit/signal status, instead of burning the full 30s on connection-refused RPCs. Give each node its own timeout budget so a slow first node cannot starve later ones. - SpawnNode: check the posix_spawn_file_actions_* / posix_spawnattr_* setup return values; a silently-failed setpgroup would leave the child in the driver's group and make the dtor's group-kill backstop target the wrong group. - KillNode: retry waitpid on EINTR so a signal mid-wait cannot leave the child unreaped. - BuildClient: set max_retry=0 -- the workload RPCs are non-idempotent, so a transport-level retry could leak a tx handle or re-commit. - DriveLeader: treat resp.retry() as a retry condition (matches the documented contract; keeps the loop correct if the engine ever returns retry without error). - AwaitClusterReady: on a non-failed WaitReady reply with ready=false (a driver-ordering bug), fail loudly and immediately rather than spinning to the deadline. txnode_main.cpp: - WaitReady: single-flight the WaitClusterReady handshake so a retried WaitReady RPC (after a client-side timeout) cannot launch a second concurrent WaitClusterReady in the same process. Test (ClusterCrossNg-Test.cpp): - Assert NodeId(c0)==0 / NodeId(c1)==1 so a topology/port mix-up cannot silently degenerate the cross-NG test into a single-node test. Comment accuracy: - Note log-group binds base+2 in the 4-wide port-window comments (txnode_main.cpp, txnode_bringup.cpp, test_cluster.cpp). - DriveLeader doc: it sends only the OnLeaderStart RPC; NotifyLeader and the recovery handshake are driven separately. - WaitReady: name the real promoter (FinishLogGroupReplay). - Clarify that Upsert buffers locally and the cross-NG work is at Commit; describe the node-1 case as the coordinator/remote-role mirror. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- txnode_bringup.cpp: drop the duplicate BindEphemeralPort and reuse the shared txservice::test::BindEphemeralPort from harness/port_util.h (same behavior; the file is already in namespace txservice::test, so the call resolves without a using-declaration). Removes the now-unused <netinet/in.h>/<sys/socket.h> includes. - txnode_main.cpp ParseTopology: reject ports > 65535 explicitly instead of silently truncating std::stoul's result to uint16_t. - ClusterCrossNg-Test.cpp: hoist the duplicated 5000ms RPC timeout into a single constexpr kRpcTimeoutMs. Skipped CodeRabbit's proto-package-directory nitpick: txnode_workload.proto is a test-only definition deliberately colocated with the cluster harness it serves and referenced there by the CMake proto-gen; moving it for the Buf directory convention has no functional benefit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

liunyl · 2026-06-13T22:15:43Z

Addressed the CodeRabbit review in 12a07e4:

txnode_main.cpp ParseTopology (port range): now parses into unsigned long, rejects values > 65535, then casts — no more silent truncation.
ClusterCrossNg-Test.cpp (RPC timeout): hoisted the duplicated 5000 ms into a single constexpr int kRpcTimeoutMs at namespace scope.
txnode_bringup.cpp (duplicate BindEphemeralPort): resolved (see inline thread).

Skipped: the txnode_workload.proto package-directory nitpick. It is a test-only proto deliberately colocated with the cluster harness it serves and referenced there by the CMake proto-gen add_custom_command; moving it to a txnode_workload/ directory for the Buf convention has no functional benefit and would only complicate the test build.

Verified: ClusterCrossNg-Test green (96 assertions, 0 orphan processes) and the Phase 1 gating suite 16/16.

Quality cleanup from a /simplify pass (no behavior change): - test_cluster.cpp DriveLeader/NotifyLeader: build the CcRpcService stub and the request once before the retry loop instead of per iteration (a stub is a stateless channel wrapper; the request fields are constant across retries). Only the Controller/response are per-attempt. - test_cluster.cpp NotifyLeader: drop the dead leader-NodeProc lookup -- it was found but never dereferenced; the request needs only leader_node_id (already known-valid from the caller's iteration). - test_cluster.cpp: resolve the txnode binary path once in Start() and pass it to SpawnNode, instead of calling LocateTxnodeBinary (readlink + stat) per node. Matches the existing BuildTopologyString hoist. - txnode_bringup.cpp: collapse the two identical CatalogFactory*[3] arrays into one shared by both the DataStoreServiceClient and TxService ctors. - Hoist the duplicated cluster_config_version literal into a single shared constant kClusterConfigVersion in txnode_bringup.h, used by both the node bring-up and the driver's OnLeaderStart (the two must agree; the coupling was previously only documented in a comment). Skipped (noted for follow-up): parallelizing AwaitWorkloadServers and reusing cc-node channels across phases (future N-node scaling, adds concurrency/state not worth it at the 2-node foundation); factoring the shared conf-map / gflags / DSS bring-up blocks out of the merged Phase 1 test_node.cpp (a larger refactor of already-merged code, outside this diff); generalizing ClusterOptions to multi-member NGs and a dedicated Promote(ng,term) RPC (deliberate YAGNI until the failover/standby increment). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Two distinct bugs were chained under the quarantined LargeObjLRU-Test (the FastMetaDataMutex tls_shard_idx spin was fixed earlier in this branch): 1. Engine bug (template_cc_map.h, FindEmplace, LO_LRU policy): when the large-object page is the LAST page in the map and a key greater than it is inserted, the `target_it == ccmp_.end()` branch creates the new next-page but never repoints `target_page` to it (the parallel `else` branch does). The key is then Emplaced back into the large-object page, and TryUpdatePageKey calls FirstKey() on the still-empty new page -> assert(!keys_.empty()) (TC-FE-01). In NDEBUG this is worse: FirstKey() reads keys_.front() on an empty vector (UB) and the large-object "alone on its page" invariant is violated. Fix: add the missing `target_page = target_it->second.get();`, mirroring the else branch. 2. Test-setup bug (LargeObjLRU-Test.cpp, 4 sites): tests create partial- commit dirty entries and compensated dirty_data_key_count_ via f.shard.AdjustDataKeyStats(...) -- the SHARD counter only. OnCommittedUpdate normally bumps both the shard and the MAP counters, so the map's dirty_data_key_count_ was left at 0 while the entries are dirty, and Terminate()'s decrement underflowed it (assert in TemplateCcMap::AdjustDataKeyStats). This surfaced only once bug #1 stopped killing the suite earlier. Fix: drive the real clean->dirty API (cc_map.OnCommittedUpdate) so both counters stay in step. (This is unrelated to the production dirty-count underflow paths in #465.) With both fixes the full LargeObjLRU-Test (23 cases, 3701 assertions) passes deterministically, so it is un-gated in CI (only ClusterCrossNg-Test remains non-gating). Full ctest suite: 40/40 green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Eliminate the port reserve-then-bind TOCTOU that made ClusterCrossNg-Test a gating risk, then gate it. - ScriptedHostManager: bind the HM brpc server on a brpc PortRange instead of a pre-picked port, and expose Port(). brpc atomically finds+holds a free port, so there is no reserve/bind window at all. The driver reads the bound port back (Start() -> hm_.Port()); the ctor no longer reserves an HM port. - TestCluster workload port: AwaitWorkloadServers now detects a node that died during bring-up (waitpid WNOHANG) and re-spawns it on a fresh workload port (up to a few times) instead of failing the cluster -- self-healing the workload-port TOCTOU. - TestCluster tx-port window: the tx window is shared across the topology (every node must agree on every peer's ports), so it cannot be re-picked for one node. Start() now wraps the whole spawn+await in a bounded retry that, on a bring-up failure, kills all nodes, re-picks every node's ports, rebuilds the topology, and re-spawns -- covering tx-port collisions under heavy port contention. Stress: 30/30 sequential, 36/36 at 6x parallel, 50/50 at 10x parallel, 0 orphaned processes. Before the whole-cluster retry, 6x parallel showed a ~1/18 SIGABRT during bring-up (a tx-port collision); it is gone now. With this ClusterCrossNg-Test is stable enough to gate, so the CI carve-out is removed and the full suite gates. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

liunyl · 2026-06-13T23:48:17Z

@CodeRabbit review

coderabbitai · 2026-06-13T23:48:23Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

liunyl and others added 8 commits June 13, 2026 18:18

test(cluster): workload proto + cluster_harness lib skeleton

42ee303

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(cluster): multi-node TxNode bring-up (external HM, skip_wal)

c1b6d26

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(cluster): txnode binary with workload brpc service

d2b0d94

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(cluster): scripted HostMangerService (StartNode/GetLeader)

4584199

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(cluster): TestCluster process management (spawn/kill/ports/stubs)

40443e2

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(cluster): assert cross-NG write commit (read-back blocked by eng…

52c5dbc

…ine skip_kv gap); harden node cleanup; non-gating CI Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread tx_service/tests/cluster/txnode_bringup.cpp Outdated

liunyl and others added 2 commits June 13, 2026 20:04

liunyl and others added 3 commits June 13, 2026 22:27

thweetkomputer approved these changes Jun 14, 2026

View reviewed changes

liunyl merged commit a210368 into main Jun 14, 2026
5 checks passed

liunyl deleted the feat/test-framework-phase2 branch June 14, 2026 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Phase 2 — multi-process cluster harness (txnode + scripted host manager + TestCluster)#507

test: Phase 2 — multi-process cluster harness (txnode + scripted host manager + TestCluster)#507
liunyl merged 13 commits into
mainfrom
feat/test-framework-phase2

liunyl commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Review limit reached

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

liunyl commented Jun 13, 2026

Uh oh!

liunyl commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

liunyl commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Known limitation (documented, tracked — NOT a harness defect)

CI

Follow-up Phase 2 increments (own PRs)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liunyl commented Jun 13, 2026

Uh oh!

liunyl commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liunyl commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading