Skip to content

test: Phase 2 — multi-process cluster harness (txnode + scripted host manager + TestCluster)#507

Merged
liunyl merged 13 commits into
mainfrom
feat/test-framework-phase2
Jun 14, 2026
Merged

test: Phase 2 — multi-process cluster harness (txnode + scripted host manager + TestCluster)#507
liunyl merged 13 commits into
mainfrom
feat/test-framework-phase2

Conversation

@liunyl

@liunyl liunyl commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 2 foundation of the data_substrate test framework (Tier 2): a multi-process cluster harness that spawns real node processes and drives transactions across them. Builds on the merged Phase 1 (TestNode/MemDataStore).

Because the engine's singletons (Sharder, TxStartTsCollector, …) are process-global (one node per process), each node runs as its own txnode OS process. The test driver runs an in-process scripted host manager and deterministically scripts leadership by calling each node's CcRpcService.OnLeaderStart — exactly what the real raft host manager does — instead of relying on raft elections. This makes multi-node behavior testable without timing nondeterminism.

What's included

  • txnode binary (tx_service/tests/cluster/txnode_main.cpp, txnode_bringup.{h,cpp}): brings up a real multi-node TxService (mock catalog + in-process MemDataStore, skip_wal/skip_kv, external host manager, fork_host_manager=false) and hosts a small workload brpc service (txnode_workload.proto: BeginTx/Upsert/Read/Commit/Abort/NodeInfo/WaitReady/Shutdown) the driver uses to drive and inspect the node.
  • ScriptedHostManager (scripted_host_manager.{h,cpp}): minimal HostMangerService — nodes register via StartNode, learn NG leaders via GetLeader from a driver-set map.
  • TestCluster (test_cluster.{h,cpp}): spawns/kills txnode processes (own process group, robust reap — no orphans even on test failure), reserves port windows, runs the scripted HM, drives leadership (OnLeaderStart + NotifyNewLeaderStart + concurrent WaitReady), and exposes Client(node).
  • ClusterCrossNg-Test: brings up a 2-node / 2-NG cluster and asserts both NGs are led and that cross-NG write transactions commit from both nodes (writing keys that span both NGs exercises remote write-lock acquisition + 2PC across nodes). 3× green, zero orphan processes.
  • port_util.h: factored the port-reservation helpers out of Phase 1's test_node.cpp (shared by both); Phase 1 suite still passes.

Known limitation (documented, tracked — NOT a harness defect)

Cross-NG read-back is intentionally not asserted. Under skip_kv, a freshly-promoted leader's InitRangeBuckets copies bucket info (enough for write routing) but does not seed the RangeBucket CcMap entries that a read pins (ReadOperationlock_bucket_op_) — this is flagged // TODO: HARDCORE SEED in cc_node.cpp. So cross-NG/cross-bucket reads hang on the bucket-meta lock while writes commit. This is an engine gap (same skip_kv family that deferred storage-backed Phase 1 work to e2e); resolving it needs core tx_execution/InitRangeBuckets changes and is a follow-up. The cluster test is non-gating in CI (like LargeObjLRU-Test) until then.

CI

ClusterCrossNg-Test runs in the non-gating step (ctest -L), excluded from the gating run (-LE "LargeObjLRU-Test|ClusterCrossNg-Test"), since it spawns processes and is heavier; promote to gating once proven stable across CI runs.

Follow-up Phase 2 increments (own PRs)

standby replication + failover; crash-recovery with an embedded LogServer; message-level fault injection; simplified scale in/out; and the cross-NG read once the RangeBucket seed gap is fixed.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added comprehensive multi-node cluster test infrastructure, standalone node driver, scripted host manager, workload RPCs, and end-to-end cross-node commit test.
  • Chores
    • Updated CI to gate unit tests while running stress/cluster tests non-blocking for informational reporting.
    • Added shared port-reservation and test orchestration utilities.

liunyl and others added 8 commits June 13, 2026 18:18
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…able

NodeConfig defaults is_candidate=false (a voter); a non-candidate cannot be
made leader (cc_node.cpp rejects a non-candidate leader target). In this
1-member-per-NG foundation each member is its NG's sole leader, so mark it a
candidate, matching how production's ParseNgConfig marks NG primaries.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Implement TestCluster::Start() (HM up -> spawn nodes -> await workload
servers -> drive OnLeaderStart per NG + SetLeader -> NotifyNewLeaderStart
to all nodes -> WaitReady) and add a WaitReady RPC to the txnode workload
service that finishes the native-NG recovery (skip_wal leaves no log
service to promote candidate->leader) and drives WaitClusterReady.

Add ClusterCrossNg-Test: a real 2-node cluster, write 20 keys via node 0
(some remote to ng1) and read them back via node 1 (some remote to ng0).

Cluster bring-up and cross-NG writes work; the cross-NG READ is currently
blocked by a pre-existing engine issue in the bucket-meta lock path on a
freshly-promoted skip_kv leader (ReadOperation::Forward
lock_range_bucket_result_->IsFinished() / the RangeBucket ReadLocal never
completing under skip_kv). See the STATUS comment in ClusterCrossNg-Test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine skip_kv gap); harden node cleanup; non-gating CI

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@liunyl, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdb0c88a-c147-40a2-a503-317b23ae4609

📥 Commits

Reviewing files that changed from the base of the PR and between 12a07e4 and c426f06.

📒 Files selected for processing (9)
  • .github/workflows/unit-tests.yml
  • tx_service/include/cc/template_cc_map.h
  • tx_service/tests/LargeObjLRU-Test.cpp
  • tx_service/tests/cluster/scripted_host_manager.cpp
  • tx_service/tests/cluster/scripted_host_manager.h
  • tx_service/tests/cluster/test_cluster.cpp
  • tx_service/tests/cluster/test_cluster.h
  • tx_service/tests/cluster/txnode_bringup.cpp
  • tx_service/tests/cluster/txnode_bringup.h

Walkthrough

Implements Phase 2 multi-process cluster test harness: workload proto and port utilities, txnode bring‑up fixtures, scripted HostManager, TestCluster orchestration to spawn txnode subprocesses, txnode binary exposing WorkloadService, Catch2 cross‑NG integration test, and CMake/CI wiring.

Changes

Phase 2 Multi-Process Cluster Test Harness

Layer / File(s) Summary
Workload protocol and port utilities
tx_service/tests/cluster/txnode_workload.proto, tx_service/tests/harness/port_util.h
gRPC WorkloadService contract (BeginTx, Upsert, Read, Commit, Abort, NodeInfo, WaitReady, Shutdown) and header-only port-binding utilities (BindEphemeralPort, TryBindPort, ReserveTxPortWindow).
ScriptedHostManager test service
tx_service/tests/cluster/scripted_host_manager.h, tx_service/tests/cluster/scripted_host_manager.cpp
In-driver brpc HostManager mock service that registers nodes (StartNode RPC) and resolves NG leaders (GetLeader RPC) from a mutex-protected in-memory map.
TxNode bring-up and configuration
tx_service/tests/cluster/txnode_bringup.h, tx_service/tests/cluster/txnode_bringup.cpp, tx_service/tests/harness/test_node.cpp
TxNodeConfig and TxNode to construct per-node in-process DataStore and TxService, register test tables, reserve ports/dirs, and refactor test_node.cpp to use shared port utilities.
txnode binary entry point
tx_service/tests/cluster/txnode_main.cpp
Standalone executable implementing WorkloadServiceImpl, topology parsing, transaction-handle management, RPC forwarding to TxService, NodeInfo/WaitReady proxying, and clean shutdown on SIGTERM/SIGINT.
TestCluster driver orchestration
tx_service/tests/cluster/test_cluster.h, tx_service/tests/cluster/test_cluster.cpp
Out-of-process cluster driver orchestrating port/directory reservations, txnode subprocess spawning, workload client construction, per-NG leadership driving, leader notification propagation, and readiness synchronization (concurrent WaitReady + NodeInfo + liveness Commit).
Cross-NG integration test and build
tx_service/tests/ClusterCrossNg-Test.cpp, tx_service/tests/CMakeLists.txt, .github/workflows/unit-tests.yml
Catch2 end-to-end test validating 2-node cluster bring-up and cross-NG write/commit flows; CMake adds protoc codegen, cluster_harness, txnode, and ClusterCrossNg-Test targets; CI excludes ClusterCrossNg-Test from gated unit tests and runs a non-gating stress/cluster step.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • thweetkomputer
  • githubzilla

Poem

🐇 I spun up nodes in a testing glade,
RPCs hopped through the grassy glen,
Leaders found homes and writes were made,
Commits returned true—then we ran again. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main contribution: Phase 2 of the multi-process cluster harness implementation (txnode, scripted host manager, TestCluster).
Description check ✅ Passed The description is comprehensive and well-structured, covering the summary, included components, known limitations, CI placement, and follow-up work. It follows good documentation practices and aligns with the template's intent despite missing explicit checklist items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/test-framework-phase2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tx_service/tests/cluster/txnode_main.cpp (1)

164-168: 💤 Low value

Consider validating port range before casting to uint16_t.

std::stoul succeeds for values > 65535 (e.g., "70000"), but static_cast<uint16_t> silently truncates them. While TCP ports are inherently 16-bit and the test harness controls inputs, explicit validation would surface configuration errors early.

♻️ Proposed validation
         try
         {
             uint32_t ng_id = static_cast<uint32_t>(std::stoul(ng_str));
             uint32_t node_id = static_cast<uint32_t>(std::stoul(node_str));
-            uint16_t port = static_cast<uint16_t>(std::stoul(port_str));
+            unsigned long port_ul = std::stoul(port_str);
+            if (port_ul > 65535)
+            {
+                throw std::runtime_error("port out of range");
+            }
+            uint16_t port = static_cast<uint16_t>(port_ul);
             ng_members[ng_id].emplace_back(node_id, host, port);
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/tests/cluster/txnode_main.cpp` around lines 164 - 168, The code
converts port_str via std::stoul then static_cast<uint16_t>, which silently
truncates values > 65535; change the port parsing in the try block to first
parse into an unsigned long (or uint32_t) (use the current std::stoul result),
validate that the value is <= 65535 (UINT16_MAX) before casting, and if out of
range throw or log an error and fail the parse (so port is not silently
truncated). Update the parsing for port (the variable currently named port and
the std::stoul(port_str) call) to perform this range check and handle the error
path consistently with the existing ng_id/node_id parse error handling.
tx_service/tests/ClusterCrossNg-Test.cpp (1)

61-114: 💤 Low value

Consider extracting the hardcoded RPC timeout to a named constant.

The 5000ms timeout appears in all four helper functions (lines 64, 78, 92, 107). Extracting it to a constexpr int kRpcTimeoutMs = 5000; at namespace scope would make future adjustments easier and document the timeout policy.

♻️ Proposed refactor
 namespace
 {
+constexpr int kRpcTimeoutMs = 5000;
+
 // --- Workload-RPC helper wrappers over a WorkloadService_Stub. Each opens a
 // fresh brpc::Controller (a Controller is single-use) and asserts the RPC did
 // not fail at the transport level before inspecting the reply. ---
 
 int64_t LeaderTerm(WorkloadStub &stub, uint32_t ng_id)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }
 
 uint64_t Begin(WorkloadStub &stub)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }
 
 bool Upsert(WorkloadStub &stub, uint64_t handle, int key, int value)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }
 
 bool Commit(WorkloadStub &stub, uint64_t handle)
 {
     brpc::Controller cntl;
-    cntl.set_timeout_ms(5000);
+    cntl.set_timeout_ms(kRpcTimeoutMs);
     ...
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/tests/ClusterCrossNg-Test.cpp` around lines 61 - 114, The four
helper functions LeaderTerm, Begin, Upsert, and Commit duplicate the hardcoded
RPC timeout (cntl.set_timeout_ms(5000)); introduce a single namespace-scope
constexpr int kRpcTimeoutMs = 5000 and replace each cntl.set_timeout_ms(5000)
call with cntl.set_timeout_ms(kRpcTimeoutMs) so the timeout policy is documented
and easy to change in one place; keep the new constant near the top of the test
file in the anonymous or test namespace scope so it is visible to all four
functions.
tx_service/tests/cluster/txnode_workload.proto (1)

2-2: 💤 Low value

Consider aligning package directory with proto convention.

The Buf linter flags that package txnode_workload should reside in a txnode_workload/ directory relative to the repository root, but the file is in tx_service/tests/cluster/. While this works functionally and may be intentional for test organization, aligning with protobuf directory conventions improves tooling compatibility and consistency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tx_service/tests/cluster/txnode_workload.proto` at line 2, The proto package
declaration package txnode_workload does not match protobuf directory
conventions; either move the .proto into a matching txnode_workload/ directory
(so the file path aligns with package txnode_workload) or change the package
declaration to reflect the current test directory layout; update any import or
build references that depend on this package name accordingly so the linter and
tooling see a consistent package-to-directory mapping.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tx_service/tests/cluster/txnode_bringup.cpp`:
- Around line 63-90: Remove the duplicate BindEphemeralPort implementation and
instead include the shared helper from harness/port_util.h and add a using
declaration (e.g., using ::BindEphemeralPort;) so this file reuses the common
function (as test_node.cpp does); also fix the misleading comment that says
"loopback"—either remove that phrase or change it to reflect that the bind uses
INADDR_ANY (0.0.0.0).

---

Nitpick comments:
In `@tx_service/tests/cluster/txnode_main.cpp`:
- Around line 164-168: The code converts port_str via std::stoul then
static_cast<uint16_t>, which silently truncates values > 65535; change the port
parsing in the try block to first parse into an unsigned long (or uint32_t) (use
the current std::stoul result), validate that the value is <= 65535 (UINT16_MAX)
before casting, and if out of range throw or log an error and fail the parse (so
port is not silently truncated). Update the parsing for port (the variable
currently named port and the std::stoul(port_str) call) to perform this range
check and handle the error path consistently with the existing ng_id/node_id
parse error handling.

In `@tx_service/tests/cluster/txnode_workload.proto`:
- Line 2: The proto package declaration package txnode_workload does not match
protobuf directory conventions; either move the .proto into a matching
txnode_workload/ directory (so the file path aligns with package
txnode_workload) or change the package declaration to reflect the current test
directory layout; update any import or build references that depend on this
package name accordingly so the linter and tooling see a consistent
package-to-directory mapping.

In `@tx_service/tests/ClusterCrossNg-Test.cpp`:
- Around line 61-114: The four helper functions LeaderTerm, Begin, Upsert, and
Commit duplicate the hardcoded RPC timeout (cntl.set_timeout_ms(5000));
introduce a single namespace-scope constexpr int kRpcTimeoutMs = 5000 and
replace each cntl.set_timeout_ms(5000) call with
cntl.set_timeout_ms(kRpcTimeoutMs) so the timeout policy is documented and easy
to change in one place; keep the new constant near the top of the test file in
the anonymous or test namespace scope so it is visible to all four functions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 415700e9-30bf-49cc-bd47-843fb46ff92d

📥 Commits

Reviewing files that changed from the base of the PR and between e9899d5 and 52c5dbc.

📒 Files selected for processing (13)
  • .github/workflows/unit-tests.yml
  • tx_service/tests/CMakeLists.txt
  • tx_service/tests/ClusterCrossNg-Test.cpp
  • tx_service/tests/cluster/scripted_host_manager.cpp
  • tx_service/tests/cluster/scripted_host_manager.h
  • tx_service/tests/cluster/test_cluster.cpp
  • tx_service/tests/cluster/test_cluster.h
  • tx_service/tests/cluster/txnode_bringup.cpp
  • tx_service/tests/cluster/txnode_bringup.h
  • tx_service/tests/cluster/txnode_main.cpp
  • tx_service/tests/cluster/txnode_workload.proto
  • tx_service/tests/harness/port_util.h
  • tx_service/tests/harness/test_node.cpp

Comment thread tx_service/tests/cluster/txnode_bringup.cpp Outdated
liunyl and others added 2 commits June 13, 2026 20:04
Apply fixes from a local multi-agent review of PR #507:

Diagnosability / robustness (test_cluster.cpp):
- AwaitWorkloadServers: detect a txnode that died during bring-up via
  waitpid(WNOHANG) and fail immediately with its exit/signal status,
  instead of burning the full 30s on connection-refused RPCs. Give each
  node its own timeout budget so a slow first node cannot starve later
  ones.
- SpawnNode: check the posix_spawn_file_actions_* / posix_spawnattr_*
  setup return values; a silently-failed setpgroup would leave the child
  in the driver's group and make the dtor's group-kill backstop target
  the wrong group.
- KillNode: retry waitpid on EINTR so a signal mid-wait cannot leave the
  child unreaped.
- BuildClient: set max_retry=0 -- the workload RPCs are non-idempotent,
  so a transport-level retry could leak a tx handle or re-commit.
- DriveLeader: treat resp.retry() as a retry condition (matches the
  documented contract; keeps the loop correct if the engine ever returns
  retry without error).
- AwaitClusterReady: on a non-failed WaitReady reply with ready=false
  (a driver-ordering bug), fail loudly and immediately rather than
  spinning to the deadline.

txnode_main.cpp:
- WaitReady: single-flight the WaitClusterReady handshake so a retried
  WaitReady RPC (after a client-side timeout) cannot launch a second
  concurrent WaitClusterReady in the same process.

Test (ClusterCrossNg-Test.cpp):
- Assert NodeId(c0)==0 / NodeId(c1)==1 so a topology/port mix-up cannot
  silently degenerate the cross-NG test into a single-node test.

Comment accuracy:
- Note log-group binds base+2 in the 4-wide port-window comments
  (txnode_main.cpp, txnode_bringup.cpp, test_cluster.cpp).
- DriveLeader doc: it sends only the OnLeaderStart RPC; NotifyLeader and
  the recovery handshake are driven separately.
- WaitReady: name the real promoter (FinishLogGroupReplay).
- Clarify that Upsert buffers locally and the cross-NG work is at Commit;
  describe the node-1 case as the coordinator/remote-role mirror.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- txnode_bringup.cpp: drop the duplicate BindEphemeralPort and reuse the
  shared txservice::test::BindEphemeralPort from harness/port_util.h (same
  behavior; the file is already in namespace txservice::test, so the call
  resolves without a using-declaration). Removes the now-unused
  <netinet/in.h>/<sys/socket.h> includes.
- txnode_main.cpp ParseTopology: reject ports > 65535 explicitly instead of
  silently truncating std::stoul's result to uint16_t.
- ClusterCrossNg-Test.cpp: hoist the duplicated 5000ms RPC timeout into a
  single constexpr kRpcTimeoutMs.

Skipped CodeRabbit's proto-package-directory nitpick: txnode_workload.proto
is a test-only definition deliberately colocated with the cluster harness it
serves and referenced there by the CMake proto-gen; moving it for the Buf
directory convention has no functional benefit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@liunyl

liunyl commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the CodeRabbit review in 12a07e4:

  • txnode_main.cpp ParseTopology (port range): now parses into unsigned long, rejects values > 65535, then casts — no more silent truncation.
  • ClusterCrossNg-Test.cpp (RPC timeout): hoisted the duplicated 5000 ms into a single constexpr int kRpcTimeoutMs at namespace scope.
  • txnode_bringup.cpp (duplicate BindEphemeralPort): resolved (see inline thread).

Skipped: the txnode_workload.proto package-directory nitpick. It is a test-only proto deliberately colocated with the cluster harness it serves and referenced there by the CMake proto-gen add_custom_command; moving it to a txnode_workload/ directory for the Buf convention has no functional benefit and would only complicate the test build.

Verified: ClusterCrossNg-Test green (96 assertions, 0 orphan processes) and the Phase 1 gating suite 16/16.

liunyl and others added 3 commits June 13, 2026 22:27
Quality cleanup from a /simplify pass (no behavior change):

- test_cluster.cpp DriveLeader/NotifyLeader: build the CcRpcService stub
  and the request once before the retry loop instead of per iteration
  (a stub is a stateless channel wrapper; the request fields are constant
  across retries). Only the Controller/response are per-attempt.
- test_cluster.cpp NotifyLeader: drop the dead leader-NodeProc lookup --
  it was found but never dereferenced; the request needs only
  leader_node_id (already known-valid from the caller's iteration).
- test_cluster.cpp: resolve the txnode binary path once in Start() and
  pass it to SpawnNode, instead of calling LocateTxnodeBinary
  (readlink + stat) per node. Matches the existing BuildTopologyString
  hoist.
- txnode_bringup.cpp: collapse the two identical CatalogFactory*[3]
  arrays into one shared by both the DataStoreServiceClient and TxService
  ctors.
- Hoist the duplicated cluster_config_version literal into a single
  shared constant kClusterConfigVersion in txnode_bringup.h, used by both
  the node bring-up and the driver's OnLeaderStart (the two must agree;
  the coupling was previously only documented in a comment).

Skipped (noted for follow-up): parallelizing AwaitWorkloadServers and
reusing cc-node channels across phases (future N-node scaling, adds
concurrency/state not worth it at the 2-node foundation); factoring the
shared conf-map / gflags / DSS bring-up blocks out of the merged Phase 1
test_node.cpp (a larger refactor of already-merged code, outside this
diff); generalizing ClusterOptions to multi-member NGs and a dedicated
Promote(ng,term) RPC (deliberate YAGNI until the failover/standby
increment).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two distinct bugs were chained under the quarantined LargeObjLRU-Test (the
FastMetaDataMutex tls_shard_idx spin was fixed earlier in this branch):

1. Engine bug (template_cc_map.h, FindEmplace, LO_LRU policy): when the
   large-object page is the LAST page in the map and a key greater than it
   is inserted, the `target_it == ccmp_.end()` branch creates the new
   next-page but never repoints `target_page` to it (the parallel `else`
   branch does). The key is then Emplaced back into the large-object page,
   and TryUpdatePageKey calls FirstKey() on the still-empty new page ->
   assert(!keys_.empty()) (TC-FE-01). In NDEBUG this is worse: FirstKey()
   reads keys_.front() on an empty vector (UB) and the large-object
   "alone on its page" invariant is violated. Fix: add the missing
   `target_page = target_it->second.get();`, mirroring the else branch.

2. Test-setup bug (LargeObjLRU-Test.cpp, 4 sites): tests create partial-
   commit dirty entries and compensated dirty_data_key_count_ via
   f.shard.AdjustDataKeyStats(...) -- the SHARD counter only. OnCommittedUpdate
   normally bumps both the shard and the MAP counters, so the map's
   dirty_data_key_count_ was left at 0 while the entries are dirty, and
   Terminate()'s decrement underflowed it (assert in
   TemplateCcMap::AdjustDataKeyStats). This surfaced only once bug #1 stopped
   killing the suite earlier. Fix: drive the real clean->dirty API
   (cc_map.OnCommittedUpdate) so both counters stay in step. (This is
   unrelated to the production dirty-count underflow paths in #465.)

With both fixes the full LargeObjLRU-Test (23 cases, 3701 assertions) passes
deterministically, so it is un-gated in CI (only ClusterCrossNg-Test remains
non-gating). Full ctest suite: 40/40 green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Eliminate the port reserve-then-bind TOCTOU that made ClusterCrossNg-Test a
gating risk, then gate it.

- ScriptedHostManager: bind the HM brpc server on a brpc PortRange instead of
  a pre-picked port, and expose Port(). brpc atomically finds+holds a free
  port, so there is no reserve/bind window at all. The driver reads the bound
  port back (Start() -> hm_.Port()); the ctor no longer reserves an HM port.

- TestCluster workload port: AwaitWorkloadServers now detects a node that died
  during bring-up (waitpid WNOHANG) and re-spawns it on a fresh workload port
  (up to a few times) instead of failing the cluster -- self-healing the
  workload-port TOCTOU.

- TestCluster tx-port window: the tx window is shared across the topology
  (every node must agree on every peer's ports), so it cannot be re-picked for
  one node. Start() now wraps the whole spawn+await in a bounded retry that, on
  a bring-up failure, kills all nodes, re-picks every node's ports, rebuilds
  the topology, and re-spawns -- covering tx-port collisions under heavy port
  contention.

Stress: 30/30 sequential, 36/36 at 6x parallel, 50/50 at 10x parallel, 0
orphaned processes. Before the whole-cluster retry, 6x parallel showed a
~1/18 SIGABRT during bring-up (a tx-port collision); it is gone now. With this
ClusterCrossNg-Test is stable enough to gate, so the CI carve-out is removed
and the full suite gates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@liunyl

liunyl commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@liunyl liunyl merged commit a210368 into main Jun 14, 2026
5 checks passed
@liunyl liunyl deleted the feat/test-framework-phase2 branch June 14, 2026 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants