Skip to content

feat(fleet): broker heartbeat carries node roster snapshot for liveness (factory p11)#1139

Merged
khaliqgant merged 8 commits into
mainfrom
ricky/factory-p11-broker-heartbeat
Jun 16, 2026
Merged

feat(fleet): broker heartbeat carries node roster snapshot for liveness (factory p11)#1139
khaliqgant merged 8 commits into
mainfrom
ricky/factory-p11-broker-heartbeat

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 16, 2026

Copy link
Copy Markdown
Member

What

The broker fleet node heartbeat now carries the node roster snapshot (name, node_id, capabilities, max_agents, version) alongside live load/active_agents/handlers_live, so the relaycast engine can refresh a node's descriptor from the steady-state heartbeat without waiting for a fresh node.register — keeping nodes.list() accurate for live load, active agents, capabilities, and liveness across reconnects and engine restarts.

Engine extended, not feature reverted

An earlier iteration of this branch reverted the roster fields because the relaycast engine's node.heartbeat Zod schema was .strict() and accepted only load/active_agents/handlers_live — every roster-carrying heartbeat was rejected wholesale, dropping all live load updates and timing out the Fleet E2E spawn/load scenarios. This version instead extends the engine to accept and adopt the roster:

  • relaycast#197 (feat/heartbeat-roster-snapshot, commit 2f685fa99ad486b09de5fb61091594fd48459815): extends FleetNodeHeartbeatMessageSchema with the optional roster fields and has heartbeatNode() refresh the node row + register newly-advertised capability actions.

.github/workflows/fleet-e2e.yml is repointed from b673dfb to 2f685fa so CI runs this broker feature against the engine that accepts it.

Single source of truth

  • last_heartbeat_at is intentionally NOT sent — the engine stamps receipt time server-side as the authoritative liveness clock (engine ignores any client value; broker omits it entirely).
  • max_agents in the heartbeat is sourced from the active FleetLoadSnapshot (the same denominator used for the load ratio, kept in lockstep with node.register via RegisterNode/UpdateLoad), so load and max_agents in a single heartbeat never diverge.

Kept (unchanged from the prior commit)

  • Republishing the fleet load snapshot on worker release/exit/restart (api.rs/fleet.rs/runtime/maintenance.rs).
  • The maintenance.rs same-tick post-restart publish fix (CodeRabbit Major).

Verification (local)

  • cargo build -p agent-relay-broker + --release: clean.
  • cargo fmt --check: clean. cargo clippy -p agent-relay-broker -- -D warnings: clean.
  • broker cargo test: 767 + 12 + 1 pass (incl. node_heartbeat_carries_roster_snapshot + fleet_wire_fixtures_round_trip_semantically).
  • Fleet E2E against the relaycast#197 engine (RELAYCAST_ENGINE_DIR + release broker): 12/13 pass, including all roster/load/spawn/reschedule scenarios (capability query, capability-routed spawn, scheduled spawn → least-loaded node, reschedule on death + restart reconcile). The one failing test (resume: a resumable spawn re-binds to ORIGIN node) passes 3/3 in isolation and only times out under back-to-back full-suite resource contention — a pre-existing flake, not a regression from the roster change.

Cross-repo dependency

This PR's Fleet E2E depends on relaycast#197 (the engine that accepts the roster heartbeat). It is pushed and the e2e ref points at its SHA, but relaycast#197 needs human review/merge. Do not merge this until that engine change is reviewed.

🤖 Generated with Claude Code

factory p11 — broker outbound heartbeat/liveness + reconnect inventory sync.

The fleet node lifecycle (register, ~12s heartbeat tick, inventory.sync on
(re)connect, graceful node.deregister) already existed in the broker control
plane; the emitted heartbeat did not carry the roster snapshot, so nodes.list()
could not report accurate capabilities / maxAgents / name / version / liveness.

- crates/broker/src/fleet_wire.rs: NodeHeartbeat gains name, node_id,
  capabilities, max_agents, last_heartbeat_at, version (+ roster-snapshot unit
  test).
- crates/broker/src/node_control.rs: FleetLoadSnapshot::heartbeat(&node)
  populates the new fields from the active NodeRegister; node_register is kept
  fresh across re-register so post-reconnect heartbeats carry current roster
  data; last_heartbeat_at is broker-stamped via chrono.
- crates/broker/tests/fixtures/fleet-wire/node.heartbeat.json: golden fixture
  updated for the new required fields.
- packages/sdk RelayNode: add nodeId (maps broker node_id distinctly from the
  roster id; snake/camel tolerant) + node roster normalization tests.

AC#2 (first-to-completed dedup) and AC#4 (offline-on-lapse) are relaycast
server-side; the broker side (re-announce + cadenced timestamped heartbeat) is
implemented here.

Note: the standalone AgentWorkforce/relay-broker repo is archived; the broker
source lives in this monorepo (crates/broker), so broker + SDK ship in one PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@khaliqgant khaliqgant requested a review from willwashburn as a code owner June 16, 2026 12:31
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a publish_fleet_load_snapshot helper in the broker fleet runtime and calls it from both the agent release path (api.rs) and the maintenance tick worker exit/restart path (maintenance.rs). Extends the SDK RelayNode interface with an optional nodeId field, populated via toRelayNode normalization. Adds supporting tests for both the fleet wire heartbeat schema and SDK node normalization.

Changes

Fleet Load Freshness and SDK RelayNode nodeId

Layer / File(s) Summary
publish_fleet_load_snapshot helper extraction
crates/broker/src/runtime/fleet.rs
publish_fleet_load delegates to a new pub(super) publish_fleet_load_snapshot function that sends UpdateLoad and conditionally HeartbeatNow; a tokio test verifies the exact command ordering.
Snapshot published on agent release
crates/broker/src/runtime/api.rs
handle_api_request binds fleet_max_agents and fleet_handlers_live, then calls publish_fleet_load_snapshot after pruning agent state on both the success and unknown-worker Release paths.
Snapshot published on worker exit/restart
crates/broker/src/runtime/maintenance.rs
handle_maintenance_tick caches fleet capacity, tracks fleet_load_changed across reap and respawn steps, and calls publish_fleet_load_snapshot once per tick when the live set changed.
SDK RelayNode.nodeId type, normalization, and tests
packages/sdk/src/messaging/types.ts, packages/sdk/src/messaging/relaycast.ts, packages/sdk/src/__tests__/messaging.test.ts, CHANGELOG.md
RelayNode gains optional nodeId; toRelayNode reads nodeId/node_id from raw payloads; messaging tests extend the nodes mock and assert normalized field mappings across single and multi-node responses.
fleet_wire.rs minimal heartbeat schema test
crates/broker/src/fleet_wire.rs
Adds a unit test asserting node.heartbeat serializes to only type, v, load, active_agents, and handlers_live with no roster or timestamp fields.
Workflow trajectory artifacts
.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/...
Adds completed trajectory summary and JSON record for factory-p11-broker-heartbeat-workflow capturing agent roles, chapter timeline, and the p11 scoping decision.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • AgentWorkforce/relay#1106: Both PRs add tests against crates/broker/src/fleet_wire.rs for the node.heartbeat serialization schema.
  • AgentWorkforce/relay#1108: Both PRs modify RelayNode normalization in packages/sdk/src/messaging/relaycast.ts and types.ts to expose nodeId.
  • AgentWorkforce/relay#1107: This PR's fleet load snapshot republishing on release/exit directly extends the node heartbeat and load reporting control plane introduced in PR #1107.

Poem

🐇 Hop, hop! No more stale nodes in sight,
A snapshot is sent when workers take flight.
Release or restart, the roster stays true,
nodeId now surfaces fresh in the view.
The meadow of metrics blooms bright and new! 🌸

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: broker heartbeat now carries node roster snapshot for improved liveness reporting (factory p11 task).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and follows the expected structure with Summary, Test Plan, and detailed implementation notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ricky/factory-p11-broker-heartbeat

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the fleet node heartbeat payload in agent-relay-broker to include additional metadata, such as the node roster snapshot (capabilities, name, node ID, max agents), a heartbeat timestamp, and the broker version. This allows nodes.list() to report live load, active agents, capabilities, and liveness. The Rust broker implementation, tests, JSON fixtures, and TypeScript SDK have been updated to support and validate these new fields. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/broker/src/node_control.rs (1)

47-65: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use one source of truth for load and max_agents.

At Line 48, load is derived from self.max_agents, but at Line 59, heartbeat publishes max_agents from node.max_agents. If these diverge, the payload can report inconsistent state (for example, non-zero load with max_agents = 0), which breaks roster accuracy downstream.

Suggested fix
 impl FleetLoadSnapshot {
     fn heartbeat(&self, node: &NodeRegister) -> NodeHeartbeat {
-        let load = if self.max_agents == 0 {
+        let capacity = node.max_agents;
+        let load = if capacity == 0 {
             0.0
         } else {
-            (self.active_agents as f64 / self.max_agents as f64).clamp(0.0, 1.0)
+            (self.active_agents as f64 / capacity as f64).clamp(0.0, 1.0)
         };
         NodeHeartbeat {
             v: FLEET_WIRE_VERSION,
             id: None,
             name: node.name.clone(),
             node_id: node.node_id.clone(),
             capabilities: node.capabilities.clone(),
             max_agents: node.max_agents,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/node_control.rs` around lines 47 - 65, The load calculation
in the heartbeat method uses self.max_agents to derive the load value, but the
NodeHeartbeat struct is populated with max_agents from node.max_agents, creating
a potential inconsistency. To ensure a single source of truth, modify the load
calculation to use node.max_agents instead of self.max_agents, so that the load
value always corresponds to the max_agents value being reported in the heartbeat
payload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/broker/src/node_control.rs`:
- Around line 47-65: The load calculation in the heartbeat method uses
self.max_agents to derive the load value, but the NodeHeartbeat struct is
populated with max_agents from node.max_agents, creating a potential
inconsistency. To ensure a single source of truth, modify the load calculation
to use node.max_agents instead of self.max_agents, so that the load value always
corresponds to the max_agents value being reported in the heartbeat payload.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7defc62d-1467-4274-b493-be17bf1ebfaa

📥 Commits

Reviewing files that changed from the base of the PR and between edd6430 and db10f39.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • crates/broker/src/fleet_wire.rs
  • crates/broker/src/node_control.rs
  • crates/broker/tests/fixtures/fleet-wire/node.heartbeat.json
  • packages/sdk/src/__tests__/messaging.test.ts
  • packages/sdk/src/messaging/relaycast.ts
  • packages/sdk/src/messaging/types.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md (1)

50-50: 💤 Low value

Remove duplicate text in the shadow-review chapter entry.

Line 50 repeats "Kept p11 residual scoped..." on both sides of the colon, reducing clarity.

♻️ Proposed fix
- Kept p11 residual scoped to activeAgents/load freshness and SDK node liveness fields: Kept p11 residual scoped to activeAgents/load freshness and SDK node liveness fields
+ Kept p11 residual scoped to activeAgents/load freshness and SDK node liveness fields
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md
at line 50, The shadow-review chapter entry in the summary contains duplicate
text where "Kept p11 residual scoped to activeAgents/load freshness and SDK node
liveness fields" is repeated on both sides of the colon. Remove the redundant
repetition after the colon to improve clarity, keeping only one instance of this
description on the appropriate side of the colon.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/broker/src/runtime/maintenance.rs`:
- Around line 277-286: Move the publish_fleet_load_snapshot call block
(currently at lines 277-286) to execute after the restart handling loop that
follows (at line 288+). This ensures that any worker restarts occurring in the
same tick are accounted for before publishing the fleet load snapshot,
preventing stale active_agents values in the published snapshot.

---

Nitpick comments:
In @.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md:
- Line 50: The shadow-review chapter entry in the summary contains duplicate
text where "Kept p11 residual scoped to activeAgents/load freshness and SDK node
liveness fields" is repeated on both sides of the colon. Remove the redundant
repetition after the colon to improve clarity, keeping only one instance of this
description on the appropriate side of the colon.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4c10e281-37ff-4e97-b594-bb3782b19f01

📥 Commits

Reviewing files that changed from the base of the PR and between bbaf7c4 and 7a966d0.

📒 Files selected for processing (8)
  • .agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md
  • .agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/trajectory.json
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/runtime/fleet.rs
  • crates/broker/src/runtime/maintenance.rs
  • packages/sdk/src/__tests__/messaging.test.ts
  • packages/sdk/src/messaging/relaycast.ts
  • packages/sdk/src/messaging/types.ts
✅ Files skipped from review due to trivial changes (1)
  • .agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/trajectory.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/sdk/src/messaging/relaycast.ts
  • packages/sdk/src/messaging/types.ts

Comment thread crates/broker/src/runtime/maintenance.rs Outdated
Proactive Runtime Bot and others added 2 commits June 16, 2026 21:14
…d after restart

The relaycast engine's node.heartbeat schema is .strict() and accepts only
load/active_agents/handlers_live. Carrying the node roster snapshot
(name/node_id/capabilities/max_agents/last_heartbeat_at/version) in the
heartbeat made the engine reject every heartbeat, so node load/active_agents
never updated and the Fleet E2E spawn/load scenarios timed out. The roster is
already owned by node.register and last_heartbeat_at is stamped server-side, so
revert the heartbeat to its minimal wire shape.

Keep the genuine fix — republish the fleet load snapshot after a worker is
released/exits/restarts — and address the CodeRabbit comment by moving the
maintenance-tick publish to after restart handling (set on restart success) so
the broadcast count reflects post-restart state, not a same-tick post-reap
intermediate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nded to accept it)

Re-applies the roster-in-heartbeat feature, this time with the relaycast
engine extended to accept it (relaycast#197) instead of reverting it.

The node heartbeat now carries the node roster snapshot (name, node_id,
capabilities, max_agents, version) alongside live load/active_agents/
handlers_live, so the relaycast engine can refresh a node's descriptor
from the steady-state heartbeat without a fresh node.register — keeping
nodes.list() accurate across reconnects and engine restarts. The engine's
node.heartbeat schema (previously .strict(), accepting only
load/active_agents/handlers_live) is extended in relaycast#197 to accept
and adopt these optional roster fields; relay's Fleet E2E is repointed at
that engine commit so CI exercises the new wire contract.

Single source of truth:
- last_heartbeat_at is NOT sent — the engine stamps receipt time
  server-side as the authoritative liveness clock.
- max_agents in the heartbeat is sourced from the active FleetLoadSnapshot
  (the same denominator used for the load ratio, kept in lockstep with
  node.register via RegisterNode/UpdateLoad), so load and max_agents in
  one heartbeat never diverge.

Kept from the prior commit (unchanged): republishing the fleet load
snapshot on release/exit/restart (api.rs/fleet.rs/maintenance.rs) and the
maintenance.rs same-tick post-restart publish fix.

fleet_wire test repointed to positively assert the heartbeat carries the
roster AND omits last_heartbeat_at; fixture updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@khaliqgant khaliqgant merged commit 5a901c8 into main Jun 16, 2026
43 checks passed
@khaliqgant khaliqgant deleted the ricky/factory-p11-broker-heartbeat branch June 16, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant