feat(fleet): broker heartbeat carries node roster snapshot for liveness (factory p11)#1139
Conversation
factory p11 — broker outbound heartbeat/liveness + reconnect inventory sync. The fleet node lifecycle (register, ~12s heartbeat tick, inventory.sync on (re)connect, graceful node.deregister) already existed in the broker control plane; the emitted heartbeat did not carry the roster snapshot, so nodes.list() could not report accurate capabilities / maxAgents / name / version / liveness. - crates/broker/src/fleet_wire.rs: NodeHeartbeat gains name, node_id, capabilities, max_agents, last_heartbeat_at, version (+ roster-snapshot unit test). - crates/broker/src/node_control.rs: FleetLoadSnapshot::heartbeat(&node) populates the new fields from the active NodeRegister; node_register is kept fresh across re-register so post-reconnect heartbeats carry current roster data; last_heartbeat_at is broker-stamped via chrono. - crates/broker/tests/fixtures/fleet-wire/node.heartbeat.json: golden fixture updated for the new required fields. - packages/sdk RelayNode: add nodeId (maps broker node_id distinctly from the roster id; snake/camel tolerant) + node roster normalization tests. AC#2 (first-to-completed dedup) and AC#4 (offline-on-lapse) are relaycast server-side; the broker side (re-announce + cadenced timestamped heartbeat) is implemented here. Note: the standalone AgentWorkforce/relay-broker repo is archived; the broker source lives in this monorepo (crates/broker), so broker + SDK ship in one PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a ChangesFleet Load Freshness and SDK RelayNode nodeId
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the fleet node heartbeat payload in agent-relay-broker to include additional metadata, such as the node roster snapshot (capabilities, name, node ID, max agents), a heartbeat timestamp, and the broker version. This allows nodes.list() to report live load, active agents, capabilities, and liveness. The Rust broker implementation, tests, JSON fixtures, and TypeScript SDK have been updated to support and validate these new fields. There are no review comments, and I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/broker/src/node_control.rs (1)
47-65:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse one source of truth for
loadandmax_agents.At Line 48,
loadis derived fromself.max_agents, but at Line 59, heartbeat publishesmax_agentsfromnode.max_agents. If these diverge, the payload can report inconsistent state (for example, non-zeroloadwithmax_agents = 0), which breaks roster accuracy downstream.Suggested fix
impl FleetLoadSnapshot { fn heartbeat(&self, node: &NodeRegister) -> NodeHeartbeat { - let load = if self.max_agents == 0 { + let capacity = node.max_agents; + let load = if capacity == 0 { 0.0 } else { - (self.active_agents as f64 / self.max_agents as f64).clamp(0.0, 1.0) + (self.active_agents as f64 / capacity as f64).clamp(0.0, 1.0) }; NodeHeartbeat { v: FLEET_WIRE_VERSION, id: None, name: node.name.clone(), node_id: node.node_id.clone(), capabilities: node.capabilities.clone(), max_agents: node.max_agents,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/broker/src/node_control.rs` around lines 47 - 65, The load calculation in the heartbeat method uses self.max_agents to derive the load value, but the NodeHeartbeat struct is populated with max_agents from node.max_agents, creating a potential inconsistency. To ensure a single source of truth, modify the load calculation to use node.max_agents instead of self.max_agents, so that the load value always corresponds to the max_agents value being reported in the heartbeat payload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/broker/src/node_control.rs`:
- Around line 47-65: The load calculation in the heartbeat method uses
self.max_agents to derive the load value, but the NodeHeartbeat struct is
populated with max_agents from node.max_agents, creating a potential
inconsistency. To ensure a single source of truth, modify the load calculation
to use node.max_agents instead of self.max_agents, so that the load value always
corresponds to the max_agents value being reported in the heartbeat payload.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 7defc62d-1467-4274-b493-be17bf1ebfaa
📒 Files selected for processing (7)
CHANGELOG.mdcrates/broker/src/fleet_wire.rscrates/broker/src/node_control.rscrates/broker/tests/fixtures/fleet-wire/node.heartbeat.jsonpackages/sdk/src/__tests__/messaging.test.tspackages/sdk/src/messaging/relaycast.tspackages/sdk/src/messaging/types.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md (1)
50-50: 💤 Low valueRemove duplicate text in the shadow-review chapter entry.
Line 50 repeats "Kept p11 residual scoped..." on both sides of the colon, reducing clarity.
♻️ Proposed fix
- Kept p11 residual scoped to activeAgents/load freshness and SDK node liveness fields: Kept p11 residual scoped to activeAgents/load freshness and SDK node liveness fields + Kept p11 residual scoped to activeAgents/load freshness and SDK node liveness fields🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md at line 50, The shadow-review chapter entry in the summary contains duplicate text where "Kept p11 residual scoped to activeAgents/load freshness and SDK node liveness fields" is repeated on both sides of the colon. Remove the redundant repetition after the colon to improve clarity, keeping only one instance of this description on the appropriate side of the colon.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/broker/src/runtime/maintenance.rs`:
- Around line 277-286: Move the publish_fleet_load_snapshot call block
(currently at lines 277-286) to execute after the restart handling loop that
follows (at line 288+). This ensures that any worker restarts occurring in the
same tick are accounted for before publishing the fleet load snapshot,
preventing stale active_agents values in the published snapshot.
---
Nitpick comments:
In @.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md:
- Line 50: The shadow-review chapter entry in the summary contains duplicate
text where "Kept p11 residual scoped to activeAgents/load freshness and SDK node
liveness fields" is repeated on both sides of the colon. Remove the redundant
repetition after the colon to improve clarity, keeping only one instance of this
description on the appropriate side of the colon.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 4c10e281-37ff-4e97-b594-bb3782b19f01
📒 Files selected for processing (8)
.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/summary.md.agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/trajectory.jsoncrates/broker/src/runtime/api.rscrates/broker/src/runtime/fleet.rscrates/broker/src/runtime/maintenance.rspackages/sdk/src/__tests__/messaging.test.tspackages/sdk/src/messaging/relaycast.tspackages/sdk/src/messaging/types.ts
✅ Files skipped from review due to trivial changes (1)
- .agentworkforce/trajectories/completed/2026-06/traj_a1xsazrek0x1/trajectory.json
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/sdk/src/messaging/relaycast.ts
- packages/sdk/src/messaging/types.ts
…d after restart The relaycast engine's node.heartbeat schema is .strict() and accepts only load/active_agents/handlers_live. Carrying the node roster snapshot (name/node_id/capabilities/max_agents/last_heartbeat_at/version) in the heartbeat made the engine reject every heartbeat, so node load/active_agents never updated and the Fleet E2E spawn/load scenarios timed out. The roster is already owned by node.register and last_heartbeat_at is stamped server-side, so revert the heartbeat to its minimal wire shape. Keep the genuine fix — republish the fleet load snapshot after a worker is released/exits/restarts — and address the CodeRabbit comment by moving the maintenance-tick publish to after restart handling (set on restart success) so the broadcast count reflects post-restart state, not a same-tick post-reap intermediate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nded to accept it) Re-applies the roster-in-heartbeat feature, this time with the relaycast engine extended to accept it (relaycast#197) instead of reverting it. The node heartbeat now carries the node roster snapshot (name, node_id, capabilities, max_agents, version) alongside live load/active_agents/ handlers_live, so the relaycast engine can refresh a node's descriptor from the steady-state heartbeat without a fresh node.register — keeping nodes.list() accurate across reconnects and engine restarts. The engine's node.heartbeat schema (previously .strict(), accepting only load/active_agents/handlers_live) is extended in relaycast#197 to accept and adopt these optional roster fields; relay's Fleet E2E is repointed at that engine commit so CI exercises the new wire contract. Single source of truth: - last_heartbeat_at is NOT sent — the engine stamps receipt time server-side as the authoritative liveness clock. - max_agents in the heartbeat is sourced from the active FleetLoadSnapshot (the same denominator used for the load ratio, kept in lockstep with node.register via RegisterNode/UpdateLoad), so load and max_agents in one heartbeat never diverge. Kept from the prior commit (unchanged): republishing the fleet load snapshot on release/exit/restart (api.rs/fleet.rs/maintenance.rs) and the maintenance.rs same-tick post-restart publish fix. fleet_wire test repointed to positively assert the heartbeat carries the roster AND omits last_heartbeat_at; fixture updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What
The broker fleet node heartbeat now carries the node roster snapshot (
name,node_id,capabilities,max_agents,version) alongside liveload/active_agents/handlers_live, so the relaycast engine can refresh a node's descriptor from the steady-state heartbeat without waiting for a freshnode.register— keepingnodes.list()accurate for live load, active agents, capabilities, and liveness across reconnects and engine restarts.Engine extended, not feature reverted
An earlier iteration of this branch reverted the roster fields because the relaycast engine's
node.heartbeatZod schema was.strict()and accepted onlyload/active_agents/handlers_live— every roster-carrying heartbeat was rejected wholesale, dropping all live load updates and timing out the Fleet E2E spawn/load scenarios. This version instead extends the engine to accept and adopt the roster:feat/heartbeat-roster-snapshot, commit2f685fa99ad486b09de5fb61091594fd48459815): extendsFleetNodeHeartbeatMessageSchemawith the optional roster fields and hasheartbeatNode()refresh the node row + register newly-advertised capability actions..github/workflows/fleet-e2e.ymlis repointed fromb673dfbto2f685faso CI runs this broker feature against the engine that accepts it.Single source of truth
last_heartbeat_atis intentionally NOT sent — the engine stamps receipt time server-side as the authoritative liveness clock (engine ignores any client value; broker omits it entirely).max_agentsin the heartbeat is sourced from the activeFleetLoadSnapshot(the same denominator used for theloadratio, kept in lockstep withnode.registerviaRegisterNode/UpdateLoad), soloadandmax_agentsin a single heartbeat never diverge.Kept (unchanged from the prior commit)
api.rs/fleet.rs/runtime/maintenance.rs).maintenance.rssame-tick post-restart publish fix (CodeRabbit Major).Verification (local)
cargo build -p agent-relay-broker+--release: clean.cargo fmt --check: clean.cargo clippy -p agent-relay-broker -- -D warnings: clean.cargo test: 767 + 12 + 1 pass (incl.node_heartbeat_carries_roster_snapshot+fleet_wire_fixtures_round_trip_semantically).RELAYCAST_ENGINE_DIR+ release broker): 12/13 pass, including all roster/load/spawn/reschedule scenarios (capability query,capability-routed spawn,scheduled spawn → least-loaded node,reschedule on death + restart reconcile). The one failing test (resume: a resumable spawn re-binds to ORIGIN node) passes 3/3 in isolation and only times out under back-to-back full-suite resource contention — a pre-existing flake, not a regression from the roster change.Cross-repo dependency
This PR's Fleet E2E depends on relaycast#197 (the engine that accepts the roster heartbeat). It is pushed and the e2e ref points at its SHA, but relaycast#197 needs human review/merge. Do not merge this until that engine change is reviewed.
🤖 Generated with Claude Code