docs: Complete trajectory for user routing fix by khaliqgant · Pull Request #214 · AgentWorkforce/relay

khaliqgant · 2026-01-18T09:23:54Z

Summary

Documents the completed work on cross-machine user message routing with trajectory files and final updates.

This PR includes:

Cloud Infrastructure: Added PresenceRegistry for user discovery and CloudMessageBus for WebSocket message delivery
Comprehensive Tests: 23 new tests covering user routing scenarios
Trajectory Documentation: Complete work history showing this was a missing feature (not a regression)

Changes

Infrastructure Added

PresenceRegistry - Tracks user presence across daemons for routing
CloudMessageBus - Delivers messages to users via WebSocket connections
API endpoints for daemon registration and user lookups

Tests Added

PresenceRegistry: User registration, timeout, multi-daemon scenarios
CloudMessageBus: Message delivery, error handling, connection management
End-to-end routing tests

Trajectory

Documents key decisions and approach
Identifies this as net-new feature, not regression
90% confidence in solution

Testing

All tests pass:

npm test -- src/cloud/services/

Context

Previous attempts (commits 41d7b4f, ba37864, 37996c0) only implemented agent-to-agent routing. User routing required additional cloud-side infrastructure that was missing.

🤖 Generated with Claude Code

Records completed work on cross-machine user message routing. - Added cloud infrastructure (PresenceRegistry, CloudMessageBus) - Created PR #213 with 23 comprehensive tests - Identified as missing feature (not regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

my-senior-dev-pr-review · 2026-01-18T09:24:47Z

🤖 My Senior Dev — Analysis Complete

👤 For @khaliqgant

📁 Expert in src/dashboard/react-components/ (10 edits) • ⚡ 70th PR this month

View your contributor analytics →

📊 3 files reviewed • 5 need attention

⚠️ Needs Attention:

.trajectories/index.json — Modifies the index impacting data integrity and logical consistency.
.trajectories/completed/2026-01/traj_qt6gh6tzb1fy.json — Introduces data structures for task management without validation checks.

🚀 Open Interactive Review →

The full interface unlocks features not available in GitHub:

💬 AI Chat — Ask questions on any file, get context-aware answers
🔍 Smart Hovers — See symbol definitions and usage without leaving the diff
📚 Code Archeology — Understand how files evolved over time (/archeology)
🎯 Learning Insights — See how this PR compares to similar changes

💬 Chat here: @my-senior-dev explain this change — or try @chaos-monkey @security-auditor @optimizer @skeptic @junior-dev

📖 View all 12 personas & slash commands

You can interact with me by mentioning @my-senior-dev in any comment:

In PR comments or on any line of code:

Ask questions about the code or PR
Request explanations of specific changes
Get suggestions for improvements

Slash commands:

/help — Show all available commands
/archeology — See the history and evolution of changed files
/profile — Performance analysis and suggestions
/expertise — Find who knows this code best
/personas — List all available AI personas

AI Personas (mention to get their perspective):

Persona	Focus
`@chaos-monkey` 🐵	Edge cases & failure scenarios
`@skeptic` 🤨	Challenge assumptions
`@optimizer` ⚡	Performance & efficiency
`@security-auditor` 🔒	Security vulnerabilities
`@accessibility-advocate` ♿	Inclusive design
`@junior-dev` 🌱	Simple explanations
`@tech-debt-collector` 💳	Code quality & shortcuts
`@ux-champion` 🎨	User experience
`@devops-engineer` 🚀	Deployment & scaling
`@documentation-nazi` 📚	Documentation gaps
`@legacy-whisperer` 🏛️	Working with existing code
`@test-driven-purist` ✅	Testing & TDD

For the best experience, view this PR on myseniordev.com — includes AI chat, file annotations, and interactive reviews.

* feat(broker): compile on relaycast v5.0.1 and prep node spawn/release Bump crates/broker to relaycast =5.0.1 (#214, increment 1 of the broker node-only delivery migration). - Remove the workspace-stream toggle (ensure_workspace_stream_enabled and its RelaycastWsClient::run call); RelayCast::workspace_stream_set is gone in v5. - Add node-frame fields the v5.0.1 engine sends (structs are deny_unknown_fields, so missing fields drop the frame): Deliver gains agent_id and delivery_id; ActionInvoke gains optional agent_id and agent_name. - Extract the AgentReleaseRequested / AgentSpawnRequested firehose match arms into reusable release_worker_locally / spawn_worker_from_request async fns and drop the arms (those WsEvent variants no longer exist in v5.0.1). Increment 2 will call these from action.invoke. - Point register-flow test mocks at the new /v1/agents endpoint with the CreateAgentResponse body (v5.0.1 register_agent_token registers via /v1/agents instead of /v1/agents/spawn). - Add the new fields to fleet-wire deliver/action.invoke fixtures. cargo build and cargo test -p agent-relay-broker both green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(broker): drop unreachable spawn fallback; classify control events by ws_type In relaycast v5.0.1 WsEvent ends in #[serde(other)] Unknown, so an agent.spawn_requested frame deserializes to Ok(WsEvent::Unknown), not Err. The firehose handler gated its raw-JSON spawn fallback (and a deser-warning) on from_value::<WsEvent>(..).is_ok(), which is now always true — making both paths dead code, contrary to the prior "preserved untouched" claim. Node control owns spawn/release via action.invoke (the extracted spawn_worker_from_request / release_worker_locally helpers); the workspace firehose no longer drives these events. Replace the meaningless is_ok() gate with an explicit ws_type match that ignores agent.spawn_requested / agent.release_requested (already deduped) instead of letting them fall through to map_ws_event. Remove the dead fallback, dead warning, and now-unused local bindings. Add regression tests pinning that both control frames decode to WsEvent::Unknown so future dispatch must classify by ws_type. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(broker): node-only delivery — enroll, bind agents, deliver over /v1/node/ws Increment 2 of the broker node-only delivery migration (relaycast v5.0.1). The broker now enrolls as a relaycast node, binds its agents through node control, and delivers/injects solely over /v1/node/ws. - NODE-TOKEN BOOTSTRAP (init.rs): resolve_node_token prefers RELAY_NODE_TOKEN, then a token cached for this exact node id, otherwise mints one via RelayCast::create_node (kind=ws, role=broker) with the workspace key and persists it next to the node id (scoped to node_id so id rotation invalidates it). node_control gains load/persist_node_token helpers. - UNCONDITIONAL node.register (init.rs): push FleetControlCommand::RegisterNode with a broker self-manifest (spawn capability) right after spawning the node-control client, so the broker enrolls every startup regardless of any sidecar. - BIND EVERY SPAWNED AGENT via node-control agent.register: extracted register_node_agent_token; both /api/spawn (api.rs) and the Inc1 spawn_worker_from_request now mint the agent token over node control (HTTP pre-registration only as fallback when node binding is unavailable). The minted token injects RELAY_AGENT_TOKEN + RELAY_SKIP_BOOTSTRAP via snippets, so the worker MCP never re-registers over HTTP. - INBOUND NODE FRAMES (fleet.rs): handle_fleet_deliver now uses the real delivery_id (no longer derived from msg_id) and branches on payload.type — message.created/thread.reply/dm.received/group_dm.received (and legacy empty type) inject into the worker PTY; message.reacted/message.read are acked with a tracing log only (PTY surfacing deferred); unknown types are acked without surfacing. action.invoke routes spawn/spawn:* and release to the Inc1 spawn/ release fns, replying action.result {output} on success or {error} on failure. - fleet_mode_enabled flips on FleetControlEvent::Connected (and is not cleared on disconnect) so workspace-firehose delivery is suppressed once node delivery is live, avoiding double-delivery while honoring at-least-once resume. Runtime delivery cannot be exercised without a live engine; added unit tests for the delivery-classification and action.invoke identity/field helpers. cargo build and cargo test -p agent-relay-broker both green (779 passing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fixup: read node deliver fields from payload.data and widen message-alias classification Node /v1/node/ws deliver frames nest the message under payload.data per relaycast 5.0.1 normalize_node_deliver (data.text/channel_name/agent_name/ from_name/thread_id). fleet_relay_delivery read only flat /text,/from,/channel paths, so every node-delivered message injected the raw JSON blob attributed to "relaycast". Extract from the data envelope (data.* first, legacy flat paths as fallback) via a testable fleet_delivery_fields helper. classify_fleet_delivery only injected message.created|thread.reply|dm.received| group_dm.received; the engine may emit any relaycast parse_inbound_kind alias (message.received/new/sent/delivered, dm.created/new/sent/message.created, direct_message.*, thread.message.created/sent, group_dm.*). Those were acked-and-dropped (permanent loss under at-least-once). Widen the Inject arm to the full alias set. Update the deliver.json fixture to the real {type,data} shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(broker): remove dead firehose delivery path (node-only delivery) Inc3 of the broker node-only delivery migration. Message delivery now flows solely over /v1/node/ws via handle_fleet_deliver, so the workspace firehose delivery path in the broker fleet runtime is dead. - Drop the fleet_mode_enabled field and all its assignments; its only reader was the firehose drop in handle_relaycast_message. - Reduce handle_relaycast_message to log-and-discard. The map_ws_event injection block, self-echo filtering, DM resolution, dashboard rebroadcast, and the fleet-mode drop are all gone. - Remove now-dead firehose-only helpers and their tests: relaycast_ws_control_dedup_key; routing is_self_echo, resolve_delivery_targets, worker_names_for_dm_participants, display_target_for_dashboard, DeliveryPlan; queue_and_try_delivery; WorkerRegistry::has_any_worker / has_worker_by_name_ignoring_case; the unused dm_participants_cache runtime field. - Rewrite routing tests to cover the surviving worker_names_for_channel_delivery / worker_names_for_direct_target (sender exclusion, case-insensitivity, workspace-id filtering). Deliberately left intact: RelaycastWsClient::run and map_ws_event are still used by `agent-relay-broker wrap` (single-agent PTY mode), which legitimately consumes the workspace firehose. Spawn/release stay owned by node control (spawn_worker_from_request / release_worker_locally). No observer/observer-token path added (separate follow-up). cargo build and clippy clean; 781 broker tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(broker): correctness fixes for node-only delivery (v5.0.1) Apply targeted adversarial-review fixes to the broker's node-only delivery migration: - HTTP-register spawn fallback now binds the agent to the broker's node via SDK bind_agent_to_node so it becomes via_node (the only kind the engine delivers to); a failed bind emits a loud registration_warning. A missing node token is logged as a hard fault, not a quiet warning. - seq:0 fan-out frames are no longer dropped: special-cased in FleetDeliveryBook::observe/commit to always surface-and-ack with msg_id dedup; action.completed/action.failed/action.denied route to Inject (delivered to the caller), message.reacted/message.read stay ack-only (PTY surfacing deferred). - Remove deny_unknown_fields from inbound Deliver/ActionInvoke so a future engine field no longer drops the frame without an ack (infinite redelivery). Outbound frames keep it. - Bound AgentDeliveryCursor.seen_msg_ids with a FIFO cap (512). - release action.result is now faithful: genuinely-unknown worker returns an error; already-exited worker still reports success. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(broker): stop opening the rejected /v1/ws workspace stream In v5.0.1 node-only delivery the /v1/ws workspace-stream WebSocket is observer-only and rejects the broker's workspace key with HTTP 401. The broker kept opening it anyway, 401-looping every 3s and burning reconnects even though delivery already flows over /v1/node/ws. The earlier teardown removed firehose message handling but left the connection itself. MultiWorkspaceSession::new no longer spawns the workspace-stream WS task; it drains the WsControl channel to a no-op and keeps the inbound channel open as an inert empty source (kept alive by a sender clone so the wrap action consumer and runtime no-op handler never busy-loop on a closed receiver). RelaycastWsClient (the 401-looping run loop) is deleted. The workspace HTTP client, WsControl plumbing, and ws_control_tx senders are all kept intact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: auto-format Rust code with cargo fmt * style: auto-format with Prettier * chore: drop session trajectory records swept into the broker PR The workflow/fix sub-agents' `git add -A` committed this session's .agentworkforce/trajectories/ records into the code branch. Untrack them (kept on disk) so the PR is a reviewable code-only diff. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(broker): scope node token to workspace and re-mint on node-ws 401 The broker persisted its minted node token at a single global path and load_node_token only checked node_id, so a token minted for workspace A (or a local engine) was reused against workspace B / prod and rejected with HTTP 401 on /v1/node/ws, and the connect loop looped forever on the rejected cached token. - Scope the persisted token to workspace_id (and engine base_url): add both to PersistedNodeToken; load_node_token returns the cached token only when node_id AND workspace_id match (and base_url when both sides know it); legacy caches without base_url still reuse on a workspace match. persist_node_token / load_node_token / resolve_node_token signatures and call sites in runtime/init.rs updated accordingly. - Re-mint on 401: detect an HTTP 401 handshake rejection on the node-control connect, discard the cached file + in-memory token, and re-mint via RelayCast::create_node (wired through a NodeTokenMinter) before retrying. Bounded to MAX_UNAUTHORIZED_BEFORE_GIVING_UP (5) consecutive 401s, after which a loud hard error is surfaced instead of spinning. - Tests: load_node_token workspace/node/base_url mismatch + round-trip, legacy-cache reuse, and 401 connect-error detection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: auto-format Rust code with cargo fmt * fix(broker): derive node id from machine-id + cwd hash The engine scopes relaycast nodes globally, so a host running brokers for two different workspaces (each in its own working directory) collided on a single global machine-id node, failing create_node and enrollment. Derive the node id deterministically from (machine_seed, canonical_cwd) via sha2, keeping it stable across restarts in the same directory but distinct across directories. The machine-id file stays the per-machine seed; cwd is read from current_dir (canonicalized when possible) with a seed-only fallback when cwd is unreadable. derive_node_id is a pure testable fn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: auto-format Rust code with cargo fmt * fix(broker): correlate agent.register replies by name; silence non-agent reply WARN The engine replies to every node-control request (node.register, inventory.sync) with a fresh snowflake reply id, since the broker sends those frames without an id. The broker routed every reply frame to complete_agent_registration, so these non-agent replies never matched a pending agent registration and produced a spurious WARN 'agent.register reply did not match a pending registration id=<snowflake>'. Resolve the pending registration by request id first, then fall back to matching the validated reply data.name against a pending entry (robust against an engine that drops/regenerates the id). Replies that resolve to neither are treated as non-agent replies and logged at debug, not warn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: auto-format Rust code with cargo fmt * fix(broker): scope node-token cache per node id, forward invoke session ref, bound re-mint loop Three node-only delivery fixes: - Node token cache is now scoped per node_id (node-tokens/{node_id}.json, filename-sanitized) instead of one host-wide file, so two brokers in different cwds on a host no longer overwrite each other's token. - node action.invoke spawns forward invocation_id and the harness session ref into agent.register (was hardcoded None,None), restoring invocation correlation and session resume. HTTP /api/spawn derives session ref from its spec too. - node-control re-mint loop only resets the consecutive-401 counter when a connection actually establishes (not on a successful mint) and drops the post-mint `continue` so retries honor the reconnect backoff, making the give-up cap reachable and stopping a tight POST /v1/nodes loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(broker): use pinned node id verbatim when RELAY_NODE_TOKEN is set Operator-enrolled / fleet nodes pin their node id (via the machine-id file) to match an engine-issued node token. The cwd-hash derivation broke that: the broker registered a derived id, so the engine rejected node.register with node_id_mismatch and the node never came online (two-node fleet E2E timed out on online+handlers_live). Only derive when auto-minting (no RELAY_NODE_TOKEN); otherwise use the pinned id verbatim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(broker): drop bare "spawn" capability from bootstrap node.register The broker's pre-sidecar bootstrap node.register advertised a generic capability literally named "spawn". The relaycast engine does not treat bare "spawn" as a placement capability (only spawn:* is), so its ensureCapabilityActions materialized a regular `spawn` ACTION pinned to whichever node bootstrapped first. From then on every spawn invoke resolved that action and was dispatched to the bootstrapping node, short-circuiting capability-based spawn placement for the whole workspace — cli/target_node/least-loaded routing were all ignored. This regressed the two-node fleet E2E: spawn:codex/spawn:claude landed on the wrong node, target_node placement-mismatch returned 201 instead of 409, and least-loaded scheduling misrouted. The bootstrap descriptor now carries no capabilities; the node's real spawn:*/action capabilities arrive on the sidecar's node.register. The node is still registered (online) before the sidecar connects, but claims no handler until the authoritative capability set is reported. Extracts the manifest into `bootstrap_node_manifest` with a unit test asserting the empty capability set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: auto-format Rust code with cargo fmt * fix(broker): run sidecar-declared harness for fleet spawn:* actions A `spawn:<harness>` node action registered by a sidecar was being run by the broker directly with the literal `cli` from the action input (`handle_fleet_action_spawn`), launching the real CLI instead of the node's declared harness. In the fleet/sidecar model the sidecar owns the harness: its `spawn(<harness>)` handler resolves the declared harness spec and calls `ctx.spawnAgent` (-> `spawn_agent` -> handle_fleet_spawn_agent). Route `spawn:*` to the sidecar's registered handler (same path as echo/work) whenever the sidecar declared a handler for that capability, gated by a new `HandlerDispatchState::has_handler`. The broker-direct raw-`cli` spawn is reserved for the direct / no-sidecar path where no sidecar handler is registered. Fixes the fleet-E2E least-loaded scheduling flake: heavyweight real processes (instead of the lightweight stub PTY) lingered/exited and triggered broker re-init, collapsing per-node active_agents. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

khaliqgant merged commit d4c1be0 into main Jan 18, 2026
8 checks passed

khaliqgant deleted the fix/cross-machine-user-routing branch January 18, 2026 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: Complete trajectory for user routing fix#214

docs: Complete trajectory for user routing fix#214
khaliqgant merged 1 commit into
mainfrom
fix/cross-machine-user-routing

khaliqgant commented Jan 18, 2026

Uh oh!

my-senior-dev-pr-review Bot commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

khaliqgant commented Jan 18, 2026

Summary

Changes

Infrastructure Added

Tests Added

Trajectory

Testing

Context

Uh oh!

my-senior-dev-pr-review Bot commented Jan 18, 2026

🤖 My Senior Dev — Analysis Complete

👤 For @khaliqgant

🚀 Open Interactive Review →

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant