Skip to content

docs: Complete trajectory for user routing fix#214

Merged
khaliqgant merged 1 commit into
mainfrom
fix/cross-machine-user-routing
Jan 18, 2026
Merged

docs: Complete trajectory for user routing fix#214
khaliqgant merged 1 commit into
mainfrom
fix/cross-machine-user-routing

Conversation

@khaliqgant

Copy link
Copy Markdown
Member

Summary

Documents the completed work on cross-machine user message routing with trajectory files and final updates.

This PR includes:

  • Cloud Infrastructure: Added PresenceRegistry for user discovery and CloudMessageBus for WebSocket message delivery
  • Comprehensive Tests: 23 new tests covering user routing scenarios
  • Trajectory Documentation: Complete work history showing this was a missing feature (not a regression)

Changes

Infrastructure Added

  • PresenceRegistry - Tracks user presence across daemons for routing
  • CloudMessageBus - Delivers messages to users via WebSocket connections
  • API endpoints for daemon registration and user lookups

Tests Added

  • PresenceRegistry: User registration, timeout, multi-daemon scenarios
  • CloudMessageBus: Message delivery, error handling, connection management
  • End-to-end routing tests

Trajectory

  • Documents key decisions and approach
  • Identifies this as net-new feature, not regression
  • 90% confidence in solution

Testing

All tests pass:

npm test -- src/cloud/services/

Context

Previous attempts (commits 41d7b4f, ba37864, 37996c0) only implemented agent-to-agent routing. User routing required additional cloud-side infrastructure that was missing.

🤖 Generated with Claude Code

Records completed work on cross-machine user message routing.
- Added cloud infrastructure (PresenceRegistry, CloudMessageBus)
- Created PR #213 with 23 comprehensive tests
- Identified as missing feature (not regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@my-senior-dev-pr-review

Copy link
Copy Markdown

🤖 My Senior Dev — Analysis Complete

👤 For @khaliqgant

📁 Expert in src/dashboard/react-components/ (10 edits) • ⚡ 70th PR this month

View your contributor analytics →


📊 3 files reviewed • 5 need attention

⚠️ Needs Attention:

  • .trajectories/index.json — Modifies the index impacting data integrity and logical consistency.
  • .trajectories/completed/2026-01/traj_qt6gh6tzb1fy.json — Introduces data structures for task management without validation checks.

🚀 Open Interactive Review →

The full interface unlocks features not available in GitHub:

  • 💬 AI Chat — Ask questions on any file, get context-aware answers
  • 🔍 Smart Hovers — See symbol definitions and usage without leaving the diff
  • 📚 Code Archeology — Understand how files evolved over time (/archeology)
  • 🎯 Learning Insights — See how this PR compares to similar changes

💬 Chat here: @my-senior-dev explain this change — or try @chaos-monkey @security-auditor @optimizer @skeptic @junior-dev

📖 View all 12 personas & slash commands

You can interact with me by mentioning @my-senior-dev in any comment:

In PR comments or on any line of code:

  • Ask questions about the code or PR
  • Request explanations of specific changes
  • Get suggestions for improvements

Slash commands:

  • /help — Show all available commands
  • /archeology — See the history and evolution of changed files
  • /profile — Performance analysis and suggestions
  • /expertise — Find who knows this code best
  • /personas — List all available AI personas

AI Personas (mention to get their perspective):

Persona Focus
@chaos-monkey 🐵 Edge cases & failure scenarios
@skeptic 🤨 Challenge assumptions
@optimizer Performance & efficiency
@security-auditor 🔒 Security vulnerabilities
@accessibility-advocate Inclusive design
@junior-dev 🌱 Simple explanations
@tech-debt-collector 💳 Code quality & shortcuts
@ux-champion 🎨 User experience
@devops-engineer 🚀 Deployment & scaling
@documentation-nazi 📚 Documentation gaps
@legacy-whisperer 🏛️ Working with existing code
@test-driven-purist Testing & TDD

For the best experience, view this PR on myseniordev.com — includes AI chat, file annotations, and interactive reviews.

@khaliqgant khaliqgant merged commit d4c1be0 into main Jan 18, 2026
8 checks passed
@khaliqgant khaliqgant deleted the fix/cross-machine-user-routing branch January 18, 2026 09:25
willwashburn added a commit that referenced this pull request Jun 26, 2026
* feat(broker): compile on relaycast v5.0.1 and prep node spawn/release

Bump crates/broker to relaycast =5.0.1 (#214, increment 1 of the broker
node-only delivery migration).

- Remove the workspace-stream toggle (ensure_workspace_stream_enabled and
  its RelaycastWsClient::run call); RelayCast::workspace_stream_set is gone
  in v5.
- Add node-frame fields the v5.0.1 engine sends (structs are
  deny_unknown_fields, so missing fields drop the frame): Deliver gains
  agent_id and delivery_id; ActionInvoke gains optional agent_id and
  agent_name.
- Extract the AgentReleaseRequested / AgentSpawnRequested firehose match
  arms into reusable release_worker_locally / spawn_worker_from_request
  async fns and drop the arms (those WsEvent variants no longer exist in
  v5.0.1). Increment 2 will call these from action.invoke.
- Point register-flow test mocks at the new /v1/agents endpoint with the
  CreateAgentResponse body (v5.0.1 register_agent_token registers via
  /v1/agents instead of /v1/agents/spawn).
- Add the new fields to fleet-wire deliver/action.invoke fixtures.

cargo build and cargo test -p agent-relay-broker both green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(broker): drop unreachable spawn fallback; classify control events by ws_type

In relaycast v5.0.1 WsEvent ends in #[serde(other)] Unknown, so an
agent.spawn_requested frame deserializes to Ok(WsEvent::Unknown), not Err.
The firehose handler gated its raw-JSON spawn fallback (and a deser-warning)
on from_value::<WsEvent>(..).is_ok(), which is now always true — making both
paths dead code, contrary to the prior "preserved untouched" claim.

Node control owns spawn/release via action.invoke (the extracted
spawn_worker_from_request / release_worker_locally helpers); the workspace
firehose no longer drives these events. Replace the meaningless is_ok() gate
with an explicit ws_type match that ignores agent.spawn_requested /
agent.release_requested (already deduped) instead of letting them fall through
to map_ws_event. Remove the dead fallback, dead warning, and now-unused local
bindings. Add regression tests pinning that both control frames decode to
WsEvent::Unknown so future dispatch must classify by ws_type.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(broker): node-only delivery — enroll, bind agents, deliver over /v1/node/ws

Increment 2 of the broker node-only delivery migration (relaycast v5.0.1). The
broker now enrolls as a relaycast node, binds its agents through node control,
and delivers/injects solely over /v1/node/ws.

- NODE-TOKEN BOOTSTRAP (init.rs): resolve_node_token prefers RELAY_NODE_TOKEN,
  then a token cached for this exact node id, otherwise mints one via
  RelayCast::create_node (kind=ws, role=broker) with the workspace key and
  persists it next to the node id (scoped to node_id so id rotation
  invalidates it). node_control gains load/persist_node_token helpers.
- UNCONDITIONAL node.register (init.rs): push FleetControlCommand::RegisterNode
  with a broker self-manifest (spawn capability) right after spawning the
  node-control client, so the broker enrolls every startup regardless of any
  sidecar.
- BIND EVERY SPAWNED AGENT via node-control agent.register: extracted
  register_node_agent_token; both /api/spawn (api.rs) and the Inc1
  spawn_worker_from_request now mint the agent token over node control (HTTP
  pre-registration only as fallback when node binding is unavailable). The
  minted token injects RELAY_AGENT_TOKEN + RELAY_SKIP_BOOTSTRAP via snippets,
  so the worker MCP never re-registers over HTTP.
- INBOUND NODE FRAMES (fleet.rs): handle_fleet_deliver now uses the real
  delivery_id (no longer derived from msg_id) and branches on payload.type —
  message.created/thread.reply/dm.received/group_dm.received (and legacy empty
  type) inject into the worker PTY; message.reacted/message.read are acked with
  a tracing log only (PTY surfacing deferred); unknown types are acked without
  surfacing. action.invoke routes spawn/spawn:* and release to the Inc1 spawn/
  release fns, replying action.result {output} on success or {error} on
  failure.
- fleet_mode_enabled flips on FleetControlEvent::Connected (and is not cleared
  on disconnect) so workspace-firehose delivery is suppressed once node
  delivery is live, avoiding double-delivery while honoring at-least-once
  resume.

Runtime delivery cannot be exercised without a live engine; added unit tests
for the delivery-classification and action.invoke identity/field helpers.
cargo build and cargo test -p agent-relay-broker both green (779 passing).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fixup: read node deliver fields from payload.data and widen message-alias classification

Node /v1/node/ws deliver frames nest the message under payload.data per
relaycast 5.0.1 normalize_node_deliver (data.text/channel_name/agent_name/
from_name/thread_id). fleet_relay_delivery read only flat /text,/from,/channel
paths, so every node-delivered message injected the raw JSON blob attributed to
"relaycast". Extract from the data envelope (data.* first, legacy flat paths as
fallback) via a testable fleet_delivery_fields helper.

classify_fleet_delivery only injected message.created|thread.reply|dm.received|
group_dm.received; the engine may emit any relaycast parse_inbound_kind alias
(message.received/new/sent/delivered, dm.created/new/sent/message.created,
direct_message.*, thread.message.created/sent, group_dm.*). Those were
acked-and-dropped (permanent loss under at-least-once). Widen the Inject arm to
the full alias set. Update the deliver.json fixture to the real {type,data}
shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(broker): remove dead firehose delivery path (node-only delivery)

Inc3 of the broker node-only delivery migration. Message delivery now
flows solely over /v1/node/ws via handle_fleet_deliver, so the workspace
firehose delivery path in the broker fleet runtime is dead.

- Drop the fleet_mode_enabled field and all its assignments; its only
  reader was the firehose drop in handle_relaycast_message.
- Reduce handle_relaycast_message to log-and-discard. The map_ws_event
  injection block, self-echo filtering, DM resolution, dashboard
  rebroadcast, and the fleet-mode drop are all gone.
- Remove now-dead firehose-only helpers and their tests:
  relaycast_ws_control_dedup_key; routing is_self_echo,
  resolve_delivery_targets, worker_names_for_dm_participants,
  display_target_for_dashboard, DeliveryPlan; queue_and_try_delivery;
  WorkerRegistry::has_any_worker / has_worker_by_name_ignoring_case;
  the unused dm_participants_cache runtime field.
- Rewrite routing tests to cover the surviving
  worker_names_for_channel_delivery / worker_names_for_direct_target
  (sender exclusion, case-insensitivity, workspace-id filtering).

Deliberately left intact: RelaycastWsClient::run and map_ws_event are
still used by `agent-relay-broker wrap` (single-agent PTY mode), which
legitimately consumes the workspace firehose. Spawn/release stay owned
by node control (spawn_worker_from_request / release_worker_locally).
No observer/observer-token path added (separate follow-up).

cargo build and clippy clean; 781 broker tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(broker): correctness fixes for node-only delivery (v5.0.1)

Apply targeted adversarial-review fixes to the broker's node-only
delivery migration:

- HTTP-register spawn fallback now binds the agent to the broker's node
  via SDK bind_agent_to_node so it becomes via_node (the only kind the
  engine delivers to); a failed bind emits a loud registration_warning.
  A missing node token is logged as a hard fault, not a quiet warning.
- seq:0 fan-out frames are no longer dropped: special-cased in
  FleetDeliveryBook::observe/commit to always surface-and-ack with
  msg_id dedup; action.completed/action.failed/action.denied route to
  Inject (delivered to the caller), message.reacted/message.read stay
  ack-only (PTY surfacing deferred).
- Remove deny_unknown_fields from inbound Deliver/ActionInvoke so a
  future engine field no longer drops the frame without an ack
  (infinite redelivery). Outbound frames keep it.
- Bound AgentDeliveryCursor.seen_msg_ids with a FIFO cap (512).
- release action.result is now faithful: genuinely-unknown worker
  returns an error; already-exited worker still reports success.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(broker): stop opening the rejected /v1/ws workspace stream

In v5.0.1 node-only delivery the /v1/ws workspace-stream WebSocket is
observer-only and rejects the broker's workspace key with HTTP 401. The
broker kept opening it anyway, 401-looping every 3s and burning reconnects
even though delivery already flows over /v1/node/ws. The earlier teardown
removed firehose message handling but left the connection itself.

MultiWorkspaceSession::new no longer spawns the workspace-stream WS task;
it drains the WsControl channel to a no-op and keeps the inbound channel
open as an inert empty source (kept alive by a sender clone so the wrap
action consumer and runtime no-op handler never busy-loop on a closed
receiver). RelaycastWsClient (the 401-looping run loop) is deleted. The
workspace HTTP client, WsControl plumbing, and ws_control_tx senders are
all kept intact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: auto-format Rust code with cargo fmt

* style: auto-format with Prettier

* chore: drop session trajectory records swept into the broker PR

The workflow/fix sub-agents' `git add -A` committed this session's
.agentworkforce/trajectories/ records into the code branch. Untrack them
(kept on disk) so the PR is a reviewable code-only diff.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(broker): scope node token to workspace and re-mint on node-ws 401

The broker persisted its minted node token at a single global path and
load_node_token only checked node_id, so a token minted for workspace A
(or a local engine) was reused against workspace B / prod and rejected
with HTTP 401 on /v1/node/ws, and the connect loop looped forever on the
rejected cached token.

- Scope the persisted token to workspace_id (and engine base_url):
  add both to PersistedNodeToken; load_node_token returns the cached
  token only when node_id AND workspace_id match (and base_url when both
  sides know it); legacy caches without base_url still reuse on a
  workspace match. persist_node_token / load_node_token / resolve_node_token
  signatures and call sites in runtime/init.rs updated accordingly.
- Re-mint on 401: detect an HTTP 401 handshake rejection on the
  node-control connect, discard the cached file + in-memory token, and
  re-mint via RelayCast::create_node (wired through a NodeTokenMinter)
  before retrying. Bounded to MAX_UNAUTHORIZED_BEFORE_GIVING_UP (5)
  consecutive 401s, after which a loud hard error is surfaced instead of
  spinning.
- Tests: load_node_token workspace/node/base_url mismatch + round-trip,
  legacy-cache reuse, and 401 connect-error detection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: auto-format Rust code with cargo fmt

* fix(broker): derive node id from machine-id + cwd hash

The engine scopes relaycast nodes globally, so a host running brokers for
two different workspaces (each in its own working directory) collided on a
single global machine-id node, failing create_node and enrollment.

Derive the node id deterministically from (machine_seed, canonical_cwd) via
sha2, keeping it stable across restarts in the same directory but distinct
across directories. The machine-id file stays the per-machine seed; cwd is
read from current_dir (canonicalized when possible) with a seed-only
fallback when cwd is unreadable. derive_node_id is a pure testable fn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: auto-format Rust code with cargo fmt

* fix(broker): correlate agent.register replies by name; silence non-agent reply WARN

The engine replies to every node-control request (node.register,
inventory.sync) with a fresh snowflake reply id, since the broker sends
those frames without an id. The broker routed every reply frame to
complete_agent_registration, so these non-agent replies never matched a
pending agent registration and produced a spurious WARN
'agent.register reply did not match a pending registration id=<snowflake>'.

Resolve the pending registration by request id first, then fall back to
matching the validated reply data.name against a pending entry (robust
against an engine that drops/regenerates the id). Replies that resolve to
neither are treated as non-agent replies and logged at debug, not warn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: auto-format Rust code with cargo fmt

* fix(broker): scope node-token cache per node id, forward invoke session ref, bound re-mint loop

Three node-only delivery fixes:

- Node token cache is now scoped per node_id (node-tokens/{node_id}.json,
  filename-sanitized) instead of one host-wide file, so two brokers in
  different cwds on a host no longer overwrite each other's token.
- node action.invoke spawns forward invocation_id and the harness session
  ref into agent.register (was hardcoded None,None), restoring invocation
  correlation and session resume. HTTP /api/spawn derives session ref from
  its spec too.
- node-control re-mint loop only resets the consecutive-401 counter when a
  connection actually establishes (not on a successful mint) and drops the
  post-mint `continue` so retries honor the reconnect backoff, making the
  give-up cap reachable and stopping a tight POST /v1/nodes loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(broker): use pinned node id verbatim when RELAY_NODE_TOKEN is set

Operator-enrolled / fleet nodes pin their node id (via the machine-id
file) to match an engine-issued node token. The cwd-hash derivation broke
that: the broker registered a derived id, so the engine rejected
node.register with node_id_mismatch and the node never came online
(two-node fleet E2E timed out on online+handlers_live). Only derive when
auto-minting (no RELAY_NODE_TOKEN); otherwise use the pinned id verbatim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(broker): drop bare "spawn" capability from bootstrap node.register

The broker's pre-sidecar bootstrap node.register advertised a generic
capability literally named "spawn". The relaycast engine does not treat
bare "spawn" as a placement capability (only spawn:* is), so its
ensureCapabilityActions materialized a regular `spawn` ACTION pinned to
whichever node bootstrapped first. From then on every spawn invoke
resolved that action and was dispatched to the bootstrapping node,
short-circuiting capability-based spawn placement for the whole
workspace — cli/target_node/least-loaded routing were all ignored.

This regressed the two-node fleet E2E: spawn:codex/spawn:claude landed on
the wrong node, target_node placement-mismatch returned 201 instead of
409, and least-loaded scheduling misrouted.

The bootstrap descriptor now carries no capabilities; the node's real
spawn:*/action capabilities arrive on the sidecar's node.register. The
node is still registered (online) before the sidecar connects, but claims
no handler until the authoritative capability set is reported.

Extracts the manifest into `bootstrap_node_manifest` with a unit test
asserting the empty capability set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: auto-format Rust code with cargo fmt

* fix(broker): run sidecar-declared harness for fleet spawn:* actions

A `spawn:<harness>` node action registered by a sidecar was being run by
the broker directly with the literal `cli` from the action input
(`handle_fleet_action_spawn`), launching the real CLI instead of the
node's declared harness. In the fleet/sidecar model the sidecar owns the
harness: its `spawn(<harness>)` handler resolves the declared harness
spec and calls `ctx.spawnAgent` (-> `spawn_agent` -> handle_fleet_spawn_agent).

Route `spawn:*` to the sidecar's registered handler (same path as
echo/work) whenever the sidecar declared a handler for that capability,
gated by a new `HandlerDispatchState::has_handler`. The broker-direct
raw-`cli` spawn is reserved for the direct / no-sidecar path where no
sidecar handler is registered.

Fixes the fleet-E2E least-loaded scheduling flake: heavyweight real
processes (instead of the lightweight stub PTY) lingered/exited and
triggered broker re-init, collapsing per-node active_agents.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant