feat(evals): combined offline SDK harness + live behavioral eval suite by khaliqgant · Pull Request #1109 · AgentWorkforce/relay

khaliqgant · 2026-06-12T08:27:49Z

Summary

Combines two eval PRs (#1092 and #1066) into a single merge:

Offline SDK harness (evals/, scripts/evals/) — 124 deterministic cases across 18 suites, zero tokens, zero broker required
Live behavioral harness (tests/integration/broker/evals/) — 8 scenarios that spawn real agent CLIs and measure whether agents actually use the messaging protocol
local-workflow CLI command and specs/fleet-delivery.md (from Protocol Evals #1066)

Closes #1092, Closes #1066.

Two-layer eval strategy

Layer	What it tests	How it runs	Cost
Offline (`npm run evals:offline`)	SDK correctness — does the SDK correctly execute relay operations?	In-memory executor, no broker	Free, instant
Live (`npm run eval`)	Agent behavior — do real models actually use the messaging protocol?	Spawns real CLIs against live broker	Tokens, needs `RELAY_INTEGRATION_REAL_CLI=1`

Running the evals

# Offline (no broker, no tokens)
npm run evals:compile         # .md → .jsonl (auto-runs before other cmds)
npm run evals:list            # see all 124 cases
npm run evals:offline         # run everything
npm run evals:offline -- --suite messaging    # single suite
npm run evals:offline -- --tag pending-executor

# Live behavioral (needs broker binary + agent CLIs)
npm run eval:unit             # scorer unit tests — fast, no CLIs
npm run eval:selftest         # negative control (cat shim, deterministic)
npm run eval:toolcheck        # wrong-tool-name trap, no broker
npm run eval                  # live eval, realistic tier
npm run eval:claude           # claude only
npm run eval:matrix           # claude + codex + gemini + grok

Test plan

npm run evals:offline passes (124 cases, 0 failed, 6 pending-executor)
npm run eval:unit passes
npm run eval:selftest goes RED (negative control)
npm run eval:toolcheck passes

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces a comprehensive evaluation suite for the Relay SDK and agent messaging, including offline deterministic test suites, an in-memory executor, and live integration evals to measure message delivery, phantom rates, and protocol adherence. The code review identified several key areas for improvement, including resolving Windows path compatibility issues in the reporting and toolcheck scripts, fixing a workspace key collision bug in the in-memory executor, removing a redundant ternary expression in the HTML reporter, and making the regex patterns for phantom message detection and ACK/DONE protocol scoring more robust against false positives and markdown formatting.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T08:31:45Z

+import { execSync } from 'node:child_process';
+import fs from 'node:fs';
+import path from 'node:path';
+
+import type { EvalReport, MatrixReport, MetricSet } from '../types.js';
+import { renderMatrixHtml, renderReportHtml } from './html.js';
+
+/** Directory where reports are written (gitignored). */
+export function reportsDir(): string {
+  // Compiled location: dist/evals/report/ → scenario source lives under evals/.
+  return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports');
+}


Using new URL(import.meta.url).pathname directly to resolve paths is a known compatibility issue on Windows platforms because it leaves a leading slash and does not correctly handle drive letters or backslashes. Use fileURLToPath from node:url instead to ensure cross-platform compatibility.

Suggested change

import { execSync } from 'node:child_process';

import fs from 'node:fs';

import path from 'node:path';

import type { EvalReport, MatrixReport, MetricSet } from '../types.js';

import { renderMatrixHtml, renderReportHtml } from './html.js';

/** Directory where reports are written (gitignored). */

export function reportsDir(): string {

// Compiled location: dist/evals/report/ → scenario source lives under evals/.

return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports');

}

import { execSync } from 'node:child_process';

import fs from 'node:fs';

import path from 'node:path';

import { fileURLToPath } from 'node:url';

import type { EvalReport, MatrixReport, MetricSet } from '../types.js';

import { renderMatrixHtml, renderReportHtml } from './html.js';

/** Directory where reports are written (gitignored). */

export function reportsDir(): string {

// Compiled location: dist/evals/report/ → scenario source lives under evals/.

return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..', '..', '..', 'evals-reports');

}

gemini-code-assist · 2026-06-12T08:31:45Z

+import fs from 'node:fs';
+import path from 'node:path';
+
+import { checkToolNames, extractToolRefs } from './scoring/toolcheck.js';
+
+function repoRoot(): string {
+  // dist/evals/toolcheck-cli.js → up 5 to repo root.
+  return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..');
+}


Using new URL(import.meta.url).pathname directly to resolve paths is a known compatibility issue on Windows platforms because it leaves a leading slash and does not correctly handle drive letters or backslashes. Use fileURLToPath from node:url instead to ensure cross-platform compatibility.

Suggested change

import fs from 'node:fs';

import path from 'node:path';

import { checkToolNames, extractToolRefs } from './scoring/toolcheck.js';

function repoRoot(): string {

// dist/evals/toolcheck-cli.js → up 5 to repo root.

return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..');

}

import fs from 'node:fs';

import path from 'node:path';

import { fileURLToPath } from 'node:url';

import { checkToolNames, extractToolRefs } from './scoring/toolcheck.js';

function repoRoot(): string {

// dist/evals/toolcheck-cli.js → up 5 to repo root.

return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '../../../../..');

}

gemini-code-assist · 2026-06-12T08:31:45Z

+        key:
+          operation.workspaceKey ??
+          operation.key ??
+          nextWorkspaceKey(operation.id ?? operation.name ?? state.workspaces.size + 1),


When operation.id is not provided but operation.name is, creating multiple workspaces with duplicate display names will result in duplicate keys (e.g., rk_live_duplicate_display_name). This causes the second workspace to silently overwrite the first one in the state.workspaces Map. Appending the workspace size or index to the key when generating it from the display name ensures uniqueness.

Suggested change

key:

operation.workspaceKey ??

operation.key ??

nextWorkspaceKey(operation.id ?? operation.name ?? state.workspaces.size + 1),

key:

operation.workspaceKey ??

operation.key ??

nextWorkspaceKey(operation.id ?? (operation.name ? operation.name + "_" + (state.workspaces.size + 1) : undefined) ?? state.workspaces.size + 1),

gemini-code-assist · 2026-06-12T08:31:45Z

+  const rows = Object.entries(matrix.harnesses)
+    .map(([h, m]) => {
+      const link = links[h] ? `<a href="${esc(links[h])}">${esc(h)}</a>` : esc(h);
+      return `<tr><td>${link}</td><td>${pct(m.messageSentRate)}</td><td class="${m.phantomCount ? '' : ''}">${pct(m.phantomRate)} (${m.phantomCount})</td><td>${pct(m.protocolAdherence)}</td><td>${pct(m.deliverySuccessRate)}</td><td>${m.wrongChannelReplies}</td><td>${m.scenariosPassed}/${m.scenariosTotal}</td></tr>`;


The class attribute contains a redundant ternary expression class="${m.phantomCount ? '' : ''}" which always evaluates to an empty string. This can be safely removed to clean up the generated HTML.

Suggested change

return `<tr><td>${link}</td><td>${pct(m.messageSentRate)}</td><td class="${m.phantomCount ? '' : ''}">${pct(m.phantomRate)} (${m.phantomCount})</td><td>${pct(m.protocolAdherence)}</td><td>${pct(m.deliverySuccessRate)}</td><td>${m.wrongChannelReplies}</td><td>${m.scenariosPassed}/${m.scenariosTotal}</td></tr>`;

return "<tr><td>" + link + "</td><td>" + pct(m.messageSentRate) + "</td><td>" + pct(m.phantomRate) + " (" + m.phantomCount + ")</td><td>" + pct(m.protocolAdherence) + "</td><td>" + pct(m.deliverySuccessRate) + "</td><td>" + m.wrongChannelReplies + "</td><td>" + m.scenariosPassed + "/" + m.scenariosTotal + "</td></tr>";

gemini-code-assist · 2026-06-12T08:31:45Z

+const INTENT_PATTERNS: IntentPattern[] = [
+  // "I'll / I will / I'm going to / going to / let me / I can" + comm verb + optional target
+  {
+    re: /\b(?:i'?ll|i will|i'?m going to|going to|let me|i can|i should)\s+(tell|message|dm|notify|reply to|respond to|post|send|report(?:\s+to)?|relay(?:\s+to)?|forward(?:\s+to)?|ping|update|let)\s+(?:to\s+|the\s+)?([a-z0-9_#-]+)?/gi,


gemini-code-assist · 2026-06-12T08:31:46Z

+  const sends = inboundFrom(events, worker).filter((e) => !isChannelTarget(e.target));
+  const ackIdx = sends.findIndex((e) => /^\s*ack\b/i.test(e.body));


The regexes for matching ack and done messages are brittle and will fail if the agent formats the status message with markdown bolding or italics (e.g., **ACK**: ... or *DONE*). Allowing optional leading markdown asterisks or underscores makes the protocol scoring much more robust.

Suggested change

const sends = inboundFrom(events, worker).filter((e) => !isChannelTarget(e.target));

const ackIdx = sends.findIndex((e) => /^\s*ack\b/i.test(e.body));

const ackIdx = sends.findIndex((e) => /^\s*[\*_]*ack\b/i.test(e.body));

const doneIdx = sends.findIndex((e) => /^\s*[\*_]*done\b/i.test(e.body));

coderabbitai · 2026-06-12T08:32:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an offline Relay eval harness (cases, compiler, in-memory executor, expectation checks), CI/run/reporting tooling, many generated suite specs and broker real-CLI scenarios with scoring, auto-routing CLI helpers, fleet-delivery spec, and related tests/config updates.

Changes

Relay eval harness

Layer / File(s)	Summary
Planning and docs `evals/PLAN.md`, `evals/README.md`, `CHANGELOG.md`, `.trajectories/`, `.claude/skills/`	Top-level plan, README, changelog entries, skill/trajectory docs and spec updates describing eval architecture and usage.
Suite specs and generated cases `evals/suites//.md`, `evals/suites/*/cases.jsonl`	Added rubric and `cases.md` for all suites and generated `cases.jsonl` files for deterministic eval cases across messaging, channels, threads, reactions, read-receipts, search, actions, schemas, listeners, facade, capabilities, delivery-modes, auth-errors, protocol-framing, agent-directory, session, workspaces, and more.
Case compiler and checks `scripts/evals/compile-cases.mjs`, `scripts/evals/relay-checks.mjs`	Markdown-to-JSONL compiler and expectation assertion helpers to validate executor output against deterministic checks.
In-memory executor `scripts/evals/relay-executor.mjs`	Large in-memory Relay executor implementing messaging, channels, DMs, inbox/delivery, actions, listeners, workspace, and stable output; exports createRelayExecutor().
Run CLI & CI `scripts/evals/run-relay-evals.mjs`, `scripts/evals/ci-summary.mjs`, `.github/workflows/relay-evals.yml`, `package.json`	Run CLI to execute suites/trials, CI workflow to run offline evals, CI summary writer, and package scripts for eval workflows.
Reports & HTML `tests/integration/broker/evals/report/*`	Report writing utilities and standalone HTML renderers for per-harness and matrix reports plus CLI to render HTML from JSON reports.

Broker eval framework

Layer / File(s)	Summary
Types & scoring `tests/integration/broker/evals/types.ts`, `tests/integration/broker/evals/scoring/*`	Eval type model, scoring helpers (base, phantom, protocol, lifecycle), metrics aggregation, fixtures, and unit tests.
Scenarios & registry `tests/integration/broker/evals/scenarios/*`, `tests/integration/broker/evals/scenarios/index.ts`	Many new realistic and smoke scenarios (lifecycle, messaging, auto-routing, phrasing, etc.), onboarding variants, helpers, and scenario registry exports.
Runner & harness `tests/integration/broker/evals/runner.ts`, `tests/integration/broker/utils/broker-harness.ts`, `tests/integration/broker/evals/*`	Real-CLI runner, harness adjustments (drop AgentRelay facade, spawn options model forwarding), toolcheck/selftest scripts, CI test entrypoints, and tsconfig/vitest updates.

Auto-routing

Layer / File(s)	Summary
Classifier & composer `packages/cli/src/auto/*`	Heuristic task classifier, team composer, director-prompt builder, tests, and barrel exports for model=auto auto-routing.
Auto-routing spec `specs/auto-routing.md`	Design document describing classifier contract, routing table, director meta-prompt, phased plan, and new eval scenario s06-auto-routing.

Fleet delivery & snippets

Layer / File(s)	Summary
Fleet delivery spec `specs/fleet-delivery.md`	Architecture specification for fleet delivery, mailbox durability, spawn/resume semantics, and open questions.
Grok/snippets `crates/broker/src/snippets.rs`	Grok MCP arg rendering fixes for positional vs flag args and updated unit tests.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

#1092: feat(evals): add Relay SDK eval harness — This PR implements the offline eval harness, cases, executor, and CI referenced by the issue.
#1056: (fleet-delivery spec overlap) — The fleet-delivery spec and delivery-related evals appear to address objectives in that issue.

Suggested reviewers

willwashburn

"I hop with a carrot near the CI tree,
Cases compiled, reports bloom for me.
Executors race, checks pass in line,
A rabbit cheers: green builds shine. 🐇"

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/combined-evals

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/broker/utils/broker-harness.ts (1)

127-131: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Iterate over a snapshot when dispatching broker events.

waitForEvent() listeners remove themselves from this.eventListeners, but start() dispatches over the live array. If one listener splices itself out during the loop, the next listener can be skipped and miss the event entirely.

Suggested fix

     this.unsubEvent = this.client.onEvent((event: BrokerEvent) => {
       this.events.push(event);
-      for (const listener of this.eventListeners) {
+      for (const listener of [...this.eventListeners]) {
         listener(event);
       }
     });

#!/bin/bash
node <<'NODE'
const listeners = [];
const seen = [];
const a = () => {
  seen.push('a');
  listeners.splice(listeners.indexOf(a), 1);
};
const b = () => seen.push('b');
listeners.push(a, b);
for (const listener of listeners) listener();
console.log(seen.join(','));
NODE

Expected result: the script prints only a, demonstrating how mutating the live array during iteration skips the next listener.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/utils/broker-harness.ts` around lines 127 - 131, The
event dispatch iterates the live this.eventListeners array so listeners that
remove themselves (e.g., waitForEvent) can cause the next listener to be
skipped; change the dispatch in the this.client.onEvent callback (the block that
pushes into this.events and then loops over this.eventListeners) to iterate a
snapshot copy of the listeners (for example by using a shallow copy via slice()
or spread) so removals during iteration don't affect which listeners are
invoked.

🧹 Nitpick comments (2)

evals/suites/auth-errors/cases.md (1)
61-74: 💤 Low value

Redundant must/mustNot in Deterministic Checks section.

The "detect-code-case-insensitive" case (and similarly "detect-cause-chain" and "ignore-tool-result-without-marker") includes must/mustNot in the Deterministic Checks block (lines 65–68), then repeats them in separate Must/Must Not sections (lines 70–74). The compiler uses the final Must/Must Not sections. Remove the inline must/mustNot from Deterministic Checks to clarify the intended structure.
♻️ Simplify the Deterministic Checks section
 ### Deterministic Checks
 ok: true
 contentIncludes:
 - true
-must:
-- Normalize error codes before comparison.
-mustNot:
-- Treat casing as significant.

 ### Must
(Apply the same pattern to lines 174–181 and lines 295–302.)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/auth-errors/cases.md` around lines 61 - 74, Remove the redundant
inline must/mustNot entries from the "Deterministic Checks" block in the cases
"detect-code-case-insensitive", "detect-cause-chain", and
"ignore-tool-result-without-marker" so that those checks only appear in the
final "Must" / "Must Not" sections; specifically, edit the Deterministic Checks
section to keep only deterministic flags (ok, contentIncludes) and delete the
inline must/mustNot lines, and apply the same cleanup for the other duplicate
occurrences in the file where Deterministic Checks repeats the final Must/Must
Not entries.
tests/integration/broker/evals/toolcheck-cli.ts (1)
17-20: ⚡ Quick win

Use fileURLToPath(import.meta.url) before passing ESM URLs into path.*.

Both locations derive filesystem paths from new URL(import.meta.url).pathname. In Node, that leaves percent-encoding intact and turns Windows paths into /C:/..., so these helpers can fail to locate the repo root or report directory on Windows and on paths containing spaces. Convert the file URL with fileURLToPath() first, then call path.dirname().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/toolcheck-cli.ts` around lines 17 - 20, The
repoRoot() helper uses path.dirname(new URL(import.meta.url).pathname) which
leaves percent-encoding and Windows drive slashes intact; replace that by
converting the ESM file URL to a filesystem path first using
fileURLToPath(import.meta.url) and then call path.dirname on the result (update
the repoRoot function and add the fileURLToPath import from 'url'). Ensure the
new logic still resolves the same upward path (keep the '../../../../..'
hierarchy) but uses path.resolve(path.dirname(filePath), '../../../../..') where
filePath is fileURLToPath(import.meta.url).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/relay-evals.yml:
- Around line 45-46: Workflow uses unpinned action tags and leaves checkout
credentials persisted; update the checkout, setup-node, and upload-artifact
steps by replacing the tag versions (actions/checkout@v4, actions/setup-node@v4,
actions/upload-artifact@v4) with their corresponding full commit SHAs, and add
the checkout input persist-credentials: false under the Checkout step (the step
that references actions/checkout) to prevent the runner token from being exposed
to later steps—ensure you update the three action references consistently to
fixed SHAs.

In `@scripts/evals/relay-executor.mjs`:
- Around line 28-38: The executor currently only iterates
normalizeOperations(testCase.input?.operation ?? testCase.input?.operations) and
ignores cases authored with input.message; update the loop in relay-executor.mjs
to include testCase.input.message (or convert message to an operation) so
Message-authored evals are executed—e.g., feed normalizeOperations with
testCase.input?.message when present or merge message into the operations array
before calling executeOperation, locating the use of normalizeOperations and
executeOperation to implement the conversion/merge.
- Around line 638-650: The send handler currently calls upsertChannel, which
silently creates or resurrects channels; change it to look up the channel
without mutating state (e.g., use the existing channel lookup logic instead of
upsertChannel in send) and validate lifecycle: if the channel is missing, throw
or return an error, and if the channel is archived, reject/post an error instead
of proceeding; only after confirming a non-archived channel exists should you
call createMessage and emit('messageCreated'). Refer to the send function,
replace upsertChannel usage, and add checks for channel existence and
channel.archived before createMessage/emit.

In `@tests/integration/broker/evals/runner.ts`:
- Around line 215-217: The current code uses flags.baseline with
compareReports(readReport(flags.baseline), report) for every harness in the run,
causing cross-harness comparisons; change behavior so baseline must either be a
single-harness file and the run contains exactly one harness (validate
flags.baseline is only allowed when the harness list length === 1 and throw/exit
with a clear message if not), or treat flags.baseline as a matrix and load a map
of harness→baseline (e.g. parse readReport(flags.baseline) into a per-harness
lookup and call compareReports(baselineByHarness[harnessName], report) for each
harness, erroring if a harness has no matching baseline). Ensure you update the
code paths that call compareReports and readReport and reference the loop
variable for the current harness when selecting the baseline.
- Around line 57-60: selectScenarios currently drops unknown IDs by mapping
flags.scenarioIds through scenarioById and filtering falsy results; instead
detect any missing IDs and fail fast. Modify selectScenarios to compute the list
of requested IDs (flags.scenarioIds), map them to scenarioById results, collect
the IDs that produced no EvalScenario, and if any are missing throw an error (or
process exit) listing the unknown IDs so the invocation fails before running any
scenarios; otherwise return the resolved EvalScenario[] as before.
- Around line 113-138: After the scenario loop, detect when no scenarios
actually ran (results.length === 0) due to all scenarios being filtered by
scenario.harnessFilter for the current cli and treat that as a skip/error
instead of emitting a normal report: add a guard right before building/returning
the report (after the loop that calls runOnce and mergeRepeats) that either
throws a descriptive Error(`no scenarios ran for harness ${cli}`) or returns a
report object marked as skipped (e.g., add skipped: true and empty metrics) so
downstream code doesn't treat an empty scenarios array as a healthy run;
reference results, scenarios, scenario.harnessFilter, cli, runOnce,
mergeRepeats, aggregateMetrics and SCHEMA_VERSION when implementing the check
and handling.

In `@tests/integration/broker/evals/scenarios/03-ack-done.ts`:
- Around line 31-47: The test spawns agents with harness.spawnAgent but only
calls harness.releaseAgent on the happy path, risking leaked agents if an
awaited step throws; wrap the scenario body (the sequence from
harness.spawnAgent through waits and scoring in 03-ack-done.ts and
04-relay-chain.ts) in a try/finally and move all harness.releaseAgent(worker)
(and any other spawned agent releases) into the finally block so release always
runs even on errors.

In `@tests/integration/broker/evals/scenarios/r01-incidental-report.ts`:
- Around line 25-66: The run function in each scenario (e.g., the async run in
r01-incidental-report.ts) currently calls harness.releaseAgent(...) only on the
success path, causing leaked agents; wrap the entire run body in a try/finally
where you call harness.spawnAgent(...) / harness.sendMessage(...) / waits in the
try block and always call harness.releaseAgent(worker) in the finally block
(catching errors from releaseAgent if needed) so agents are released regardless
of failures; apply the same pattern to the other scenario files
(r02-forget-to-report.ts, r03-proactive-handoff.ts, r04-channel-vs-dm.ts) to
prevent cross-test flakiness.

In `@tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts`:
- Around line 43-47: The pass check allows channel-targeted handoffs because
sentTo(events, author, reviewer) treats normalized channel names like
"`#reviewer-`*" as a match; update the pass criteria so it only counts a direct DM
handoff to the reviewer. Replace or augment the handedOff check (the sentTo
call) with a check that the event(s) from author target the reviewer actor
directly (e.g., compare explicit recipient ID or channelType === 'dm' /
recipient === reviewer.id) and reject matches that are channel names; keep the
other conditions (base.phantoms.length === 0 && base.deliveryOk) unchanged.

In `@tests/integration/broker/evals/scoring/protocol.ts`:
- Around line 115-121: scoreRelayChain currently treats payloadIntact true if
any RelayInbound to finalChannel contains the payload; restrict this to only
count posts from the expected relay sender. Update the finalPost filter (where
RelayInbound events are selected) to also check that e.sender equals the
expected relay identity (e.g., compare against the last relay/expected relay
variable used in scoreRelayChain such as lastRelay.sender or
expectedRelaySender) while keeping the existing normalizeChannel and payload
contains checks.

In `@tests/integration/broker/evals/scoring/stream-clean.ts`:
- Line 15: ANSI_PATTERN currently only matches OSC sequences terminated by BEL
(\x07) and thus leaves ST-terminated OSCs (ESC backslash) intact; update the
regex constant ANSI_PATTERN to also match OSC sequences terminated by the String
Terminator (ESC backslash) so stripAnsi() removes OSC 8 hyperlinks and similar
sequences—specifically extend the pattern that matches OSC (\x1b\][^\x07]*\x07)
to also accept sequences ending with ESC "\" (escape-backslash), ensuring the
pattern covers both \x07 and ESC "\" terminators.

In `@tests/integration/broker/evals/selftest.ts`:
- Around line 59-68: The failure branches in
tests/integration/broker/evals/selftest.ts use process.exit(1) which aborts
before the finally block can run; replace those calls inside the try (the checks
for evalWouldPass and score.sent) with throwing an Error (e.g., throw new
Error(...)) so the exception propagates and allows the finally block to run
harness.releaseAgent() / harness.stop(); ensure the thrown Error messages
contain the same descriptive text currently passed to console.error so test
harness teardown still executes and the process will exit non‑zero after
cleanup.

---

Outside diff comments:
In `@tests/integration/broker/utils/broker-harness.ts`:
- Around line 127-131: The event dispatch iterates the live this.eventListeners
array so listeners that remove themselves (e.g., waitForEvent) can cause the
next listener to be skipped; change the dispatch in the this.client.onEvent
callback (the block that pushes into this.events and then loops over
this.eventListeners) to iterate a snapshot copy of the listeners (for example by
using a shallow copy via slice() or spread) so removals during iteration don't
affect which listeners are invoked.

---

Nitpick comments:
In `@evals/suites/auth-errors/cases.md`:
- Around line 61-74: Remove the redundant inline must/mustNot entries from the
"Deterministic Checks" block in the cases "detect-code-case-insensitive",
"detect-cause-chain", and "ignore-tool-result-without-marker" so that those
checks only appear in the final "Must" / "Must Not" sections; specifically, edit
the Deterministic Checks section to keep only deterministic flags (ok,
contentIncludes) and delete the inline must/mustNot lines, and apply the same
cleanup for the other duplicate occurrences in the file where Deterministic
Checks repeats the final Must/Must Not entries.

In `@tests/integration/broker/evals/toolcheck-cli.ts`:
- Around line 17-20: The repoRoot() helper uses path.dirname(new
URL(import.meta.url).pathname) which leaves percent-encoding and Windows drive
slashes intact; replace that by converting the ESM file URL to a filesystem path
first using fileURLToPath(import.meta.url) and then call path.dirname on the
result (update the repoRoot function and add the fileURLToPath import from
'url'). Ensure the new logic still resolves the same upward path (keep the
'../../../../..' hierarchy) but uses path.resolve(path.dirname(filePath),
'../../../../..') where filePath is fileURLToPath(import.meta.url).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: db4d0ac4-695b-462a-9cfb-8041a8dd3b89

📥 Commits

Reviewing files that changed from the base of the PR and between fe2f9f6 and b4d1170.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (104)

.agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json
.github/workflows/relay-evals.yml
.gitignore
.trajectories/completed/2026-06/traj_o61z0ze6kvla/summary.md
.trajectories/completed/2026-06/traj_o61z0ze6kvla/trajectory.json
CHANGELOG.md
evals/PLAN.md
evals/README.md
evals/suites/action-errors/cases.jsonl
evals/suites/action-errors/cases.md
evals/suites/action-errors/rubric.md
evals/suites/action-schema/cases.jsonl
evals/suites/action-schema/cases.md
evals/suites/action-schema/rubric.md
evals/suites/actions/cases.jsonl
evals/suites/actions/cases.md
evals/suites/actions/rubric.md
evals/suites/agent-directory/cases.jsonl
evals/suites/agent-directory/cases.md
evals/suites/agent-directory/rubric.md
evals/suites/auth-errors/cases.jsonl
evals/suites/auth-errors/cases.md
evals/suites/auth-errors/rubric.md
evals/suites/capabilities/cases.jsonl
evals/suites/capabilities/cases.md
evals/suites/capabilities/rubric.md
evals/suites/channels/cases.jsonl
evals/suites/channels/cases.md
evals/suites/channels/rubric.md
evals/suites/delivery-modes/cases.jsonl
evals/suites/delivery-modes/cases.md
evals/suites/delivery-modes/rubric.md
evals/suites/facade/cases.jsonl
evals/suites/facade/cases.md
evals/suites/facade/rubric.md
evals/suites/listeners/cases.jsonl
evals/suites/listeners/cases.md
evals/suites/listeners/rubric.md
evals/suites/messaging/cases.jsonl
evals/suites/messaging/cases.md
evals/suites/messaging/rubric.md
evals/suites/protocol-framing/cases.jsonl
evals/suites/protocol-framing/cases.md
evals/suites/protocol-framing/rubric.md
evals/suites/reactions/cases.jsonl
evals/suites/reactions/cases.md
evals/suites/reactions/rubric.md
evals/suites/read-receipts/cases.jsonl
evals/suites/read-receipts/cases.md
evals/suites/read-receipts/rubric.md
evals/suites/search/cases.jsonl
evals/suites/search/cases.md
evals/suites/search/rubric.md
evals/suites/session/cases.jsonl
evals/suites/session/cases.md
evals/suites/session/rubric.md
evals/suites/threads/cases.jsonl
evals/suites/threads/cases.md
evals/suites/threads/rubric.md
evals/suites/workspaces/cases.jsonl
evals/suites/workspaces/cases.md
evals/suites/workspaces/rubric.md
package.json
scripts/evals/ci-summary.mjs
scripts/evals/compile-cases.mjs
scripts/evals/relay-checks.mjs
scripts/evals/relay-executor.mjs
scripts/evals/run-relay-evals.mjs
specs/fleet-delivery.md
tests/integration/broker/evals/README.md
tests/integration/broker/evals/eval.test.ts
tests/integration/broker/evals/report/html.ts
tests/integration/broker/evals/report/render-cli.ts
tests/integration/broker/evals/report/write.ts
tests/integration/broker/evals/runner.ts
tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts
tests/integration/broker/evals/scenarios/02-channel-reply.ts
tests/integration/broker/evals/scenarios/03-ack-done.ts
tests/integration/broker/evals/scenarios/04-relay-chain.ts
tests/integration/broker/evals/scenarios/helpers.ts
tests/integration/broker/evals/scenarios/index.ts
tests/integration/broker/evals/scenarios/r01-incidental-report.ts
tests/integration/broker/evals/scenarios/r02-forget-to-report.ts
tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts
tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts
tests/integration/broker/evals/scoring/base.ts
tests/integration/broker/evals/scoring/fixtures.ts
tests/integration/broker/evals/scoring/metrics.ts
tests/integration/broker/evals/scoring/metrics.unit.test.ts
tests/integration/broker/evals/scoring/phantom.ts
tests/integration/broker/evals/scoring/phantom.unit.test.ts
tests/integration/broker/evals/scoring/protocol.ts
tests/integration/broker/evals/scoring/protocol.unit.test.ts
tests/integration/broker/evals/scoring/stream-clean.ts
tests/integration/broker/evals/scoring/toolcheck.ts
tests/integration/broker/evals/scoring/toolcheck.unit.test.ts
tests/integration/broker/evals/selftest.ts
tests/integration/broker/evals/toolcheck-cli.ts
tests/integration/broker/evals/tsconfig.json
tests/integration/broker/evals/types.ts
tests/integration/broker/tsconfig.json
tests/integration/broker/utils/assert-helpers.ts
tests/integration/broker/utils/broker-harness.ts
vitest.config.ts

coderabbitai · 2026-06-12T08:46:17Z

+      - name: Checkout
+        uses: actions/checkout@v4


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify unpinned action tags and missing persist-credentials in relay-evals workflow. rg -nP 'uses:\s*actions\/[A-Za-z0-9._-]+@v[0-9]+' .github/workflows/relay-evals.yml rg -n 'persist-credentials:\s*false' .github/workflows/relay-evals.yml

Repository: AgentWorkforce/relay

Length of output: 184

Harden relay-evals workflow: pin action SHAs and disable checkout credential persistence.

.github/workflows/relay-evals.yml uses unpinned action tags (actions/checkout@v4, actions/setup-node@v4, actions/upload-artifact@v4) at lines 46/49/70—pin each to full commit SHAs.

The Checkout step (line 46) is missing with: persist-credentials: false, which can unnecessarily expose the runner token to later steps.

Suggested hardening patch

- name: Checkout - uses: actions/checkout@v4 + uses: actions/checkout@<full_commit_sha_for_v4> + with: + persist-credentials: false - name: Setup Node.js - uses: actions/setup-node@v4 + uses: actions/setup-node@<full_commit_sha_for_v4> with: node-version: ${{ env.NODE_VERSION }} cache: npm cache-dependency-path: package-lock.json - name: Upload eval artifacts if: always() - uses: actions/upload-artifact@v4 + uses: actions/upload-artifact@<full_commit_sha_for_v4> with: name: relay-eval-run path: .relay/evals/runs/ retention-days: 14 if-no-files-found: ignore

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 45-46: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 46-46: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/relay-evals.yml around lines 45 - 46, Workflow uses unpinned action tags and leaves checkout credentials persisted; update the checkout, setup-node, and upload-artifact steps by replacing the tag versions (actions/checkout@v4, actions/setup-node@v4, actions/upload-artifact@v4) with their corresponding full commit SHAs, and add the checkout input persist-credentials: false under the Checkout step (the step that references actions/checkout) to prevent the runner token from being exposed to later steps—ensure you update the three action references consistently to fixed SHAs.

Source: Linters/SAST tools

coderabbitai · 2026-06-12T08:46:17Z

+      for (const operation of normalizeOperations(testCase.input?.operation ?? testCase.input?.operations)) {
+        try {
+          await executeOperation(state, operation);
+        } catch (error) {
+          state.observed.error = normalizeError(error);
+          state.contentLines.push(
+            `error ${operation.op ?? operation.type}: ${state.observed.error.code} ${state.observed.error.message}`
+          );
+          break;
+        }
+      }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Message-authored eval cases never execute.

compile-cases.mjs accepts cases with input.message, but this executor only walks input.operation(s). Any suite authored with ### Message will no-op and return an empty harness snapshot without ever exercising the SDK path.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/evals/relay-executor.mjs` around lines 28 - 38, The executor currently only iterates normalizeOperations(testCase.input?.operation ?? testCase.input?.operations) and ignores cases authored with input.message; update the loop in relay-executor.mjs to include testCase.input.message (or convert message to an operation) so Message-authored evals are executed—e.g., feed normalizeOperations with testCase.input?.message when present or merge message into the operations array before calling executeOperation, locating the use of normalizeOperations and executeOperation to implement the conversion/merge.

coderabbitai · 2026-06-12T08:46:17Z

+    send: async (input) => {
+      const channel = upsertChannel(state, input.channel, { members: [actorName] });
+      const message = createMessage(state, {
+        id: input.__id,
+        kind: 'channel',
+        from: actorName,
+        text: input.text,
+        channel: channel.name,
+        attachments: input.attachments,
+        mode: input.mode,
+      });
+      emit(state, 'messageCreated', { type: 'messageCreated', channel: channel.name, message });
+      return message;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

post_message bypasses channel lifecycle checks.

This path uses upsertChannel(...), so posting to a missing channel silently creates it, and posting to an archived channel also succeeds. That breaks the create_channel / archive_channel contract the rest of this harness is modeling.

Suggested fix

send: async (input) => { - const channel = upsertChannel(state, input.channel, { members: [actorName] }); + const channel = requireChannel(state, input.channel); + if (channel.archived) { + throw codedError('channel_archived', `channel is archived: ${channel.name}`); + } + channel.members.add(actorName); const message = createMessage(state, { id: input.__id, kind: 'channel', from: actorName, text: input.text,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

send: async (input) => {

const channel = upsertChannel(state, input.channel, { members: [actorName] });

const message = createMessage(state, {

id: input.__id,

kind: 'channel',

from: actorName,

text: input.text,

channel: channel.name,

attachments: input.attachments,

mode: input.mode,

});

emit(state, 'messageCreated', { type: 'messageCreated', channel: channel.name, message });

return message;

send: async (input) => {

const channel = requireChannel(state, input.channel);

if (channel.archived) {

throw codedError('channel_archived', `channel is archived: ${channel.name}`);

}

channel.members.add(actorName);

const message = createMessage(state, {

id: input.__id,

kind: 'channel',

from: actorName,

text: input.text,

channel: channel.name,

attachments: input.attachments,

mode: input.mode,

});

emit(state, 'messageCreated', { type: 'messageCreated', channel: channel.name, message });

return message;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/evals/relay-executor.mjs` around lines 638 - 650, The send handler currently calls upsertChannel, which silently creates or resurrects channels; change it to look up the channel without mutating state (e.g., use the existing channel lookup logic instead of upsertChannel in send) and validate lifecycle: if the channel is missing, throw or return an error, and if the channel is archived, reject/post an error instead of proceeding; only after confirming a non-archived channel exists should you call createMessage and emit('messageCreated'). Refer to the send function, replace upsertChannel usage, and add checks for channel existence and channel.archived before createMessage/emit.

coderabbitai · 2026-06-12T08:46:18Z

+function selectScenarios(flags: Flags): EvalScenario[] {
+  if (flags.scenarioIds) {
+    return flags.scenarioIds.map(scenarioById).filter((s): s is EvalScenario => Boolean(s));
+  }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject unknown --scenario IDs instead of dropping them.

Lines 57-60 silently filter out misses from scenarioById(), so --scenario=01-dm-roundtrip,typo still exits after running only one scenario. That hides invocation mistakes and under-runs the suite. Fail fast with the unknown IDs before execution starts.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/runner.ts` around lines 57 - 60, selectScenarios currently drops unknown IDs by mapping flags.scenarioIds through scenarioById and filtering falsy results; instead detect any missing IDs and fail fast. Modify selectScenarios to compute the list of requested IDs (flags.scenarioIds), map them to scenarioById results, collect the IDs that produced no EvalScenario, and if any are missing throw an error (or process exit) listing the unknown IDs so the invocation fails before running any scenarios; otherwise return the resolved EvalScenario[] as before.

coderabbitai · 2026-06-12T08:46:18Z

+  for (const scenario of scenarios) {
+    if (scenario.harnessFilter && !scenario.harnessFilter.includes(cli)) continue;
+    const runs: ScenarioResult[] = [];
+    for (let i = 0; i < repeat; i++) {
+      process.stdout.write(`  [${cli}] ${scenario.id} (run ${i + 1}/${repeat})… `);
+      try {
+        const result = await runOnce(scenario, cli);
+        runs.push(result);
+        console.log(result.pass ? 'PASS' : 'FAIL');
+      } catch (err) {
+        console.log(`ERROR: ${(err as Error)?.message ?? err}`);
+        runs.push(failedResult(scenario, `error: ${(err as Error)?.message ?? err}`));
+      }
+    }
+    results.push(repeat > 1 ? mergeRepeats(runs) : runs[0]);
+  }
+  return {
+    schemaVersion: SCHEMA_VERSION,
+    startedAt: startedAt.toISOString(),
+    durationMs: Date.now() - startedAt.getTime(),
+    harness: cli,
+    gitSha: gitSha(),
+    env: { realCli: process.env.RELAY_INTEGRATION_REAL_CLI === '1', repeat },
+    metrics: aggregateMetrics(results),
+    scenarios: results,
+  };


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip harness reports when no scenario actually ran.

If every selected scenario is excluded by scenario.harnessFilter, results stays empty and the downstream report shows perfect-looking rates with scenariosTotal = 0. That makes an unsupported harness appear healthy in JSON, HTML, and the matrix. Treat this as a skip/error instead of emitting a normal report.

Possible fix

async function runHarness( cli: string, scenarios: EvalScenario[], repeat: number, startedAt: Date ): Promise<EvalReport> { const results: ScenarioResult[] = []; for (const scenario of scenarios) { if (scenario.harnessFilter && !scenario.harnessFilter.includes(cli)) continue; @@ results.push(repeat > 1 ? mergeRepeats(runs) : runs[0]); } + if (results.length === 0) { + throw new Error(`No applicable scenarios ran for harness "${cli}"`); + } return {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/runner.ts` around lines 113 - 138, After the scenario loop, detect when no scenarios actually ran (results.length === 0) due to all scenarios being filtered by scenario.harnessFilter for the current cli and treat that as a skip/error instead of emitting a normal report: add a guard right before building/returning the report (after the loop that calls runOnce and mergeRepeats) that either throws a descriptive Error(`no scenarios ran for harness ${cli}`) or returns a report object marked as skipped (e.g., add skipped: true and empty metrics) so downstream code doesn't treat an empty scenarios array as a healthy run; reference results, scenarios, scenario.harnessFilter, cli, runOnce, mergeRepeats, aggregateMetrics and SCHEMA_VERSION when implementing the check and handling.

coderabbitai · 2026-06-12T08:46:18Z

+  run: async (ctx): Promise<ScenarioResult> => {
+    const { harness, cli, suffix, sleep } = ctx;
+    const worker = `worker-${suffix}`;
+
+    await harness.spawnAgent(worker, cli, ['general'], { task: ROLE });
+    await sleep(STARTUP_MS);
+    harness.clearEvents();
+
+    await harness.sendMessage({
+      to: worker,
+      from: 'Orchestrator',
+      text: 'Can you work out the sum of all prime numbers below 30 and let me know the answer? I need it to continue.',
+    });
+
+    await waitForSends(harness, worker, 1, RESPONSE_MS);
+
+    const events = harness.getEvents();
+    const base = baseScore(events, [worker]);
+    const ackDone = scoreAckDone(events, worker);
+    const reported = sentDirectMessage(events, worker);
+    await harness.releaseAgent(worker).catch(() => {});
+
+    const pass = reported && base.phantoms.length === 0 && base.deliveryOk;
+
+    return {
+      id: scenario.id,
+      title: scenario.title,
+      pass,
+      agents: [{ name: worker, cli, role: 'worker', prompt: ROLE }],
+      transcript: base.transcript,
+      sent: base.sent,
+      expected: 1,
+      phantoms: base.phantoms,
+      totalIntents: base.totalIntents,
+      protocolAdherence: ackDone.score,
+      wrongChannelReplies: 0,
+      deliveryOk: base.deliveryOk,
+      events: base.events,
+      notes: `repliedPrivately=${reported} (DM back to requester) · ack=${ackDone.ackPresent} done=${ackDone.donePresent}`,
+    };
+  },
+};


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Shared root cause: non-finalized agent teardown across realistic scenarios.

tests/integration/broker/evals/scenarios/r01-incidental-report.ts, tests/integration/broker/evals/scenarios/r02-forget-to-report.ts, tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts, and tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts all release agents only on the success path. Wrap each run body in try/finally and perform all releaseAgent(...) calls in finally to prevent leaked agent processes and cross-scenario flakiness.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/r01-incidental-report.ts` around lines 25 - 66, The run function in each scenario (e.g., the async run in r01-incidental-report.ts) currently calls harness.releaseAgent(...) only on the success path, causing leaked agents; wrap the entire run body in a try/finally where you call harness.spawnAgent(...) / harness.sendMessage(...) / waits in the try block and always call harness.releaseAgent(worker) in the finally block (catching errors from releaseAgent if needed) so agents are released regardless of failures; apply the same pattern to the other scenario files (r02-forget-to-report.ts, r03-proactive-handoff.ts, r04-channel-vs-dm.ts) to prevent cross-test flakiness.

coderabbitai · 2026-06-12T08:46:18Z

+    const handedOff = sentTo(events, author, reviewer);
+    await harness.releaseAgent(author).catch(() => {});
+    await harness.releaseAgent(reviewer).catch(() => {});
+
+    const pass = handedOff && base.phantoms.length === 0 && base.deliveryOk;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass criteria does not enforce DM-only handoff.

Scenario text says PASS requires author DM to reviewer, but sentTo(...) also accepts channel targets after normalization (e.g., #reviewer-*). That can produce false passes.

Suggested fix

- const handedOff = sentTo(events, author, reviewer); + const handedOff = events.some( + (e) => + e.kind === 'relay_inbound' && + e.from === author && + !e.target.startsWith('#') && + e.target.replace(/^[#@]/, '').trim().toLowerCase() === reviewer.toLowerCase() + );

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts` around lines 43 - 47, The pass check allows channel-targeted handoffs because sentTo(events, author, reviewer) treats normalized channel names like "`#reviewer-`*" as a match; update the pass criteria so it only counts a direct DM handoff to the reviewer. Replace or augment the handedOff check (the sentTo call) with a check that the event(s) from author target the reviewer actor directly (e.g., compare explicit recipient ID or channelType === 'dm' / recipient === reviewer.id) and reject matches that are channel names; keep the other conditions (base.phantoms.length === 0 && base.deliveryOk) unchanged.

coderabbitai · 2026-06-12T08:46:18Z

+  const finalPost = events
+    .filter((e): e is RelayInbound => e.kind === 'relay_inbound')
+    .find(
+      (e) =>
+        normalizeChannel(e.target) === normalizeChannel(finalChannel) &&
+        e.body.includes(payload)
+    );


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Constrain final payload validation to the expected relay sender.

scoreRelayChain currently marks payloadIntact true if any sender posts the payload to finalChannel. That can over-credit broken chains when unrelated traffic contains the same payload.

Suggested fix

export function scoreRelayChain( events: BrokerEvent[], hops: Array<{ from: string; to: string }>, payload: string, finalChannel: string ): { hopsCompleted: number; payloadIntact: boolean } { + const expectedFinalSender = hops.length > 0 ? hops[hops.length - 1]!.from : null; let hopsCompleted = 0; for (const hop of hops) { const sent = inboundFrom(events, hop.from).some( (e) => normalizeChannel(e.target) === normalizeChannel(hop.to) ); if (sent) hopsCompleted += 1; else break; } const finalPost = events .filter((e): e is RelayInbound => e.kind === 'relay_inbound') .find( (e) => + (expectedFinalSender === null || e.from === expectedFinalSender) && normalizeChannel(e.target) === normalizeChannel(finalChannel) && e.body.includes(payload) ); return { hopsCompleted, payloadIntact: Boolean(finalPost) }; }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scoring/protocol.ts` around lines 115 - 121, scoreRelayChain currently treats payloadIntact true if any RelayInbound to finalChannel contains the payload; restrict this to only count posts from the expected relay sender. Update the finalPost filter (where RelayInbound events are selected) to also check that e.sender equals the expected relay identity (e.g., compare against the last relay/expected relay variable used in scoreRelayChain such as lastRelay.sender or expectedRelaySender) while keeping the existing normalizeChannel and payload contains checks.

coderabbitai · 2026-06-12T08:46:18Z

+import { eventsForAgent } from '../../utils/assert-helpers.js';
+
+/** Matches CSI sequences, OSC sequences, and private-mode toggles. */
+const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash node <<'NODE' const pattern = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g; const input = '\x1b]8;;https://example.com\x1b\\link\x1b]8;;\x1b\\'; const output = input.replace(pattern, ''); console.log(JSON.stringify(output)); NODE

Repository: AgentWorkforce/relay

Length of output: 126

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the target file around the ANSI_PATTERN definition and its usage. FILE="tests/integration/broker/evals/scoring/stream-clean.ts" echo "=== File: $FILE (lines ~1-120) ===" nl -ba "$FILE" | sed -n '1,120p' echo echo "=== Find cleanStreamOutput and its callsites in this file ===" rg -n "cleanStreamOutput|ANSI_PATTERN|detectIntents" "$FILE" || true echo echo "=== Node regex behavior: BEL-terminated vs ST-terminated OSC (OSC 8 hyperlink) ===" node <<'NODE' const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g; // BEL-terminated OSC: ESC ] ... BEL const oscBel = '\x1b]8;;https://example.com\x07' + 'link'; const outBel = oscBel.replace(ANSI_PATTERN, ''); console.log("BEL input matches:", JSON.stringify(oscBel)); console.log("BEL output:", JSON.stringify(outBel)); // ST-terminated OSC: ESC ] ... ESC \ const oscSt = '\x1b]8;;https://example.com\x1b\\' + 'link' + '\x1b]8;;\x1b\\'; const outSt = oscSt.replace(ANSI_PATTERN, ''); console.log("ST input matches:", JSON.stringify(oscSt)); console.log("ST output:", JSON.stringify(outSt)); // Quick check: CSI should be removed (e.g., cursor movement) const csi = '\x1b[2K' + 'hello'; const outCsi = csi.replace(ANSI_PATTERN, ''); console.log("CSI input:", JSON.stringify(csi)); console.log("CSI output:", JSON.stringify(outCsi)); NODE

Repository: AgentWorkforce/relay

Length of output: 209

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tests/integration/broker/evals/scoring/stream-clean.ts" echo "=== File: $FILE (lines 1-160) ===" cat -n "$FILE" | sed -n '1,160p' echo echo "=== Find cleanStreamOutput and ANSI_PATTERN usage in this file ===" rg -n "cleanStreamOutput|ANSI_PATTERN|detectIntents" "$FILE" || true echo echo "=== Node regex behavior: BEL-terminated vs ST-terminated OSC (OSC 8 hyperlink) ===" node <<'NODE' const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g; // BEL-terminated OSC: ESC ] ... BEL const oscBel = '\x1b]8;;https://example.com\x07' + 'link'; const outBel = oscBel.replace(ANSI_PATTERN, ''); console.log("BEL input:", JSON.stringify(oscBel)); console.log("BEL output:", JSON.stringify(outBel)); // ST-terminated OSC: ESC ] ... ESC \ const oscSt = '\x1b]8;;https://example.com\x1b\\' + 'link' + '\x1b]8;;\x1b\\'; const outSt = oscSt.replace(ANSI_PATTERN, ''); console.log("ST input:", JSON.stringify(oscSt)); console.log("ST output:", JSON.stringify(outSt)); // CSI should be removed const csi = '\x1b[2K' + 'hello'; const outCsi = csi.replace(ANSI_PATTERN, ''); console.log("CSI output:", JSON.stringify(outCsi)); NODE

Repository: AgentWorkforce/relay

Length of output: 2391

🏁 Script executed:

#!/bin/bash set -euo pipefail node <<'NODE' const OLD = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g; const NEW = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][\s\S]*?(?:\x07|\x1b\\)|\x1b\[[?][0-9]*[a-z]/g; // BEL-terminated OSC const oscBel = '\x1b]8;;https://example.com\x07' + 'link' + '\x1b]8;;\x07'; // ST-terminated OSC const oscSt = '\x1b]8;;https://example.com\x1b\\' + 'link' + '\x1b]8;;\x1b\\'; // CSI const csi = '\x1b[2K' + 'hello'; // Private-mode toggle (DECTCEM/DECSET style) const priv = '\x1b[?25h' + 'cursor' + '\x1b[?25l'; for (const [name, pattern] of [['OLD', OLD], ['NEW', NEW]]) { console.log('---', name, '---'); console.log("BEL output:", JSON.stringify(oscBel.replace(pattern, ''))); console.log("ST output :", JSON.stringify(oscSt.replace(pattern, ''))); console.log("CSI output:", JSON.stringify(csi.replace(pattern, ''))); console.log("Priv output:", JSON.stringify(priv.replace(pattern, ''))); } NODE

Repository: AgentWorkforce/relay

Length of output: 303

Handle ST-terminated OSC sequences too

ANSI_PATTERN only strips OSC sequences terminated by BEL (\x07). ST-terminated OSC sequences (ESC \, commonly used for OSC 8 hyperlinks) remain in stripAnsi() output, which can skew phantom detection/scoring.

Suggested fix

-const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g; +const ANSI_PATTERN = + /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][\s\S]*?(?:\x07|\x1b\\)|\x1b\[[?][0-9]*[a-z]/g;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g;

const ANSI_PATTERN =

/\x1b\[[0-9;]*[a-zA-Z]|\x1b\][\s\S]*?(?:\x07|\x1b\\)|\x1b\[[?][0-9]*[a-z]/g;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scoring/stream-clean.ts` at line 15, ANSI_PATTERN currently only matches OSC sequences terminated by BEL (\x07) and thus leaves ST-terminated OSCs (ESC backslash) intact; update the regex constant ANSI_PATTERN to also match OSC sequences terminated by the String Terminator (ESC backslash) so stripAnsi() removes OSC 8 hyperlinks and similar sequences—specifically extend the pattern that matches OSC (\x1b\][^\x07]*\x07) to also accept sequences ending with ESC "\" (escape-backslash), ensuring the pattern covers both \x07 and ESC "\" terminators.

coderabbitai · 2026-06-12T08:46:18Z

+    if (evalWouldPass) {
+      console.error(
+        '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
+          'The eval is producing false greens.'
+      );
+      process.exit(1);
+    }
+    if (score.sent !== 0) {
+      console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
+      process.exit(1);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set +e node -e 'try { process.exit(1); } finally { console.log("finally-ran"); }' status=$? echo "exit_status=$status"

Repository: AgentWorkforce/relay

Length of output: 78

🏁 Script executed:

#!/bin/bash set -euo pipefail ls -la tests/integration/broker/evals/selftest.ts tests/integration/broker/utils/broker-harness.ts echo "---- selftest.ts (around lines 40-110) ----" nl -ba tests/integration/broker/evals/selftest.ts | sed -n '40,120p' echo "---- selftest.ts (end) ----" nl -ba tests/integration/broker/evals/selftest.ts | tail -n 80 echo "---- broker-harness.ts (around harness lifecycle / cleanup) ----" nl -ba tests/integration/broker/utils/broker-harness.ts | sed -n '1,200p'

Repository: AgentWorkforce/relay

Length of output: 361

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- selftest.ts (lines 1-140) ----" cat -n tests/integration/broker/evals/selftest.ts | sed -n '1,140p' echo "---- broker-harness.ts (first 260 lines) ----" cat -n tests/integration/broker/utils/broker-harness.ts | sed -n '1,260p' echo "---- broker-harness.ts (lines 260-end) ----" cat -n tests/integration/broker/utils/broker-harness.ts | sed -n '261,520p' || true

Repository: AgentWorkforce/relay

Length of output: 16258

Replace process.exit(1) in selftest.ts failure branches to ensure finally cleanup runs.

tests/integration/broker/evals/selftest.ts calls process.exit(1) inside the try (lines ~64 and ~68), which terminates the process immediately and bypasses the finally block—so harness.releaseAgent() / harness.stop() may not run, leaving the broker/agent running.

Suggested fix

if (evalWouldPass) { - console.error( - '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' + - 'The eval is producing false greens.' - ); - process.exit(1); + throw new Error( + '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' + + 'The eval is producing false greens.' + ); } if (score.sent !== 0) { - console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`); - process.exit(1); + throw new Error( + `✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.` + ); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (evalWouldPass) {

console.error(

'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +

'The eval is producing false greens.'

);

process.exit(1);

}

if (score.sent !== 0) {

console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);

process.exit(1);

if (evalWouldPass) {

throw new Error(

'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +

'The eval is producing false greens.'

);

}

if (score.sent !== 0) {

throw new Error(

`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`

);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/selftest.ts` around lines 59 - 68, The failure branches in tests/integration/broker/evals/selftest.ts use process.exit(1) which aborts before the finally block can run; replace those calls inside the try (the checks for evalWouldPass and score.sent) with throwing an Error (e.g., throw new Error(...)) so the exception propagates and allows the finally block to run harness.releaseAgent() / harness.stop(); ensure the thrown Error messages contain the same descriptive text currently passed to console.error so test harness teardown still executes and the process will exit non‑zero after cleanup.

agent-relay-code · 2026-06-12T08:51:59Z

Implemented scoped fixes for PR #1109:

Synced package-lock.json with the new @agent-assistant/telemetry dependency; the original npm ci failure was a real PR blocker.
Updated package.json so eval:build builds SDK and harness-driver declarations before compiling broker evals.
Added missing @agent-relay/harness-driver path mappings to tests/integration/broker/tsconfig.json, matching the changed broker-harness.ts import.

Addressed comments

No bot/reviewer comment files or comment metadata were present under .workforce/; .workforce/context.json only contained PR metadata, so there were no external review comments to validate or fix.

Advisory Notes

None.

Validation performed:

npm install --package-lock-only --ignore-scripts --no-audit --no-fund passed.
npm run evals:compile passed and compiled 124 cases.
package.json, package-lock.json, and broker tsconfigs parse successfully.
tsc --showConfig confirms both broker tsconfigs now include @agent-relay/harness-driver path mappings.

Blocked validation:

Full npm ci, npm run evals:offline, npm run eval:build, and npm run eval:unit could not be completed because this sandbox repeatedly terminated dependency extraction before installing TypeScript/Vitest. Initial npm ci did reproduce the lockfile mismatch; after fixing the lockfile, the environment still killed installs before .bin creation, so I cannot claim full CI verification.

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (3)

tests/integration/broker/evals/runner.ts (3)

146-169: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip harness reports when no scenario actually ran.

If every selected scenario is excluded by scenario.harnessFilter, results stays empty and the downstream report shows perfect-looking rates with scenariosTotal = 0. That makes an unsupported harness appear healthy in JSON, HTML, and the matrix. Treat this as a skip/error instead of emitting a normal report.

🚨 Guard against empty results

   for (const scenario of scenarios) {
     if (scenario.harnessFilter && !scenario.harnessFilter.includes(spec.cli)) continue;
     ...
     results.push(repeat > 1 ? mergeRepeats(runs) : runs[0]);
   }
+  if (results.length === 0) {
+    throw new Error(`No applicable scenarios ran for harness "${label}"`);
+  }
   return {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/runner.ts` around lines 146 - 169, runHarness
currently returns a normal EvalReport even when no scenarios actually ran
(because every scenario was filtered out), producing misleading "perfect"
results; after the loop over scenarios check if results.length === 0 and instead
return an EvalReport that marks the harness as skipped/errored (e.g., set a
skipped flag or a non-zero failure/skip count and include a clear message) so
downstream consumers see the skip; update the return path in runHarness to use
the same EvalReport shape as normal returns (using startedAt and any existing
metadata) but populate scenariosTotal/summary to reflect the skip and include an
explanatory message so JSON/HTML/matrix outputs don't show a healthy run when
nothing executed.

84-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject unknown --scenario IDs instead of dropping them.

Line 85 silently filters out misses from scenarioById(), so --scenario=01-dm-roundtrip,typo exits after running only one scenario. That hides invocation mistakes and under-runs the suite. Fail fast with unknown IDs before execution starts.

🚨 Validate before filtering

 function selectScenarios(flags: Flags): EvalScenario[] {
   if (flags.scenarioIds) {
-    return flags.scenarioIds.map(scenarioById).filter((s): s is EvalScenario => Boolean(s));
+    const resolved = flags.scenarioIds.map((id) => ({ id, scenario: scenarioById(id) }));
+    const missing = resolved.filter((r) => !r.scenario).map((r) => r.id);
+    if (missing.length > 0) {
+      console.error(`Unknown scenario IDs: ${missing.join(', ')}`);
+      process.exit(2);
+    }
+    return resolved.map((r) => r.scenario!);
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/runner.ts` around lines 84 - 86, The current
branch quietly drops unknown IDs when mapping flags.scenarioIds with
scenarioById, which hides invalid CLI input; change the logic in the runner
(where flags.scenarioIds is handled) to first map flags.scenarioIds to results
via scenarioById, collect any missing entries (where result is undefined), and
if any unknown IDs exist throw an error (or exit non-zero) listing those unknown
IDs; otherwise return the mapped array as EvalScenario. This ensures
flags.scenarioIds, scenarioById, and EvalScenario are used to validate and fail
fast on bad IDs.

312-323: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't compare every harness against the same baseline file.

Line 314 reuses one --baseline report for each harness in the loop. A baseline captured for claude will also be compared against codex, gemini, and grok, and compareReports() does not validate harness identity before scoring deltas. That can manufacture regressions and fail the run incorrectly. Require exactly one harness for --baseline, or load a matrix baseline and compare by harness.

🚨 Validate baseline harness match

     if (flags.baseline) {
       try {
-        const deltas = compareReports(readReport(flags.baseline), report);
+        const baselineReport = readReport(flags.baseline);
+        if (baselineReport.harness !== label) {
+          console.warn(`  baseline harness mismatch: baseline=${baselineReport.harness}, current=${label} — skipping comparison`);
+        } else {
+          const deltas = compareReports(baselineReport, report);

Or require single harness when --baseline is provided:

   const scenarios = selectScenarios(flags);
+  if (flags.baseline && flags.harnesses.length > 1) {
+    console.error('--baseline requires exactly one --harness (or use a matrix baseline)');
+    process.exit(2);
+  }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/runner.ts` around lines 312 - 323, The
baseline comparison currently reuses a single baseline file for every harness
inside the loop (flags.baseline + compareReports(readReport(flags.baseline),
report)), which can misattribute regressions; change the logic to either (A)
require exactly one harness when flags.baseline is provided by validating the
harness count before entering the loop and erroring if >1, or (B) support a
harness-keyed baseline matrix by loading the baseline via
readReport(flags.baseline) and selecting the per-harness baseline (matching
harness id/name) before calling compareReports; implement the
validation/selection near the existing baseline branch where flags.baseline is
checked and use harness identity (the same identifier used to produce report) to
look up the correct baseline instead of reusing the same file for every harness.

🧹 Nitpick comments (2)

tests/integration/broker/evals/scenarios/s01-spawn-worker.ts (1)

36-36: ⚡ Quick win

Remove dead ternary branch.

Both branches of the ternary return 'realistic'. If all onboarding variants should use the same tier, remove the conditional; otherwise, distinguish the 'bare' variant from the others.

♻️ Simplify to constant

-    tier: onboarding === 'bare' ? 'realistic' : 'realistic',
+    tier: 'realistic',

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scenarios/s01-spawn-worker.ts` at line 36, The
ternary expression assigning tier is dead (both branches return 'realistic');
replace the conditional with a single constant assignment by setting the tier
property to 'realistic' directly (update the object property "tier" in the
s01-spawn-worker scenario where the current "tier: onboarding === 'bare' ?
'realistic' : 'realistic'," appears); if instead the 'bare' onboarding should
differ, change the second branch to the correct value for non-'bare'
cases—otherwise remove the ternary and use the constant.

tests/integration/broker/evals/scoring/metrics.ts (1)

41-48: ⚡ Quick win

Remove redundant nullish coalescing inside undefined guards.

Lines 43 and 47 use (r.spawnCount ?? 0) and (r.releaseCount ?? 0) after already confirming these fields are not undefined on lines 41 and 45. The ?? 0 fallback will never execute in this context.

♻️ Simplify the conditionals

     // Lifecycle metrics: any scenario that measured spawn/release.
     if (r.spawnCount !== undefined) {
       spawnTotal += 1;
-      if ((r.spawnCount ?? 0) > 0) spawnPassed += 1;
+      if (r.spawnCount > 0) spawnPassed += 1;
     }
     if (r.releaseCount !== undefined) {
       releaseTotal += 1;
-      if ((r.releaseCount ?? 0) > 0) releasePassed += 1;
+      if (r.releaseCount > 0) releasePassed += 1;
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scoring/metrics.ts` around lines 41 - 48, The
checks for r.spawnCount and r.releaseCount already ensure these fields are not
undefined, so the nullish coalescing (r.spawnCount ?? 0) and (r.releaseCount ??
0) are redundant; replace those with direct comparisons (e.g., use r.spawnCount
> 0 and r.releaseCount > 0) in the blocks that update spawnPassed/releasePassed
to simplify the conditionals while keeping the existing guards that increment
spawnTotal/releaseTotal.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/broker/evals/runner.ts`:
- Around line 132-140: The test helper runOnce sets OPENCODE_MODEL when
spec.model is present but silently ignores model overrides for non-opencode
harnesses (e.g., "--harness=claude:gpt-4"); update runOnce to validate the
override scope by checking spec.model against spec.cli (e.g., ensure spec.cli
indicates opencode before setting OPENCODE_MODEL) and throw a clear error or
assertion if a model override is provided for a non-opencode harness (reference
runOnce, OPENCODE_MODEL, spec.model, and BrokerHarness), or alternatively add a
concise comment documenting that model overrides apply only to the opencode
harness.
- Line 239: The test hardcodes the onboarding variants with const variants =
['bare', 'one-liner', 'brief', 'skill'], which can drift from the canonical
list; import and use the exported ONBOARDING_VARIANTS constant from
onboarding.ts instead of the literal array (add the import at the top of the
file and replace the const variants assignment with ONBOARDING_VARIANTS so the
test matrix always matches the source of truth).

In `@tests/integration/broker/evals/scenarios/s01-spawn-worker.ts`:
- Around line 56-60: The predicate passed to harness.waitForEvent for
'agent_spawned' (used when creating spawnWaiter) unsafely casts the event to {
parent?: string }; replace it with a runtime guard that first verifies the event
is a non-null object and that it has a 'parent' property of type string before
comparing to lead (or extract parent only after the guard). Update the predicate
used in harness.waitForEvent (the anonymous arrow function) to perform this
check so malformed events won't cause silent failures or exceptions.

In `@tests/integration/broker/evals/scenarios/s02-release-worker.ts`:
- Around line 50-54: The predicate passed to harness.waitForEvent for
'agent_released' currently casts the event to { name: string } without checking
shape; change the predicate used in releaseWaiter to first guard that the event
is an object and has a string 'name' property (e.g., typeof e === 'object' && e
!== null && typeof (e as any).name === 'string') before comparing it to the
worker variable so the comparison is safe and won't throw if the broker emits a
different structure; update the predicate inside harness.waitForEvent (the
callback passed to releaseWaiter) accordingly.

In `@tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts`:
- Around line 100-104: The notes assembly can interpolate a null workerName
(used in the template strings for `spawned ${workerName} but never released` and
`spawned+released ${workerName}`); change the logic around `notesParts` to guard
against null by either early-returning when `workerName` is null or substituting
a safe fallback like '<unknown>' before interpolation, ensuring checks around
`spawn.phantomSpawn`, `spawnOk`, `releaseOk`, and `pass` still behave the same
and that `notesParts` never receives a template that interpolates a null
`workerName`.
- Around line 56-65: The code assumes the event returned by spawnWaiter.promise
contains a name and casts directly, which can fail; modify the promise handler
for harness.waitForEvent('agent_spawned', ...) (where workerName is set) to
validate the event shape at runtime before assigning: check that the resolved
event has a 'name' property and typeof name === 'string' (and optionally parent
matches), only then set workerName, otherwise handle the unexpected shape
(reject, log, or ignore) to avoid unsafe casting and runtime errors.

---

Duplicate comments:
In `@tests/integration/broker/evals/runner.ts`:
- Around line 146-169: runHarness currently returns a normal EvalReport even
when no scenarios actually ran (because every scenario was filtered out),
producing misleading "perfect" results; after the loop over scenarios check if
results.length === 0 and instead return an EvalReport that marks the harness as
skipped/errored (e.g., set a skipped flag or a non-zero failure/skip count and
include a clear message) so downstream consumers see the skip; update the return
path in runHarness to use the same EvalReport shape as normal returns (using
startedAt and any existing metadata) but populate scenariosTotal/summary to
reflect the skip and include an explanatory message so JSON/HTML/matrix outputs
don't show a healthy run when nothing executed.
- Around line 84-86: The current branch quietly drops unknown IDs when mapping
flags.scenarioIds with scenarioById, which hides invalid CLI input; change the
logic in the runner (where flags.scenarioIds is handled) to first map
flags.scenarioIds to results via scenarioById, collect any missing entries
(where result is undefined), and if any unknown IDs exist throw an error (or
exit non-zero) listing those unknown IDs; otherwise return the mapped array as
EvalScenario. This ensures flags.scenarioIds, scenarioById, and EvalScenario are
used to validate and fail fast on bad IDs.
- Around line 312-323: The baseline comparison currently reuses a single
baseline file for every harness inside the loop (flags.baseline +
compareReports(readReport(flags.baseline), report)), which can misattribute
regressions; change the logic to either (A) require exactly one harness when
flags.baseline is provided by validating the harness count before entering the
loop and erroring if >1, or (B) support a harness-keyed baseline matrix by
loading the baseline via readReport(flags.baseline) and selecting the
per-harness baseline (matching harness id/name) before calling compareReports;
implement the validation/selection near the existing baseline branch where
flags.baseline is checked and use harness identity (the same identifier used to
produce report) to look up the correct baseline instead of reusing the same file
for every harness.

---

Nitpick comments:
In `@tests/integration/broker/evals/scenarios/s01-spawn-worker.ts`:
- Line 36: The ternary expression assigning tier is dead (both branches return
'realistic'); replace the conditional with a single constant assignment by
setting the tier property to 'realistic' directly (update the object property
"tier" in the s01-spawn-worker scenario where the current "tier: onboarding ===
'bare' ? 'realistic' : 'realistic'," appears); if instead the 'bare' onboarding
should differ, change the second branch to the correct value for non-'bare'
cases—otherwise remove the ternary and use the constant.

In `@tests/integration/broker/evals/scoring/metrics.ts`:
- Around line 41-48: The checks for r.spawnCount and r.releaseCount already
ensure these fields are not undefined, so the nullish coalescing (r.spawnCount
?? 0) and (r.releaseCount ?? 0) are redundant; replace those with direct
comparisons (e.g., use r.spawnCount > 0 and r.releaseCount > 0) in the blocks
that update spawnPassed/releasePassed to simplify the conditionals while keeping
the existing guards that increment spawnTotal/releaseTotal.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 911d0907-2720-4765-8669-73bed72e3a8d

📥 Commits

Reviewing files that changed from the base of the PR and between b4d1170 and 7cca70a.

📒 Files selected for processing (11)

package.json
tests/integration/broker/evals/report/write.ts
tests/integration/broker/evals/runner.ts
tests/integration/broker/evals/scenarios/index.ts
tests/integration/broker/evals/scenarios/onboarding.ts
tests/integration/broker/evals/scenarios/s01-spawn-worker.ts
tests/integration/broker/evals/scenarios/s02-release-worker.ts
tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts
tests/integration/broker/evals/scoring/lifecycle.ts
tests/integration/broker/evals/scoring/metrics.ts
tests/integration/broker/evals/types.ts

✅ Files skipped from review due to trivial changes (1)

tests/integration/broker/evals/scoring/lifecycle.ts

🚧 Files skipped from review as they are similar to previous changes (3)

tests/integration/broker/evals/types.ts
tests/integration/broker/evals/report/write.ts
package.json

coderabbitai · 2026-06-12T08:56:46Z

+    }
+  }
+
+  const variants = ['bare', 'one-liner', 'brief', 'skill'];


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Avoid hardcoding onboarding variant list.

Line 239 hardcodes ['bare', 'one-liner', 'brief', 'skill']. If ONBOARDING_VARIANTS in onboarding.ts changes, this matrix will silently diverge. Import and reuse ONBOARDING_VARIANTS instead.

♻️ Import the canonical list

+import { ONBOARDING_VARIANTS } from './scenarios/onboarding.js'; ... function printLifecycleMatrix(allReports: Array<{ harness: string; report: EvalReport }>): void { ... - const variants = ['bare', 'one-liner', 'brief', 'skill']; + const variants = ONBOARDING_VARIANTS;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/runner.ts` at line 239, The test hardcodes the onboarding variants with const variants = ['bare', 'one-liner', 'brief', 'skill'], which can drift from the canonical list; import and use the exported ONBOARDING_VARIANTS constant from onboarding.ts instead of the literal array (add the import at the top of the file and replace the const variants assignment with ONBOARDING_VARIANTS so the test matrix always matches the source of truth).

coderabbitai · 2026-06-12T08:56:47Z

+      const releaseWaiter = harness.waitForEvent(
+        'agent_released',
+        RESPONSE_MS,
+        (e) => (e as { name: string }).name === worker
+      );


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate event structure before casting.

Line 53 casts to { name: string } without confirming the event shape. If the broker emits a different agent_released structure, the name comparison may fail or throw.

🛡️ Add runtime guard

const releaseWaiter = harness.waitForEvent( 'agent_released', RESPONSE_MS, - (e) => (e as { name: string }).name === worker + (e) => { + const evt = e as Record<string, unknown>; + return typeof evt.name === 'string' && evt.name === worker; + } );

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/s02-release-worker.ts` around lines 50 - 54, The predicate passed to harness.waitForEvent for 'agent_released' currently casts the event to { name: string } without checking shape; change the predicate used in releaseWaiter to first guard that the event is an object and has a string 'name' property (e.g., typeof e === 'object' && e !== null && typeof (e as any).name === 'string') before comparing it to the worker variable so the comparison is safe and won't throw if the broker emits a different structure; update the predicate inside harness.waitForEvent (the callback passed to releaseWaiter) accordingly.

coderabbitai · 2026-06-12T08:56:47Z

+      const spawnWaiter = harness.waitForEvent(
+        'agent_spawned',
+        RESPONSE_MS,
+        (e) => (e as { parent?: string }).parent === lead
+      );
+      await spawnWaiter.promise
+        .then((e) => {
+          workerName = (e as { name: string }).name;
+        })
+        .catch(() => {});


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate event structure before casting.

Line 63 casts to { name: string } without confirming the event shape. If the broker emits a different structure, extracting .name may fail or capture the wrong value.

🛡️ Add runtime guard

await spawnWaiter.promise .then((e) => { - workerName = (e as { name: string }).name; + const evt = e as Record<string, unknown>; + if (typeof evt.name === 'string') { + workerName = evt.name; + } }) .catch(() => {});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts` around lines 56 - 65, The code assumes the event returned by spawnWaiter.promise contains a name and casts directly, which can fail; modify the promise handler for harness.waitForEvent('agent_spawned', ...) (where workerName is set) to validate the event shape at runtime before assigning: check that the resolved event has a 'name' property and typeof name === 'string' (and optionally parent matches), only then set workerName, otherwise handle the unexpected shape (reject, log, or ignore) to avoid unsafe casting and runtime errors.

coderabbitai · 2026-06-12T08:56:47Z

+      const notesParts: string[] = [];
+      if (spawn.phantomSpawn) notesParts.push('phantom spawn');
+      if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName} but never released`);
+      if (!spawnOk) notesParts.push('no spawn');
+      if (pass) notesParts.push(`spawned+released ${workerName}`);


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard against null workerName in notes.

Line 102 interpolates workerName which may be null if the spawn event arrived but the promise handler failed or if scoring logic diverges from the captured name. Add a fallback or early-return guard.

🛡️ Add null guard

const notesParts: string[] = []; if (spawn.phantomSpawn) notesParts.push('phantom spawn'); - if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName} but never released`); + if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName ?? 'unknown'} but never released`); if (!spawnOk) notesParts.push('no spawn'); - if (pass) notesParts.push(`spawned+released ${workerName}`); + if (pass) notesParts.push(`spawned+released ${workerName ?? 'unknown'}`);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const notesParts: string[] = [];

if (spawn.phantomSpawn) notesParts.push('phantom spawn');

if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName} but never released`);

if (!spawnOk) notesParts.push('no spawn');

if (pass) notesParts.push(`spawned+released ${workerName}`);

const notesParts: string[] = [];

if (spawn.phantomSpawn) notesParts.push('phantom spawn');

if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName ?? 'unknown'} but never released`);

if (!spawnOk) notesParts.push('no spawn');

if (pass) notesParts.push(`spawned+released ${workerName ?? 'unknown'}`);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts` around lines 100 - 104, The notes assembly can interpolate a null workerName (used in the template strings for `spawned ${workerName} but never released` and `spawned+released ${workerName}`); change the logic around `notesParts` to guard against null by either early-returning when `workerName` is null or substituting a safe fallback like '<unknown>' before interpolation, ensuring checks around `spawn.phantomSpawn`, `spawnOk`, `releaseOk`, and `pass` still behave the same and that `notesParts` never receives a template that interpolates a null `workerName`.

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

crates/broker/src/runtime/api.rs (1)

214-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor skip_relay_prompt for the small-model skill injection.

This path still prepends relay guidance even when the caller explicitly set skip_relay_prompt. That changes the task text for requests that asked for no relay prompt augmentation, which can skew evals and custom prompts. Guard the prefix behind !skip_relay_prompt.

Suggested fix

-                        if let Some(prefix) = model_skill_prefix(effective_spec.model.as_deref()) {
-                            effective_task = Some(match effective_task {
-                                Some(task) => format!("{prefix}\n\n{task}"),
-                                None => prefix.to_string(),
-                            });
-                            tracing::debug!(
-                                agent = %name,
-                                model = ?effective_spec.model,
-                                "injected relay skill prefix for small-tier model"
-                            );
+                        if !skip_relay_prompt {
+                            if let Some(prefix) = model_skill_prefix(effective_spec.model.as_deref()) {
+                                effective_task = Some(match effective_task {
+                                    Some(task) => format!("{prefix}\n\n{task}"),
+                                    None => prefix.to_string(),
+                                });
+                                tracing::debug!(
+                                    agent = %name,
+                                    model = ?effective_spec.model,
+                                    "injected relay skill prefix for small-tier model"
+                                );
+                            }
                         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/runtime/api.rs` around lines 214 - 233, The code always
injects the relay skill prefix into effective_task regardless of the caller's
intent; update the logic that prepends
model_skill_prefix(effective_spec.model.as_deref()) so it only runs when
skip_relay_prompt is false (i.e. guard the injection with a check like if
!skip_relay_prompt && let Some(prefix) = model_skill_prefix(...)). Modify the
block that mutates effective_task (and emits the tracing::debug!) to first
ensure skip_relay_prompt is honored, using the existing symbols effective_spec,
effective_task, model_skill_prefix, skip_relay_prompt, and name.

tests/integration/broker/evals/scenarios/onboarding.ts (1)

30-59: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't hardcode cli: "claude" in shared onboarding text.

These prompts are reused by scenarios that already carry a runtime-selected ctx.cli, but the brief and skill variants instruct agents to spawn workers with cli: "claude" unconditionally. That skews non-Claude evals and can turn a compliant agent into a false failure.

Proposed refactor

-export function onboardingText(variant: OnboardingVariant): string {
+export function onboardingText(variant: OnboardingVariant, cli: string): string {
   switch (variant) {
@@
     case 'brief':
       return `
 
 ## Agent management
-- Spawn a worker: mcp__agent-relay__add_agent({ name, cli: "claude", task })
+- Spawn a worker: mcp__agent-relay__add_agent({ name, cli: "${cli}", task })
 - Release a worker: mcp__agent-relay__remove_agent({ name })
 Spawn when a task needs dedicated focus. Release as soon as the worker reports done.`;
@@
 ### Spawn a worker
 To delegate work, call:
-  mcp__agent-relay__add_agent({ name: "WorkerName", cli: "claude", task: "detailed instructions" })
+  mcp__agent-relay__add_agent({ name: "WorkerName", cli: "${cli}", task: "detailed instructions" })
 
-Required fields: name (unique string), cli ("claude"), task (full instructions for the worker).
+Required fields: name (unique string), cli ("${cli}"), task (full instructions for the worker).

Then thread the current scenario CLI through the callers instead of baking one provider into the shared prompt text.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scenarios/onboarding.ts` around lines 30 - 59,
The onboarding prompt cases 'brief' and 'skill' hardcode cli: "claude" in the
spawn examples; change them to use the runtime-selected CLI instead (e.g.,
replace the literal cli: "claude" with a placeholder injected from the scenario
context such as cli: "${ctx.cli}" or format the string with the caller-provided
cli) and update callers to thread the scenario CLI into these prompt templates
so non-Claude evals receive the correct provider value; ensure the worker
examples in both the 'brief' and 'skill' branches reference the dynamic value
rather than the fixed "claude".

tests/integration/broker/evals/scoring/lifecycle.ts (1)

75-80: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't treat an empty expected-worker list as "match every release".

When workerNames is empty, scoreRelease currently accepts every agent_released event. That lets harness teardown of the lead agent flip releaseConfirmed to true, and scoreLifecycle(..., expectedWorkers = []) makes that false positive the default path.

Proposed fix

 export function scoreRelease(events: BrokerEvent[], workerNames: string[]): ReleaseScore {
+  if (workerNames.length === 0) {
+    return {
+      releaseConfirmed: false,
+      releasedNames: [],
+      releaseCount: 0,
+    };
+  }
+
   const nameSet = new Set(workerNames);
   const releaseEvents = events.filter(
     (e): e is Extract<BrokerEvent, { kind: 'agent_released' }> =>
-      e.kind === 'agent_released' && (workerNames.length === 0 || nameSet.has(e.name))
+      e.kind === 'agent_released' && nameSet.has(e.name)
   );

Also applies to: 95-102

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scoring/lifecycle.ts` around lines 75 - 80,
scoreRelease currently treats an empty workerNames array as "match every
release" causing false positives; change the filter in scoreRelease to require
workerNames.length > 0 AND nameSet.has(e.name) (i.e., only match releases when
an expected worker list is provided and the event name is in nameSet), or
explicitly return an empty releaseEvents when workerNames is empty; apply the
same change to the analogous workerName-based filter later in the file (the
other lifecycle scoring filter that uses workerNames) so empty expected-workers
never matches all agent_released events.

tests/integration/broker/evals/scenarios/04-relay-chain.ts (1)

20-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always release spawned agents from finally in these real-CLI scenarios.

Both tests/integration/broker/evals/scenarios/04-relay-chain.ts and tests/integration/broker/evals/scenarios/s04-no-native-subagents.ts only call releaseAgent(...) on the happy path. Any failure in spawnAgent, sendMessage, waiter timeouts, or scoring skips teardown and leaves real CLI agents behind, which can bleed into later evals.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scenarios/04-relay-chain.ts` around lines 20 -
89, The test's run function currently releases agents only on the happy path;
move agent teardown into a finally block so spawned CLI agents are always
released even on errors/timeouts. Wrap the main test flow in try { ... } finally
{ ... } inside run, track successful spawns (e.g., push agent names after each
harness.spawnAgent call) and in finally call harness.releaseAgent for each
tracked name (use Promise.allSettled or catch each release) to ensure cleanup
regardless of failures in spawnAgent, sendMessage, waiters, or scoring
(references: run, harness.spawnAgent, harness.releaseAgent,
finalWaiter/waitForEvent).

🧹 Nitpick comments (4)

packages/cli/src/auto/classifier.ts (2)

83-89: ⚡ Quick win

Document the parallelizability threshold reasoning.

The condition parallelSignals >= 2 || domains.length >= 3 uses specific thresholds (2 signals, 3 domains) without explanation. Adding a brief inline comment would improve maintainability and help future calibration based on eval results.

📝 Suggested documentation

   const domains = detectDomains(task);
   const parallelSignals = countParallelSignals(task);
+  // Threshold: 2+ parallel signals OR 3+ domains indicates task can be split.
+  // Calibrated empirically; adjust based on routing eval results.
   const parallelizable = parallelSignals >= 2 || domains.length >= 3;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/cli/src/auto/classifier.ts` around lines 83 - 89, Add a concise
inline comment next to the parallelizability calculation in classifyTask
explaining the chosen thresholds: why parallelSignals >= 2 and domains.length >=
3 are used to mark tasks as parallelizable (e.g., two or more explicit parallel
cues or three or more distinct domains implying independent sub-tasks), and note
that these values are heuristics subject to calibration via evaluation
metrics—update the comment near the parallelizable assignment referencing the
variables parallelSignals and domains so future maintainers know the rationale
and where to adjust thresholds.

90-101: ⚡ Quick win

Extract word-count thresholds as named constants.

The hardcoded values 100 and 40 for complexity boundaries would benefit from extraction as named constants (e.g., HIGH_COMPLEXITY_WORD_THRESHOLD, MEDIUM_COMPLEXITY_WORD_THRESHOLD) to improve readability and make tuning easier based on eval results.

♻️ Proposed refactor

+const HIGH_COMPLEXITY_WORD_THRESHOLD = 100;
+const MEDIUM_COMPLEXITY_WORD_THRESHOLD = 40;
+const HIGH_COMPLEXITY_DOMAIN_THRESHOLD = 4;
+const MEDIUM_COMPLEXITY_DOMAIN_THRESHOLD = 2;
+
 export function classifyTask(task: string): TaskAssessment {
   // ... existing code ...
   
   // ── Complexity ──────────────────────────────────────────────────────────────
   let complexity: TaskAssessment['complexity'];
   const highHit = HIGH_COMPLEXITY_KEYWORDS.some((kw) => lower.includes(kw));
   const medHit  = MEDIUM_COMPLEXITY_KEYWORDS.some((kw) => lower.includes(kw));
 
-  if (highHit || words > 100 || domains.length >= 4) {
+  if (highHit || words > HIGH_COMPLEXITY_WORD_THRESHOLD || domains.length >= HIGH_COMPLEXITY_DOMAIN_THRESHOLD) {
     complexity = 'high';
-  } else if (medHit || words > 40 || domains.length >= 2) {
+  } else if (medHit || words > MEDIUM_COMPLEXITY_WORD_THRESHOLD || domains.length >= MEDIUM_COMPLEXITY_DOMAIN_THRESHOLD) {
     complexity = 'medium';
   } else {
     complexity = 'low';
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/cli/src/auto/classifier.ts` around lines 90 - 101, Extract the magic
numbers used for word-count thresholds into named constants and use them in the
complexity decision logic: introduce constants like
HIGH_COMPLEXITY_WORD_THRESHOLD and MEDIUM_COMPLEXITY_WORD_THRESHOLD and replace
the literals 100 and 40 in the complexity calculation block (the logic that sets
the variable complexity, which also references HIGH_COMPLEXITY_KEYWORDS,
MEDIUM_COMPLEXITY_KEYWORDS, words and domains.length) so the comparisons read
words > HIGH_COMPLEXITY_WORD_THRESHOLD and words >
MEDIUM_COMPLEXITY_WORD_THRESHOLD; keep the existing boolean hits (highHit,
medHit) and domain checks unchanged.

specs/auto-routing.md (1)

78-90: 💤 Low value

Add language specifiers to fenced code blocks.

Markdownlint flags three fenced code blocks (lines 80-90, 113-119, 133-144) for missing language specifiers. Adding ```text or ```plaintext would satisfy the linter and improve rendering consistency.

📝 Suggested fixes

 **Classifier prompt** (compressed):
 
-```
+```text
 You are a task router. Given a task description, assess:
 ...

Routing table (derived from eval data):

- +text
complexity=low, parallel=false → lead: sonnet/one-liner, workers: [haiku×1]
...


This is critical: pre-composing the team means the lead's job is coordination, not planning. The lead doesn't have to decide whether to delegate — it's already been decided.

-```
+```text
You are Director, leading a {N}-worker team on this task:
...

Also applies to: 111-119, 131-144

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/auto-routing.md` around lines 78 - 90, Several fenced code blocks in
the "Classifier prompt" and the later "Routing table" and "Director" sections
are missing language specifiers causing markdownlint warnings; update each
triple-backtick fence around the prompt, routing table examples, and Director
example (the blocks containing the compressed classifier prompt, the
routing-table snippet, and the Director/team snippet) to use a language token
such as ```text or ```plaintext so they validate and render consistently. Ensure
you replace the opening ``` with ```text (or ```plaintext) for all three blocks
referenced in the comment.

Source: Linters/SAST tools

packages/cli/src/auto/composer.ts (1)

72-81: ⚡ Quick win

Consider extracting the worker task template to a shared builder.

The worker task format at line 79 is later parsed via string manipulation in director-prompt.ts line 32. This creates fragile coupling where format changes break the parsing logic.

Extracting a shared buildWorkerTask(domain, subtask, originalTask) function would eliminate duplication and make the format contract explicit.

♻️ Proposed refactor

Create a shared helper:

// In a shared utilities file or in composer.ts
function buildWorkerTask(domain: string, subtask: string, originalTask: string): string {
  return `You are a specialised ${domain} worker. Your task:\n\n${originalTask}\n\nFocus exclusively on the ${subtask}. Report DONE when complete with a concise summary.`;
}

function extractWorkerDomain(task: string): string {
  // Extract domain from the standard format
  const match = task.match(/You are a specialised (\w+) worker/);
  return match?.[1] ?? 'general';
}

Then use in composer.ts:

return {
  role: `Worker-${domain.charAt(0).toUpperCase() + domain.slice(1)}`,
  model: row.workerModel,
  task: buildWorkerTask(domain, subtask, originalTask),
};

And in director-prompt.ts:

const workerList = workers
  .map((w, i) => {
    const domain = extractWorkerDomain(w.task);
    return `  ${i + 1}. **${w.role}** (${w.model} relay worker): ${domain} work`;
  })
  .join('\n');

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/cli/src/auto/composer.ts` around lines 72 - 81, The worker task
string is built inline in the composer mapping, which couples formatting to
fragile string parsing elsewhere; extract a shared builder function
buildWorkerTask(domain, subtask, originalTask) that returns the exact task
template and replace the inline template in the WorkerSpec creation (the mapping
that sets role/model/task) to call this builder; also add a complementary
extractor extractWorkerDomain(task) (used by director-prompt.ts instead of
ad-hoc regexes) so parsing and generation share the same contract, and update
both composer usage and director-prompt.ts to call these helpers.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/using-agent-relay/SKILL.md:
- Around line 91-93: Several fenced code examples (e.g., the
mcp__agent-relay__send_dm(...) snippet and other examples in the same section)
are missing fence language tags; update each fenced block (including the
examples around lines with mcp__agent-relay__send_dm and the other listed
ranges) to include an appropriate language tag like ```text or ```bash so
markdownlint MD040 is satisfied; search for the literal snippets (e.g.,
mcp__agent-relay__send_dm) to locate each block and add the language identifier
to the opening triple backticks.
- Line 3: The description uses the wrong tool name "agent_register" which
doesn't match the actual API; update the text to reference the correct tool name
"register_agent" (or fully-qualified "mcp__agent-relay__register_agent") so
readers can find the real call, and scan the skill for any other occurrences of
"agent_register" and replace them with the correct identifier (register_agent /
mcp__agent-relay__register_agent).

In `@CHANGELOG.md`:
- Line 12: Split the two long CHANGELOG bullets into multiple short, user-facing
bullets (one change per bullet) and remove lower-level implementation details;
for the lifecycle entry (referencing eval:lifecycle,
eval:lifecycle:claude-models, eval:lifecycle:full, add_agent/remove_agent,
broker events) create separate bullets for the lifecycle eval layer,
phantom-spawn detection (broker lifecycle events distinguishing phantom vs real
spawns), and the HTML dashboard/CI per-model/variant spawn-rate visuals, each
written impact-first and concise; likewise for the messaging entry (referencing
npm run eval, eval:matrix, ACK/DONE, messaging-rate, wrong-channel, self-test,
toolcheck, JSON/HTML reports, baseline diffing) break it into distinct bullets
for the messaging eval suite, scoring metrics (message rate, phantom-message
rate, ACK/DONE adherence, wrong-channel detection), scenario tiers (realistic,
smoke, negative-control), and report output formats, and strip implementation
details like “via broker events” or “call add_agent” so each bullet describes
only the user-visible change.

In `@packages/cli/src/auto/classifier.ts`:
- Around line 25-43: The keyword lists HIGH_COMPLEXITY_KEYWORDS,
MEDIUM_COMPLEXITY_KEYWORDS and PARALLEL_PHRASES contain entries with literal
surrounding spaces which cause missed matches; update these arrays to store
trimmed, canonical keyword phrases (no leading/trailing spaces) and change the
matching logic (e.g., the function that checks keywords—implement or update
detectKeyword/detectPhrase) to perform case-insensitive word-boundary regex
matches that escape special chars and convert internal spaces to \s+ so terms
match at string boundaries and next to punctuation (use \b...\\s+...\\b with
proper escaping).

In `@packages/cli/src/auto/composer.ts`:
- Around line 72-81: The worker task string in composer.ts is tightly coupled to
brittle parsing in director-prompt.ts; extract a shared explicit
formatter/parser so both sides use the same contract. Create a shared module
(e.g., task-format) that exports a WorkerTaskParts type plus
buildWorkerTask(parts) used by composeTeam()/the baseWorkers mapping in
composer.ts to construct task strings, and parseWorkerTaskSummary(task) used by
director-prompt.ts to extract the summary; update composer.ts to call
buildWorkerTask({domain, originalTask, subtask}) and update director-prompt.ts
to call parseWorkerTaskSummary(w.task), removing any ad-hoc string slicing so
format changes remain centralized.
- Around line 51-53: Update the routing-table comment above the medium
configurations to accurately describe the hybrid composition for
'medium:parallel': it uses a sonnet lead (leadModel: 'sonnet'), haiku base
workers (workerModel: 'haiku'), and a separate sonnet synthesiser (synth: true),
whereas 'medium:serial' remains sonnet lead + sonnet workers; change the comment
near the entries for 'medium:serial' and 'medium:parallel' (and any nearby
explanatory text referencing synth) to reflect these exact roles.

In `@tests/integration/broker/evals/report/html.ts`:
- Around line 853-867: The current passedAll logic treats 0/0 as success because
passedAll is computed only by equality; update the check used when rendering the
summary pill (the passedAll variable and the similar logic at the other location
around lines 988-991) to treat a zero total as "no data" instead of pass:
compute passedAll as (report.metrics.scenariosTotal > 0 &&
report.metrics.scenariosPassed === report.metrics.scenariosTotal), and change
the summary pill rendering to show a neutral "no data" state (neither all-pass
nor has-fail) when report.metrics.scenariosTotal === 0, using the same change
for the other summary code path that currently mirrors these checks.

In `@tests/integration/broker/evals/scenarios/r05-check-inbox.ts`:
- Line 33: The test is using the fixed RESPONSE_MS timeout instead of the new
per-model slow-window helper; update the wait calls to use responseMs(model) so
slow-model runs get the correct timeout. Replace occurrences like await
waitForSends(harness, agent, 1, RESPONSE_MS) with await waitForSends(harness,
agent, 1, responseMs(model)) in the scenario files (ensure the local variable
model is the same one captured by the test), and remove or stop using
RESPONSE_MS in those waits; keep the waitForSends signature unchanged.
- Around line 39-55: The pass logic currently only requires base.sent >= 1 which
allows generic replies to pass; update the pass condition to also require
explicit inbox-check evidence by checking base.events or base.transcript for the
coordinator's inbox-check action. Concretely, in the pass assignment (variable
pass) add a predicate such as base.events.some(e => e.type === 'inbox_check' &&
(e.actor === `coordinator-${suffix}` || e.actor?.startsWith('coordinator'))) or
fall back to scanning base.transcript for an inbox-check phrase, so that pass
requires base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk &&
inboxCheckEvidence. Ensure you reference the same variables (pass, base.sent,
base.phantoms, base.deliveryOk, base.events, base.transcript, agents name
'coordinator') when implementing the change.

In `@tests/integration/broker/evals/scenarios/r06-group-dm.ts`:
- Around line 49-69: The current pass calculation only checks base.sent and
phantoms/delivery flags, which misses whether the lead DM'd the correct targets;
update the pass logic (and related expected/notes) in r06-group-dm.ts to assert
that the transcript/events include direct messages addressed specifically to
both recipients (Alice and Bob) rather than any outbound relay; use identifiers
like pass, base.sent, base.transcript, base.events, expected and notes to
enforce that both alice and bob were targeted (e.g., require evidence for both
recipients before setting pass=true and set expected/notes to reflect
sentToTargets for both).

In `@tests/integration/broker/evals/scenarios/r07-list-agents.ts`:
- Around line 42-58: The current pass condition only checks for any outbound
message; tighten it so the scenario requires explicit roster-lookup evidence by
modifying the pass expression (variable "pass") to additionally require either a
roster-related event in base.events (e.g., base.events.some(e => e.type ===
'roster_lookup' || e.name === 'list-agents')) or a roster-report phrase in
base.transcript (e.g., /available|online|here/.test(base.transcript)); update
the pass assignment that currently references base.sent, base.phantoms, and
base.deliveryOk to include this new check.

In `@tests/integration/broker/evals/scenarios/t01-thread-reply.ts`:
- Around line 55-71: The pass logic only checks base.sent and phantoms but not
that the outbound message was posted to the seeded thread; update the evaluation
to require that at least one outbound message has its thread identifier matching
the seeded thread id (use whatever thread field exists on the captured message
objects in base, e.g. message.thread_ts or message.threadId) before setting pass
to true, set wrongChannelReplies to the count of outbound messages that have a
different or missing thread id, and update notes to reflect thread-targeting
(e.g., notes=`replied=${base.sent >= 1}; inThread=${<match-check>}`) while
preserving the other returned fields (transcript, sent, phantoms, deliveryOk,
events, totalIntents).

---

Outside diff comments:
In `@crates/broker/src/runtime/api.rs`:
- Around line 214-233: The code always injects the relay skill prefix into
effective_task regardless of the caller's intent; update the logic that prepends
model_skill_prefix(effective_spec.model.as_deref()) so it only runs when
skip_relay_prompt is false (i.e. guard the injection with a check like if
!skip_relay_prompt && let Some(prefix) = model_skill_prefix(...)). Modify the
block that mutates effective_task (and emits the tracing::debug!) to first
ensure skip_relay_prompt is honored, using the existing symbols effective_spec,
effective_task, model_skill_prefix, skip_relay_prompt, and name.

In `@tests/integration/broker/evals/scenarios/04-relay-chain.ts`:
- Around line 20-89: The test's run function currently releases agents only on
the happy path; move agent teardown into a finally block so spawned CLI agents
are always released even on errors/timeouts. Wrap the main test flow in try {
... } finally { ... } inside run, track successful spawns (e.g., push agent
names after each harness.spawnAgent call) and in finally call
harness.releaseAgent for each tracked name (use Promise.allSettled or catch each
release) to ensure cleanup regardless of failures in spawnAgent, sendMessage,
waiters, or scoring (references: run, harness.spawnAgent, harness.releaseAgent,
finalWaiter/waitForEvent).

In `@tests/integration/broker/evals/scenarios/onboarding.ts`:
- Around line 30-59: The onboarding prompt cases 'brief' and 'skill' hardcode
cli: "claude" in the spawn examples; change them to use the runtime-selected CLI
instead (e.g., replace the literal cli: "claude" with a placeholder injected
from the scenario context such as cli: "${ctx.cli}" or format the string with
the caller-provided cli) and update callers to thread the scenario CLI into
these prompt templates so non-Claude evals receive the correct provider value;
ensure the worker examples in both the 'brief' and 'skill' branches reference
the dynamic value rather than the fixed "claude".

In `@tests/integration/broker/evals/scoring/lifecycle.ts`:
- Around line 75-80: scoreRelease currently treats an empty workerNames array as
"match every release" causing false positives; change the filter in scoreRelease
to require workerNames.length > 0 AND nameSet.has(e.name) (i.e., only match
releases when an expected worker list is provided and the event name is in
nameSet), or explicitly return an empty releaseEvents when workerNames is empty;
apply the same change to the analogous workerName-based filter later in the file
(the other lifecycle scoring filter that uses workerNames) so empty
expected-workers never matches all agent_released events.

---

Nitpick comments:
In `@packages/cli/src/auto/classifier.ts`:
- Around line 83-89: Add a concise inline comment next to the parallelizability
calculation in classifyTask explaining the chosen thresholds: why
parallelSignals >= 2 and domains.length >= 3 are used to mark tasks as
parallelizable (e.g., two or more explicit parallel cues or three or more
distinct domains implying independent sub-tasks), and note that these values are
heuristics subject to calibration via evaluation metrics—update the comment near
the parallelizable assignment referencing the variables parallelSignals and
domains so future maintainers know the rationale and where to adjust thresholds.
- Around line 90-101: Extract the magic numbers used for word-count thresholds
into named constants and use them in the complexity decision logic: introduce
constants like HIGH_COMPLEXITY_WORD_THRESHOLD and
MEDIUM_COMPLEXITY_WORD_THRESHOLD and replace the literals 100 and 40 in the
complexity calculation block (the logic that sets the variable complexity, which
also references HIGH_COMPLEXITY_KEYWORDS, MEDIUM_COMPLEXITY_KEYWORDS, words and
domains.length) so the comparisons read words > HIGH_COMPLEXITY_WORD_THRESHOLD
and words > MEDIUM_COMPLEXITY_WORD_THRESHOLD; keep the existing boolean hits
(highHit, medHit) and domain checks unchanged.

In `@packages/cli/src/auto/composer.ts`:
- Around line 72-81: The worker task string is built inline in the composer
mapping, which couples formatting to fragile string parsing elsewhere; extract a
shared builder function buildWorkerTask(domain, subtask, originalTask) that
returns the exact task template and replace the inline template in the
WorkerSpec creation (the mapping that sets role/model/task) to call this
builder; also add a complementary extractor extractWorkerDomain(task) (used by
director-prompt.ts instead of ad-hoc regexes) so parsing and generation share
the same contract, and update both composer usage and director-prompt.ts to call
these helpers.

In `@specs/auto-routing.md`:
- Around line 78-90: Several fenced code blocks in the "Classifier prompt" and
the later "Routing table" and "Director" sections are missing language
specifiers causing markdownlint warnings; update each triple-backtick fence
around the prompt, routing table examples, and Director example (the blocks
containing the compressed classifier prompt, the routing-table snippet, and the
Director/team snippet) to use a language token such as ```text or ```plaintext
so they validate and render consistently. Ensure you replace the opening ```
with ```text (or ```plaintext) for all three blocks referenced in the comment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e95efcb-efdb-401f-89e1-076a25197ae4

📥 Commits

Reviewing files that changed from the base of the PR and between 55cea0f and 9b20538.

📒 Files selected for processing (48)

.agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_8bwr1de9cmtn/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_8bwr1de9cmtn/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_gcn2gqbj96ya/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_gcn2gqbj96ya/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/trajectory.json
.claude/skills/using-agent-relay/SKILL.md
CHANGELOG.md
crates/broker/src/runtime/api.rs
package.json
packages/cli/src/auto/classifier.test.ts
packages/cli/src/auto/classifier.ts
packages/cli/src/auto/composer.test.ts
packages/cli/src/auto/composer.ts
packages/cli/src/auto/director-prompt.ts
packages/cli/src/auto/index.ts
specs/auto-routing.md
tests/integration/broker/evals/report/html.ts
tests/integration/broker/evals/runner.ts
tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts
tests/integration/broker/evals/scenarios/02-channel-reply.ts
tests/integration/broker/evals/scenarios/03-ack-done.ts
tests/integration/broker/evals/scenarios/04-relay-chain.ts
tests/integration/broker/evals/scenarios/helpers.ts
tests/integration/broker/evals/scenarios/index.ts
tests/integration/broker/evals/scenarios/onboarding.ts
tests/integration/broker/evals/scenarios/r01-incidental-report.ts
tests/integration/broker/evals/scenarios/r02-forget-to-report.ts
tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts
tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts
tests/integration/broker/evals/scenarios/r05-check-inbox.ts
tests/integration/broker/evals/scenarios/r06-group-dm.ts
tests/integration/broker/evals/scenarios/r07-list-agents.ts
tests/integration/broker/evals/scenarios/s01-spawn-worker.ts
tests/integration/broker/evals/scenarios/s02-release-worker.ts
tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts
tests/integration/broker/evals/scenarios/s04-no-native-subagents.ts
tests/integration/broker/evals/scenarios/s05-phrasing-variants.ts
tests/integration/broker/evals/scenarios/t01-thread-reply.ts
tests/integration/broker/evals/scoring/lifecycle.ts
tests/integration/broker/evals/scoring/native-subagent.ts
tests/integration/broker/evals/types.ts
tests/integration/broker/utils/broker-harness.ts

✅ Files skipped from review due to trivial changes (9)

.agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_8bwr1de9cmtn/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_gcn2gqbj96ya/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/trajectory.json
.agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/trajectory.json

🚧 Files skipped from review as they are similar to previous changes (13)

tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts
tests/integration/broker/evals/scenarios/03-ack-done.ts
tests/integration/broker/evals/scenarios/r02-forget-to-report.ts
tests/integration/broker/evals/scenarios/index.ts
tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts
tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts
tests/integration/broker/evals/scenarios/s02-release-worker.ts
tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts
tests/integration/broker/evals/runner.ts
tests/integration/broker/evals/scenarios/s01-spawn-worker.ts
tests/integration/broker/evals/types.ts
tests/integration/broker/evals/scenarios/02-channel-reply.ts
tests/integration/broker/utils/broker-harness.ts

coderabbitai · 2026-06-12T13:38:39Z

+const HIGH_COMPLEXITY_KEYWORDS = [
+  'audit', 'refactor', 'migrate', 'redesign', 'overhaul', 'architecture',
+  'security review', 'performance analysis', 'comprehensive', 'full ',
+  'entire ', 'across all', 'end-to-end', 'investigation', 'root cause',
+  'diagnose', 'large-scale', 'multi-phase', 'research',
+];
+
+const MEDIUM_COMPLEXITY_KEYWORDS = [
+  'implement', 'build', 'add', 'create', 'update', 'fix', 'debug',
+  'integrate', 'configure', 'deploy', 'analyse', 'analyze', 'review',
+  'test', 'document', 'optimize', 'optimise',
+];
+
+// ── Parallelism signals ───────────────────────────────────────────────────────
+
+const PARALLEL_PHRASES = [
+  ' and ', ' also ', 'at the same time', 'in parallel', 'simultaneously',
+  'concurrently', 'both ', 'each ', 'all ', 'multiple ',
+];


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keyword patterns with surrounding spaces reduce match coverage.

The keyword arrays include patterns with leading/trailing spaces (e.g., 'full ', 'entire ', ' and '), which cause missed matches when the keyword appears adjacent to punctuation or at string boundaries. For example, 'full ' won't match "full." or "full-scale".

Consider using word-boundary checks or trimming spaces to improve heuristic accuracy.

🔧 Suggested pattern improvements

const HIGH_COMPLEXITY_KEYWORDS = [ - 'audit', 'refactor', 'migrate', 'redesign', 'overhaul', 'architecture', - 'security review', 'performance analysis', 'comprehensive', 'full ', - 'entire ', 'across all', 'end-to-end', 'investigation', 'root cause', + 'audit', 'refactor', 'migrate', 'redesign', 'overhaul', 'architecture', + 'security review', 'performance analysis', 'comprehensive', 'full', + 'entire', 'across all', 'end-to-end', 'investigation', 'root cause', 'diagnose', 'large-scale', 'multi-phase', 'research', ]; const PARALLEL_PHRASES = [ - ' and ', ' also ', 'at the same time', 'in parallel', 'simultaneously', - 'concurrently', 'both ', 'each ', 'all ', 'multiple ', + ' and ', ' also ', 'at the same time', 'in parallel', 'simultaneously', + 'concurrently', 'both', 'each', 'all', 'multiple', ];

Then update detection to use word boundaries:

function detectKeyword(text: string, keyword: string): boolean { const regex = new RegExp(`\\b${keyword.replace(/\s+/g, '\\s+')}\\b`, 'i'); return regex.test(text); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/auto/classifier.ts` around lines 25 - 43, The keyword lists HIGH_COMPLEXITY_KEYWORDS, MEDIUM_COMPLEXITY_KEYWORDS and PARALLEL_PHRASES contain entries with literal surrounding spaces which cause missed matches; update these arrays to store trimmed, canonical keyword phrases (no leading/trailing spaces) and change the matching logic (e.g., the function that checks keywords—implement or update detectKeyword/detectPhrase) to perform case-insensitive word-boundary regex matches that escape special chars and convert internal spaces to \s+ so terms match at string boundaries and next to punctuation (use \b...\\s+...\\b with proper escaping).

coderabbitai · 2026-06-12T13:38:40Z

+    harness.clearEvents();
+
+    // No stimulus injected — the agent must act proactively
+    await waitForSends(harness, agent, 1, RESPONSE_MS);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Adopt responseMs(model) consistently across tests/integration/broker/evals/scenarios/r05-check-inbox.ts, tests/integration/broker/evals/scenarios/r06-group-dm.ts, tests/integration/broker/evals/scenarios/r07-list-agents.ts, tests/integration/broker/evals/scenarios/s05-phrasing-variants.ts, and tests/integration/broker/evals/scenarios/t01-thread-reply.ts. These files all capture model, and tests/integration/broker/evals/scenarios/helpers.ts now exposes a slow-model window, but each wait still uses the fixed RESPONSE_MS. The shared root cause is that the new helper was added without being wired into these scenario waits, so slow-model runs can fail for timeout noise instead of behavior.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/r05-check-inbox.ts` at line 33, The test is using the fixed RESPONSE_MS timeout instead of the new per-model slow-window helper; update the wait calls to use responseMs(model) so slow-model runs get the correct timeout. Replace occurrences like await waitForSends(harness, agent, 1, RESPONSE_MS) with await waitForSends(harness, agent, 1, responseMs(model)) in the scenario files (ensure the local variable model is the same one captured by the test), and remove or stop using RESPONSE_MS in those waits; keep the waitForSends signature unchanged.

coderabbitai · 2026-06-12T13:38:40Z

+    const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk;
+
+    return {
+      id: scenario.id,
+      title: scenario.title,
+      pass,
+      agents: [{ name: agent, cli, role: 'coordinator', prompt: TASK }],
+      transcript: base.transcript,
+      sent: base.sent,
+      expected: 1,
+      phantoms: base.phantoms,
+      totalIntents: base.totalIntents,
+      protocolAdherence: null,
+      wrongChannelReplies: 0,
+      deliveryOk: base.deliveryOk,
+      events: base.events,
+      notes: `proactivelySent=${base.sent >= 1}`,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require inbox-check evidence before passing.

This currently passes on any outbound message from coordinator-${suffix}. A generic “standing by” reply satisfies base.sent >= 1, so the scenario can go green without ever exercising the inbox-check behavior described in Lines 4-7.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/r05-check-inbox.ts` around lines 39 - 55, The pass logic currently only requires base.sent >= 1 which allows generic replies to pass; update the pass condition to also require explicit inbox-check evidence by checking base.events or base.transcript for the coordinator's inbox-check action. Concretely, in the pass assignment (variable pass) add a predicate such as base.events.some(e => e.type === 'inbox_check' && (e.actor === `coordinator-${suffix}` || e.actor?.startsWith('coordinator'))) or fall back to scanning base.transcript for an inbox-check phrase, so that pass requires base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk && inboxCheckEvidence. Ensure you reference the same variables (pass, base.sent, base.phantoms, base.deliveryOk, base.events, base.transcript, agents name 'coordinator') when implementing the change.

coderabbitai · 2026-06-12T13:38:40Z

+    const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk;
+
+    return {
+      id: scenario.id,
+      title: scenario.title,
+      pass,
+      agents: [
+        { name: agent, cli, role: 'lead', prompt: ROLE },
+        { name: alice, cli: 'cat', role: 'recipient (cat shim)', prompt: '(cat shim)' },
+        { name: bob, cli: 'cat', role: 'recipient (cat shim)', prompt: '(cat shim)' },
+      ],
+      transcript: base.transcript,
+      sent: base.sent,
+      expected: 1,
+      phantoms: base.phantoms,
+      totalIntents: base.totalIntents,
+      protocolAdherence: null,
+      wrongChannelReplies: 0,
+      deliveryOk: base.deliveryOk,
+      events: base.events,
+      notes: `sentToTargets=${base.sent >= 1}`,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert the DM targets, not just that something was sent.

baseScore() only tells you that the lead emitted at least one relay_inbound. A reply to Orchestrator, or a DM to the wrong agent, still makes pass true and sentToTargets=true, which does not match the “update Alice and Bob” contract in Lines 4-6.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/r06-group-dm.ts` around lines 49 - 69, The current pass calculation only checks base.sent and phantoms/delivery flags, which misses whether the lead DM'd the correct targets; update the pass logic (and related expected/notes) in r06-group-dm.ts to assert that the transcript/events include direct messages addressed specifically to both recipients (Alice and Bob) rather than any outbound relay; use identifiers like pass, base.sent, base.transcript, base.events, expected and notes to enforce that both alice and bob were targeted (e.g., require evidence for both recipients before setting pass=true and set expected/notes to reflect sentToTargets for both).

coderabbitai · 2026-06-12T13:38:40Z

+    const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk;
+
+    return {
+      id: scenario.id,
+      title: scenario.title,
+      pass,
+      agents: [{ name: agent, cli, role: 'coordinator', prompt: ROLE }],
+      transcript: base.transcript,
+      sent: base.sent,
+      expected: 1,
+      phantoms: base.phantoms,
+      totalIntents: base.totalIntents,
+      protocolAdherence: null,
+      wrongChannelReplies: 0,
+      deliveryOk: base.deliveryOk,
+      events: base.events,
+      notes: `replied=${base.sent >= 1}`,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require roster-lookup evidence before passing.

The scenario description says the agent must discover who is available and report back, but pass only checks for any outbound message. That allows a fabricated or non-actionable reply to pass without exercising the directory/list-agents behavior this eval is supposed to cover.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/r07-list-agents.ts` around lines 42 - 58, The current pass condition only checks for any outbound message; tighten it so the scenario requires explicit roster-lookup evidence by modifying the pass expression (variable "pass") to additionally require either a roster-related event in base.events (e.g., base.events.some(e => e.type === 'roster_lookup' || e.name === 'list-agents')) or a roster-report phrase in base.transcript (e.g., /available|online|here/.test(base.transcript)); update the pass assignment that currently references base.sent, base.phantoms, and base.deliveryOk to include this new check.

coderabbitai · 2026-06-12T13:38:40Z

+    const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk;
+
+    return {
+      id: scenario.id,
+      title: scenario.title,
+      pass,
+      agents: [{ name: agent, cli, role: 'infra engineer', prompt: ROLE }],
+      transcript: base.transcript,
+      sent: base.sent,
+      expected: 1,
+      phantoms: base.phantoms,
+      totalIntents: base.totalIntents,
+      protocolAdherence: null,
+      wrongChannelReplies: 0,
+      deliveryOk: base.deliveryOk,
+      events: base.events,
+      notes: `replied=${base.sent >= 1}`,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the reply stayed in the seeded thread.

This scenario is titled and documented as a reply_to_thread check, but pass only requires any outbound message. A fresh channel post still goes green here, and wrongChannelReplies: 0 reports perfect adherence even though thread targeting was never verified.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/t01-thread-reply.ts` around lines 55 - 71, The pass logic only checks base.sent and phantoms but not that the outbound message was posted to the seeded thread; update the evaluation to require that at least one outbound message has its thread identifier matching the seeded thread id (use whatever thread field exists on the captured message objects in base, e.g. message.thread_ts or message.threadId) before setting pass to true, set wrongChannelReplies to the count of outbound messages that have a different or missing thread id, and update notes to reflect thread-targeting (e.g., notes=`replied=${base.sent >= 1}; inThread=${<match-check>}`) while preserving the other returned fields (transcript, sent, phantoms, deliveryOk, events, totalIntents).

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

specs/auto-routing.md (2)
66-67: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The classifier contract here is out of sync with the current CLI integration.

This spec says the classifier is an async LLM call returning Promise<TaskAssessment>, but packages/cli/src/cli/commands/local-agent.ts invokes classifyTask(task) synchronously during spawn resolution. Either the implementation needs to become async/await all the way through, or this document should be updated to describe the current heuristic/synchronous contract so follow-on work does not build against the wrong interface.

Also applies to: 173-179
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/auto-routing.md` around lines 66 - 67, The spec for the classifier is
out of sync: the document claims classifyTask returns a Promise<TaskAssessment>
(async LLM call) but the CLI code (packages/cli/src/cli/commands/local-agent.ts)
calls classifyTask(task) synchronously during spawn resolution; reconcile these
by either making the CLI flow fully async (propagating await through spawn
resolution and any callers of classifyTask) or by updating the spec to document
a synchronous heuristic contract; specifically, decide whether classifyTask
should be async and update the CLI call sites to use await (and mark functions
like the spawn resolver as async) or change the spec lines (including the
mentions at 66–67 and 173–179) to describe the current synchronous classifyTask
behavior so consumers implement against the correct interface.
45-60: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the CLI example to match the implemented command shape.

The implementation in packages/cli/src/cli/commands/local-agent.ts routes through local agent spawn <provider> and only auto-resolves for the Claude provider. The example here treats Director as the positional argument and omits the provider, so readers copying it will not hit the code path this PR adds.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/auto-routing.md` around lines 45 - 60, Update the CLI/SDK examples to
match the implemented command shape used by local-agent.ts: show the full
command as "local agent spawn <provider> <agentName> --model=auto --task=..."
(i.e., include the provider positional argument and then the agent name like
"Director") and note that model="auto" is auto-resolved only for the Claude
provider in local-agent.ts; also adjust the SDK example to call the same
provider-aware spawn API (reflecting the provider positional parameter) so
readers exercise the code path added in local-agent.ts and the task classifier.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/cli/src/cli/commands/local-agent.ts`:
- Around line 28-45: resolveAutoSpawn currently returns a model of 'auto' when
provider === 'claude' and no task is supplied, which lets spawn/new forward an
unresolved model; change resolveAutoSpawn to return an explicit boolean
autoRouted in its return shape (e.g., { name, task, model, autoRouted }) and set
autoRouted = true only when you successfully classify and rewrite (use
classifyTask, composeTeam, buildDirectorPrompt, CLAUDE_MODEL_IDS as you already
do); when provider === 'claude' && model === 'auto' && no task is present,
return autoRouted = false (or throw) so callers can fail fast instead of
forwarding 'auto'; update the spawn/new handlers to check resolved.autoRouted:
if false, abort with a clear error before calling spawnAgentWithClient and use
resolved.autoRouted to decide the "(auto-routed)" log suffix.

In `@tests/integration/broker/evals/scenarios/s06-auto-routing.ts`:
- Around line 88-104: The test currently stops and scores immediately after
detecting two agent_spawned events (spawnCount loop) which can false-pass;
instead keep observing the event stream until a real completion or a quiet
period after the second spawn before calling harness.getEvents(),
scoreSpawn(...) and detectNativeSubagent(...). Concretely, after incrementing
spawnCount in the harness.waitForEvent('agent_spawned') handler, do not break to
scoring; instead record the timestamp of the last spawn and continue collecting
events until either a completion signal event (e.g. a run-level finish event if
present) is observed or a configurable quiet-period (e.g. 300–1000ms) has
elapsed since the last event, then call harness.getEvents(), baseScore(...),
scoreSpawn(...) and detectNativeSubagent(...). Ensure to reference and update
the existing variables/logic around spawnCount,
harness.waitForEvent('agent_spawned'), harness.getEvents(), scoreSpawn, and
detectNativeSubagent.
- Around line 72-110: The run function currently releases agents only on the
success path; to guarantee cleanup on failures wrap the orchestration
(everything after spawnAgent/startup/clearEvents and before final score
collection) in a try/finally so release logic always runs: move the loop that
iterates spawn.spawnedNames and the harness.releaseAgent(director) calls into
the finally block and ensure any awaited operations (e.g., harness.waitForEvent,
harness.sendMessage, detection/score calls) remain inside the try; reference
symbols: run, director, harness.spawnAgent, harness.waitForEvent,
spawn.spawnedNames, harness.releaseAgent.

---

Outside diff comments:
In `@specs/auto-routing.md`:
- Around line 66-67: The spec for the classifier is out of sync: the document
claims classifyTask returns a Promise<TaskAssessment> (async LLM call) but the
CLI code (packages/cli/src/cli/commands/local-agent.ts) calls classifyTask(task)
synchronously during spawn resolution; reconcile these by either making the CLI
flow fully async (propagating await through spawn resolution and any callers of
classifyTask) or by updating the spec to document a synchronous heuristic
contract; specifically, decide whether classifyTask should be async and update
the CLI call sites to use await (and mark functions like the spawn resolver as
async) or change the spec lines (including the mentions at 66–67 and 173–179) to
describe the current synchronous classifyTask behavior so consumers implement
against the correct interface.
- Around line 45-60: Update the CLI/SDK examples to match the implemented
command shape used by local-agent.ts: show the full command as "local agent
spawn <provider> <agentName> --model=auto --task=..." (i.e., include the
provider positional argument and then the agent name like "Director") and note
that model="auto" is auto-resolved only for the Claude provider in
local-agent.ts; also adjust the SDK example to call the same provider-aware
spawn API (reflecting the provider positional parameter) so readers exercise the
code path added in local-agent.ts and the task classifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 55575465-06ac-48e4-ae15-772b28c69d58

📥 Commits

Reviewing files that changed from the base of the PR and between 9b20538 and 4abde7d.

📒 Files selected for processing (10)

.agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.json
CHANGELOG.md
crates/broker/src/snippets.rs
package.json
packages/cli/src/cli/commands/local-agent.ts
specs/auto-routing.md
tests/integration/broker/evals/runner.ts
tests/integration/broker/evals/scenarios/index.ts
tests/integration/broker/evals/scenarios/onboarding.ts
tests/integration/broker/evals/scenarios/s06-auto-routing.ts

✅ Files skipped from review due to trivial changes (2)

CHANGELOG.md
.agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.json

🚧 Files skipped from review as they are similar to previous changes (4)

tests/integration/broker/evals/scenarios/onboarding.ts
tests/integration/broker/evals/scenarios/index.ts
package.json
tests/integration/broker/evals/runner.ts

coderabbitai · 2026-06-12T14:26:10Z

+function resolveAutoSpawn(
+  provider: string,
+  name: string,
+  task: string | undefined,
+  model: string | undefined,
+): { name: string; task: string | undefined; model: string | undefined } {
+  if (model !== 'auto' || provider !== 'claude' || !task) {
+    return { name, task, model };
+  }
+  const assessment = classifyTask(task);
+  const team = composeTeam(assessment, task);
+  const directorPrompt = buildDirectorPrompt(task, team);
+  return {
+    name: name === provider ? 'Director' : name,
+    task: directorPrompt,
+    model: CLAUDE_MODEL_IDS[team.lead.model],
+  };
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject unresolved Claude --model auto before spawning.

When provider === 'claude' and --model auto is used without --task, this falls through and returns { model: 'auto' }, so both spawn and new forward a non-model ID into spawnAgentWithClient. The same missing state also makes the handlers print (auto-routed) for cases that were never actually rewritten. Return an explicit autoRouted flag here and fail fast when Claude auto-routing cannot resolve.

Suggested fix

+type ResolvedSpawn = { + name: string; + task: string | undefined; + model: string | undefined; + autoRouted: boolean; +}; + function resolveAutoSpawn( provider: string, name: string, task: string | undefined, model: string | undefined, -): { name: string; task: string | undefined; model: string | undefined } { - if (model !== 'auto' || provider !== 'claude' || !task) { - return { name, task, model }; +): ResolvedSpawn { + if (model !== 'auto') { + return { name, task, model, autoRouted: false }; + } + if (provider !== 'claude') { + return { name, task, model, autoRouted: false }; + } + if (!task) { + throw new Error('`--model auto` for Claude requires `--task`.'); } const assessment = classifyTask(task); const team = composeTeam(assessment, task); const directorPrompt = buildDirectorPrompt(task, team); return { name: name === provider ? 'Director' : name, task: directorPrompt, model: CLAUDE_MODEL_IDS[team.lead.model], + autoRouted: true, }; }

Then use resolved.autoRouted for the log suffix in both handlers.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function resolveAutoSpawn(

provider: string,

name: string,

task: string | undefined,

model: string | undefined,

): { name: string; task: string | undefined; model: string | undefined } {

if (model !== 'auto' || provider !== 'claude' || !task) {

return { name, task, model };

}

const assessment = classifyTask(task);

const team = composeTeam(assessment, task);

const directorPrompt = buildDirectorPrompt(task, team);

return {

name: name === provider ? 'Director' : name,

task: directorPrompt,

model: CLAUDE_MODEL_IDS[team.lead.model],

};

}

type ResolvedSpawn = {

name: string;

task: string | undefined;

model: string | undefined;

autoRouted: boolean;

};

function resolveAutoSpawn(

provider: string,

name: string,

task: string | undefined,

model: string | undefined,

): ResolvedSpawn {

if (model !== 'auto') {

return { name, task, model, autoRouted: false };

}

if (provider !== 'claude') {

return { name, task, model, autoRouted: false };

}

if (!task) {

throw new Error('`--model auto` for Claude requires `--task`.');

}

const assessment = classifyTask(task);

const team = composeTeam(assessment, task);

const directorPrompt = buildDirectorPrompt(task, team);

return {

name: name === provider ? 'Director' : name,

task: directorPrompt,

model: CLAUDE_MODEL_IDS[team.lead.model],

autoRouted: true,

};

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/cli/commands/local-agent.ts` around lines 28 - 45, resolveAutoSpawn currently returns a model of 'auto' when provider === 'claude' and no task is supplied, which lets spawn/new forward an unresolved model; change resolveAutoSpawn to return an explicit boolean autoRouted in its return shape (e.g., { name, task, model, autoRouted }) and set autoRouted = true only when you successfully classify and rewrite (use classifyTask, composeTeam, buildDirectorPrompt, CLAUDE_MODEL_IDS as you already do); when provider === 'claude' && model === 'auto' && no task is present, return autoRouted = false (or throw) so callers can fail fast instead of forwarding 'auto'; update the spawn/new handlers to check resolved.autoRouted: if false, abort with a clear error before calling spawnAgentWithClient and use resolved.autoRouted to decide the "(auto-routed)" log suffix.

coderabbitai · 2026-06-12T14:26:10Z

+  run: async (ctx): Promise<ScenarioResult> => {
+    const { harness, cli, model, suffix, sleep } = ctx;
+    const director = `director-${suffix}`;
+    const phaseMs = responseMs(model);
+
+    const task = buildDirectorPrompt();
+    await harness.spawnAgent(director, cli, ['general'], { task, model });
+    await sleep(STARTUP_MS);
+    harness.clearEvents();
+
+    await harness.sendMessage({
+      to: director,
+      from: 'Orchestrator',
+      text: ORIGINAL_TASK,
+    });
+
+    // Wait for at least 2 spawn events.
+    let spawnCount = 0;
+    const deadline = Date.now() + phaseMs;
+    while (spawnCount < 2 && Date.now() < deadline) {
+      const waiter = harness.waitForEvent('agent_spawned', Math.max(0, deadline - Date.now()));
+      await waiter.promise
+        .then(() => {
+          spawnCount++;
+        })
+        .catch(() => {});
+      if (spawnCount >= 2) break;
+    }
+
+    const events = harness.getEvents();
+    const base = baseScore(events, [director]);
+    const spawn = scoreSpawn(events, director);
+    const nativeSubagent = detectNativeSubagent(events, director);
+
+    // Release any spawned workers before releasing director.
+    for (const name of spawn.spawnedNames) {
+      await harness.releaseAgent(name).catch(() => {});
+    }
+    await harness.releaseAgent(director).catch(() => {});


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guarantee cleanup on failing paths.

If any await after Line 78 rejects before Line 106 runs, this scenario leaves the Director and any spawned workers alive in the shared broker harness. That can poison later scenarios with extra agents/events. Wrap the orchestration in try/finally and release from the finally block instead of only on the success path.

Suggested fix

run: async (ctx): Promise<ScenarioResult> => { const { harness, cli, model, suffix, sleep } = ctx; const director = `director-${suffix}`; const phaseMs = responseMs(model); + const spawned = new Set<string>(); - const task = buildDirectorPrompt(); - await harness.spawnAgent(director, cli, ['general'], { task, model }); - await sleep(STARTUP_MS); - harness.clearEvents(); - - await harness.sendMessage({ - to: director, - from: 'Orchestrator', - text: ORIGINAL_TASK, - }); + try { + const task = buildDirectorPrompt(); + await harness.spawnAgent(director, cli, ['general'], { task, model }); + await sleep(STARTUP_MS); + harness.clearEvents(); + + await harness.sendMessage({ + to: director, + from: 'Orchestrator', + text: ORIGINAL_TASK, + }); - // Wait for at least 2 spawn events. - let spawnCount = 0; - const deadline = Date.now() + phaseMs; - while (spawnCount < 2 && Date.now() < deadline) { - const waiter = harness.waitForEvent('agent_spawned', Math.max(0, deadline - Date.now())); - await waiter.promise - .then(() => { - spawnCount++; - }) - .catch(() => {}); - if (spawnCount >= 2) break; - } + // ... existing scoring logic ... + for (const name of spawn.spawnedNames) spawned.add(name); + + return { + // ... existing result ... + }; + } finally { + for (const name of spawned) { + await harness.releaseAgent(name).catch(() => {}); + } + await harness.releaseAgent(director).catch(() => {}); + } - - return { - // ... - }; },

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/s06-auto-routing.ts` around lines 72 - 110, The run function currently releases agents only on the success path; to guarantee cleanup on failures wrap the orchestration (everything after spawnAgent/startup/clearEvents and before final score collection) in a try/finally so release logic always runs: move the loop that iterates spawn.spawnedNames and the harness.releaseAgent(director) calls into the finally block and ensure any awaited operations (e.g., harness.waitForEvent, harness.sendMessage, detection/score calls) remain inside the try; reference symbols: run, director, harness.spawnAgent, harness.waitForEvent, spawn.spawnedNames, harness.releaseAgent.

coderabbitai · 2026-06-12T14:26:10Z

+    // Wait for at least 2 spawn events.
+    let spawnCount = 0;
+    const deadline = Date.now() + phaseMs;
+    while (spawnCount < 2 && Date.now() < deadline) {
+      const waiter = harness.waitForEvent('agent_spawned', Math.max(0, deadline - Date.now()));
+      await waiter.promise
+        .then(() => {
+          spawnCount++;
+        })
+        .catch(() => {});
+      if (spawnCount >= 2) break;
+    }
+
+    const events = harness.getEvents();
+    const base = baseScore(events, [director]);
+    const spawn = scoreSpawn(events, director);
+    const nativeSubagent = detectNativeSubagent(events, director);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t score the run immediately after the second spawn.

Lines 88-104 stop collecting as soon as two agent_spawned events land, then score the stream right away. That can false-pass runs where the Director later uses the native Task tool, never waits for both DONEs, or skips synthesis/release. Keep observing until a real completion signal (or at least a quiet-period after the second spawn) before calling scoreSpawn and detectNativeSubagent.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/s06-auto-routing.ts` around lines 88 - 104, The test currently stops and scores immediately after detecting two agent_spawned events (spawnCount loop) which can false-pass; instead keep observing the event stream until a real completion or a quiet period after the second spawn before calling harness.getEvents(), scoreSpawn(...) and detectNativeSubagent(...). Concretely, after incrementing spawnCount in the harness.waitForEvent('agent_spawned') handler, do not break to scoring; instead record the timestamp of the last spawn and continue collecting events until either a completion signal event (e.g. a run-level finish event if present) is observed or a configurable quiet-period (e.g. 300–1000ms) has elapsed since the last event, then call harness.getEvents(), baseScore(...), scoreSpawn(...) and detectNativeSubagent(...). Ensure to reference and update the existing variables/logic around spawnCount, harness.waitForEvent('agent_spawned'), harness.getEvents(), scoreSpawn, and detectNativeSubagent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Folds in the op argument and check-key conventions surfaced by the first authoring round so the relaunched workers start from a concrete spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…gger message All previous approaches (CRITICAL instruction, ACK injection, Orchestrator nudge) failed because an incoming message triggers only a partial response (single add_agent call). Switching to production-matching approach: Director spawns workers autonomously from its meta-prompt without any trigger message. Extends wait window to 180s from STARTUP_MS offset to cover Director boot + sequential spawn pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codex: 100% all variants — never routes to native subagents droid: 0% on bare/one-liner (uses native Task without disambiguation); skill text that fixes s04 for Claude breaks droid s03 (0%) gemini: 80% bare, 60% one-liner, 80%+ brief/skill — mostly relay-native Key insight: droid needs skill onboarding to avoid native subagents (s04), but the SAME skill text that helps s04 breaks s03 — needs investigation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The Director processes its task immediately after boot. clearEvents() after STARTUP_MS was discarding agent_spawned events fired during startup. Now count spawns that happened during startup before clearing, then continue watching for more spawns with the 180s window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…scoring; document PTY multi-spawn limitation - s05 phrasing eval complete across all harnesses: opus achieves 100% on relay-worker/relay-agent/arw-worker/arw-agent (vs haiku 60%/20%/60%/40%), confirming relay-anchored vocabulary is required for Claude with opus showing the strongest native tool knowledge - s06 scoring fix: remove clearEvents() before scoring so startup spawns are counted by scoreSpawn; use plain sleep() instead of waitForEvent (which resolves immediately on buffered events) - s06 architectural conclusion: PTY-mode agents make exactly one tool call per turn — 0% multi-worker spawn across all harnesses and 8+ prompt approaches. Production fix: pre-spawn workers from CLI layer in Phase 4, not via Director - Update specs/auto-routing.md with opus phrasing table, s06 final answer, and revised open questions - Update CHANGELOG with complete phrasing findings and s06 architectural decision Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent binary The broker's command_parse was mapping cli="cursor" to "agent", but the `agent` binary on PATH is /Users/khaliqgant/.grok/bin/agent (the Grok CLI). This meant all cursor eval runs were actually running Grok, explaining the 0% scores. Fix: map cursor → cursor-agent explicitly in command_parse. Update harness command to match. Update unit tests accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nessSpec for codex models - Fix parseHarnessSpec: codex models (gpt-5.4-mini, o3, etc.) are raw OpenAI model names and must not be qualified with a "codex/" prefix — add early-return for cli=codex - Add codex tier comparison section to eval-master-summary.html: gpt-5.5 recommended (16/16, 0% phantom), gpt-5.4-mini viable budget (15/16, 31% phantom), gpt-5.4 avoid (52% phantom despite 100% majority-vote), spark not viable (6/16) - Update action item #8 in HTML from "pending" to completed findings with tier table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…to composer - Add run-opencode-models.sh: two-phase deterministic batch eval over 39 opencode model tiers. Phase 1 screens with s03 repeat=3; Phase 2 runs s01-s04 repeat=5 for models that score ≥67% in Phase 1. No LLM coordinator — pure shell loop. - Add CODEX_MODEL_TIERS, WorkerCli type, and per-harness HARNESS_ONBOARDING defaults to composer.ts based on lifecycle eval findings (2026-06-12) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s, add missing claude variants - run_eval: add || true so one model failure doesn't abort the entire batch - Expand model list from 41 to 45: add claude-sonnet-4, claude-opus-4-1/4-5/4-6/4-7 to match actual opencode models output Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drop Claude and OpenAI variants — those are already covered by native CLI evals. Keep: DeepSeek, Kimi, Qwen, Minimax, GLM, MiMo, Grok-via-opencode, Gemini-via-opencode, Nemotron, North-mini, big-pickle (19 models total). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uting spec - Add AgentRole type (lead, coordinator, worker, planner, reviewer, critic, verifier, judge, mapper, reducer, supervisor, debater) matching the choosing-swarm-patterns skill's role taxonomy - Add HarnessRoleMap + HARNESS_ROLE_MAP: eval-backed table mapping each harness to the roles it can fill (confirmed/provisional/not-viable/untested) - Add harnessesForRole() helper for pattern-aware slot selection - Expand WorkerSpec to carry cli, codexModel, opencodeModel - Add §5 Role-Fit Classification to specs/auto-routing.md: role definitions, eval signal → role mapping, provisional role-fit table, pattern → role → harness assignment table, and s07-s11 scenario roadmap for full coverage opencode Chinese/alternative model rows marked 'untested'; will update from Phase 1+2 batch eval results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

19 alternative/Chinese models evaluated (s01-s04, repeat=3). 17/19 advance. Top tier (16/16, 0-2 phantoms): deepseek-v4-flash, deepseek-v4-flash-free, qwen3.6-plus, qwen3.5-plus, minimax-m2.5, minimax-m2.7, glm-5.1, big-pickle Confirmed (16/16, 3-9 phantoms): glm-5, gemini-3.1-pro, grok-build-0.1, gemini-3-flash Provisional (12-15/16): kimi-k2.5, kimi-k2.6, mimo-v2.5-free, gemini-3.5-flash, north-mini-code-free Eliminated: deepseek-v4-pro (11/16), nemotron-3-ultra-free (10/16) Key findings: - opencode normalizes flaky native CLIs: gemini bare 60% native → 16/16 via opencode; grok 0/48 native → 16/16 via opencode (model capable, native CLI MCP was broken) - All top-tier Chinese models are relay-native across all 4 onboarding variants - HARNESS_ROLE_MAP updated with per-model entries for all 19 tested models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r summary Adds §7 "Role Rankings — Top 5 per Role" to eval-master-summary.html: - Lead/Coordinator: claude:sonnet > opus > codex:gpt-5.5 > opencode:deepseek-v4-flash > opencode:qwen3.6-plus - Worker: codex:gpt-5.5 > opencode:deepseek-v4-flash > minimax-m2.5 > qwen3.5/3.6-plus - Reviewer/Critic: claude:opus > sonnet > codex:gpt-5.5 > opencode:gemini-3.1-pro - Mapper/Reducer: codex:gpt-5.5 > opencode:deepseek-v4-flash > minimax-m2.5 > qwen3.6 > glm-5.1 - Judge: claude:opus > sonnet > codex:gpt-5.5 - Communicator: claude:opus > sonnet > codex:gpt-5.5 > opencode:gemini-3.1-pro > opencode:deepseek-v4-flash - Best lead callout: claude:sonnet + one-liner is the clear #1 across all criteria Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Creates packages/evals/ as the shared eval harness package: - package.json with subpath exports for types, runner, harness, scoring, and scenarios - tsconfig.json matching the monorepo TypeScript config - src/types.ts (canonical copy of EvalScenario/ScenarioResult/etc. types) - src/harness.ts (BrokerHarness interface for downstream type-checking) - src/index.ts (barrel re-export) Adds build:evals to root build:core chain (after harness-driver, before cli). Updates publish.yml: converts publish-harnesses to a matrix job covering both harnesses and evals — same dependency pattern (needs publish-packages so harness-driver is on the registry before evals can be installed). Also adds specs/agent-relay-evals-package.md scoping the full migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three new eval scenarios that test lead role discipline — the gap that existing s03/s04 don't cover (they only check spawn mechanics, not whether the lead actually delegates vs. self-implements): l01 — Unconditional delegation: coding task with no "don't do it yourself" hint. Good lead spawns; bad lead writes code itself. l02 — Temptation resistance: task explicitly says "you'd be faster doing it yourself". Lead must still delegate. l03 — Post-delegation synthesis: after workers report DONE, lead must send a synthesis message referencing their results. New scoring fields on ScenarioResult: - selfImplemented: lead's PTY stream contained code blocks / impl tool calls - synthesisOk: lead sent a synthesis message after receiving DONE All three run across bare/one-liner/brief/skill onboarding variants (12 total). Runner: --group=lead-delegation, npm scripts eval:lead-delegation, eval:lead-delegation:claude-models, eval:lead-delegation:all-harnesses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SpawnAgentRequest has no top-level model field; model must be passed through metadata so the broker can forward --model to the launched CLI. Also auto-format prettier fixes in auto/ and local-agent.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Package was pinned at 8.3.1 causing npm ci to fail — lock file was missing @agent-relay/evals and several stale version entries. Updated to match monorepo version 8.6.0 and regenerated package-lock.json. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eval artifacts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s eval contentIncludes does literal substring matching; rk*live* was treated as the literal string, not a glob. The mock key rk_live_ws_eval_create matches rk_live_ as a prefix check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. composer.test.ts: check for role === 'reducer' not 'Synthesiser'; the AgentRole type uses 'reducer' for aggregator workers. 2. packages/cli/package.json: bump all @agent-relay/* workspace deps from 8.2.0 → 8.7.2 so npm ci resolves them to the local workspace symlinks instead of installing stale registry copies. This was shadowing the local @agent-relay/cloud build and hiding cloudWorkerStateDir. 3. package-lock.json: regenerated to remove the nested packages/cli/node_modules/@agent-relay/{cloud,config,sdk,...}@8.2.0 entries that were installed from npm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The prettier bot ran on the pre-fix commit and its auto-format push overwrote the rk*live* → rk_live_ fix. Reapplied: contentIncludes does literal substring matching, not glob, so rk_live_ is correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the glob-style rk*live* (reverted by prettier bot) with the full exact key rk_live_ws_eval_create that the mock actually returns. This is a precise substring check, passes prettier with no diff, and won't be overwritten by the prettier auto-format workflow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…odules When CLI's @agent-relay/* deps were bumped from 8.2.0 → 8.7.2, the old @agent-relay/broker-*@8.2.0 optional entries left by the previous npm install were not cleaned by --package-lock-only. These shadowed the root workspace broker binaries (8.7.2), causing 'unexpected argument --instance-name' from the old CLI init interface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

khaliqgant requested a review from willwashburn as a code owner June 12, 2026 08:27

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

agent-relay-code Bot added a commit that referenced this pull request Jun 12, 2026

chore: apply pr-reviewer fixes for #1109

55cea0f

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

khaliqgant pushed a commit that referenced this pull request Jun 13, 2026

chore: apply pr-reviewer fixes for #1109

bdfea1d

khaliqgant force-pushed the feature/combined-evals branch 2 times, most recently from 7a71316 to 565d110 Compare June 14, 2026 06:19

Proactive Runtime Bot and others added 18 commits June 14, 2026 08:20

docs(evals): add master plan for relay SDK eval suite

8387f4e

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(evals): add op-arg reference + executor result contract

12f5969

Folds in the op argument and check-key conventions surfaced by the first authoring round so the relaunched workers start from a concrete spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(evals): mark op vocabulary LOCKED per W1 confirmation

f822f82

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add messaging relay eval suites

463f964

Add session listener facade eval suites

55f6a09

Add channel workspace agent directory eval suites

6b2fa83

Add delivery actions eval suites

c08db33

Relax W3 eval content expectations

b0ecdff

Align W3 eval duplicate error codes

baaca7e

Relax delivery action eval expectations

878a12f

Align W5 eval cases with runner output

ca01c5a

Align workspace eval assertions with runner output

52e865b

Mark capability eval hooks pending

6a4d96b

Relax unknown agent removal eval assertion

9d86980

feat(evals): add relay eval harness

194e9ac

chore: apply pr-reviewer fixes for #1092

23a85b2

Add local workflow run commands

94c94ea

Avoid log read file race

81d5c9c

Proactive Runtime Bot and others added 17 commits June 14, 2026 08:20

docs(evals): cursor-agent confirmed 0% — not viable relay worker

eee5e93

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: auto-format Rust code with cargo fmt

f9733d7

khaliqgant force-pushed the feature/combined-evals branch from 565d110 to 71dbb19 Compare June 14, 2026 06:20

Proactive Runtime Bot and others added 9 commits June 14, 2026 08:23

style: auto-format all files with prettier; add broker gitignore for …

20805cd

…eval artifacts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: auto-format with Prettier

a69fcb1

style: auto-format with Prettier

bedddcc

khaliqgant merged commit 87c05f1 into main Jun 14, 2026
47 checks passed

khaliqgant deleted the feature/combined-evals branch June 14, 2026 08:13

coderabbitai Bot mentioned this pull request Jun 14, 2026

fix(broker): apply eval-derived relay worker guidance #1126

Merged

	return `<tr><td>${link}</td><td>${pct(m.messageSentRate)}</td><td class="${m.phantomCount ? '' : ''}">${pct(m.phantomRate)} (${m.phantomCount})</td><td>${pct(m.protocolAdherence)}</td><td>${pct(m.deliverySuccessRate)}</td><td>${m.wrongChannelReplies}</td><td>${m.scenariosPassed}/${m.scenariosTotal}</td></tr>`;
	return "<tr><td>" + link + "</td><td>" + pct(m.messageSentRate) + "</td><td>" + pct(m.phantomRate) + " (" + m.phantomCount + ")</td><td>" + pct(m.protocolAdherence) + "</td><td>" + pct(m.deliverySuccessRate) + "</td><td>" + m.wrongChannelReplies + "</td><td>" + m.scenariosPassed + "/" + m.scenariosTotal + "</td></tr>";

	re: /\b(?:i'?ll\|i will\|i'?m going to\|going to\|let me\|i can\|i should)\s+(tell\|message\|dm\|notify\|reply to\|respond to\|post\|send\|report(?:\s+to)?\|relay(?:\s+to)?\|forward(?:\s+to)?\|ping\|update\|let)\s+(?:to\s+\|the\s+)?([a-z0-9_#-]+)?/gi,
	re: /\b(?:i'?ll\|i will\|i'?m going to\|going to\|let me\|i can\|i should)\s+(tell\|message\|dm\|notify\|reply to\|respond to\|post\|send\|report(?:\s+to)?\|relay(?:\s+to)?\|forward(?:\s+to)?\|ping\|update\|inform)\s+(?:to\s+\|the\s+)?([a-z0-9_#-]+)?/gi,

		const sends = inboundFrom(events, worker).filter((e) => !isChannelTarget(e.target));
		const ackIdx = sends.findIndex((e) => /^\s*ack\b/i.test(e.body));

	const ANSI_PATTERN = /\x1b\[[0-9;][a-zA-Z]\|\x1b\][^\x07]\x07\|\x1b\[[?][0-9]*[a-z]/g;
	const ANSI_PATTERN =
	/\x1b\[[0-9;][a-zA-Z]\|\x1b\][\s\S]?(?:\x07\|\x1b\\)\|\x1b\[[?][0-9]*[a-z]/g;

Conversation

khaliqgant commented Jun 12, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two-layer eval strategy

Running the evals

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

agent-relay-code Bot commented Jun 12, 2026

Addressed comments

Advisory Notes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

khaliqgant commented Jun 12, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading