feat(evals): combined offline SDK harness + live behavioral eval suite#1109
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive evaluation suite for the Relay SDK and agent messaging, including offline deterministic test suites, an in-memory executor, and live integration evals to measure message delivery, phantom rates, and protocol adherence. The code review identified several key areas for improvement, including resolving Windows path compatibility issues in the reporting and toolcheck scripts, fixing a workspace key collision bug in the in-memory executor, removing a redundant ternary expression in the HTML reporter, and making the regex patterns for phantom message detection and ACK/DONE protocol scoring more robust against false positives and markdown formatting.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| import { execSync } from 'node:child_process'; | ||
| import fs from 'node:fs'; | ||
| import path from 'node:path'; | ||
|
|
||
| import type { EvalReport, MatrixReport, MetricSet } from '../types.js'; | ||
| import { renderMatrixHtml, renderReportHtml } from './html.js'; | ||
|
|
||
| /** Directory where reports are written (gitignored). */ | ||
| export function reportsDir(): string { | ||
| // Compiled location: dist/evals/report/ → scenario source lives under evals/. | ||
| return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports'); | ||
| } |
There was a problem hiding this comment.
Using new URL(import.meta.url).pathname directly to resolve paths is a known compatibility issue on Windows platforms because it leaves a leading slash and does not correctly handle drive letters or backslashes. Use fileURLToPath from node:url instead to ensure cross-platform compatibility.
| import { execSync } from 'node:child_process'; | |
| import fs from 'node:fs'; | |
| import path from 'node:path'; | |
| import type { EvalReport, MatrixReport, MetricSet } from '../types.js'; | |
| import { renderMatrixHtml, renderReportHtml } from './html.js'; | |
| /** Directory where reports are written (gitignored). */ | |
| export function reportsDir(): string { | |
| // Compiled location: dist/evals/report/ → scenario source lives under evals/. | |
| return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports'); | |
| } | |
| import { execSync } from 'node:child_process'; | |
| import fs from 'node:fs'; | |
| import path from 'node:path'; | |
| import { fileURLToPath } from 'node:url'; | |
| import type { EvalReport, MatrixReport, MetricSet } from '../types.js'; | |
| import { renderMatrixHtml, renderReportHtml } from './html.js'; | |
| /** Directory where reports are written (gitignored). */ | |
| export function reportsDir(): string { | |
| // Compiled location: dist/evals/report/ → scenario source lives under evals/. | |
| return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..', '..', '..', 'evals-reports'); | |
| } |
| import fs from 'node:fs'; | ||
| import path from 'node:path'; | ||
|
|
||
| import { checkToolNames, extractToolRefs } from './scoring/toolcheck.js'; | ||
|
|
||
| function repoRoot(): string { | ||
| // dist/evals/toolcheck-cli.js → up 5 to repo root. | ||
| return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..'); | ||
| } |
There was a problem hiding this comment.
Using new URL(import.meta.url).pathname directly to resolve paths is a known compatibility issue on Windows platforms because it leaves a leading slash and does not correctly handle drive letters or backslashes. Use fileURLToPath from node:url instead to ensure cross-platform compatibility.
| import fs from 'node:fs'; | |
| import path from 'node:path'; | |
| import { checkToolNames, extractToolRefs } from './scoring/toolcheck.js'; | |
| function repoRoot(): string { | |
| // dist/evals/toolcheck-cli.js → up 5 to repo root. | |
| return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..'); | |
| } | |
| import fs from 'node:fs'; | |
| import path from 'node:path'; | |
| import { fileURLToPath } from 'node:url'; | |
| import { checkToolNames, extractToolRefs } from './scoring/toolcheck.js'; | |
| function repoRoot(): string { | |
| // dist/evals/toolcheck-cli.js → up 5 to repo root. | |
| return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '../../../../..'); | |
| } |
| key: | ||
| operation.workspaceKey ?? | ||
| operation.key ?? | ||
| nextWorkspaceKey(operation.id ?? operation.name ?? state.workspaces.size + 1), |
There was a problem hiding this comment.
When operation.id is not provided but operation.name is, creating multiple workspaces with duplicate display names will result in duplicate keys (e.g., rk_live_duplicate_display_name). This causes the second workspace to silently overwrite the first one in the state.workspaces Map. Appending the workspace size or index to the key when generating it from the display name ensures uniqueness.
| key: | |
| operation.workspaceKey ?? | |
| operation.key ?? | |
| nextWorkspaceKey(operation.id ?? operation.name ?? state.workspaces.size + 1), | |
| key: | |
| operation.workspaceKey ?? | |
| operation.key ?? | |
| nextWorkspaceKey(operation.id ?? (operation.name ? operation.name + "_" + (state.workspaces.size + 1) : undefined) ?? state.workspaces.size + 1), |
| const rows = Object.entries(matrix.harnesses) | ||
| .map(([h, m]) => { | ||
| const link = links[h] ? `<a href="${esc(links[h])}">${esc(h)}</a>` : esc(h); | ||
| return `<tr><td>${link}</td><td>${pct(m.messageSentRate)}</td><td class="${m.phantomCount ? '' : ''}">${pct(m.phantomRate)} (${m.phantomCount})</td><td>${pct(m.protocolAdherence)}</td><td>${pct(m.deliverySuccessRate)}</td><td>${m.wrongChannelReplies}</td><td>${m.scenariosPassed}/${m.scenariosTotal}</td></tr>`; |
There was a problem hiding this comment.
The class attribute contains a redundant ternary expression class="${m.phantomCount ? '' : ''}" which always evaluates to an empty string. This can be safely removed to clean up the generated HTML.
| return `<tr><td>${link}</td><td>${pct(m.messageSentRate)}</td><td class="${m.phantomCount ? '' : ''}">${pct(m.phantomRate)} (${m.phantomCount})</td><td>${pct(m.protocolAdherence)}</td><td>${pct(m.deliverySuccessRate)}</td><td>${m.wrongChannelReplies}</td><td>${m.scenariosPassed}/${m.scenariosTotal}</td></tr>`; | |
| return "<tr><td>" + link + "</td><td>" + pct(m.messageSentRate) + "</td><td>" + pct(m.phantomRate) + " (" + m.phantomCount + ")</td><td>" + pct(m.protocolAdherence) + "</td><td>" + pct(m.deliverySuccessRate) + "</td><td>" + m.wrongChannelReplies + "</td><td>" + m.scenariosPassed + "/" + m.scenariosTotal + "</td></tr>"; |
| const INTENT_PATTERNS: IntentPattern[] = [ | ||
| // "I'll / I will / I'm going to / going to / let me / I can" + comm verb + optional target | ||
| { | ||
| re: /\b(?:i'?ll|i will|i'?m going to|going to|let me|i can|i should)\s+(tell|message|dm|notify|reply to|respond to|post|send|report(?:\s+to)?|relay(?:\s+to)?|forward(?:\s+to)?|ping|update|let)\s+(?:to\s+|the\s+)?([a-z0-9_#-]+)?/gi, |
There was a problem hiding this comment.
Using let as a standalone communication verb is highly prone to false positives in programming/agent contexts (e.g., 'I'll let the process run', 'I'll let the script finish'). Replacing it with inform avoids these false positives while adding support for a very common communication verb.
| re: /\b(?:i'?ll|i will|i'?m going to|going to|let me|i can|i should)\s+(tell|message|dm|notify|reply to|respond to|post|send|report(?:\s+to)?|relay(?:\s+to)?|forward(?:\s+to)?|ping|update|let)\s+(?:to\s+|the\s+)?([a-z0-9_#-]+)?/gi, | |
| re: /\b(?:i'?ll|i will|i'?m going to|going to|let me|i can|i should)\s+(tell|message|dm|notify|reply to|respond to|post|send|report(?:\s+to)?|relay(?:\s+to)?|forward(?:\s+to)?|ping|update|inform)\s+(?:to\s+|the\s+)?([a-z0-9_#-]+)?/gi, |
| const sends = inboundFrom(events, worker).filter((e) => !isChannelTarget(e.target)); | ||
| const ackIdx = sends.findIndex((e) => /^\s*ack\b/i.test(e.body)); |
There was a problem hiding this comment.
The regexes for matching ack and done messages are brittle and will fail if the agent formats the status message with markdown bolding or italics (e.g., **ACK**: ... or *DONE*). Allowing optional leading markdown asterisks or underscores makes the protocol scoring much more robust.
| const sends = inboundFrom(events, worker).filter((e) => !isChannelTarget(e.target)); | |
| const ackIdx = sends.findIndex((e) => /^\s*ack\b/i.test(e.body)); | |
| const ackIdx = sends.findIndex((e) => /^\s*[\*_]*ack\b/i.test(e.body)); | |
| const doneIdx = sends.findIndex((e) => /^\s*[\*_]*done\b/i.test(e.body)); |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an offline Relay eval harness (cases, compiler, in-memory executor, expectation checks), CI/run/reporting tooling, many generated suite specs and broker real-CLI scenarios with scoring, auto-routing CLI helpers, fleet-delivery spec, and related tests/config updates. ChangesRelay eval harness
Broker eval framework
Auto-routing
Fleet delivery & snippets
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/broker/utils/broker-harness.ts (1)
127-131:⚠️ Potential issue | 🟠 Major | ⚡ Quick winIterate over a snapshot when dispatching broker events.
waitForEvent()listeners remove themselves fromthis.eventListeners, butstart()dispatches over the live array. If one listener splices itself out during the loop, the next listener can be skipped and miss the event entirely.Suggested fix
this.unsubEvent = this.client.onEvent((event: BrokerEvent) => { this.events.push(event); - for (const listener of this.eventListeners) { + for (const listener of [...this.eventListeners]) { listener(event); } });#!/bin/bash node <<'NODE' const listeners = []; const seen = []; const a = () => { seen.push('a'); listeners.splice(listeners.indexOf(a), 1); }; const b = () => seen.push('b'); listeners.push(a, b); for (const listener of listeners) listener(); console.log(seen.join(',')); NODEExpected result: the script prints only
a, demonstrating how mutating the live array during iteration skips the next listener.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/utils/broker-harness.ts` around lines 127 - 131, The event dispatch iterates the live this.eventListeners array so listeners that remove themselves (e.g., waitForEvent) can cause the next listener to be skipped; change the dispatch in the this.client.onEvent callback (the block that pushes into this.events and then loops over this.eventListeners) to iterate a snapshot copy of the listeners (for example by using a shallow copy via slice() or spread) so removals during iteration don't affect which listeners are invoked.
🧹 Nitpick comments (2)
evals/suites/auth-errors/cases.md (1)
61-74: 💤 Low valueRedundant must/mustNot in Deterministic Checks section.
The "detect-code-case-insensitive" case (and similarly "detect-cause-chain" and "ignore-tool-result-without-marker") includes
must/mustNotin the Deterministic Checks block (lines 65–68), then repeats them in separate Must/Must Not sections (lines 70–74). The compiler uses the final Must/Must Not sections. Remove the inline must/mustNot from Deterministic Checks to clarify the intended structure.♻️ Simplify the Deterministic Checks section
### Deterministic Checks ok: true contentIncludes: - true -must: -- Normalize error codes before comparison. -mustNot: -- Treat casing as significant. ### Must(Apply the same pattern to lines 174–181 and lines 295–302.)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/suites/auth-errors/cases.md` around lines 61 - 74, Remove the redundant inline must/mustNot entries from the "Deterministic Checks" block in the cases "detect-code-case-insensitive", "detect-cause-chain", and "ignore-tool-result-without-marker" so that those checks only appear in the final "Must" / "Must Not" sections; specifically, edit the Deterministic Checks section to keep only deterministic flags (ok, contentIncludes) and delete the inline must/mustNot lines, and apply the same cleanup for the other duplicate occurrences in the file where Deterministic Checks repeats the final Must/Must Not entries.tests/integration/broker/evals/toolcheck-cli.ts (1)
17-20: ⚡ Quick winUse
fileURLToPath(import.meta.url)before passing ESM URLs intopath.*.Both locations derive filesystem paths from
new URL(import.meta.url).pathname. In Node, that leaves percent-encoding intact and turns Windows paths into/C:/..., so these helpers can fail to locate the repo root or report directory on Windows and on paths containing spaces. Convert the file URL withfileURLToPath()first, then callpath.dirname().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/toolcheck-cli.ts` around lines 17 - 20, The repoRoot() helper uses path.dirname(new URL(import.meta.url).pathname) which leaves percent-encoding and Windows drive slashes intact; replace that by converting the ESM file URL to a filesystem path first using fileURLToPath(import.meta.url) and then call path.dirname on the result (update the repoRoot function and add the fileURLToPath import from 'url'). Ensure the new logic still resolves the same upward path (keep the '../../../../..' hierarchy) but uses path.resolve(path.dirname(filePath), '../../../../..') where filePath is fileURLToPath(import.meta.url).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/relay-evals.yml:
- Around line 45-46: Workflow uses unpinned action tags and leaves checkout
credentials persisted; update the checkout, setup-node, and upload-artifact
steps by replacing the tag versions (actions/checkout@v4, actions/setup-node@v4,
actions/upload-artifact@v4) with their corresponding full commit SHAs, and add
the checkout input persist-credentials: false under the Checkout step (the step
that references actions/checkout) to prevent the runner token from being exposed
to later steps—ensure you update the three action references consistently to
fixed SHAs.
In `@scripts/evals/relay-executor.mjs`:
- Around line 28-38: The executor currently only iterates
normalizeOperations(testCase.input?.operation ?? testCase.input?.operations) and
ignores cases authored with input.message; update the loop in relay-executor.mjs
to include testCase.input.message (or convert message to an operation) so
Message-authored evals are executed—e.g., feed normalizeOperations with
testCase.input?.message when present or merge message into the operations array
before calling executeOperation, locating the use of normalizeOperations and
executeOperation to implement the conversion/merge.
- Around line 638-650: The send handler currently calls upsertChannel, which
silently creates or resurrects channels; change it to look up the channel
without mutating state (e.g., use the existing channel lookup logic instead of
upsertChannel in send) and validate lifecycle: if the channel is missing, throw
or return an error, and if the channel is archived, reject/post an error instead
of proceeding; only after confirming a non-archived channel exists should you
call createMessage and emit('messageCreated'). Refer to the send function,
replace upsertChannel usage, and add checks for channel existence and
channel.archived before createMessage/emit.
In `@tests/integration/broker/evals/runner.ts`:
- Around line 215-217: The current code uses flags.baseline with
compareReports(readReport(flags.baseline), report) for every harness in the run,
causing cross-harness comparisons; change behavior so baseline must either be a
single-harness file and the run contains exactly one harness (validate
flags.baseline is only allowed when the harness list length === 1 and throw/exit
with a clear message if not), or treat flags.baseline as a matrix and load a map
of harness→baseline (e.g. parse readReport(flags.baseline) into a per-harness
lookup and call compareReports(baselineByHarness[harnessName], report) for each
harness, erroring if a harness has no matching baseline). Ensure you update the
code paths that call compareReports and readReport and reference the loop
variable for the current harness when selecting the baseline.
- Around line 57-60: selectScenarios currently drops unknown IDs by mapping
flags.scenarioIds through scenarioById and filtering falsy results; instead
detect any missing IDs and fail fast. Modify selectScenarios to compute the list
of requested IDs (flags.scenarioIds), map them to scenarioById results, collect
the IDs that produced no EvalScenario, and if any are missing throw an error (or
process exit) listing the unknown IDs so the invocation fails before running any
scenarios; otherwise return the resolved EvalScenario[] as before.
- Around line 113-138: After the scenario loop, detect when no scenarios
actually ran (results.length === 0) due to all scenarios being filtered by
scenario.harnessFilter for the current cli and treat that as a skip/error
instead of emitting a normal report: add a guard right before building/returning
the report (after the loop that calls runOnce and mergeRepeats) that either
throws a descriptive Error(`no scenarios ran for harness ${cli}`) or returns a
report object marked as skipped (e.g., add skipped: true and empty metrics) so
downstream code doesn't treat an empty scenarios array as a healthy run;
reference results, scenarios, scenario.harnessFilter, cli, runOnce,
mergeRepeats, aggregateMetrics and SCHEMA_VERSION when implementing the check
and handling.
In `@tests/integration/broker/evals/scenarios/03-ack-done.ts`:
- Around line 31-47: The test spawns agents with harness.spawnAgent but only
calls harness.releaseAgent on the happy path, risking leaked agents if an
awaited step throws; wrap the scenario body (the sequence from
harness.spawnAgent through waits and scoring in 03-ack-done.ts and
04-relay-chain.ts) in a try/finally and move all harness.releaseAgent(worker)
(and any other spawned agent releases) into the finally block so release always
runs even on errors.
In `@tests/integration/broker/evals/scenarios/r01-incidental-report.ts`:
- Around line 25-66: The run function in each scenario (e.g., the async run in
r01-incidental-report.ts) currently calls harness.releaseAgent(...) only on the
success path, causing leaked agents; wrap the entire run body in a try/finally
where you call harness.spawnAgent(...) / harness.sendMessage(...) / waits in the
try block and always call harness.releaseAgent(worker) in the finally block
(catching errors from releaseAgent if needed) so agents are released regardless
of failures; apply the same pattern to the other scenario files
(r02-forget-to-report.ts, r03-proactive-handoff.ts, r04-channel-vs-dm.ts) to
prevent cross-test flakiness.
In `@tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts`:
- Around line 43-47: The pass check allows channel-targeted handoffs because
sentTo(events, author, reviewer) treats normalized channel names like
"`#reviewer-`*" as a match; update the pass criteria so it only counts a direct DM
handoff to the reviewer. Replace or augment the handedOff check (the sentTo
call) with a check that the event(s) from author target the reviewer actor
directly (e.g., compare explicit recipient ID or channelType === 'dm' /
recipient === reviewer.id) and reject matches that are channel names; keep the
other conditions (base.phantoms.length === 0 && base.deliveryOk) unchanged.
In `@tests/integration/broker/evals/scoring/protocol.ts`:
- Around line 115-121: scoreRelayChain currently treats payloadIntact true if
any RelayInbound to finalChannel contains the payload; restrict this to only
count posts from the expected relay sender. Update the finalPost filter (where
RelayInbound events are selected) to also check that e.sender equals the
expected relay identity (e.g., compare against the last relay/expected relay
variable used in scoreRelayChain such as lastRelay.sender or
expectedRelaySender) while keeping the existing normalizeChannel and payload
contains checks.
In `@tests/integration/broker/evals/scoring/stream-clean.ts`:
- Line 15: ANSI_PATTERN currently only matches OSC sequences terminated by BEL
(\x07) and thus leaves ST-terminated OSCs (ESC backslash) intact; update the
regex constant ANSI_PATTERN to also match OSC sequences terminated by the String
Terminator (ESC backslash) so stripAnsi() removes OSC 8 hyperlinks and similar
sequences—specifically extend the pattern that matches OSC (\x1b\][^\x07]*\x07)
to also accept sequences ending with ESC "\" (escape-backslash), ensuring the
pattern covers both \x07 and ESC "\" terminators.
In `@tests/integration/broker/evals/selftest.ts`:
- Around line 59-68: The failure branches in
tests/integration/broker/evals/selftest.ts use process.exit(1) which aborts
before the finally block can run; replace those calls inside the try (the checks
for evalWouldPass and score.sent) with throwing an Error (e.g., throw new
Error(...)) so the exception propagates and allows the finally block to run
harness.releaseAgent() / harness.stop(); ensure the thrown Error messages
contain the same descriptive text currently passed to console.error so test
harness teardown still executes and the process will exit non‑zero after
cleanup.
---
Outside diff comments:
In `@tests/integration/broker/utils/broker-harness.ts`:
- Around line 127-131: The event dispatch iterates the live this.eventListeners
array so listeners that remove themselves (e.g., waitForEvent) can cause the
next listener to be skipped; change the dispatch in the this.client.onEvent
callback (the block that pushes into this.events and then loops over
this.eventListeners) to iterate a snapshot copy of the listeners (for example by
using a shallow copy via slice() or spread) so removals during iteration don't
affect which listeners are invoked.
---
Nitpick comments:
In `@evals/suites/auth-errors/cases.md`:
- Around line 61-74: Remove the redundant inline must/mustNot entries from the
"Deterministic Checks" block in the cases "detect-code-case-insensitive",
"detect-cause-chain", and "ignore-tool-result-without-marker" so that those
checks only appear in the final "Must" / "Must Not" sections; specifically, edit
the Deterministic Checks section to keep only deterministic flags (ok,
contentIncludes) and delete the inline must/mustNot lines, and apply the same
cleanup for the other duplicate occurrences in the file where Deterministic
Checks repeats the final Must/Must Not entries.
In `@tests/integration/broker/evals/toolcheck-cli.ts`:
- Around line 17-20: The repoRoot() helper uses path.dirname(new
URL(import.meta.url).pathname) which leaves percent-encoding and Windows drive
slashes intact; replace that by converting the ESM file URL to a filesystem path
first using fileURLToPath(import.meta.url) and then call path.dirname on the
result (update the repoRoot function and add the fileURLToPath import from
'url'). Ensure the new logic still resolves the same upward path (keep the
'../../../../..' hierarchy) but uses path.resolve(path.dirname(filePath),
'../../../../..') where filePath is fileURLToPath(import.meta.url).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: db4d0ac4-695b-462a-9cfb-8041a8dd3b89
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (104)
.agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json.github/workflows/relay-evals.yml.gitignore.trajectories/completed/2026-06/traj_o61z0ze6kvla/summary.md.trajectories/completed/2026-06/traj_o61z0ze6kvla/trajectory.jsonCHANGELOG.mdevals/PLAN.mdevals/README.mdevals/suites/action-errors/cases.jsonlevals/suites/action-errors/cases.mdevals/suites/action-errors/rubric.mdevals/suites/action-schema/cases.jsonlevals/suites/action-schema/cases.mdevals/suites/action-schema/rubric.mdevals/suites/actions/cases.jsonlevals/suites/actions/cases.mdevals/suites/actions/rubric.mdevals/suites/agent-directory/cases.jsonlevals/suites/agent-directory/cases.mdevals/suites/agent-directory/rubric.mdevals/suites/auth-errors/cases.jsonlevals/suites/auth-errors/cases.mdevals/suites/auth-errors/rubric.mdevals/suites/capabilities/cases.jsonlevals/suites/capabilities/cases.mdevals/suites/capabilities/rubric.mdevals/suites/channels/cases.jsonlevals/suites/channels/cases.mdevals/suites/channels/rubric.mdevals/suites/delivery-modes/cases.jsonlevals/suites/delivery-modes/cases.mdevals/suites/delivery-modes/rubric.mdevals/suites/facade/cases.jsonlevals/suites/facade/cases.mdevals/suites/facade/rubric.mdevals/suites/listeners/cases.jsonlevals/suites/listeners/cases.mdevals/suites/listeners/rubric.mdevals/suites/messaging/cases.jsonlevals/suites/messaging/cases.mdevals/suites/messaging/rubric.mdevals/suites/protocol-framing/cases.jsonlevals/suites/protocol-framing/cases.mdevals/suites/protocol-framing/rubric.mdevals/suites/reactions/cases.jsonlevals/suites/reactions/cases.mdevals/suites/reactions/rubric.mdevals/suites/read-receipts/cases.jsonlevals/suites/read-receipts/cases.mdevals/suites/read-receipts/rubric.mdevals/suites/search/cases.jsonlevals/suites/search/cases.mdevals/suites/search/rubric.mdevals/suites/session/cases.jsonlevals/suites/session/cases.mdevals/suites/session/rubric.mdevals/suites/threads/cases.jsonlevals/suites/threads/cases.mdevals/suites/threads/rubric.mdevals/suites/workspaces/cases.jsonlevals/suites/workspaces/cases.mdevals/suites/workspaces/rubric.mdpackage.jsonscripts/evals/ci-summary.mjsscripts/evals/compile-cases.mjsscripts/evals/relay-checks.mjsscripts/evals/relay-executor.mjsscripts/evals/run-relay-evals.mjsspecs/fleet-delivery.mdtests/integration/broker/evals/README.mdtests/integration/broker/evals/eval.test.tstests/integration/broker/evals/report/html.tstests/integration/broker/evals/report/render-cli.tstests/integration/broker/evals/report/write.tstests/integration/broker/evals/runner.tstests/integration/broker/evals/scenarios/01-dm-roundtrip.tstests/integration/broker/evals/scenarios/02-channel-reply.tstests/integration/broker/evals/scenarios/03-ack-done.tstests/integration/broker/evals/scenarios/04-relay-chain.tstests/integration/broker/evals/scenarios/helpers.tstests/integration/broker/evals/scenarios/index.tstests/integration/broker/evals/scenarios/r01-incidental-report.tstests/integration/broker/evals/scenarios/r02-forget-to-report.tstests/integration/broker/evals/scenarios/r03-proactive-handoff.tstests/integration/broker/evals/scenarios/r04-channel-vs-dm.tstests/integration/broker/evals/scoring/base.tstests/integration/broker/evals/scoring/fixtures.tstests/integration/broker/evals/scoring/metrics.tstests/integration/broker/evals/scoring/metrics.unit.test.tstests/integration/broker/evals/scoring/phantom.tstests/integration/broker/evals/scoring/phantom.unit.test.tstests/integration/broker/evals/scoring/protocol.tstests/integration/broker/evals/scoring/protocol.unit.test.tstests/integration/broker/evals/scoring/stream-clean.tstests/integration/broker/evals/scoring/toolcheck.tstests/integration/broker/evals/scoring/toolcheck.unit.test.tstests/integration/broker/evals/selftest.tstests/integration/broker/evals/toolcheck-cli.tstests/integration/broker/evals/tsconfig.jsontests/integration/broker/evals/types.tstests/integration/broker/tsconfig.jsontests/integration/broker/utils/assert-helpers.tstests/integration/broker/utils/broker-harness.tsvitest.config.ts
| - name: Checkout | ||
| uses: actions/checkout@v4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify unpinned action tags and missing persist-credentials in relay-evals workflow.
rg -nP 'uses:\s*actions\/[A-Za-z0-9._-]+@v[0-9]+' .github/workflows/relay-evals.yml
rg -n 'persist-credentials:\s*false' .github/workflows/relay-evals.ymlRepository: AgentWorkforce/relay
Length of output: 184
Harden relay-evals workflow: pin action SHAs and disable checkout credential persistence.
.github/workflows/relay-evals.ymluses unpinned action tags (actions/checkout@v4,actions/setup-node@v4,actions/upload-artifact@v4) at lines 46/49/70—pin each to full commit SHAs.- The
Checkoutstep (line 46) is missingwith: persist-credentials: false, which can unnecessarily expose the runner token to later steps.
Suggested hardening patch
- name: Checkout
- uses: actions/checkout@v4
+ uses: actions/checkout@<full_commit_sha_for_v4>
+ with:
+ persist-credentials: false
- name: Setup Node.js
- uses: actions/setup-node@v4
+ uses: actions/setup-node@<full_commit_sha_for_v4>
with:
node-version: ${{ env.NODE_VERSION }}
cache: npm
cache-dependency-path: package-lock.json
- name: Upload eval artifacts
if: always()
- uses: actions/upload-artifact@v4
+ uses: actions/upload-artifact@<full_commit_sha_for_v4>
with:
name: relay-eval-run
path: .relay/evals/runs/
retention-days: 14
if-no-files-found: ignore🧰 Tools
🪛 zizmor (1.25.2)
[warning] 45-46: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 46-46: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/relay-evals.yml around lines 45 - 46, Workflow uses
unpinned action tags and leaves checkout credentials persisted; update the
checkout, setup-node, and upload-artifact steps by replacing the tag versions
(actions/checkout@v4, actions/setup-node@v4, actions/upload-artifact@v4) with
their corresponding full commit SHAs, and add the checkout input
persist-credentials: false under the Checkout step (the step that references
actions/checkout) to prevent the runner token from being exposed to later
steps—ensure you update the three action references consistently to fixed SHAs.
Source: Linters/SAST tools
| for (const operation of normalizeOperations(testCase.input?.operation ?? testCase.input?.operations)) { | ||
| try { | ||
| await executeOperation(state, operation); | ||
| } catch (error) { | ||
| state.observed.error = normalizeError(error); | ||
| state.contentLines.push( | ||
| `error ${operation.op ?? operation.type}: ${state.observed.error.code} ${state.observed.error.message}` | ||
| ); | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Message-authored eval cases never execute.
compile-cases.mjs accepts cases with input.message, but this executor only walks input.operation(s). Any suite authored with ### Message will no-op and return an empty harness snapshot without ever exercising the SDK path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/evals/relay-executor.mjs` around lines 28 - 38, The executor
currently only iterates normalizeOperations(testCase.input?.operation ??
testCase.input?.operations) and ignores cases authored with input.message;
update the loop in relay-executor.mjs to include testCase.input.message (or
convert message to an operation) so Message-authored evals are executed—e.g.,
feed normalizeOperations with testCase.input?.message when present or merge
message into the operations array before calling executeOperation, locating the
use of normalizeOperations and executeOperation to implement the
conversion/merge.
| send: async (input) => { | ||
| const channel = upsertChannel(state, input.channel, { members: [actorName] }); | ||
| const message = createMessage(state, { | ||
| id: input.__id, | ||
| kind: 'channel', | ||
| from: actorName, | ||
| text: input.text, | ||
| channel: channel.name, | ||
| attachments: input.attachments, | ||
| mode: input.mode, | ||
| }); | ||
| emit(state, 'messageCreated', { type: 'messageCreated', channel: channel.name, message }); | ||
| return message; |
There was a problem hiding this comment.
post_message bypasses channel lifecycle checks.
This path uses upsertChannel(...), so posting to a missing channel silently creates it, and posting to an archived channel also succeeds. That breaks the create_channel / archive_channel contract the rest of this harness is modeling.
Suggested fix
send: async (input) => {
- const channel = upsertChannel(state, input.channel, { members: [actorName] });
+ const channel = requireChannel(state, input.channel);
+ if (channel.archived) {
+ throw codedError('channel_archived', `channel is archived: ${channel.name}`);
+ }
+ channel.members.add(actorName);
const message = createMessage(state, {
id: input.__id,
kind: 'channel',
from: actorName,
text: input.text,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| send: async (input) => { | |
| const channel = upsertChannel(state, input.channel, { members: [actorName] }); | |
| const message = createMessage(state, { | |
| id: input.__id, | |
| kind: 'channel', | |
| from: actorName, | |
| text: input.text, | |
| channel: channel.name, | |
| attachments: input.attachments, | |
| mode: input.mode, | |
| }); | |
| emit(state, 'messageCreated', { type: 'messageCreated', channel: channel.name, message }); | |
| return message; | |
| send: async (input) => { | |
| const channel = requireChannel(state, input.channel); | |
| if (channel.archived) { | |
| throw codedError('channel_archived', `channel is archived: ${channel.name}`); | |
| } | |
| channel.members.add(actorName); | |
| const message = createMessage(state, { | |
| id: input.__id, | |
| kind: 'channel', | |
| from: actorName, | |
| text: input.text, | |
| channel: channel.name, | |
| attachments: input.attachments, | |
| mode: input.mode, | |
| }); | |
| emit(state, 'messageCreated', { type: 'messageCreated', channel: channel.name, message }); | |
| return message; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/evals/relay-executor.mjs` around lines 638 - 650, The send handler
currently calls upsertChannel, which silently creates or resurrects channels;
change it to look up the channel without mutating state (e.g., use the existing
channel lookup logic instead of upsertChannel in send) and validate lifecycle:
if the channel is missing, throw or return an error, and if the channel is
archived, reject/post an error instead of proceeding; only after confirming a
non-archived channel exists should you call createMessage and
emit('messageCreated'). Refer to the send function, replace upsertChannel usage,
and add checks for channel existence and channel.archived before
createMessage/emit.
| function selectScenarios(flags: Flags): EvalScenario[] { | ||
| if (flags.scenarioIds) { | ||
| return flags.scenarioIds.map(scenarioById).filter((s): s is EvalScenario => Boolean(s)); | ||
| } |
There was a problem hiding this comment.
Reject unknown --scenario IDs instead of dropping them.
Lines 57-60 silently filter out misses from scenarioById(), so --scenario=01-dm-roundtrip,typo still exits after running only one scenario. That hides invocation mistakes and under-runs the suite. Fail fast with the unknown IDs before execution starts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/runner.ts` around lines 57 - 60,
selectScenarios currently drops unknown IDs by mapping flags.scenarioIds through
scenarioById and filtering falsy results; instead detect any missing IDs and
fail fast. Modify selectScenarios to compute the list of requested IDs
(flags.scenarioIds), map them to scenarioById results, collect the IDs that
produced no EvalScenario, and if any are missing throw an error (or process
exit) listing the unknown IDs so the invocation fails before running any
scenarios; otherwise return the resolved EvalScenario[] as before.
| for (const scenario of scenarios) { | ||
| if (scenario.harnessFilter && !scenario.harnessFilter.includes(cli)) continue; | ||
| const runs: ScenarioResult[] = []; | ||
| for (let i = 0; i < repeat; i++) { | ||
| process.stdout.write(` [${cli}] ${scenario.id} (run ${i + 1}/${repeat})… `); | ||
| try { | ||
| const result = await runOnce(scenario, cli); | ||
| runs.push(result); | ||
| console.log(result.pass ? 'PASS' : 'FAIL'); | ||
| } catch (err) { | ||
| console.log(`ERROR: ${(err as Error)?.message ?? err}`); | ||
| runs.push(failedResult(scenario, `error: ${(err as Error)?.message ?? err}`)); | ||
| } | ||
| } | ||
| results.push(repeat > 1 ? mergeRepeats(runs) : runs[0]); | ||
| } | ||
| return { | ||
| schemaVersion: SCHEMA_VERSION, | ||
| startedAt: startedAt.toISOString(), | ||
| durationMs: Date.now() - startedAt.getTime(), | ||
| harness: cli, | ||
| gitSha: gitSha(), | ||
| env: { realCli: process.env.RELAY_INTEGRATION_REAL_CLI === '1', repeat }, | ||
| metrics: aggregateMetrics(results), | ||
| scenarios: results, | ||
| }; |
There was a problem hiding this comment.
Skip harness reports when no scenario actually ran.
If every selected scenario is excluded by scenario.harnessFilter, results stays empty and the downstream report shows perfect-looking rates with scenariosTotal = 0. That makes an unsupported harness appear healthy in JSON, HTML, and the matrix. Treat this as a skip/error instead of emitting a normal report.
Possible fix
async function runHarness(
cli: string,
scenarios: EvalScenario[],
repeat: number,
startedAt: Date
): Promise<EvalReport> {
const results: ScenarioResult[] = [];
for (const scenario of scenarios) {
if (scenario.harnessFilter && !scenario.harnessFilter.includes(cli)) continue;
@@
results.push(repeat > 1 ? mergeRepeats(runs) : runs[0]);
}
+ if (results.length === 0) {
+ throw new Error(`No applicable scenarios ran for harness "${cli}"`);
+ }
return {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/runner.ts` around lines 113 - 138, After the
scenario loop, detect when no scenarios actually ran (results.length === 0) due
to all scenarios being filtered by scenario.harnessFilter for the current cli
and treat that as a skip/error instead of emitting a normal report: add a guard
right before building/returning the report (after the loop that calls runOnce
and mergeRepeats) that either throws a descriptive Error(`no scenarios ran for
harness ${cli}`) or returns a report object marked as skipped (e.g., add
skipped: true and empty metrics) so downstream code doesn't treat an empty
scenarios array as a healthy run; reference results, scenarios,
scenario.harnessFilter, cli, runOnce, mergeRepeats, aggregateMetrics and
SCHEMA_VERSION when implementing the check and handling.
| run: async (ctx): Promise<ScenarioResult> => { | ||
| const { harness, cli, suffix, sleep } = ctx; | ||
| const worker = `worker-${suffix}`; | ||
|
|
||
| await harness.spawnAgent(worker, cli, ['general'], { task: ROLE }); | ||
| await sleep(STARTUP_MS); | ||
| harness.clearEvents(); | ||
|
|
||
| await harness.sendMessage({ | ||
| to: worker, | ||
| from: 'Orchestrator', | ||
| text: 'Can you work out the sum of all prime numbers below 30 and let me know the answer? I need it to continue.', | ||
| }); | ||
|
|
||
| await waitForSends(harness, worker, 1, RESPONSE_MS); | ||
|
|
||
| const events = harness.getEvents(); | ||
| const base = baseScore(events, [worker]); | ||
| const ackDone = scoreAckDone(events, worker); | ||
| const reported = sentDirectMessage(events, worker); | ||
| await harness.releaseAgent(worker).catch(() => {}); | ||
|
|
||
| const pass = reported && base.phantoms.length === 0 && base.deliveryOk; | ||
|
|
||
| return { | ||
| id: scenario.id, | ||
| title: scenario.title, | ||
| pass, | ||
| agents: [{ name: worker, cli, role: 'worker', prompt: ROLE }], | ||
| transcript: base.transcript, | ||
| sent: base.sent, | ||
| expected: 1, | ||
| phantoms: base.phantoms, | ||
| totalIntents: base.totalIntents, | ||
| protocolAdherence: ackDone.score, | ||
| wrongChannelReplies: 0, | ||
| deliveryOk: base.deliveryOk, | ||
| events: base.events, | ||
| notes: `repliedPrivately=${reported} (DM back to requester) · ack=${ackDone.ackPresent} done=${ackDone.donePresent}`, | ||
| }; | ||
| }, | ||
| }; |
There was a problem hiding this comment.
Shared root cause: non-finalized agent teardown across realistic scenarios.
tests/integration/broker/evals/scenarios/r01-incidental-report.ts, tests/integration/broker/evals/scenarios/r02-forget-to-report.ts, tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts, and tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts all release agents only on the success path. Wrap each run body in try/finally and perform all releaseAgent(...) calls in finally to prevent leaked agent processes and cross-scenario flakiness.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/r01-incidental-report.ts` around
lines 25 - 66, The run function in each scenario (e.g., the async run in
r01-incidental-report.ts) currently calls harness.releaseAgent(...) only on the
success path, causing leaked agents; wrap the entire run body in a try/finally
where you call harness.spawnAgent(...) / harness.sendMessage(...) / waits in the
try block and always call harness.releaseAgent(worker) in the finally block
(catching errors from releaseAgent if needed) so agents are released regardless
of failures; apply the same pattern to the other scenario files
(r02-forget-to-report.ts, r03-proactive-handoff.ts, r04-channel-vs-dm.ts) to
prevent cross-test flakiness.
| const handedOff = sentTo(events, author, reviewer); | ||
| await harness.releaseAgent(author).catch(() => {}); | ||
| await harness.releaseAgent(reviewer).catch(() => {}); | ||
|
|
||
| const pass = handedOff && base.phantoms.length === 0 && base.deliveryOk; |
There was a problem hiding this comment.
Pass criteria does not enforce DM-only handoff.
Scenario text says PASS requires author DM to reviewer, but sentTo(...) also accepts channel targets after normalization (e.g., #reviewer-*). That can produce false passes.
Suggested fix
- const handedOff = sentTo(events, author, reviewer);
+ const handedOff = events.some(
+ (e) =>
+ e.kind === 'relay_inbound' &&
+ e.from === author &&
+ !e.target.startsWith('#') &&
+ e.target.replace(/^[#@]/, '').trim().toLowerCase() === reviewer.toLowerCase()
+ );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts` around
lines 43 - 47, The pass check allows channel-targeted handoffs because
sentTo(events, author, reviewer) treats normalized channel names like
"`#reviewer-`*" as a match; update the pass criteria so it only counts a direct DM
handoff to the reviewer. Replace or augment the handedOff check (the sentTo
call) with a check that the event(s) from author target the reviewer actor
directly (e.g., compare explicit recipient ID or channelType === 'dm' /
recipient === reviewer.id) and reject matches that are channel names; keep the
other conditions (base.phantoms.length === 0 && base.deliveryOk) unchanged.
| const finalPost = events | ||
| .filter((e): e is RelayInbound => e.kind === 'relay_inbound') | ||
| .find( | ||
| (e) => | ||
| normalizeChannel(e.target) === normalizeChannel(finalChannel) && | ||
| e.body.includes(payload) | ||
| ); |
There was a problem hiding this comment.
Constrain final payload validation to the expected relay sender.
scoreRelayChain currently marks payloadIntact true if any sender posts the payload to finalChannel. That can over-credit broken chains when unrelated traffic contains the same payload.
Suggested fix
export function scoreRelayChain(
events: BrokerEvent[],
hops: Array<{ from: string; to: string }>,
payload: string,
finalChannel: string
): { hopsCompleted: number; payloadIntact: boolean } {
+ const expectedFinalSender = hops.length > 0 ? hops[hops.length - 1]!.from : null;
let hopsCompleted = 0;
for (const hop of hops) {
const sent = inboundFrom(events, hop.from).some(
(e) => normalizeChannel(e.target) === normalizeChannel(hop.to)
);
if (sent) hopsCompleted += 1;
else break;
}
const finalPost = events
.filter((e): e is RelayInbound => e.kind === 'relay_inbound')
.find(
(e) =>
+ (expectedFinalSender === null || e.from === expectedFinalSender) &&
normalizeChannel(e.target) === normalizeChannel(finalChannel) &&
e.body.includes(payload)
);
return { hopsCompleted, payloadIntact: Boolean(finalPost) };
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scoring/protocol.ts` around lines 115 - 121,
scoreRelayChain currently treats payloadIntact true if any RelayInbound to
finalChannel contains the payload; restrict this to only count posts from the
expected relay sender. Update the finalPost filter (where RelayInbound events
are selected) to also check that e.sender equals the expected relay identity
(e.g., compare against the last relay/expected relay variable used in
scoreRelayChain such as lastRelay.sender or expectedRelaySender) while keeping
the existing normalizeChannel and payload contains checks.
| import { eventsForAgent } from '../../utils/assert-helpers.js'; | ||
|
|
||
| /** Matches CSI sequences, OSC sequences, and private-mode toggles. */ | ||
| const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
node <<'NODE'
const pattern = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g;
const input = '\x1b]8;;https://example.com\x1b\\link\x1b]8;;\x1b\\';
const output = input.replace(pattern, '');
console.log(JSON.stringify(output));
NODERepository: AgentWorkforce/relay
Length of output: 126
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file around the ANSI_PATTERN definition and its usage.
FILE="tests/integration/broker/evals/scoring/stream-clean.ts"
echo "=== File: $FILE (lines ~1-120) ==="
nl -ba "$FILE" | sed -n '1,120p'
echo
echo "=== Find cleanStreamOutput and its callsites in this file ==="
rg -n "cleanStreamOutput|ANSI_PATTERN|detectIntents" "$FILE" || true
echo
echo "=== Node regex behavior: BEL-terminated vs ST-terminated OSC (OSC 8 hyperlink) ==="
node <<'NODE'
const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g;
// BEL-terminated OSC: ESC ] ... BEL
const oscBel = '\x1b]8;;https://example.com\x07' + 'link';
const outBel = oscBel.replace(ANSI_PATTERN, '');
console.log("BEL input matches:", JSON.stringify(oscBel));
console.log("BEL output:", JSON.stringify(outBel));
// ST-terminated OSC: ESC ] ... ESC \
const oscSt = '\x1b]8;;https://example.com\x1b\\' + 'link' + '\x1b]8;;\x1b\\';
const outSt = oscSt.replace(ANSI_PATTERN, '');
console.log("ST input matches:", JSON.stringify(oscSt));
console.log("ST output:", JSON.stringify(outSt));
// Quick check: CSI should be removed (e.g., cursor movement)
const csi = '\x1b[2K' + 'hello';
const outCsi = csi.replace(ANSI_PATTERN, '');
console.log("CSI input:", JSON.stringify(csi));
console.log("CSI output:", JSON.stringify(outCsi));
NODERepository: AgentWorkforce/relay
Length of output: 209
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tests/integration/broker/evals/scoring/stream-clean.ts"
echo "=== File: $FILE (lines 1-160) ==="
cat -n "$FILE" | sed -n '1,160p'
echo
echo "=== Find cleanStreamOutput and ANSI_PATTERN usage in this file ==="
rg -n "cleanStreamOutput|ANSI_PATTERN|detectIntents" "$FILE" || true
echo
echo "=== Node regex behavior: BEL-terminated vs ST-terminated OSC (OSC 8 hyperlink) ==="
node <<'NODE'
const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g;
// BEL-terminated OSC: ESC ] ... BEL
const oscBel = '\x1b]8;;https://example.com\x07' + 'link';
const outBel = oscBel.replace(ANSI_PATTERN, '');
console.log("BEL input:", JSON.stringify(oscBel));
console.log("BEL output:", JSON.stringify(outBel));
// ST-terminated OSC: ESC ] ... ESC \
const oscSt = '\x1b]8;;https://example.com\x1b\\' + 'link' + '\x1b]8;;\x1b\\';
const outSt = oscSt.replace(ANSI_PATTERN, '');
console.log("ST input:", JSON.stringify(oscSt));
console.log("ST output:", JSON.stringify(outSt));
// CSI should be removed
const csi = '\x1b[2K' + 'hello';
const outCsi = csi.replace(ANSI_PATTERN, '');
console.log("CSI output:", JSON.stringify(outCsi));
NODERepository: AgentWorkforce/relay
Length of output: 2391
🏁 Script executed:
#!/bin/bash
set -euo pipefail
node <<'NODE'
const OLD = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g;
const NEW =
/\x1b\[[0-9;]*[a-zA-Z]|\x1b\][\s\S]*?(?:\x07|\x1b\\)|\x1b\[[?][0-9]*[a-z]/g;
// BEL-terminated OSC
const oscBel = '\x1b]8;;https://example.com\x07' + 'link' + '\x1b]8;;\x07';
// ST-terminated OSC
const oscSt = '\x1b]8;;https://example.com\x1b\\' + 'link' + '\x1b]8;;\x1b\\';
// CSI
const csi = '\x1b[2K' + 'hello';
// Private-mode toggle (DECTCEM/DECSET style)
const priv = '\x1b[?25h' + 'cursor' + '\x1b[?25l';
for (const [name, pattern] of [['OLD', OLD], ['NEW', NEW]]) {
console.log('---', name, '---');
console.log("BEL output:", JSON.stringify(oscBel.replace(pattern, '')));
console.log("ST output :", JSON.stringify(oscSt.replace(pattern, '')));
console.log("CSI output:", JSON.stringify(csi.replace(pattern, '')));
console.log("Priv output:", JSON.stringify(priv.replace(pattern, '')));
}
NODERepository: AgentWorkforce/relay
Length of output: 303
Handle ST-terminated OSC sequences too
ANSI_PATTERN only strips OSC sequences terminated by BEL (\x07). ST-terminated OSC sequences (ESC \, commonly used for OSC 8 hyperlinks) remain in stripAnsi() output, which can skew phantom detection/scoring.
Suggested fix
-const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g;
+const ANSI_PATTERN =
+ /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][\s\S]*?(?:\x07|\x1b\\)|\x1b\[[?][0-9]*[a-z]/g;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const ANSI_PATTERN = /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][^\x07]*\x07|\x1b\[[?][0-9]*[a-z]/g; | |
| const ANSI_PATTERN = | |
| /\x1b\[[0-9;]*[a-zA-Z]|\x1b\][\s\S]*?(?:\x07|\x1b\\)|\x1b\[[?][0-9]*[a-z]/g; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scoring/stream-clean.ts` at line 15,
ANSI_PATTERN currently only matches OSC sequences terminated by BEL (\x07) and
thus leaves ST-terminated OSCs (ESC backslash) intact; update the regex constant
ANSI_PATTERN to also match OSC sequences terminated by the String Terminator
(ESC backslash) so stripAnsi() removes OSC 8 hyperlinks and similar
sequences—specifically extend the pattern that matches OSC (\x1b\][^\x07]*\x07)
to also accept sequences ending with ESC "\" (escape-backslash), ensuring the
pattern covers both \x07 and ESC "\" terminators.
| if (evalWouldPass) { | ||
| console.error( | ||
| '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' + | ||
| 'The eval is producing false greens.' | ||
| ); | ||
| process.exit(1); | ||
| } | ||
| if (score.sent !== 0) { | ||
| console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`); | ||
| process.exit(1); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set +e
node -e 'try { process.exit(1); } finally { console.log("finally-ran"); }'
status=$?
echo "exit_status=$status"Repository: AgentWorkforce/relay
Length of output: 78
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ls -la tests/integration/broker/evals/selftest.ts tests/integration/broker/utils/broker-harness.ts
echo "---- selftest.ts (around lines 40-110) ----"
nl -ba tests/integration/broker/evals/selftest.ts | sed -n '40,120p'
echo "---- selftest.ts (end) ----"
nl -ba tests/integration/broker/evals/selftest.ts | tail -n 80
echo "---- broker-harness.ts (around harness lifecycle / cleanup) ----"
nl -ba tests/integration/broker/utils/broker-harness.ts | sed -n '1,200p'Repository: AgentWorkforce/relay
Length of output: 361
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- selftest.ts (lines 1-140) ----"
cat -n tests/integration/broker/evals/selftest.ts | sed -n '1,140p'
echo "---- broker-harness.ts (first 260 lines) ----"
cat -n tests/integration/broker/utils/broker-harness.ts | sed -n '1,260p'
echo "---- broker-harness.ts (lines 260-end) ----"
cat -n tests/integration/broker/utils/broker-harness.ts | sed -n '261,520p' || trueRepository: AgentWorkforce/relay
Length of output: 16258
Replace process.exit(1) in selftest.ts failure branches to ensure finally cleanup runs.
tests/integration/broker/evals/selftest.ts calls process.exit(1) inside the try (lines ~64 and ~68), which terminates the process immediately and bypasses the finally block—so harness.releaseAgent() / harness.stop() may not run, leaving the broker/agent running.
Suggested fix
if (evalWouldPass) {
- console.error(
- '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
- 'The eval is producing false greens.'
- );
- process.exit(1);
+ throw new Error(
+ '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
+ 'The eval is producing false greens.'
+ );
}
if (score.sent !== 0) {
- console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
- process.exit(1);
+ throw new Error(
+ `✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`
+ );
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (evalWouldPass) { | |
| console.error( | |
| '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' + | |
| 'The eval is producing false greens.' | |
| ); | |
| process.exit(1); | |
| } | |
| if (score.sent !== 0) { | |
| console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`); | |
| process.exit(1); | |
| if (evalWouldPass) { | |
| throw new Error( | |
| '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' + | |
| 'The eval is producing false greens.' | |
| ); | |
| } | |
| if (score.sent !== 0) { | |
| throw new Error( | |
| `✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.` | |
| ); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/selftest.ts` around lines 59 - 68, The failure
branches in tests/integration/broker/evals/selftest.ts use process.exit(1) which
aborts before the finally block can run; replace those calls inside the try (the
checks for evalWouldPass and score.sent) with throwing an Error (e.g., throw new
Error(...)) so the exception propagates and allows the finally block to run
harness.releaseAgent() / harness.stop(); ensure the thrown Error messages
contain the same descriptive text currently passed to console.error so test
harness teardown still executes and the process will exit non‑zero after
cleanup.
|
Implemented scoped fixes for PR #1109:
Addressed comments
Advisory Notes
Validation performed:
Blocked validation:
|
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (3)
tests/integration/broker/evals/runner.ts (3)
146-169:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSkip harness reports when no scenario actually ran.
If every selected scenario is excluded by
scenario.harnessFilter,resultsstays empty and the downstream report shows perfect-looking rates withscenariosTotal = 0. That makes an unsupported harness appear healthy in JSON, HTML, and the matrix. Treat this as a skip/error instead of emitting a normal report.🚨 Guard against empty results
for (const scenario of scenarios) { if (scenario.harnessFilter && !scenario.harnessFilter.includes(spec.cli)) continue; ... results.push(repeat > 1 ? mergeRepeats(runs) : runs[0]); } + if (results.length === 0) { + throw new Error(`No applicable scenarios ran for harness "${label}"`); + } return {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/runner.ts` around lines 146 - 169, runHarness currently returns a normal EvalReport even when no scenarios actually ran (because every scenario was filtered out), producing misleading "perfect" results; after the loop over scenarios check if results.length === 0 and instead return an EvalReport that marks the harness as skipped/errored (e.g., set a skipped flag or a non-zero failure/skip count and include a clear message) so downstream consumers see the skip; update the return path in runHarness to use the same EvalReport shape as normal returns (using startedAt and any existing metadata) but populate scenariosTotal/summary to reflect the skip and include an explanatory message so JSON/HTML/matrix outputs don't show a healthy run when nothing executed.
84-86:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject unknown
--scenarioIDs instead of dropping them.Line 85 silently filters out misses from
scenarioById(), so--scenario=01-dm-roundtrip,typoexits after running only one scenario. That hides invocation mistakes and under-runs the suite. Fail fast with unknown IDs before execution starts.🚨 Validate before filtering
function selectScenarios(flags: Flags): EvalScenario[] { if (flags.scenarioIds) { - return flags.scenarioIds.map(scenarioById).filter((s): s is EvalScenario => Boolean(s)); + const resolved = flags.scenarioIds.map((id) => ({ id, scenario: scenarioById(id) })); + const missing = resolved.filter((r) => !r.scenario).map((r) => r.id); + if (missing.length > 0) { + console.error(`Unknown scenario IDs: ${missing.join(', ')}`); + process.exit(2); + } + return resolved.map((r) => r.scenario!); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/runner.ts` around lines 84 - 86, The current branch quietly drops unknown IDs when mapping flags.scenarioIds with scenarioById, which hides invalid CLI input; change the logic in the runner (where flags.scenarioIds is handled) to first map flags.scenarioIds to results via scenarioById, collect any missing entries (where result is undefined), and if any unknown IDs exist throw an error (or exit non-zero) listing those unknown IDs; otherwise return the mapped array as EvalScenario. This ensures flags.scenarioIds, scenarioById, and EvalScenario are used to validate and fail fast on bad IDs.
312-323:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't compare every harness against the same baseline file.
Line 314 reuses one
--baselinereport for each harness in the loop. A baseline captured forclaudewill also be compared againstcodex,gemini, andgrok, andcompareReports()does not validate harness identity before scoring deltas. That can manufacture regressions and fail the run incorrectly. Require exactly one harness for--baseline, or load a matrix baseline and compare by harness.🚨 Validate baseline harness match
if (flags.baseline) { try { - const deltas = compareReports(readReport(flags.baseline), report); + const baselineReport = readReport(flags.baseline); + if (baselineReport.harness !== label) { + console.warn(` baseline harness mismatch: baseline=${baselineReport.harness}, current=${label} — skipping comparison`); + } else { + const deltas = compareReports(baselineReport, report);Or require single harness when
--baselineis provided:const scenarios = selectScenarios(flags); + if (flags.baseline && flags.harnesses.length > 1) { + console.error('--baseline requires exactly one --harness (or use a matrix baseline)'); + process.exit(2); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/runner.ts` around lines 312 - 323, The baseline comparison currently reuses a single baseline file for every harness inside the loop (flags.baseline + compareReports(readReport(flags.baseline), report)), which can misattribute regressions; change the logic to either (A) require exactly one harness when flags.baseline is provided by validating the harness count before entering the loop and erroring if >1, or (B) support a harness-keyed baseline matrix by loading the baseline via readReport(flags.baseline) and selecting the per-harness baseline (matching harness id/name) before calling compareReports; implement the validation/selection near the existing baseline branch where flags.baseline is checked and use harness identity (the same identifier used to produce report) to look up the correct baseline instead of reusing the same file for every harness.
🧹 Nitpick comments (2)
tests/integration/broker/evals/scenarios/s01-spawn-worker.ts (1)
36-36: ⚡ Quick winRemove dead ternary branch.
Both branches of the ternary return
'realistic'. If all onboarding variants should use the same tier, remove the conditional; otherwise, distinguish the'bare'variant from the others.♻️ Simplify to constant
- tier: onboarding === 'bare' ? 'realistic' : 'realistic', + tier: 'realistic',🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/s01-spawn-worker.ts` at line 36, The ternary expression assigning tier is dead (both branches return 'realistic'); replace the conditional with a single constant assignment by setting the tier property to 'realistic' directly (update the object property "tier" in the s01-spawn-worker scenario where the current "tier: onboarding === 'bare' ? 'realistic' : 'realistic'," appears); if instead the 'bare' onboarding should differ, change the second branch to the correct value for non-'bare' cases—otherwise remove the ternary and use the constant.tests/integration/broker/evals/scoring/metrics.ts (1)
41-48: ⚡ Quick winRemove redundant nullish coalescing inside undefined guards.
Lines 43 and 47 use
(r.spawnCount ?? 0)and(r.releaseCount ?? 0)after already confirming these fields are notundefinedon lines 41 and 45. The?? 0fallback will never execute in this context.♻️ Simplify the conditionals
// Lifecycle metrics: any scenario that measured spawn/release. if (r.spawnCount !== undefined) { spawnTotal += 1; - if ((r.spawnCount ?? 0) > 0) spawnPassed += 1; + if (r.spawnCount > 0) spawnPassed += 1; } if (r.releaseCount !== undefined) { releaseTotal += 1; - if ((r.releaseCount ?? 0) > 0) releasePassed += 1; + if (r.releaseCount > 0) releasePassed += 1; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scoring/metrics.ts` around lines 41 - 48, The checks for r.spawnCount and r.releaseCount already ensure these fields are not undefined, so the nullish coalescing (r.spawnCount ?? 0) and (r.releaseCount ?? 0) are redundant; replace those with direct comparisons (e.g., use r.spawnCount > 0 and r.releaseCount > 0) in the blocks that update spawnPassed/releasePassed to simplify the conditionals while keeping the existing guards that increment spawnTotal/releaseTotal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integration/broker/evals/runner.ts`:
- Around line 132-140: The test helper runOnce sets OPENCODE_MODEL when
spec.model is present but silently ignores model overrides for non-opencode
harnesses (e.g., "--harness=claude:gpt-4"); update runOnce to validate the
override scope by checking spec.model against spec.cli (e.g., ensure spec.cli
indicates opencode before setting OPENCODE_MODEL) and throw a clear error or
assertion if a model override is provided for a non-opencode harness (reference
runOnce, OPENCODE_MODEL, spec.model, and BrokerHarness), or alternatively add a
concise comment documenting that model overrides apply only to the opencode
harness.
- Line 239: The test hardcodes the onboarding variants with const variants =
['bare', 'one-liner', 'brief', 'skill'], which can drift from the canonical
list; import and use the exported ONBOARDING_VARIANTS constant from
onboarding.ts instead of the literal array (add the import at the top of the
file and replace the const variants assignment with ONBOARDING_VARIANTS so the
test matrix always matches the source of truth).
In `@tests/integration/broker/evals/scenarios/s01-spawn-worker.ts`:
- Around line 56-60: The predicate passed to harness.waitForEvent for
'agent_spawned' (used when creating spawnWaiter) unsafely casts the event to {
parent?: string }; replace it with a runtime guard that first verifies the event
is a non-null object and that it has a 'parent' property of type string before
comparing to lead (or extract parent only after the guard). Update the predicate
used in harness.waitForEvent (the anonymous arrow function) to perform this
check so malformed events won't cause silent failures or exceptions.
In `@tests/integration/broker/evals/scenarios/s02-release-worker.ts`:
- Around line 50-54: The predicate passed to harness.waitForEvent for
'agent_released' currently casts the event to { name: string } without checking
shape; change the predicate used in releaseWaiter to first guard that the event
is an object and has a string 'name' property (e.g., typeof e === 'object' && e
!== null && typeof (e as any).name === 'string') before comparing it to the
worker variable so the comparison is safe and won't throw if the broker emits a
different structure; update the predicate inside harness.waitForEvent (the
callback passed to releaseWaiter) accordingly.
In `@tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts`:
- Around line 100-104: The notes assembly can interpolate a null workerName
(used in the template strings for `spawned ${workerName} but never released` and
`spawned+released ${workerName}`); change the logic around `notesParts` to guard
against null by either early-returning when `workerName` is null or substituting
a safe fallback like '<unknown>' before interpolation, ensuring checks around
`spawn.phantomSpawn`, `spawnOk`, `releaseOk`, and `pass` still behave the same
and that `notesParts` never receives a template that interpolates a null
`workerName`.
- Around line 56-65: The code assumes the event returned by spawnWaiter.promise
contains a name and casts directly, which can fail; modify the promise handler
for harness.waitForEvent('agent_spawned', ...) (where workerName is set) to
validate the event shape at runtime before assigning: check that the resolved
event has a 'name' property and typeof name === 'string' (and optionally parent
matches), only then set workerName, otherwise handle the unexpected shape
(reject, log, or ignore) to avoid unsafe casting and runtime errors.
---
Duplicate comments:
In `@tests/integration/broker/evals/runner.ts`:
- Around line 146-169: runHarness currently returns a normal EvalReport even
when no scenarios actually ran (because every scenario was filtered out),
producing misleading "perfect" results; after the loop over scenarios check if
results.length === 0 and instead return an EvalReport that marks the harness as
skipped/errored (e.g., set a skipped flag or a non-zero failure/skip count and
include a clear message) so downstream consumers see the skip; update the return
path in runHarness to use the same EvalReport shape as normal returns (using
startedAt and any existing metadata) but populate scenariosTotal/summary to
reflect the skip and include an explanatory message so JSON/HTML/matrix outputs
don't show a healthy run when nothing executed.
- Around line 84-86: The current branch quietly drops unknown IDs when mapping
flags.scenarioIds with scenarioById, which hides invalid CLI input; change the
logic in the runner (where flags.scenarioIds is handled) to first map
flags.scenarioIds to results via scenarioById, collect any missing entries
(where result is undefined), and if any unknown IDs exist throw an error (or
exit non-zero) listing those unknown IDs; otherwise return the mapped array as
EvalScenario. This ensures flags.scenarioIds, scenarioById, and EvalScenario are
used to validate and fail fast on bad IDs.
- Around line 312-323: The baseline comparison currently reuses a single
baseline file for every harness inside the loop (flags.baseline +
compareReports(readReport(flags.baseline), report)), which can misattribute
regressions; change the logic to either (A) require exactly one harness when
flags.baseline is provided by validating the harness count before entering the
loop and erroring if >1, or (B) support a harness-keyed baseline matrix by
loading the baseline via readReport(flags.baseline) and selecting the
per-harness baseline (matching harness id/name) before calling compareReports;
implement the validation/selection near the existing baseline branch where
flags.baseline is checked and use harness identity (the same identifier used to
produce report) to look up the correct baseline instead of reusing the same file
for every harness.
---
Nitpick comments:
In `@tests/integration/broker/evals/scenarios/s01-spawn-worker.ts`:
- Line 36: The ternary expression assigning tier is dead (both branches return
'realistic'); replace the conditional with a single constant assignment by
setting the tier property to 'realistic' directly (update the object property
"tier" in the s01-spawn-worker scenario where the current "tier: onboarding ===
'bare' ? 'realistic' : 'realistic'," appears); if instead the 'bare' onboarding
should differ, change the second branch to the correct value for non-'bare'
cases—otherwise remove the ternary and use the constant.
In `@tests/integration/broker/evals/scoring/metrics.ts`:
- Around line 41-48: The checks for r.spawnCount and r.releaseCount already
ensure these fields are not undefined, so the nullish coalescing (r.spawnCount
?? 0) and (r.releaseCount ?? 0) are redundant; replace those with direct
comparisons (e.g., use r.spawnCount > 0 and r.releaseCount > 0) in the blocks
that update spawnPassed/releasePassed to simplify the conditionals while keeping
the existing guards that increment spawnTotal/releaseTotal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 911d0907-2720-4765-8669-73bed72e3a8d
📒 Files selected for processing (11)
package.jsontests/integration/broker/evals/report/write.tstests/integration/broker/evals/runner.tstests/integration/broker/evals/scenarios/index.tstests/integration/broker/evals/scenarios/onboarding.tstests/integration/broker/evals/scenarios/s01-spawn-worker.tstests/integration/broker/evals/scenarios/s02-release-worker.tstests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.tstests/integration/broker/evals/scoring/lifecycle.tstests/integration/broker/evals/scoring/metrics.tstests/integration/broker/evals/types.ts
✅ Files skipped from review due to trivial changes (1)
- tests/integration/broker/evals/scoring/lifecycle.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/integration/broker/evals/types.ts
- tests/integration/broker/evals/report/write.ts
- package.json
| } | ||
| } | ||
|
|
||
| const variants = ['bare', 'one-liner', 'brief', 'skill']; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Avoid hardcoding onboarding variant list.
Line 239 hardcodes ['bare', 'one-liner', 'brief', 'skill']. If ONBOARDING_VARIANTS in onboarding.ts changes, this matrix will silently diverge. Import and reuse ONBOARDING_VARIANTS instead.
♻️ Import the canonical list
+import { ONBOARDING_VARIANTS } from './scenarios/onboarding.js';
...
function printLifecycleMatrix(allReports: Array<{ harness: string; report: EvalReport }>): void {
...
- const variants = ['bare', 'one-liner', 'brief', 'skill'];
+ const variants = ONBOARDING_VARIANTS;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/runner.ts` at line 239, The test hardcodes the
onboarding variants with const variants = ['bare', 'one-liner', 'brief',
'skill'], which can drift from the canonical list; import and use the exported
ONBOARDING_VARIANTS constant from onboarding.ts instead of the literal array
(add the import at the top of the file and replace the const variants assignment
with ONBOARDING_VARIANTS so the test matrix always matches the source of truth).
| const releaseWaiter = harness.waitForEvent( | ||
| 'agent_released', | ||
| RESPONSE_MS, | ||
| (e) => (e as { name: string }).name === worker | ||
| ); |
There was a problem hiding this comment.
Validate event structure before casting.
Line 53 casts to { name: string } without confirming the event shape. If the broker emits a different agent_released structure, the name comparison may fail or throw.
🛡️ Add runtime guard
const releaseWaiter = harness.waitForEvent(
'agent_released',
RESPONSE_MS,
- (e) => (e as { name: string }).name === worker
+ (e) => {
+ const evt = e as Record<string, unknown>;
+ return typeof evt.name === 'string' && evt.name === worker;
+ }
);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/s02-release-worker.ts` around lines
50 - 54, The predicate passed to harness.waitForEvent for 'agent_released'
currently casts the event to { name: string } without checking shape; change the
predicate used in releaseWaiter to first guard that the event is an object and
has a string 'name' property (e.g., typeof e === 'object' && e !== null &&
typeof (e as any).name === 'string') before comparing it to the worker variable
so the comparison is safe and won't throw if the broker emits a different
structure; update the predicate inside harness.waitForEvent (the callback passed
to releaseWaiter) accordingly.
| const spawnWaiter = harness.waitForEvent( | ||
| 'agent_spawned', | ||
| RESPONSE_MS, | ||
| (e) => (e as { parent?: string }).parent === lead | ||
| ); | ||
| await spawnWaiter.promise | ||
| .then((e) => { | ||
| workerName = (e as { name: string }).name; | ||
| }) | ||
| .catch(() => {}); |
There was a problem hiding this comment.
Validate event structure before casting.
Line 63 casts to { name: string } without confirming the event shape. If the broker emits a different structure, extracting .name may fail or capture the wrong value.
🛡️ Add runtime guard
await spawnWaiter.promise
.then((e) => {
- workerName = (e as { name: string }).name;
+ const evt = e as Record<string, unknown>;
+ if (typeof evt.name === 'string') {
+ workerName = evt.name;
+ }
})
.catch(() => {});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts`
around lines 56 - 65, The code assumes the event returned by spawnWaiter.promise
contains a name and casts directly, which can fail; modify the promise handler
for harness.waitForEvent('agent_spawned', ...) (where workerName is set) to
validate the event shape at runtime before assigning: check that the resolved
event has a 'name' property and typeof name === 'string' (and optionally parent
matches), only then set workerName, otherwise handle the unexpected shape
(reject, log, or ignore) to avoid unsafe casting and runtime errors.
| const notesParts: string[] = []; | ||
| if (spawn.phantomSpawn) notesParts.push('phantom spawn'); | ||
| if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName} but never released`); | ||
| if (!spawnOk) notesParts.push('no spawn'); | ||
| if (pass) notesParts.push(`spawned+released ${workerName}`); |
There was a problem hiding this comment.
Guard against null workerName in notes.
Line 102 interpolates workerName which may be null if the spawn event arrived but the promise handler failed or if scoring logic diverges from the captured name. Add a fallback or early-return guard.
🛡️ Add null guard
const notesParts: string[] = [];
if (spawn.phantomSpawn) notesParts.push('phantom spawn');
- if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName} but never released`);
+ if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName ?? 'unknown'} but never released`);
if (!spawnOk) notesParts.push('no spawn');
- if (pass) notesParts.push(`spawned+released ${workerName}`);
+ if (pass) notesParts.push(`spawned+released ${workerName ?? 'unknown'}`);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const notesParts: string[] = []; | |
| if (spawn.phantomSpawn) notesParts.push('phantom spawn'); | |
| if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName} but never released`); | |
| if (!spawnOk) notesParts.push('no spawn'); | |
| if (pass) notesParts.push(`spawned+released ${workerName}`); | |
| const notesParts: string[] = []; | |
| if (spawn.phantomSpawn) notesParts.push('phantom spawn'); | |
| if (spawnOk && !releaseOk) notesParts.push(`spawned ${workerName ?? 'unknown'} but never released`); | |
| if (!spawnOk) notesParts.push('no spawn'); | |
| if (pass) notesParts.push(`spawned+released ${workerName ?? 'unknown'}`); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts`
around lines 100 - 104, The notes assembly can interpolate a null workerName
(used in the template strings for `spawned ${workerName} but never released` and
`spawned+released ${workerName}`); change the logic around `notesParts` to guard
against null by either early-returning when `workerName` is null or substituting
a safe fallback like '<unknown>' before interpolation, ensuring checks around
`spawn.phantomSpawn`, `spawnOk`, `releaseOk`, and `pass` still behave the same
and that `notesParts` never receives a template that interpolates a null
`workerName`.
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
crates/broker/src/runtime/api.rs (1)
214-233:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor
skip_relay_promptfor the small-model skill injection.This path still prepends relay guidance even when the caller explicitly set
skip_relay_prompt. That changes the task text for requests that asked for no relay prompt augmentation, which can skew evals and custom prompts. Guard the prefix behind!skip_relay_prompt.Suggested fix
- if let Some(prefix) = model_skill_prefix(effective_spec.model.as_deref()) { - effective_task = Some(match effective_task { - Some(task) => format!("{prefix}\n\n{task}"), - None => prefix.to_string(), - }); - tracing::debug!( - agent = %name, - model = ?effective_spec.model, - "injected relay skill prefix for small-tier model" - ); + if !skip_relay_prompt { + if let Some(prefix) = model_skill_prefix(effective_spec.model.as_deref()) { + effective_task = Some(match effective_task { + Some(task) => format!("{prefix}\n\n{task}"), + None => prefix.to_string(), + }); + tracing::debug!( + agent = %name, + model = ?effective_spec.model, + "injected relay skill prefix for small-tier model" + ); + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/broker/src/runtime/api.rs` around lines 214 - 233, The code always injects the relay skill prefix into effective_task regardless of the caller's intent; update the logic that prepends model_skill_prefix(effective_spec.model.as_deref()) so it only runs when skip_relay_prompt is false (i.e. guard the injection with a check like if !skip_relay_prompt && let Some(prefix) = model_skill_prefix(...)). Modify the block that mutates effective_task (and emits the tracing::debug!) to first ensure skip_relay_prompt is honored, using the existing symbols effective_spec, effective_task, model_skill_prefix, skip_relay_prompt, and name.tests/integration/broker/evals/scenarios/onboarding.ts (1)
30-59:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't hardcode
cli: "claude"in shared onboarding text.These prompts are reused by scenarios that already carry a runtime-selected
ctx.cli, but thebriefandskillvariants instruct agents to spawn workers withcli: "claude"unconditionally. That skews non-Claude evals and can turn a compliant agent into a false failure.Proposed refactor
-export function onboardingText(variant: OnboardingVariant): string { +export function onboardingText(variant: OnboardingVariant, cli: string): string { switch (variant) { @@ case 'brief': return ` ## Agent management -- Spawn a worker: mcp__agent-relay__add_agent({ name, cli: "claude", task }) +- Spawn a worker: mcp__agent-relay__add_agent({ name, cli: "${cli}", task }) - Release a worker: mcp__agent-relay__remove_agent({ name }) Spawn when a task needs dedicated focus. Release as soon as the worker reports done.`; @@ ### Spawn a worker To delegate work, call: - mcp__agent-relay__add_agent({ name: "WorkerName", cli: "claude", task: "detailed instructions" }) + mcp__agent-relay__add_agent({ name: "WorkerName", cli: "${cli}", task: "detailed instructions" }) -Required fields: name (unique string), cli ("claude"), task (full instructions for the worker). +Required fields: name (unique string), cli ("${cli}"), task (full instructions for the worker).Then thread the current scenario CLI through the callers instead of baking one provider into the shared prompt text.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/onboarding.ts` around lines 30 - 59, The onboarding prompt cases 'brief' and 'skill' hardcode cli: "claude" in the spawn examples; change them to use the runtime-selected CLI instead (e.g., replace the literal cli: "claude" with a placeholder injected from the scenario context such as cli: "${ctx.cli}" or format the string with the caller-provided cli) and update callers to thread the scenario CLI into these prompt templates so non-Claude evals receive the correct provider value; ensure the worker examples in both the 'brief' and 'skill' branches reference the dynamic value rather than the fixed "claude".tests/integration/broker/evals/scoring/lifecycle.ts (1)
75-80:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't treat an empty expected-worker list as "match every release".
When
workerNamesis empty,scoreReleasecurrently accepts everyagent_releasedevent. That lets harness teardown of the lead agent flipreleaseConfirmedtotrue, andscoreLifecycle(..., expectedWorkers = [])makes that false positive the default path.Proposed fix
export function scoreRelease(events: BrokerEvent[], workerNames: string[]): ReleaseScore { + if (workerNames.length === 0) { + return { + releaseConfirmed: false, + releasedNames: [], + releaseCount: 0, + }; + } + const nameSet = new Set(workerNames); const releaseEvents = events.filter( (e): e is Extract<BrokerEvent, { kind: 'agent_released' }> => - e.kind === 'agent_released' && (workerNames.length === 0 || nameSet.has(e.name)) + e.kind === 'agent_released' && nameSet.has(e.name) );Also applies to: 95-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scoring/lifecycle.ts` around lines 75 - 80, scoreRelease currently treats an empty workerNames array as "match every release" causing false positives; change the filter in scoreRelease to require workerNames.length > 0 AND nameSet.has(e.name) (i.e., only match releases when an expected worker list is provided and the event name is in nameSet), or explicitly return an empty releaseEvents when workerNames is empty; apply the same change to the analogous workerName-based filter later in the file (the other lifecycle scoring filter that uses workerNames) so empty expected-workers never matches all agent_released events.tests/integration/broker/evals/scenarios/04-relay-chain.ts (1)
20-89:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlways release spawned agents from
finallyin these real-CLI scenarios.Both
tests/integration/broker/evals/scenarios/04-relay-chain.tsandtests/integration/broker/evals/scenarios/s04-no-native-subagents.tsonly callreleaseAgent(...)on the happy path. Any failure inspawnAgent,sendMessage, waiter timeouts, or scoring skips teardown and leaves real CLI agents behind, which can bleed into later evals.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scenarios/04-relay-chain.ts` around lines 20 - 89, The test's run function currently releases agents only on the happy path; move agent teardown into a finally block so spawned CLI agents are always released even on errors/timeouts. Wrap the main test flow in try { ... } finally { ... } inside run, track successful spawns (e.g., push agent names after each harness.spawnAgent call) and in finally call harness.releaseAgent for each tracked name (use Promise.allSettled or catch each release) to ensure cleanup regardless of failures in spawnAgent, sendMessage, waiters, or scoring (references: run, harness.spawnAgent, harness.releaseAgent, finalWaiter/waitForEvent).
🧹 Nitpick comments (4)
packages/cli/src/auto/classifier.ts (2)
83-89: ⚡ Quick winDocument the parallelizability threshold reasoning.
The condition
parallelSignals >= 2 || domains.length >= 3uses specific thresholds (2 signals, 3 domains) without explanation. Adding a brief inline comment would improve maintainability and help future calibration based on eval results.📝 Suggested documentation
const domains = detectDomains(task); const parallelSignals = countParallelSignals(task); + // Threshold: 2+ parallel signals OR 3+ domains indicates task can be split. + // Calibrated empirically; adjust based on routing eval results. const parallelizable = parallelSignals >= 2 || domains.length >= 3;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/auto/classifier.ts` around lines 83 - 89, Add a concise inline comment next to the parallelizability calculation in classifyTask explaining the chosen thresholds: why parallelSignals >= 2 and domains.length >= 3 are used to mark tasks as parallelizable (e.g., two or more explicit parallel cues or three or more distinct domains implying independent sub-tasks), and note that these values are heuristics subject to calibration via evaluation metrics—update the comment near the parallelizable assignment referencing the variables parallelSignals and domains so future maintainers know the rationale and where to adjust thresholds.
90-101: ⚡ Quick winExtract word-count thresholds as named constants.
The hardcoded values
100and40for complexity boundaries would benefit from extraction as named constants (e.g.,HIGH_COMPLEXITY_WORD_THRESHOLD,MEDIUM_COMPLEXITY_WORD_THRESHOLD) to improve readability and make tuning easier based on eval results.♻️ Proposed refactor
+const HIGH_COMPLEXITY_WORD_THRESHOLD = 100; +const MEDIUM_COMPLEXITY_WORD_THRESHOLD = 40; +const HIGH_COMPLEXITY_DOMAIN_THRESHOLD = 4; +const MEDIUM_COMPLEXITY_DOMAIN_THRESHOLD = 2; + export function classifyTask(task: string): TaskAssessment { // ... existing code ... // ── Complexity ────────────────────────────────────────────────────────────── let complexity: TaskAssessment['complexity']; const highHit = HIGH_COMPLEXITY_KEYWORDS.some((kw) => lower.includes(kw)); const medHit = MEDIUM_COMPLEXITY_KEYWORDS.some((kw) => lower.includes(kw)); - if (highHit || words > 100 || domains.length >= 4) { + if (highHit || words > HIGH_COMPLEXITY_WORD_THRESHOLD || domains.length >= HIGH_COMPLEXITY_DOMAIN_THRESHOLD) { complexity = 'high'; - } else if (medHit || words > 40 || domains.length >= 2) { + } else if (medHit || words > MEDIUM_COMPLEXITY_WORD_THRESHOLD || domains.length >= MEDIUM_COMPLEXITY_DOMAIN_THRESHOLD) { complexity = 'medium'; } else { complexity = 'low'; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/auto/classifier.ts` around lines 90 - 101, Extract the magic numbers used for word-count thresholds into named constants and use them in the complexity decision logic: introduce constants like HIGH_COMPLEXITY_WORD_THRESHOLD and MEDIUM_COMPLEXITY_WORD_THRESHOLD and replace the literals 100 and 40 in the complexity calculation block (the logic that sets the variable complexity, which also references HIGH_COMPLEXITY_KEYWORDS, MEDIUM_COMPLEXITY_KEYWORDS, words and domains.length) so the comparisons read words > HIGH_COMPLEXITY_WORD_THRESHOLD and words > MEDIUM_COMPLEXITY_WORD_THRESHOLD; keep the existing boolean hits (highHit, medHit) and domain checks unchanged.specs/auto-routing.md (1)
78-90: 💤 Low valueAdd language specifiers to fenced code blocks.
Markdownlint flags three fenced code blocks (lines 80-90, 113-119, 133-144) for missing language specifiers. Adding
```textor```plaintextwould satisfy the linter and improve rendering consistency.📝 Suggested fixes
**Classifier prompt** (compressed): -``` +```text You are a task router. Given a task description, assess: ...Routing table (derived from eval data):
-
+text
complexity=low, parallel=false → lead: sonnet/one-liner, workers: [haiku×1]
...This is critical: pre-composing the team means the lead's job is coordination, not planning. The lead doesn't have to decide whether to delegate — it's already been decided. -``` +```text You are Director, leading a {N}-worker team on this task: ...Also applies to: 111-119, 131-144
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/auto-routing.md` around lines 78 - 90, Several fenced code blocks in the "Classifier prompt" and the later "Routing table" and "Director" sections are missing language specifiers causing markdownlint warnings; update each triple-backtick fence around the prompt, routing table examples, and Director example (the blocks containing the compressed classifier prompt, the routing-table snippet, and the Director/team snippet) to use a language token such as ```text or ```plaintext so they validate and render consistently. Ensure you replace the opening ``` with ```text (or ```plaintext) for all three blocks referenced in the comment.Source: Linters/SAST tools
packages/cli/src/auto/composer.ts (1)
72-81: ⚡ Quick winConsider extracting the worker task template to a shared builder.
The worker task format at line 79 is later parsed via string manipulation in
director-prompt.tsline 32. This creates fragile coupling where format changes break the parsing logic.Extracting a shared
buildWorkerTask(domain, subtask, originalTask)function would eliminate duplication and make the format contract explicit.♻️ Proposed refactor
Create a shared helper:
// In a shared utilities file or in composer.ts function buildWorkerTask(domain: string, subtask: string, originalTask: string): string { return `You are a specialised ${domain} worker. Your task:\n\n${originalTask}\n\nFocus exclusively on the ${subtask}. Report DONE when complete with a concise summary.`; } function extractWorkerDomain(task: string): string { // Extract domain from the standard format const match = task.match(/You are a specialised (\w+) worker/); return match?.[1] ?? 'general'; }Then use in composer.ts:
return { role: `Worker-${domain.charAt(0).toUpperCase() + domain.slice(1)}`, model: row.workerModel, task: buildWorkerTask(domain, subtask, originalTask), };And in director-prompt.ts:
const workerList = workers .map((w, i) => { const domain = extractWorkerDomain(w.task); return ` ${i + 1}. **${w.role}** (${w.model} relay worker): ${domain} work`; }) .join('\n');🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/auto/composer.ts` around lines 72 - 81, The worker task string is built inline in the composer mapping, which couples formatting to fragile string parsing elsewhere; extract a shared builder function buildWorkerTask(domain, subtask, originalTask) that returns the exact task template and replace the inline template in the WorkerSpec creation (the mapping that sets role/model/task) to call this builder; also add a complementary extractor extractWorkerDomain(task) (used by director-prompt.ts instead of ad-hoc regexes) so parsing and generation share the same contract, and update both composer usage and director-prompt.ts to call these helpers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/skills/using-agent-relay/SKILL.md:
- Around line 91-93: Several fenced code examples (e.g., the
mcp__agent-relay__send_dm(...) snippet and other examples in the same section)
are missing fence language tags; update each fenced block (including the
examples around lines with mcp__agent-relay__send_dm and the other listed
ranges) to include an appropriate language tag like ```text or ```bash so
markdownlint MD040 is satisfied; search for the literal snippets (e.g.,
mcp__agent-relay__send_dm) to locate each block and add the language identifier
to the opening triple backticks.
- Line 3: The description uses the wrong tool name "agent_register" which
doesn't match the actual API; update the text to reference the correct tool name
"register_agent" (or fully-qualified "mcp__agent-relay__register_agent") so
readers can find the real call, and scan the skill for any other occurrences of
"agent_register" and replace them with the correct identifier (register_agent /
mcp__agent-relay__register_agent).
In `@CHANGELOG.md`:
- Line 12: Split the two long CHANGELOG bullets into multiple short, user-facing
bullets (one change per bullet) and remove lower-level implementation details;
for the lifecycle entry (referencing eval:lifecycle,
eval:lifecycle:claude-models, eval:lifecycle:full, add_agent/remove_agent,
broker events) create separate bullets for the lifecycle eval layer,
phantom-spawn detection (broker lifecycle events distinguishing phantom vs real
spawns), and the HTML dashboard/CI per-model/variant spawn-rate visuals, each
written impact-first and concise; likewise for the messaging entry (referencing
npm run eval, eval:matrix, ACK/DONE, messaging-rate, wrong-channel, self-test,
toolcheck, JSON/HTML reports, baseline diffing) break it into distinct bullets
for the messaging eval suite, scoring metrics (message rate, phantom-message
rate, ACK/DONE adherence, wrong-channel detection), scenario tiers (realistic,
smoke, negative-control), and report output formats, and strip implementation
details like “via broker events” or “call add_agent” so each bullet describes
only the user-visible change.
In `@packages/cli/src/auto/classifier.ts`:
- Around line 25-43: The keyword lists HIGH_COMPLEXITY_KEYWORDS,
MEDIUM_COMPLEXITY_KEYWORDS and PARALLEL_PHRASES contain entries with literal
surrounding spaces which cause missed matches; update these arrays to store
trimmed, canonical keyword phrases (no leading/trailing spaces) and change the
matching logic (e.g., the function that checks keywords—implement or update
detectKeyword/detectPhrase) to perform case-insensitive word-boundary regex
matches that escape special chars and convert internal spaces to \s+ so terms
match at string boundaries and next to punctuation (use \b...\\s+...\\b with
proper escaping).
In `@packages/cli/src/auto/composer.ts`:
- Around line 72-81: The worker task string in composer.ts is tightly coupled to
brittle parsing in director-prompt.ts; extract a shared explicit
formatter/parser so both sides use the same contract. Create a shared module
(e.g., task-format) that exports a WorkerTaskParts type plus
buildWorkerTask(parts) used by composeTeam()/the baseWorkers mapping in
composer.ts to construct task strings, and parseWorkerTaskSummary(task) used by
director-prompt.ts to extract the summary; update composer.ts to call
buildWorkerTask({domain, originalTask, subtask}) and update director-prompt.ts
to call parseWorkerTaskSummary(w.task), removing any ad-hoc string slicing so
format changes remain centralized.
- Around line 51-53: Update the routing-table comment above the medium
configurations to accurately describe the hybrid composition for
'medium:parallel': it uses a sonnet lead (leadModel: 'sonnet'), haiku base
workers (workerModel: 'haiku'), and a separate sonnet synthesiser (synth: true),
whereas 'medium:serial' remains sonnet lead + sonnet workers; change the comment
near the entries for 'medium:serial' and 'medium:parallel' (and any nearby
explanatory text referencing synth) to reflect these exact roles.
In `@tests/integration/broker/evals/report/html.ts`:
- Around line 853-867: The current passedAll logic treats 0/0 as success because
passedAll is computed only by equality; update the check used when rendering the
summary pill (the passedAll variable and the similar logic at the other location
around lines 988-991) to treat a zero total as "no data" instead of pass:
compute passedAll as (report.metrics.scenariosTotal > 0 &&
report.metrics.scenariosPassed === report.metrics.scenariosTotal), and change
the summary pill rendering to show a neutral "no data" state (neither all-pass
nor has-fail) when report.metrics.scenariosTotal === 0, using the same change
for the other summary code path that currently mirrors these checks.
In `@tests/integration/broker/evals/scenarios/r05-check-inbox.ts`:
- Line 33: The test is using the fixed RESPONSE_MS timeout instead of the new
per-model slow-window helper; update the wait calls to use responseMs(model) so
slow-model runs get the correct timeout. Replace occurrences like await
waitForSends(harness, agent, 1, RESPONSE_MS) with await waitForSends(harness,
agent, 1, responseMs(model)) in the scenario files (ensure the local variable
model is the same one captured by the test), and remove or stop using
RESPONSE_MS in those waits; keep the waitForSends signature unchanged.
- Around line 39-55: The pass logic currently only requires base.sent >= 1 which
allows generic replies to pass; update the pass condition to also require
explicit inbox-check evidence by checking base.events or base.transcript for the
coordinator's inbox-check action. Concretely, in the pass assignment (variable
pass) add a predicate such as base.events.some(e => e.type === 'inbox_check' &&
(e.actor === `coordinator-${suffix}` || e.actor?.startsWith('coordinator'))) or
fall back to scanning base.transcript for an inbox-check phrase, so that pass
requires base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk &&
inboxCheckEvidence. Ensure you reference the same variables (pass, base.sent,
base.phantoms, base.deliveryOk, base.events, base.transcript, agents name
'coordinator') when implementing the change.
In `@tests/integration/broker/evals/scenarios/r06-group-dm.ts`:
- Around line 49-69: The current pass calculation only checks base.sent and
phantoms/delivery flags, which misses whether the lead DM'd the correct targets;
update the pass logic (and related expected/notes) in r06-group-dm.ts to assert
that the transcript/events include direct messages addressed specifically to
both recipients (Alice and Bob) rather than any outbound relay; use identifiers
like pass, base.sent, base.transcript, base.events, expected and notes to
enforce that both alice and bob were targeted (e.g., require evidence for both
recipients before setting pass=true and set expected/notes to reflect
sentToTargets for both).
In `@tests/integration/broker/evals/scenarios/r07-list-agents.ts`:
- Around line 42-58: The current pass condition only checks for any outbound
message; tighten it so the scenario requires explicit roster-lookup evidence by
modifying the pass expression (variable "pass") to additionally require either a
roster-related event in base.events (e.g., base.events.some(e => e.type ===
'roster_lookup' || e.name === 'list-agents')) or a roster-report phrase in
base.transcript (e.g., /available|online|here/.test(base.transcript)); update
the pass assignment that currently references base.sent, base.phantoms, and
base.deliveryOk to include this new check.
In `@tests/integration/broker/evals/scenarios/t01-thread-reply.ts`:
- Around line 55-71: The pass logic only checks base.sent and phantoms but not
that the outbound message was posted to the seeded thread; update the evaluation
to require that at least one outbound message has its thread identifier matching
the seeded thread id (use whatever thread field exists on the captured message
objects in base, e.g. message.thread_ts or message.threadId) before setting pass
to true, set wrongChannelReplies to the count of outbound messages that have a
different or missing thread id, and update notes to reflect thread-targeting
(e.g., notes=`replied=${base.sent >= 1}; inThread=${<match-check>}`) while
preserving the other returned fields (transcript, sent, phantoms, deliveryOk,
events, totalIntents).
---
Outside diff comments:
In `@crates/broker/src/runtime/api.rs`:
- Around line 214-233: The code always injects the relay skill prefix into
effective_task regardless of the caller's intent; update the logic that prepends
model_skill_prefix(effective_spec.model.as_deref()) so it only runs when
skip_relay_prompt is false (i.e. guard the injection with a check like if
!skip_relay_prompt && let Some(prefix) = model_skill_prefix(...)). Modify the
block that mutates effective_task (and emits the tracing::debug!) to first
ensure skip_relay_prompt is honored, using the existing symbols effective_spec,
effective_task, model_skill_prefix, skip_relay_prompt, and name.
In `@tests/integration/broker/evals/scenarios/04-relay-chain.ts`:
- Around line 20-89: The test's run function currently releases agents only on
the happy path; move agent teardown into a finally block so spawned CLI agents
are always released even on errors/timeouts. Wrap the main test flow in try {
... } finally { ... } inside run, track successful spawns (e.g., push agent
names after each harness.spawnAgent call) and in finally call
harness.releaseAgent for each tracked name (use Promise.allSettled or catch each
release) to ensure cleanup regardless of failures in spawnAgent, sendMessage,
waiters, or scoring (references: run, harness.spawnAgent, harness.releaseAgent,
finalWaiter/waitForEvent).
In `@tests/integration/broker/evals/scenarios/onboarding.ts`:
- Around line 30-59: The onboarding prompt cases 'brief' and 'skill' hardcode
cli: "claude" in the spawn examples; change them to use the runtime-selected CLI
instead (e.g., replace the literal cli: "claude" with a placeholder injected
from the scenario context such as cli: "${ctx.cli}" or format the string with
the caller-provided cli) and update callers to thread the scenario CLI into
these prompt templates so non-Claude evals receive the correct provider value;
ensure the worker examples in both the 'brief' and 'skill' branches reference
the dynamic value rather than the fixed "claude".
In `@tests/integration/broker/evals/scoring/lifecycle.ts`:
- Around line 75-80: scoreRelease currently treats an empty workerNames array as
"match every release" causing false positives; change the filter in scoreRelease
to require workerNames.length > 0 AND nameSet.has(e.name) (i.e., only match
releases when an expected worker list is provided and the event name is in
nameSet), or explicitly return an empty releaseEvents when workerNames is empty;
apply the same change to the analogous workerName-based filter later in the file
(the other lifecycle scoring filter that uses workerNames) so empty
expected-workers never matches all agent_released events.
---
Nitpick comments:
In `@packages/cli/src/auto/classifier.ts`:
- Around line 83-89: Add a concise inline comment next to the parallelizability
calculation in classifyTask explaining the chosen thresholds: why
parallelSignals >= 2 and domains.length >= 3 are used to mark tasks as
parallelizable (e.g., two or more explicit parallel cues or three or more
distinct domains implying independent sub-tasks), and note that these values are
heuristics subject to calibration via evaluation metrics—update the comment near
the parallelizable assignment referencing the variables parallelSignals and
domains so future maintainers know the rationale and where to adjust thresholds.
- Around line 90-101: Extract the magic numbers used for word-count thresholds
into named constants and use them in the complexity decision logic: introduce
constants like HIGH_COMPLEXITY_WORD_THRESHOLD and
MEDIUM_COMPLEXITY_WORD_THRESHOLD and replace the literals 100 and 40 in the
complexity calculation block (the logic that sets the variable complexity, which
also references HIGH_COMPLEXITY_KEYWORDS, MEDIUM_COMPLEXITY_KEYWORDS, words and
domains.length) so the comparisons read words > HIGH_COMPLEXITY_WORD_THRESHOLD
and words > MEDIUM_COMPLEXITY_WORD_THRESHOLD; keep the existing boolean hits
(highHit, medHit) and domain checks unchanged.
In `@packages/cli/src/auto/composer.ts`:
- Around line 72-81: The worker task string is built inline in the composer
mapping, which couples formatting to fragile string parsing elsewhere; extract a
shared builder function buildWorkerTask(domain, subtask, originalTask) that
returns the exact task template and replace the inline template in the
WorkerSpec creation (the mapping that sets role/model/task) to call this
builder; also add a complementary extractor extractWorkerDomain(task) (used by
director-prompt.ts instead of ad-hoc regexes) so parsing and generation share
the same contract, and update both composer usage and director-prompt.ts to call
these helpers.
In `@specs/auto-routing.md`:
- Around line 78-90: Several fenced code blocks in the "Classifier prompt" and
the later "Routing table" and "Director" sections are missing language
specifiers causing markdownlint warnings; update each triple-backtick fence
around the prompt, routing table examples, and Director example (the blocks
containing the compressed classifier prompt, the routing-table snippet, and the
Director/team snippet) to use a language token such as ```text or ```plaintext
so they validate and render consistently. Ensure you replace the opening ```
with ```text (or ```plaintext) for all three blocks referenced in the comment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 8e95efcb-efdb-401f-89e1-076a25197ae4
📒 Files selected for processing (48)
.agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.json.agentworkforce/trajectories/completed/2026-06/traj_8bwr1de9cmtn/summary.md.agentworkforce/trajectories/completed/2026-06/traj_8bwr1de9cmtn/trajectory.json.agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/summary.md.agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/trajectory.json.agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/summary.md.agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/trajectory.json.agentworkforce/trajectories/completed/2026-06/traj_gcn2gqbj96ya/summary.md.agentworkforce/trajectories/completed/2026-06/traj_gcn2gqbj96ya/trajectory.json.agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/summary.md.agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/trajectory.json.claude/skills/using-agent-relay/SKILL.mdCHANGELOG.mdcrates/broker/src/runtime/api.rspackage.jsonpackages/cli/src/auto/classifier.test.tspackages/cli/src/auto/classifier.tspackages/cli/src/auto/composer.test.tspackages/cli/src/auto/composer.tspackages/cli/src/auto/director-prompt.tspackages/cli/src/auto/index.tsspecs/auto-routing.mdtests/integration/broker/evals/report/html.tstests/integration/broker/evals/runner.tstests/integration/broker/evals/scenarios/01-dm-roundtrip.tstests/integration/broker/evals/scenarios/02-channel-reply.tstests/integration/broker/evals/scenarios/03-ack-done.tstests/integration/broker/evals/scenarios/04-relay-chain.tstests/integration/broker/evals/scenarios/helpers.tstests/integration/broker/evals/scenarios/index.tstests/integration/broker/evals/scenarios/onboarding.tstests/integration/broker/evals/scenarios/r01-incidental-report.tstests/integration/broker/evals/scenarios/r02-forget-to-report.tstests/integration/broker/evals/scenarios/r03-proactive-handoff.tstests/integration/broker/evals/scenarios/r04-channel-vs-dm.tstests/integration/broker/evals/scenarios/r05-check-inbox.tstests/integration/broker/evals/scenarios/r06-group-dm.tstests/integration/broker/evals/scenarios/r07-list-agents.tstests/integration/broker/evals/scenarios/s01-spawn-worker.tstests/integration/broker/evals/scenarios/s02-release-worker.tstests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.tstests/integration/broker/evals/scenarios/s04-no-native-subagents.tstests/integration/broker/evals/scenarios/s05-phrasing-variants.tstests/integration/broker/evals/scenarios/t01-thread-reply.tstests/integration/broker/evals/scoring/lifecycle.tstests/integration/broker/evals/scoring/native-subagent.tstests/integration/broker/evals/types.tstests/integration/broker/utils/broker-harness.ts
✅ Files skipped from review due to trivial changes (9)
- .agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/summary.md
- .agentworkforce/trajectories/completed/2026-06/traj_8bwr1de9cmtn/trajectory.json
- .agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/summary.md
- .agentworkforce/trajectories/completed/2026-06/traj_c60wda02p633/trajectory.json
- .agentworkforce/trajectories/completed/2026-06/traj_gcn2gqbj96ya/trajectory.json
- .agentworkforce/trajectories/completed/2026-06/traj_tp0abyug3tz5/trajectory.json
- .agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.json
- .agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/summary.md
- .agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/trajectory.json
🚧 Files skipped from review as they are similar to previous changes (13)
- tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts
- tests/integration/broker/evals/scenarios/03-ack-done.ts
- tests/integration/broker/evals/scenarios/r02-forget-to-report.ts
- tests/integration/broker/evals/scenarios/index.ts
- tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts
- tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts
- tests/integration/broker/evals/scenarios/s02-release-worker.ts
- tests/integration/broker/evals/scenarios/s03-spawn-release-lifecycle.ts
- tests/integration/broker/evals/runner.ts
- tests/integration/broker/evals/scenarios/s01-spawn-worker.ts
- tests/integration/broker/evals/types.ts
- tests/integration/broker/evals/scenarios/02-channel-reply.ts
- tests/integration/broker/utils/broker-harness.ts
| const HIGH_COMPLEXITY_KEYWORDS = [ | ||
| 'audit', 'refactor', 'migrate', 'redesign', 'overhaul', 'architecture', | ||
| 'security review', 'performance analysis', 'comprehensive', 'full ', | ||
| 'entire ', 'across all', 'end-to-end', 'investigation', 'root cause', | ||
| 'diagnose', 'large-scale', 'multi-phase', 'research', | ||
| ]; | ||
|
|
||
| const MEDIUM_COMPLEXITY_KEYWORDS = [ | ||
| 'implement', 'build', 'add', 'create', 'update', 'fix', 'debug', | ||
| 'integrate', 'configure', 'deploy', 'analyse', 'analyze', 'review', | ||
| 'test', 'document', 'optimize', 'optimise', | ||
| ]; | ||
|
|
||
| // ── Parallelism signals ─────────────────────────────────────────────────────── | ||
|
|
||
| const PARALLEL_PHRASES = [ | ||
| ' and ', ' also ', 'at the same time', 'in parallel', 'simultaneously', | ||
| 'concurrently', 'both ', 'each ', 'all ', 'multiple ', | ||
| ]; |
There was a problem hiding this comment.
Keyword patterns with surrounding spaces reduce match coverage.
The keyword arrays include patterns with leading/trailing spaces (e.g., 'full ', 'entire ', ' and '), which cause missed matches when the keyword appears adjacent to punctuation or at string boundaries. For example, 'full ' won't match "full." or "full-scale".
Consider using word-boundary checks or trimming spaces to improve heuristic accuracy.
🔧 Suggested pattern improvements
const HIGH_COMPLEXITY_KEYWORDS = [
- 'audit', 'refactor', 'migrate', 'redesign', 'overhaul', 'architecture',
- 'security review', 'performance analysis', 'comprehensive', 'full ',
- 'entire ', 'across all', 'end-to-end', 'investigation', 'root cause',
+ 'audit', 'refactor', 'migrate', 'redesign', 'overhaul', 'architecture',
+ 'security review', 'performance analysis', 'comprehensive', 'full',
+ 'entire', 'across all', 'end-to-end', 'investigation', 'root cause',
'diagnose', 'large-scale', 'multi-phase', 'research',
];
const PARALLEL_PHRASES = [
- ' and ', ' also ', 'at the same time', 'in parallel', 'simultaneously',
- 'concurrently', 'both ', 'each ', 'all ', 'multiple ',
+ ' and ', ' also ', 'at the same time', 'in parallel', 'simultaneously',
+ 'concurrently', 'both', 'each', 'all', 'multiple',
];Then update detection to use word boundaries:
function detectKeyword(text: string, keyword: string): boolean {
const regex = new RegExp(`\\b${keyword.replace(/\s+/g, '\\s+')}\\b`, 'i');
return regex.test(text);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/cli/src/auto/classifier.ts` around lines 25 - 43, The keyword lists
HIGH_COMPLEXITY_KEYWORDS, MEDIUM_COMPLEXITY_KEYWORDS and PARALLEL_PHRASES
contain entries with literal surrounding spaces which cause missed matches;
update these arrays to store trimmed, canonical keyword phrases (no
leading/trailing spaces) and change the matching logic (e.g., the function that
checks keywords—implement or update detectKeyword/detectPhrase) to perform
case-insensitive word-boundary regex matches that escape special chars and
convert internal spaces to \s+ so terms match at string boundaries and next to
punctuation (use \b...\\s+...\\b with proper escaping).
| harness.clearEvents(); | ||
|
|
||
| // No stimulus injected — the agent must act proactively | ||
| await waitForSends(harness, agent, 1, RESPONSE_MS); |
There was a problem hiding this comment.
Adopt responseMs(model) consistently across tests/integration/broker/evals/scenarios/r05-check-inbox.ts, tests/integration/broker/evals/scenarios/r06-group-dm.ts, tests/integration/broker/evals/scenarios/r07-list-agents.ts, tests/integration/broker/evals/scenarios/s05-phrasing-variants.ts, and tests/integration/broker/evals/scenarios/t01-thread-reply.ts. These files all capture model, and tests/integration/broker/evals/scenarios/helpers.ts now exposes a slow-model window, but each wait still uses the fixed RESPONSE_MS. The shared root cause is that the new helper was added without being wired into these scenario waits, so slow-model runs can fail for timeout noise instead of behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/r05-check-inbox.ts` at line 33, The
test is using the fixed RESPONSE_MS timeout instead of the new per-model
slow-window helper; update the wait calls to use responseMs(model) so slow-model
runs get the correct timeout. Replace occurrences like await
waitForSends(harness, agent, 1, RESPONSE_MS) with await waitForSends(harness,
agent, 1, responseMs(model)) in the scenario files (ensure the local variable
model is the same one captured by the test), and remove or stop using
RESPONSE_MS in those waits; keep the waitForSends signature unchanged.
| const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk; | ||
|
|
||
| return { | ||
| id: scenario.id, | ||
| title: scenario.title, | ||
| pass, | ||
| agents: [{ name: agent, cli, role: 'coordinator', prompt: TASK }], | ||
| transcript: base.transcript, | ||
| sent: base.sent, | ||
| expected: 1, | ||
| phantoms: base.phantoms, | ||
| totalIntents: base.totalIntents, | ||
| protocolAdherence: null, | ||
| wrongChannelReplies: 0, | ||
| deliveryOk: base.deliveryOk, | ||
| events: base.events, | ||
| notes: `proactivelySent=${base.sent >= 1}`, |
There was a problem hiding this comment.
Require inbox-check evidence before passing.
This currently passes on any outbound message from coordinator-${suffix}. A generic “standing by” reply satisfies base.sent >= 1, so the scenario can go green without ever exercising the inbox-check behavior described in Lines 4-7.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/r05-check-inbox.ts` around lines 39
- 55, The pass logic currently only requires base.sent >= 1 which allows generic
replies to pass; update the pass condition to also require explicit inbox-check
evidence by checking base.events or base.transcript for the coordinator's
inbox-check action. Concretely, in the pass assignment (variable pass) add a
predicate such as base.events.some(e => e.type === 'inbox_check' && (e.actor ===
`coordinator-${suffix}` || e.actor?.startsWith('coordinator'))) or fall back to
scanning base.transcript for an inbox-check phrase, so that pass requires
base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk &&
inboxCheckEvidence. Ensure you reference the same variables (pass, base.sent,
base.phantoms, base.deliveryOk, base.events, base.transcript, agents name
'coordinator') when implementing the change.
| const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk; | ||
|
|
||
| return { | ||
| id: scenario.id, | ||
| title: scenario.title, | ||
| pass, | ||
| agents: [ | ||
| { name: agent, cli, role: 'lead', prompt: ROLE }, | ||
| { name: alice, cli: 'cat', role: 'recipient (cat shim)', prompt: '(cat shim)' }, | ||
| { name: bob, cli: 'cat', role: 'recipient (cat shim)', prompt: '(cat shim)' }, | ||
| ], | ||
| transcript: base.transcript, | ||
| sent: base.sent, | ||
| expected: 1, | ||
| phantoms: base.phantoms, | ||
| totalIntents: base.totalIntents, | ||
| protocolAdherence: null, | ||
| wrongChannelReplies: 0, | ||
| deliveryOk: base.deliveryOk, | ||
| events: base.events, | ||
| notes: `sentToTargets=${base.sent >= 1}`, |
There was a problem hiding this comment.
Assert the DM targets, not just that something was sent.
baseScore() only tells you that the lead emitted at least one relay_inbound. A reply to Orchestrator, or a DM to the wrong agent, still makes pass true and sentToTargets=true, which does not match the “update Alice and Bob” contract in Lines 4-6.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/r06-group-dm.ts` around lines 49 -
69, The current pass calculation only checks base.sent and phantoms/delivery
flags, which misses whether the lead DM'd the correct targets; update the pass
logic (and related expected/notes) in r06-group-dm.ts to assert that the
transcript/events include direct messages addressed specifically to both
recipients (Alice and Bob) rather than any outbound relay; use identifiers like
pass, base.sent, base.transcript, base.events, expected and notes to enforce
that both alice and bob were targeted (e.g., require evidence for both
recipients before setting pass=true and set expected/notes to reflect
sentToTargets for both).
| const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk; | ||
|
|
||
| return { | ||
| id: scenario.id, | ||
| title: scenario.title, | ||
| pass, | ||
| agents: [{ name: agent, cli, role: 'coordinator', prompt: ROLE }], | ||
| transcript: base.transcript, | ||
| sent: base.sent, | ||
| expected: 1, | ||
| phantoms: base.phantoms, | ||
| totalIntents: base.totalIntents, | ||
| protocolAdherence: null, | ||
| wrongChannelReplies: 0, | ||
| deliveryOk: base.deliveryOk, | ||
| events: base.events, | ||
| notes: `replied=${base.sent >= 1}`, |
There was a problem hiding this comment.
Require roster-lookup evidence before passing.
The scenario description says the agent must discover who is available and report back, but pass only checks for any outbound message. That allows a fabricated or non-actionable reply to pass without exercising the directory/list-agents behavior this eval is supposed to cover.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/r07-list-agents.ts` around lines 42
- 58, The current pass condition only checks for any outbound message; tighten
it so the scenario requires explicit roster-lookup evidence by modifying the
pass expression (variable "pass") to additionally require either a
roster-related event in base.events (e.g., base.events.some(e => e.type ===
'roster_lookup' || e.name === 'list-agents')) or a roster-report phrase in
base.transcript (e.g., /available|online|here/.test(base.transcript)); update
the pass assignment that currently references base.sent, base.phantoms, and
base.deliveryOk to include this new check.
| const pass = base.sent >= 1 && base.phantoms.length === 0 && base.deliveryOk; | ||
|
|
||
| return { | ||
| id: scenario.id, | ||
| title: scenario.title, | ||
| pass, | ||
| agents: [{ name: agent, cli, role: 'infra engineer', prompt: ROLE }], | ||
| transcript: base.transcript, | ||
| sent: base.sent, | ||
| expected: 1, | ||
| phantoms: base.phantoms, | ||
| totalIntents: base.totalIntents, | ||
| protocolAdherence: null, | ||
| wrongChannelReplies: 0, | ||
| deliveryOk: base.deliveryOk, | ||
| events: base.events, | ||
| notes: `replied=${base.sent >= 1}`, |
There was a problem hiding this comment.
Validate the reply stayed in the seeded thread.
This scenario is titled and documented as a reply_to_thread check, but pass only requires any outbound message. A fresh channel post still goes green here, and wrongChannelReplies: 0 reports perfect adherence even though thread targeting was never verified.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/t01-thread-reply.ts` around lines 55
- 71, The pass logic only checks base.sent and phantoms but not that the
outbound message was posted to the seeded thread; update the evaluation to
require that at least one outbound message has its thread identifier matching
the seeded thread id (use whatever thread field exists on the captured message
objects in base, e.g. message.thread_ts or message.threadId) before setting pass
to true, set wrongChannelReplies to the count of outbound messages that have a
different or missing thread id, and update notes to reflect thread-targeting
(e.g., notes=`replied=${base.sent >= 1}; inThread=${<match-check>}`) while
preserving the other returned fields (transcript, sent, phantoms, deliveryOk,
events, totalIntents).
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
specs/auto-routing.md (2)
66-67:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThe classifier contract here is out of sync with the current CLI integration.
This spec says the classifier is an async LLM call returning
Promise<TaskAssessment>, butpackages/cli/src/cli/commands/local-agent.tsinvokesclassifyTask(task)synchronously during spawn resolution. Either the implementation needs to become async/await all the way through, or this document should be updated to describe the current heuristic/synchronous contract so follow-on work does not build against the wrong interface.Also applies to: 173-179
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/auto-routing.md` around lines 66 - 67, The spec for the classifier is out of sync: the document claims classifyTask returns a Promise<TaskAssessment> (async LLM call) but the CLI code (packages/cli/src/cli/commands/local-agent.ts) calls classifyTask(task) synchronously during spawn resolution; reconcile these by either making the CLI flow fully async (propagating await through spawn resolution and any callers of classifyTask) or by updating the spec to document a synchronous heuristic contract; specifically, decide whether classifyTask should be async and update the CLI call sites to use await (and mark functions like the spawn resolver as async) or change the spec lines (including the mentions at 66–67 and 173–179) to describe the current synchronous classifyTask behavior so consumers implement against the correct interface.
45-60:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the CLI example to match the implemented command shape.
The implementation in
packages/cli/src/cli/commands/local-agent.tsroutes throughlocal agent spawn <provider>and only auto-resolves for the Claude provider. The example here treatsDirectoras the positional argument and omits the provider, so readers copying it will not hit the code path this PR adds.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/auto-routing.md` around lines 45 - 60, Update the CLI/SDK examples to match the implemented command shape used by local-agent.ts: show the full command as "local agent spawn <provider> <agentName> --model=auto --task=..." (i.e., include the provider positional argument and then the agent name like "Director") and note that model="auto" is auto-resolved only for the Claude provider in local-agent.ts; also adjust the SDK example to call the same provider-aware spawn API (reflecting the provider positional parameter) so readers exercise the code path added in local-agent.ts and the task classifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/cli/src/cli/commands/local-agent.ts`:
- Around line 28-45: resolveAutoSpawn currently returns a model of 'auto' when
provider === 'claude' and no task is supplied, which lets spawn/new forward an
unresolved model; change resolveAutoSpawn to return an explicit boolean
autoRouted in its return shape (e.g., { name, task, model, autoRouted }) and set
autoRouted = true only when you successfully classify and rewrite (use
classifyTask, composeTeam, buildDirectorPrompt, CLAUDE_MODEL_IDS as you already
do); when provider === 'claude' && model === 'auto' && no task is present,
return autoRouted = false (or throw) so callers can fail fast instead of
forwarding 'auto'; update the spawn/new handlers to check resolved.autoRouted:
if false, abort with a clear error before calling spawnAgentWithClient and use
resolved.autoRouted to decide the "(auto-routed)" log suffix.
In `@tests/integration/broker/evals/scenarios/s06-auto-routing.ts`:
- Around line 88-104: The test currently stops and scores immediately after
detecting two agent_spawned events (spawnCount loop) which can false-pass;
instead keep observing the event stream until a real completion or a quiet
period after the second spawn before calling harness.getEvents(),
scoreSpawn(...) and detectNativeSubagent(...). Concretely, after incrementing
spawnCount in the harness.waitForEvent('agent_spawned') handler, do not break to
scoring; instead record the timestamp of the last spawn and continue collecting
events until either a completion signal event (e.g. a run-level finish event if
present) is observed or a configurable quiet-period (e.g. 300–1000ms) has
elapsed since the last event, then call harness.getEvents(), baseScore(...),
scoreSpawn(...) and detectNativeSubagent(...). Ensure to reference and update
the existing variables/logic around spawnCount,
harness.waitForEvent('agent_spawned'), harness.getEvents(), scoreSpawn, and
detectNativeSubagent.
- Around line 72-110: The run function currently releases agents only on the
success path; to guarantee cleanup on failures wrap the orchestration
(everything after spawnAgent/startup/clearEvents and before final score
collection) in a try/finally so release logic always runs: move the loop that
iterates spawn.spawnedNames and the harness.releaseAgent(director) calls into
the finally block and ensure any awaited operations (e.g., harness.waitForEvent,
harness.sendMessage, detection/score calls) remain inside the try; reference
symbols: run, director, harness.spawnAgent, harness.waitForEvent,
spawn.spawnedNames, harness.releaseAgent.
---
Outside diff comments:
In `@specs/auto-routing.md`:
- Around line 66-67: The spec for the classifier is out of sync: the document
claims classifyTask returns a Promise<TaskAssessment> (async LLM call) but the
CLI code (packages/cli/src/cli/commands/local-agent.ts) calls classifyTask(task)
synchronously during spawn resolution; reconcile these by either making the CLI
flow fully async (propagating await through spawn resolution and any callers of
classifyTask) or by updating the spec to document a synchronous heuristic
contract; specifically, decide whether classifyTask should be async and update
the CLI call sites to use await (and mark functions like the spawn resolver as
async) or change the spec lines (including the mentions at 66–67 and 173–179) to
describe the current synchronous classifyTask behavior so consumers implement
against the correct interface.
- Around line 45-60: Update the CLI/SDK examples to match the implemented
command shape used by local-agent.ts: show the full command as "local agent
spawn <provider> <agentName> --model=auto --task=..." (i.e., include the
provider positional argument and then the agent name like "Director") and note
that model="auto" is auto-resolved only for the Claude provider in
local-agent.ts; also adjust the SDK example to call the same provider-aware
spawn API (reflecting the provider positional parameter) so readers exercise the
code path added in local-agent.ts and the task classifier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 55575465-06ac-48e4-ae15-772b28c69d58
📒 Files selected for processing (10)
.agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.jsonCHANGELOG.mdcrates/broker/src/snippets.rspackage.jsonpackages/cli/src/cli/commands/local-agent.tsspecs/auto-routing.mdtests/integration/broker/evals/runner.tstests/integration/broker/evals/scenarios/index.tstests/integration/broker/evals/scenarios/onboarding.tstests/integration/broker/evals/scenarios/s06-auto-routing.ts
✅ Files skipped from review due to trivial changes (2)
- CHANGELOG.md
- .agentworkforce/trajectories/active/traj_899hiw3gbx8m/trajectory.json
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/integration/broker/evals/scenarios/onboarding.ts
- tests/integration/broker/evals/scenarios/index.ts
- package.json
- tests/integration/broker/evals/runner.ts
| function resolveAutoSpawn( | ||
| provider: string, | ||
| name: string, | ||
| task: string | undefined, | ||
| model: string | undefined, | ||
| ): { name: string; task: string | undefined; model: string | undefined } { | ||
| if (model !== 'auto' || provider !== 'claude' || !task) { | ||
| return { name, task, model }; | ||
| } | ||
| const assessment = classifyTask(task); | ||
| const team = composeTeam(assessment, task); | ||
| const directorPrompt = buildDirectorPrompt(task, team); | ||
| return { | ||
| name: name === provider ? 'Director' : name, | ||
| task: directorPrompt, | ||
| model: CLAUDE_MODEL_IDS[team.lead.model], | ||
| }; | ||
| } |
There was a problem hiding this comment.
Reject unresolved Claude --model auto before spawning.
When provider === 'claude' and --model auto is used without --task, this falls through and returns { model: 'auto' }, so both spawn and new forward a non-model ID into spawnAgentWithClient. The same missing state also makes the handlers print (auto-routed) for cases that were never actually rewritten. Return an explicit autoRouted flag here and fail fast when Claude auto-routing cannot resolve.
Suggested fix
+type ResolvedSpawn = {
+ name: string;
+ task: string | undefined;
+ model: string | undefined;
+ autoRouted: boolean;
+};
+
function resolveAutoSpawn(
provider: string,
name: string,
task: string | undefined,
model: string | undefined,
-): { name: string; task: string | undefined; model: string | undefined } {
- if (model !== 'auto' || provider !== 'claude' || !task) {
- return { name, task, model };
+): ResolvedSpawn {
+ if (model !== 'auto') {
+ return { name, task, model, autoRouted: false };
+ }
+ if (provider !== 'claude') {
+ return { name, task, model, autoRouted: false };
+ }
+ if (!task) {
+ throw new Error('`--model auto` for Claude requires `--task`.');
}
const assessment = classifyTask(task);
const team = composeTeam(assessment, task);
const directorPrompt = buildDirectorPrompt(task, team);
return {
name: name === provider ? 'Director' : name,
task: directorPrompt,
model: CLAUDE_MODEL_IDS[team.lead.model],
+ autoRouted: true,
};
}Then use resolved.autoRouted for the log suffix in both handlers.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function resolveAutoSpawn( | |
| provider: string, | |
| name: string, | |
| task: string | undefined, | |
| model: string | undefined, | |
| ): { name: string; task: string | undefined; model: string | undefined } { | |
| if (model !== 'auto' || provider !== 'claude' || !task) { | |
| return { name, task, model }; | |
| } | |
| const assessment = classifyTask(task); | |
| const team = composeTeam(assessment, task); | |
| const directorPrompt = buildDirectorPrompt(task, team); | |
| return { | |
| name: name === provider ? 'Director' : name, | |
| task: directorPrompt, | |
| model: CLAUDE_MODEL_IDS[team.lead.model], | |
| }; | |
| } | |
| type ResolvedSpawn = { | |
| name: string; | |
| task: string | undefined; | |
| model: string | undefined; | |
| autoRouted: boolean; | |
| }; | |
| function resolveAutoSpawn( | |
| provider: string, | |
| name: string, | |
| task: string | undefined, | |
| model: string | undefined, | |
| ): ResolvedSpawn { | |
| if (model !== 'auto') { | |
| return { name, task, model, autoRouted: false }; | |
| } | |
| if (provider !== 'claude') { | |
| return { name, task, model, autoRouted: false }; | |
| } | |
| if (!task) { | |
| throw new Error('`--model auto` for Claude requires `--task`.'); | |
| } | |
| const assessment = classifyTask(task); | |
| const team = composeTeam(assessment, task); | |
| const directorPrompt = buildDirectorPrompt(task, team); | |
| return { | |
| name: name === provider ? 'Director' : name, | |
| task: directorPrompt, | |
| model: CLAUDE_MODEL_IDS[team.lead.model], | |
| autoRouted: true, | |
| }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/cli/src/cli/commands/local-agent.ts` around lines 28 - 45,
resolveAutoSpawn currently returns a model of 'auto' when provider === 'claude'
and no task is supplied, which lets spawn/new forward an unresolved model;
change resolveAutoSpawn to return an explicit boolean autoRouted in its return
shape (e.g., { name, task, model, autoRouted }) and set autoRouted = true only
when you successfully classify and rewrite (use classifyTask, composeTeam,
buildDirectorPrompt, CLAUDE_MODEL_IDS as you already do); when provider ===
'claude' && model === 'auto' && no task is present, return autoRouted = false
(or throw) so callers can fail fast instead of forwarding 'auto'; update the
spawn/new handlers to check resolved.autoRouted: if false, abort with a clear
error before calling spawnAgentWithClient and use resolved.autoRouted to decide
the "(auto-routed)" log suffix.
| run: async (ctx): Promise<ScenarioResult> => { | ||
| const { harness, cli, model, suffix, sleep } = ctx; | ||
| const director = `director-${suffix}`; | ||
| const phaseMs = responseMs(model); | ||
|
|
||
| const task = buildDirectorPrompt(); | ||
| await harness.spawnAgent(director, cli, ['general'], { task, model }); | ||
| await sleep(STARTUP_MS); | ||
| harness.clearEvents(); | ||
|
|
||
| await harness.sendMessage({ | ||
| to: director, | ||
| from: 'Orchestrator', | ||
| text: ORIGINAL_TASK, | ||
| }); | ||
|
|
||
| // Wait for at least 2 spawn events. | ||
| let spawnCount = 0; | ||
| const deadline = Date.now() + phaseMs; | ||
| while (spawnCount < 2 && Date.now() < deadline) { | ||
| const waiter = harness.waitForEvent('agent_spawned', Math.max(0, deadline - Date.now())); | ||
| await waiter.promise | ||
| .then(() => { | ||
| spawnCount++; | ||
| }) | ||
| .catch(() => {}); | ||
| if (spawnCount >= 2) break; | ||
| } | ||
|
|
||
| const events = harness.getEvents(); | ||
| const base = baseScore(events, [director]); | ||
| const spawn = scoreSpawn(events, director); | ||
| const nativeSubagent = detectNativeSubagent(events, director); | ||
|
|
||
| // Release any spawned workers before releasing director. | ||
| for (const name of spawn.spawnedNames) { | ||
| await harness.releaseAgent(name).catch(() => {}); | ||
| } | ||
| await harness.releaseAgent(director).catch(() => {}); |
There was a problem hiding this comment.
Guarantee cleanup on failing paths.
If any await after Line 78 rejects before Line 106 runs, this scenario leaves the Director and any spawned workers alive in the shared broker harness. That can poison later scenarios with extra agents/events. Wrap the orchestration in try/finally and release from the finally block instead of only on the success path.
Suggested fix
run: async (ctx): Promise<ScenarioResult> => {
const { harness, cli, model, suffix, sleep } = ctx;
const director = `director-${suffix}`;
const phaseMs = responseMs(model);
+ const spawned = new Set<string>();
- const task = buildDirectorPrompt();
- await harness.spawnAgent(director, cli, ['general'], { task, model });
- await sleep(STARTUP_MS);
- harness.clearEvents();
-
- await harness.sendMessage({
- to: director,
- from: 'Orchestrator',
- text: ORIGINAL_TASK,
- });
+ try {
+ const task = buildDirectorPrompt();
+ await harness.spawnAgent(director, cli, ['general'], { task, model });
+ await sleep(STARTUP_MS);
+ harness.clearEvents();
+
+ await harness.sendMessage({
+ to: director,
+ from: 'Orchestrator',
+ text: ORIGINAL_TASK,
+ });
- // Wait for at least 2 spawn events.
- let spawnCount = 0;
- const deadline = Date.now() + phaseMs;
- while (spawnCount < 2 && Date.now() < deadline) {
- const waiter = harness.waitForEvent('agent_spawned', Math.max(0, deadline - Date.now()));
- await waiter.promise
- .then(() => {
- spawnCount++;
- })
- .catch(() => {});
- if (spawnCount >= 2) break;
- }
+ // ... existing scoring logic ...
+ for (const name of spawn.spawnedNames) spawned.add(name);
+
+ return {
+ // ... existing result ...
+ };
+ } finally {
+ for (const name of spawned) {
+ await harness.releaseAgent(name).catch(() => {});
+ }
+ await harness.releaseAgent(director).catch(() => {});
+ }
-
- return {
- // ...
- };
},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/s06-auto-routing.ts` around lines 72
- 110, The run function currently releases agents only on the success path; to
guarantee cleanup on failures wrap the orchestration (everything after
spawnAgent/startup/clearEvents and before final score collection) in a
try/finally so release logic always runs: move the loop that iterates
spawn.spawnedNames and the harness.releaseAgent(director) calls into the finally
block and ensure any awaited operations (e.g., harness.waitForEvent,
harness.sendMessage, detection/score calls) remain inside the try; reference
symbols: run, director, harness.spawnAgent, harness.waitForEvent,
spawn.spawnedNames, harness.releaseAgent.
| // Wait for at least 2 spawn events. | ||
| let spawnCount = 0; | ||
| const deadline = Date.now() + phaseMs; | ||
| while (spawnCount < 2 && Date.now() < deadline) { | ||
| const waiter = harness.waitForEvent('agent_spawned', Math.max(0, deadline - Date.now())); | ||
| await waiter.promise | ||
| .then(() => { | ||
| spawnCount++; | ||
| }) | ||
| .catch(() => {}); | ||
| if (spawnCount >= 2) break; | ||
| } | ||
|
|
||
| const events = harness.getEvents(); | ||
| const base = baseScore(events, [director]); | ||
| const spawn = scoreSpawn(events, director); | ||
| const nativeSubagent = detectNativeSubagent(events, director); |
There was a problem hiding this comment.
Don’t score the run immediately after the second spawn.
Lines 88-104 stop collecting as soon as two agent_spawned events land, then score the stream right away. That can false-pass runs where the Director later uses the native Task tool, never waits for both DONEs, or skips synthesis/release. Keep observing until a real completion signal (or at least a quiet-period after the second spawn) before calling scoreSpawn and detectNativeSubagent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/broker/evals/scenarios/s06-auto-routing.ts` around lines 88
- 104, The test currently stops and scores immediately after detecting two
agent_spawned events (spawnCount loop) which can false-pass; instead keep
observing the event stream until a real completion or a quiet period after the
second spawn before calling harness.getEvents(), scoreSpawn(...) and
detectNativeSubagent(...). Concretely, after incrementing spawnCount in the
harness.waitForEvent('agent_spawned') handler, do not break to scoring; instead
record the timestamp of the last spawn and continue collecting events until
either a completion signal event (e.g. a run-level finish event if present) is
observed or a configurable quiet-period (e.g. 300–1000ms) has elapsed since the
last event, then call harness.getEvents(), baseScore(...), scoreSpawn(...) and
detectNativeSubagent(...). Ensure to reference and update the existing
variables/logic around spawnCount, harness.waitForEvent('agent_spawned'),
harness.getEvents(), scoreSpawn, and detectNativeSubagent.
7a71316 to
565d110
Compare
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Folds in the op argument and check-key conventions surfaced by the first authoring round so the relaunched workers start from a concrete spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gger message All previous approaches (CRITICAL instruction, ACK injection, Orchestrator nudge) failed because an incoming message triggers only a partial response (single add_agent call). Switching to production-matching approach: Director spawns workers autonomously from its meta-prompt without any trigger message. Extends wait window to 180s from STARTUP_MS offset to cover Director boot + sequential spawn pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
codex: 100% all variants — never routes to native subagents droid: 0% on bare/one-liner (uses native Task without disambiguation); skill text that fixes s04 for Claude breaks droid s03 (0%) gemini: 80% bare, 60% one-liner, 80%+ brief/skill — mostly relay-native Key insight: droid needs skill onboarding to avoid native subagents (s04), but the SAME skill text that helps s04 breaks s03 — needs investigation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Director processes its task immediately after boot. clearEvents() after STARTUP_MS was discarding agent_spawned events fired during startup. Now count spawns that happened during startup before clearing, then continue watching for more spawns with the 180s window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…scoring; document PTY multi-spawn limitation - s05 phrasing eval complete across all harnesses: opus achieves 100% on relay-worker/relay-agent/arw-worker/arw-agent (vs haiku 60%/20%/60%/40%), confirming relay-anchored vocabulary is required for Claude with opus showing the strongest native tool knowledge - s06 scoring fix: remove clearEvents() before scoring so startup spawns are counted by scoreSpawn; use plain sleep() instead of waitForEvent (which resolves immediately on buffered events) - s06 architectural conclusion: PTY-mode agents make exactly one tool call per turn — 0% multi-worker spawn across all harnesses and 8+ prompt approaches. Production fix: pre-spawn workers from CLI layer in Phase 4, not via Director - Update specs/auto-routing.md with opus phrasing table, s06 final answer, and revised open questions - Update CHANGELOG with complete phrasing findings and s06 architectural decision Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ent binary The broker's command_parse was mapping cli="cursor" to "agent", but the `agent` binary on PATH is /Users/khaliqgant/.grok/bin/agent (the Grok CLI). This meant all cursor eval runs were actually running Grok, explaining the 0% scores. Fix: map cursor → cursor-agent explicitly in command_parse. Update harness command to match. Update unit tests accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nessSpec for codex models - Fix parseHarnessSpec: codex models (gpt-5.4-mini, o3, etc.) are raw OpenAI model names and must not be qualified with a "codex/" prefix — add early-return for cli=codex - Add codex tier comparison section to eval-master-summary.html: gpt-5.5 recommended (16/16, 0% phantom), gpt-5.4-mini viable budget (15/16, 31% phantom), gpt-5.4 avoid (52% phantom despite 100% majority-vote), spark not viable (6/16) - Update action item #8 in HTML from "pending" to completed findings with tier table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to composer - Add run-opencode-models.sh: two-phase deterministic batch eval over 39 opencode model tiers. Phase 1 screens with s03 repeat=3; Phase 2 runs s01-s04 repeat=5 for models that score ≥67% in Phase 1. No LLM coordinator — pure shell loop. - Add CODEX_MODEL_TIERS, WorkerCli type, and per-harness HARNESS_ONBOARDING defaults to composer.ts based on lifecycle eval findings (2026-06-12) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, add missing claude variants - run_eval: add || true so one model failure doesn't abort the entire batch - Expand model list from 41 to 45: add claude-sonnet-4, claude-opus-4-1/4-5/4-6/4-7 to match actual opencode models output Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop Claude and OpenAI variants — those are already covered by native CLI evals. Keep: DeepSeek, Kimi, Qwen, Minimax, GLM, MiMo, Grok-via-opencode, Gemini-via-opencode, Nemotron, North-mini, big-pickle (19 models total). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uting spec - Add AgentRole type (lead, coordinator, worker, planner, reviewer, critic, verifier, judge, mapper, reducer, supervisor, debater) matching the choosing-swarm-patterns skill's role taxonomy - Add HarnessRoleMap + HARNESS_ROLE_MAP: eval-backed table mapping each harness to the roles it can fill (confirmed/provisional/not-viable/untested) - Add harnessesForRole() helper for pattern-aware slot selection - Expand WorkerSpec to carry cli, codexModel, opencodeModel - Add §5 Role-Fit Classification to specs/auto-routing.md: role definitions, eval signal → role mapping, provisional role-fit table, pattern → role → harness assignment table, and s07-s11 scenario roadmap for full coverage opencode Chinese/alternative model rows marked 'untested'; will update from Phase 1+2 batch eval results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
19 alternative/Chinese models evaluated (s01-s04, repeat=3). 17/19 advance. Top tier (16/16, 0-2 phantoms): deepseek-v4-flash, deepseek-v4-flash-free, qwen3.6-plus, qwen3.5-plus, minimax-m2.5, minimax-m2.7, glm-5.1, big-pickle Confirmed (16/16, 3-9 phantoms): glm-5, gemini-3.1-pro, grok-build-0.1, gemini-3-flash Provisional (12-15/16): kimi-k2.5, kimi-k2.6, mimo-v2.5-free, gemini-3.5-flash, north-mini-code-free Eliminated: deepseek-v4-pro (11/16), nemotron-3-ultra-free (10/16) Key findings: - opencode normalizes flaky native CLIs: gemini bare 60% native → 16/16 via opencode; grok 0/48 native → 16/16 via opencode (model capable, native CLI MCP was broken) - All top-tier Chinese models are relay-native across all 4 onboarding variants - HARNESS_ROLE_MAP updated with per-model entries for all 19 tested models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r summary Adds §7 "Role Rankings — Top 5 per Role" to eval-master-summary.html: - Lead/Coordinator: claude:sonnet > opus > codex:gpt-5.5 > opencode:deepseek-v4-flash > opencode:qwen3.6-plus - Worker: codex:gpt-5.5 > opencode:deepseek-v4-flash > minimax-m2.5 > qwen3.5/3.6-plus - Reviewer/Critic: claude:opus > sonnet > codex:gpt-5.5 > opencode:gemini-3.1-pro - Mapper/Reducer: codex:gpt-5.5 > opencode:deepseek-v4-flash > minimax-m2.5 > qwen3.6 > glm-5.1 - Judge: claude:opus > sonnet > codex:gpt-5.5 - Communicator: claude:opus > sonnet > codex:gpt-5.5 > opencode:gemini-3.1-pro > opencode:deepseek-v4-flash - Best lead callout: claude:sonnet + one-liner is the clear #1 across all criteria Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates packages/evals/ as the shared eval harness package: - package.json with subpath exports for types, runner, harness, scoring, and scenarios - tsconfig.json matching the monorepo TypeScript config - src/types.ts (canonical copy of EvalScenario/ScenarioResult/etc. types) - src/harness.ts (BrokerHarness interface for downstream type-checking) - src/index.ts (barrel re-export) Adds build:evals to root build:core chain (after harness-driver, before cli). Updates publish.yml: converts publish-harnesses to a matrix job covering both harnesses and evals — same dependency pattern (needs publish-packages so harness-driver is on the registry before evals can be installed). Also adds specs/agent-relay-evals-package.md scoping the full migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three new eval scenarios that test lead role discipline — the gap that
existing s03/s04 don't cover (they only check spawn mechanics, not whether
the lead actually delegates vs. self-implements):
l01 — Unconditional delegation: coding task with no "don't do it yourself"
hint. Good lead spawns; bad lead writes code itself.
l02 — Temptation resistance: task explicitly says "you'd be faster doing it
yourself". Lead must still delegate.
l03 — Post-delegation synthesis: after workers report DONE, lead must send
a synthesis message referencing their results.
New scoring fields on ScenarioResult:
- selfImplemented: lead's PTY stream contained code blocks / impl tool calls
- synthesisOk: lead sent a synthesis message after receiving DONE
All three run across bare/one-liner/brief/skill onboarding variants (12 total).
Runner: --group=lead-delegation, npm scripts eval:lead-delegation,
eval:lead-delegation:claude-models, eval:lead-delegation:all-harnesses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SpawnAgentRequest has no top-level model field; model must be passed through metadata so the broker can forward --model to the launched CLI. Also auto-format prettier fixes in auto/ and local-agent.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
565d110 to
71dbb19
Compare
Package was pinned at 8.3.1 causing npm ci to fail — lock file was missing @agent-relay/evals and several stale version entries. Updated to match monorepo version 8.6.0 and regenerated package-lock.json. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eval artifacts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s eval contentIncludes does literal substring matching; rk*live* was treated as the literal string, not a glob. The mock key rk_live_ws_eval_create matches rk_live_ as a prefix check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. composer.test.ts: check for role === 'reducer' not 'Synthesiser'; the
AgentRole type uses 'reducer' for aggregator workers.
2. packages/cli/package.json: bump all @agent-relay/* workspace deps from
8.2.0 → 8.7.2 so npm ci resolves them to the local workspace symlinks
instead of installing stale registry copies. This was shadowing the
local @agent-relay/cloud build and hiding cloudWorkerStateDir.
3. package-lock.json: regenerated to remove the nested
packages/cli/node_modules/@agent-relay/{cloud,config,sdk,...}@8.2.0
entries that were installed from npm.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The prettier bot ran on the pre-fix commit and its auto-format push overwrote the rk*live* → rk_live_ fix. Reapplied: contentIncludes does literal substring matching, not glob, so rk_live_ is correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the glob-style rk*live* (reverted by prettier bot) with the full exact key rk_live_ws_eval_create that the mock actually returns. This is a precise substring check, passes prettier with no diff, and won't be overwritten by the prettier auto-format workflow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…odules When CLI's @agent-relay/* deps were bumped from 8.2.0 → 8.7.2, the old @agent-relay/broker-*@8.2.0 optional entries left by the previous npm install were not cleaned by --package-lock-only. These shadowed the root workspace broker binaries (8.7.2), causing 'unexpected argument --instance-name' from the old CLI init interface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Combines two eval PRs (#1092 and #1066) into a single merge:
evals/,scripts/evals/) — 124 deterministic cases across 18 suites, zero tokens, zero broker requiredtests/integration/broker/evals/) — 8 scenarios that spawn real agent CLIs and measure whether agents actually use the messaging protocollocal-workflowCLI command andspecs/fleet-delivery.md(from Protocol Evals #1066)Closes #1092, Closes #1066.
Two-layer eval strategy
npm run evals:offline)npm run eval)RELAY_INTEGRATION_REAL_CLI=1Running the evals
Test plan
npm run evals:offlinepasses (124 cases, 0 failed, 6 pending-executor)npm run eval:unitpassesnpm run eval:selftestgoes RED (negative control)npm run eval:toolcheckpasses🤖 Generated with Claude Code