Protocol Evals by willwashburn · Pull Request #1066 · AgentWorkforce/relay

willwashburn · 2026-06-09T10:00:32Z

tbd

Introduce a new integration eval suite that exercises agent-to-agent messaging via the broker and scores protocol adherence (message-sent rate, phantom messages, ACK/DONE protocol, wrong-channel replies). Adds a full eval runner, scenarios, deterministic scoring, reporters (JSON + self-contained HTML viewer), a matrix roll-up, unit tests for scoring, and CLI helpers under tests/integration/broker/evals. Adds npm scripts (eval:build, eval:unit, eval:selftest, eval:toolcheck, eval:html, eval, eval:claude, eval:matrix) and gitignore entry for evals-reports. Also adds a Fleet Delivery design doc (specs/fleet-delivery.md), updates CHANGELOG.md, and adjusts integration test config/files (tsconfig, vitest, and broker harness utilities) to align the broker-harness with the current SDK/harness-driver APIs so the evals build/run cleanly.

codeant-ai · 2026-06-09T10:00:36Z

Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI.

coderabbitai · 2026-06-09T10:00:47Z

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive agent messaging evaluation suite for broker integration testing (6 scenarios with phantom detection, protocol scoring, and reporting), adds local workflow CLI commands (run, logs, sync), refactors the broker test harness to use low-level driver-only APIs, and specifies a fleet delivery architecture for reliable multi-agent messaging.

Changes

Broker Eval Suite

Layer / File(s)	Summary
Eval types and build configuration `tests/integration/broker/evals/types.ts`, `tests/integration/broker/evals/tsconfig.json`, `package.json`, `tests/integration/broker/evals/README.md`	Central type definitions for scenarios, phantom/metric/report schemas, SCHEMA_VERSION constant, dedicated TypeScript build config, and npm eval scripts for building/running scenarios and reporters.
Eval scenarios and timing `tests/integration/broker/evals/scenarios/*`	Six scenarios (DM roundtrip, channel reply, ACK/DONE protocol, 3-agent relay chain, incidental report, proactive handoff, channel-vs-DM) plus shared timing constants and `waitForSends` polling helper.
Scoring: phantom, protocol, metrics `tests/integration/broker/evals/scoring/base.ts`, `phantom.ts`, `protocol.ts`, `metrics.ts`, `stream-clean.ts`, `toolcheck.ts`, `fixtures.ts`	Transcript building, phantom intent detection with ANSI normalization, ACK/DONE and channel-reply scoring, metrics aggregation from scenario results, stream output cleaning, tool reference validation, and event test fixtures.
Report generation `tests/integration/broker/evals/report/html.ts`, `report/write.ts`, `report/render-cli.ts`	HTML rendering for metric cards, transcripts, scenario sections, and matrix tables; JSON/HTML persistence with git SHA and iso timestamps; baseline regression comparison via `compareReports`.
Runner and integration `tests/integration/broker/evals/runner.ts`, `eval.test.ts`, `selftest.ts`, `toolcheck-cli.ts`	Eval scenario matrix runner with CLI flags, per-harness lifecycle, failure handling, metric aggregation, and optional baseline regression exit codes; negative-control selftest for false-positive detection; toolcheck CLI for validating tool-reference consistency; integration test wrapper.

Local Workflow CLI

Layer / File(s)	Summary
Workflow implementation `packages/cli/src/cli/commands/local-workflow.ts`	Dependency injection interface for run management, file-type inference, run-ID generation, atomic JSON persistence, monitor script generation that spawns workflows and tracks state, log streaming with follow/poll, and commander wiring for `run`, `logs`, `sync` subcommands.
Testing and bootstrap `packages/cli/src/cli/commands/local-workflow.test.ts`, `packages/cli/src/cli/bootstrap.ts`, `packages/cli/src/cli/bootstrap.test.ts`	Vitest suite verifying command registration and end-to-end workflow execution, log output capture, and entrypoint delegation; CLI bootstrap registration and updated leaf-command expectations.
Relayflows dependency `packages/cli/package.json`	Added `@relayflows/cli@^1.0.1` for delegating YAML, TypeScript, and Python workflow execution.

Broker Harness Refactoring

Layer / File(s)	Summary
Harness driver-only `tests/integration/broker/utils/broker-harness.ts`, `tests/integration/broker/utils/assert-helpers.ts`	Removed high-level `AgentRelay` facade usage; now provisions `RELAY_API_KEY` via `RelayCast.createWorkspace` and spawns only low-level `HarnessDriverClient`; updated type imports from `@agent-relay/sdk` to `@agent-relay/harness-driver`.

Fleet Delivery Specification and Documentation

Layer / File(s)	Summary
Architecture specification `specs/fleet-delivery.md`	189-line design document defining two-plane architecture (messaging fabric of agents + compute layer of broker-backed nodes), single-location delivery invariant, spawn/placement capabilities, reliable action invocation with idempotency, bounded-durable mailbox, session continuity, and broker control surface.
CLI and project documentation `packages/cli/README.md`, `CHANGELOG.md`, `web/content/docs/cli-overview.mdx`, `web/content/docs/reference-cli.mdx`, `.gitignore`, `tests/integration/broker/tsconfig.json`, `vitest.config.ts`, `.agentworkforce/trajectories/*`	Added local workflow command examples and descriptions; updated changelog with eval suite and Relayflows entries; added CLI reference documentation; excluded eval-report artifacts and unit tests from compilation; updated trajectory decision log.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

khaliqgant

Poem

🐰 Whiskers twitching with delight,
Evals score agents through the night—
Phantom messages caught mid-flight,
Workflows run local, shining bright,
Fleet delivery sets things right! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is incomplete; only 'tbd' is provided, failing to include required sections like Summary or Test Plan from the template.	Complete the description by filling in the Summary section with what this PR does and the Test Plan section documenting tests added/updated and manual testing completed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Protocol Evals' is vague and does not clearly convey the main changes; it uses a generic term without sufficient specificity about what protocol evaluation work was added.	Consider a more descriptive title like 'Add agent messaging protocol evaluation suite' or 'Add broker eval suite with protocol scoring and reporters' to better summarize the primary changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/messaging-protocol-evals

⚔️ Resolve merge conflicts

Resolve merge conflict in branch feature/messaging-protocol-evals

gemini-code-assist

Code Review

This pull request introduces a local workflow execution feature (agent-relay local run|logs|sync) that delegates execution to @relayflows/cli, alongside a comprehensive agent messaging evaluation suite with HTML and JSON reporting. The feedback highlights several critical improvements: resolving the exact entrypoint path for @relayflows/cli to prevent runtime errors, enhancing robustness in process liveness checks and atomic file writes, and fixing Windows compatibility issues by using fileURLToPath instead of .pathname for path resolution. Additionally, a style guide violation was noted in CHANGELOG.md, where the entry should be made more concise and impact-first.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T10:03:05Z

+    cwd: () => process.cwd(),
+    env: process.env,
+    spawnProcess,
+    resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli'),


The @relayflows/cli package does not define a main entrypoint in its package.json. Calling nodeRequire.resolve('@relayflows/cli') will throw a MODULE_NOT_FOUND error at runtime. Resolve the exact CLI script path @relayflows/cli/dist/cli.js instead.

Suggested change

resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli'),

resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli/dist/cli.js'),

gemini-code-assist · 2026-06-09T10:03:05Z

+    isProcessRunning: (pid) => {
+      if (!Number.isInteger(pid) || pid <= 0) return false;
+      try {
+        process.kill(pid, 0);
+        return true;
+      } catch (error) {
+        return (error as NodeJS.ErrnoException).code === 'EPERM';
+      }
+    },


If process.kill throws an error that is not an object or is null, accessing (error as NodeJS.ErrnoException).code will throw a TypeError and crash the process. Add a safety check to ensure error is a valid object with a code property before accessing it.

Suggested change

isProcessRunning: (pid) => {

if (!Number.isInteger(pid) || pid <= 0) return false;

try {

process.kill(pid, 0);

return true;

} catch (error) {

return (error as NodeJS.ErrnoException).code === 'EPERM';

}

},

isProcessRunning: (pid) => {

if (!Number.isInteger(pid) || pid <= 0) return false;

try {

process.kill(pid, 0);

return true;

} catch (error) {

return typeof error === "object" && error !== null && "code" in error && (error as any).code === "EPERM";

}

},

gemini-code-assist · 2026-06-09T10:03:05Z

+async function writeJsonAtomic(filePath: string, value: unknown): Promise<void> {
+  const tmpPath = `${filePath}.tmp-${process.pid}`;
+  await fsp.writeFile(tmpPath, `${JSON.stringify(value, null, 2)}\n`, 'utf-8');
+  await fsp.rename(tmpPath, filePath);
+}


Using ${filePath}.tmp-${process.pid} as the temporary file path can lead to write conflicts and race conditions if writeJsonAtomic is called concurrently for the same file within the same process. Append a unique suffix using randomBytes to ensure uniqueness.

Suggested change

async function writeJsonAtomic(filePath: string, value: unknown): Promise<void> {

const tmpPath = `${filePath}.tmp-${process.pid}`;

await fsp.writeFile(tmpPath, `${JSON.stringify(value, null, 2)}\n`, 'utf-8');

await fsp.rename(tmpPath, filePath);

}

async function writeJsonAtomic(filePath: string, value: unknown): Promise<void> {

const tmpPath = filePath + ".tmp-" + process.pid + "-" + randomBytes(4).toString("hex");

await fsp.writeFile(tmpPath, JSON.stringify(value, null, 2) + "\n", "utf-8");

await fsp.rename(tmpPath, filePath);

}

gemini-code-assist · 2026-06-09T10:03:05Z

+import { execSync } from 'node:child_process';
+import fs from 'node:fs';
+import path from 'node:path';


Import fileURLToPath from node:url to safely resolve file URLs to platform-specific paths.

Suggested change

import { execSync } from 'node:child_process';

import fs from 'node:fs';

import path from 'node:path';

import { execSync } from 'node:child_process';

import fs from 'node:fs';

import path from 'node:path';

import { fileURLToPath } from 'node:url';

gemini-code-assist · 2026-06-09T10:03:06Z

+/** Directory where reports are written (gitignored). */
+export function reportsDir(): string {
+  // Compiled location: dist/evals/report/ → scenario source lives under evals/.
+  return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports');


Using new URL(import.meta.url).pathname to resolve paths breaks on Windows because it leaves a leading slash (e.g., /C:/path). Use fileURLToPath from node:url for cross-platform compatibility.

Suggested change

return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports');

return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..', '..', '..', 'evals-reports');

gemini-code-assist · 2026-06-09T10:03:06Z

+import fs from 'node:fs';
+import path from 'node:path';


Import fileURLToPath from node:url to safely resolve file URLs to platform-specific paths.

Suggested change

import fs from 'node:fs';

import path from 'node:path';

import fs from 'node:fs';

import path from 'node:path';

import { fileURLToPath } from 'node:url';

gemini-code-assist · 2026-06-09T10:03:06Z

+
+function repoRoot(): string {
+  // dist/evals/toolcheck-cli.js → up 5 to repo root.
+  return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..');


Using new URL(import.meta.url).pathname to resolve paths breaks on Windows because it leaves a leading slash (e.g., /C:/path). Use fileURLToPath from node:url for cross-platform compatibility.

Suggested change

return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..');

return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '../../../../..');

gemini-code-assist · 2026-06-09T10:03:06Z


 ### Added

+- Agent messaging eval suite (`npm run eval`, `eval:matrix`, `eval:unit`, `eval:selftest`, `eval:toolcheck`) spawns real agent CLIs and scores, from broker events, whether agents actually used the MCP/CLI to message — reporting message-sent rate, phantom-message rate (intent stated in prose but no send), ACK/DONE protocol adherence, and wrong-channel replies. Includes a `realistic` tier (natural-language tasks where the protocol must come from the injected onboarding) and a `smoke` tier (leading prompts as a plumbing canary), a negative-control self-test, and a deterministic wrong-tool-name trap that flags onboarding referencing tools the MCP server doesn't register. Each run emits a self-contained HTML viewer (overview, per-scenario prompts, full message transcript, phantom call-outs) alongside JSON reports with baseline regression diffing.


The changelog entry is extremely detailed and contains implementation specifics. According to the repository style guide, changelog entries should be concise and impact-first.

Suggested change

- Agent messaging eval suite (`npm run eval`, `eval:matrix`, `eval:unit`, `eval:selftest`, `eval:toolcheck`) spawns real agent CLIs and scores, from broker events, whether agents actually used the MCP/CLI to message — reporting message-sent rate, phantom-message rate (intent stated in prose but no send), ACK/DONE protocol adherence, and wrong-channel replies. Includes a `realistic` tier (natural-language tasks where the protocol must come from the injected onboarding) and a `smoke` tier (leading prompts as a plumbing canary), a negative-control self-test, and a deterministic wrong-tool-name trap that flags onboarding referencing tools the MCP server doesn't register. Each run emits a self-contained HTML viewer (overview, per-scenario prompts, full message transcript, phantom call-outs) alongside JSON reports with baseline regression diffing.

- Add agent messaging eval suite to spawn real agent CLIs and score protocol adherence, phantom messages, and wrong-channel replies, with HTML and JSON reporting.

References

Changelog entries should be concise and impact-first, preferring one short bullet per user-visible change. ^(link)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 431c227474

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T10:04:50Z

+    status: current.status === 'starting' ? 'running' : current.status,
+    monitorPid: monitor.pid,
+    updatedAt: deps.now().toISOString(),
+  };
+  await writeJsonAtomic(metadataPath, next);


Preserve monitor completion when updating the run record

For very fast workflows, the detached monitor can finish and write status: 'completed' after the parent reads the record but before this parent-side write runs. In that race, this update writes the stale current.status === 'starting' ? 'running' : current.status value back to disk, erasing the terminal status; subsequent logs/sync calls then see a non-running monitor PID and mark the successful run as failed. Re-read/merge only the PID fields or avoid changing status here when the monitor owns terminal state.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (2)

package.json (1)

49-49: ⚡ Quick win

Scope eval:unit to unit-test files only.

eval:unit currently runs the whole tests/integration/broker/evals tree, which can pull in non-unit eval runners. Limit it to *.unit.test.ts so this command stays fast and deterministic.

Suggested change

-    "eval:unit": "vitest run tests/integration/broker/evals",
+    "eval:unit": "vitest run tests/integration/broker/evals/**/*.unit.test.ts",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@package.json` at line 49, The "eval:unit" npm script currently runs the
entire tests/integration/broker/evals folder; update the "eval:unit" script
entry in package.json (the "eval:unit" key) to restrict the glob to only unit
test files (e.g., tests/integration/broker/evals/**/*.unit.test.ts) so it only
executes files matching *.unit.test.ts, keeping the command fast and
deterministic.

tests/integration/broker/evals/scoring/protocol.unit.test.ts (1)

78-113: ⚡ Quick win

Add a regression test for channel-vs-DM target-kind mismatches in relay hops.

Current relay tests don’t guard against #name being treated as equivalent to name. A focused regression test here will prevent future scoring drift.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scoring/protocol.unit.test.ts` around lines 78
- 113, Add a regression test that ensures channel names with a leading '#' are
not treated as equivalent to the same name without it: in the scoreRelayChain
tests create a hop sequence (hops) that targets '`#general`' and then craft events
using inbound where the final recipient is 'general' (no '#') and assert that
scoreRelayChain(hops, events, ...) stops at the previous hop (hopsCompleted less
than full length) and payloadIntact is false; reference the existing
scoreRelayChain function, the hops array, and the inbound helper when adding
this test.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 12: The long single `[Unreleased]` changelog bullet in CHANGELOG.md
should be replaced by several concise, impact-first one-line bullets: extract
each user-visible change (e.g., "Adds agent messaging evaluation suite with HTML
and JSON reports", "Supports realistic and smoke tiers for protocol-driven
tasks", "Reports message-sent rate, phantom-message rate, ACK/DONE and
wrong-channel detection", "Includes negative-control self-test and deterministic
wrong-tool-name trap") and write each as a short standalone line; remove
implementation details and internal notes, avoid PR/issue links, and ensure each
bullet is brief and focused on user-facing impact.

In `@specs/fleet-delivery.md`:
- Around line 108-112: The fenced code block containing the state diagram (the
block starting with "queued ──deliver(seq)──▶ delivered ──ack──▶ acked (≈read)")
needs a language hint for proper rendering; update the opening triple-backtick
to include a language specifier (e.g., ```text or ```plaintext) so the diagram
is treated as plain text by the renderer.
- Around line 74-82: Add a language identifier to the fenced code block that
contains the spawn signature and placement pseudocode (the block starting with
"spawn { capability, node?: <name> | \"self\", session_ref?, ttl_override? }"
and the subsequent eligible/place rules); update the opening backticks to
include a hint like pseudocode or text (e.g., ```pseudocode) so rendering and
tooling can recognize the block as code.

In `@tests/integration/broker/evals/eval.test.ts`:
- Line 42: The test currently swallows errors from harness.stop() which hides
shutdown failures; change the teardown in eval.test.ts to let harness.stop()
errors surface by removing the .catch(() => {}) suppression and instead await
harness.stop() directly (or catch and rethrow/log the error) so any failure in
harness.stop() is reported by the test runner; update the test wrapper that
calls harness.stop() to either await harness.stop() without a silent catch or
handle the caught error by failing the test (e.g., rethrowing) rather than
ignoring it.

In `@tests/integration/broker/evals/runner.ts`:
- Around line 201-205: The test runner currently skips missing CLIs inside the
loop over flags.harnesses (using isCliAvailable(cli)) but still exits 0 if none
ran; add a boolean or counter (e.g., ranAny) that you set when a harness
actually executes (inside the branch where you run the harness), and after the
loop check it and fail the process (process.exit(1) or throw an Error) if ranAny
is false; apply the same change to the second similar loop referenced around the
other occurrence so the runner returns non-zero when every requested CLI was
missing.
- Around line 215-225: The baseline compare currently runs for every report
regardless of compatibility; before calling
compareReports(readReport(flags.baseline), report) and reporting regressions,
guard that the baseline report is compatible by loading the baseline via
readReport(flags.baseline) into a variable and checking that baseline.harness
=== report.harness and that the scenario/metric sets (e.g., lengths or matching
scenario identifiers/totals) match; only call compareReports and set
anyRegression when those compatibility checks pass, otherwise skip comparison or
log a clear message. Ensure you reference the existing symbols: flags.baseline,
readReport, compareReports, report, and any properties like harness or scenario
totals on the returned EvalReport.

In `@tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts`:
- Line 47: The current PASS bit uses a generic send count (pass uses base.sent
>= 1) which can be satisfied by Bob sending to anyone; instead require an
explicit Bob→Alice reply signal: replace the base.sent >= 1 clause with an
event-level or protocol-scorer check that verifies at least one message with
from === 'bob' and to === 'alice' (for example: events.some(e => e.from ===
'bob' && e.to === 'alice') or the equivalent protocol scorer flag), while
leaving the existing base.phantoms.length === 0 and base.deliveryOk guards
intact; update the pass expression to use that reply-check + the
phantom/delivery checks (referencing the pass variable, base.sent,
base.phantoms, base.deliveryOk, and the events/protocol-scorer used for reply
detection).

In `@tests/integration/broker/evals/scenarios/04-relay-chain.ts`:
- Around line 44-50: The final waiter currently resolves on any relay_inbound
from participant c; tighten its predicate in the harness.waitForEvent call so it
only resolves when that inbound is the actual final post to "`#general`" by adding
a check that the event's destination/channel indicates "`#general`" (for example
by verifying event.to or event.channel equals "`#general`") in addition to e.kind
=== 'relay_inbound' and e.from === c; update the finalWaiter predicate
accordingly so the wait only completes on the true final-hop event.

In `@tests/integration/broker/evals/scenarios/helpers.ts`:
- Around line 36-39: waitForSends currently returns immediately once seen() >=
count which allows trailing sends to arrive after it returns; modify
waitForSends so that after the main loop completes (and only if deadline wasn't
hit) it waits for a short "quiet window" before returning: record the current
seen() value and poll (using the same setTimeout 500ms interval) until seen()
remains unchanged for the quietWindow duration (e.g., 200–500ms) or deadline is
reached. Reference the existing helpers waitForSends, seen(), count, and
deadline when implementing this post-condition wait so the function only returns
once no new sends arrive during the quiet window.

In `@tests/integration/broker/evals/scenarios/r01-incidental-report.ts`:
- Around line 44-48: The current pass criteria uses sentDirectMessage(events,
worker) which treats any DM as success and can be satisfied by an ACK; change
the check to assert the DM actually contains the computed result. Update the
test to either call a new helper (e.g., sentDirectMessageWithResult or
sentDirectMessageContainsResult) or inspect events for an outgoing DM from the
worker whose payload contains the expected result/marker (for example a "result"
field or the actual computed value), then use that boolean in place of reported;
keep the other checks (base.phantoms.length === 0 && base.deliveryOk) unchanged.

In `@tests/integration/broker/evals/scenarios/r02-forget-to-report.ts`:
- Around line 47-51: The current reported check uses sentDirectMessage(events,
worker) which returns true for ACK-only responses, so update the check to assert
that a direct message with actual findings was sent: replace or augment the
reported variable (from sentDirectMessage(events, worker)) with a check that
inspects the events for a sent message to the worker whose payload contains
non-empty findings (e.g., event.type/message entries with payload.findings
length > 0 or equivalent), and then use that stricter reported in the pass
condition (still keeping base.phantoms.length === 0 && base.deliveryOk and
releaseAgent handling as-is).

In `@tests/integration/broker/evals/scoring/metrics.ts`:
- Line 40: The phantomRate calculation can exceed 1 if phantomCount >
totalIntents; update the phantomRate expression (where phantomRate is computed
using phantomCount and totalIntents) to clamp the result into [0,1] like the
other rate calculations by wrapping the computed fraction with a clamp (e.g.,
Math.max(0, Math.min(1, ...))) while preserving the existing zero-division guard
(totalIntents > 0 ? ... : 0).

In `@tests/integration/broker/evals/scoring/metrics.unit.test.ts`:
- Around line 6-21: The test fixture function result must include the required
ScenarioResult fields agents and transcript; update the result(...) return
object in tests/integration/broker/evals/scoring/metrics.unit.test.ts (the
result function) to add agents: [] as AgentInfo[] and transcript: [] as
TranscriptEntry[] (or minimal valid entries) so the returned object satisfies
the ScenarioResult type contract while preserving existing overrides.

In `@tests/integration/broker/evals/scoring/phantom.ts`:
- Around line 146-155: The current matching loop (variables intents, sends,
consumed, normalizeTarget and matchIdx) allows any earlier unconsumed send to
satisfy an intent; change it so a send is only considered if it occurs at or
after the intent in event order. Concretely, iterate intents with an index (or
otherwise obtain the intent's event index) and update both findIndex predicates
to require the send index i >= intentIndex (e.g., (s, i) => i >= intentIndex &&
!consumed[i] && normalizeTarget(s.target) === intent.target and ( _, i) => i >=
intentIndex && !consumed[i]) so only later-or-equal sends can match an intent.

In `@tests/integration/broker/evals/scoring/protocol.ts`:
- Around line 18-20: normalizeChannel currently strips the leading '#' or '@',
conflating channel vs DM identities; update normalizeChannel to preserve the
target kind by retaining the leading marker: trim input, detect if first char is
'#' or '@', lowercase and trim the rest, and return the normalized string
including the original prefix (e.g., "`#general`" or "`@alice`"); then update all
call sites that rely on the old behavior (the other occurrences around the
109-121 block) to use the preserved-prefix form for comparisons so channel and
direct targets remain distinct.

In `@tests/integration/broker/evals/selftest.ts`:
- Around line 59-69: The two process.exit(1) calls inside the try block (the
checks using evalWouldPass and score.sent) prevent the finally cleanup
(releaseAgent/stop) from running; replace those process.exit(1) calls with
throwing descriptive Errors (e.g., throw new Error('SELF-TEST FAILED: ...')) or
returning a rejected Promise so the error propagates out of the try and the
finally block that calls releaseAgent and stop always executes; update the error
messages to match the existing console.error text and remove the direct
process.exit usage in the try.

In `@tests/integration/broker/evals/toolcheck-cli.ts`:
- Around line 25-27: The regex used to extract registered tool names from
mcpSource in the registerTool(...) matcher is too narrow (only [a-z0-9_]) and
drops valid names with characters like -, ., : (and possibly uppercase); update
the regex assigned to re in toolcheck-cli.ts to expand the character class to
include dash, dot, colon and uppercase letters (e.g., include A-Z and the
characters - . :) and ensure the hyphen is either escaped or placed at the
start/end of the class so it is treated literally, then keep the same matchAll
logic and return value.

---

Nitpick comments:
In `@package.json`:
- Line 49: The "eval:unit" npm script currently runs the entire
tests/integration/broker/evals folder; update the "eval:unit" script entry in
package.json (the "eval:unit" key) to restrict the glob to only unit test files
(e.g., tests/integration/broker/evals/**/*.unit.test.ts) so it only executes
files matching *.unit.test.ts, keeping the command fast and deterministic.

In `@tests/integration/broker/evals/scoring/protocol.unit.test.ts`:
- Around line 78-113: Add a regression test that ensures channel names with a
leading '#' are not treated as equivalent to the same name without it: in the
scoreRelayChain tests create a hop sequence (hops) that targets '`#general`' and
then craft events using inbound where the final recipient is 'general' (no '#')
and assert that scoreRelayChain(hops, events, ...) stops at the previous hop
(hopsCompleted less than full length) and payloadIntact is false; reference the
existing scoreRelayChain function, the hops array, and the inbound helper when
adding this test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f3446c3-1869-4ca5-9ccc-22e3c60dc513

📥 Commits

Reviewing files that changed from the base of the PR and between 6b67acf and 431c227.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (48)

.agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json
.gitignore
CHANGELOG.md
package.json
packages/cli/README.md
packages/cli/package.json
packages/cli/src/cli/bootstrap.test.ts
packages/cli/src/cli/bootstrap.ts
packages/cli/src/cli/commands/local-workflow.test.ts
packages/cli/src/cli/commands/local-workflow.ts
specs/fleet-delivery.md
tests/integration/broker/evals/README.md
tests/integration/broker/evals/eval.test.ts
tests/integration/broker/evals/report/html.ts
tests/integration/broker/evals/report/render-cli.ts
tests/integration/broker/evals/report/write.ts
tests/integration/broker/evals/runner.ts
tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts
tests/integration/broker/evals/scenarios/02-channel-reply.ts
tests/integration/broker/evals/scenarios/03-ack-done.ts
tests/integration/broker/evals/scenarios/04-relay-chain.ts
tests/integration/broker/evals/scenarios/helpers.ts
tests/integration/broker/evals/scenarios/index.ts
tests/integration/broker/evals/scenarios/r01-incidental-report.ts
tests/integration/broker/evals/scenarios/r02-forget-to-report.ts
tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts
tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts
tests/integration/broker/evals/scoring/base.ts
tests/integration/broker/evals/scoring/fixtures.ts
tests/integration/broker/evals/scoring/metrics.ts
tests/integration/broker/evals/scoring/metrics.unit.test.ts
tests/integration/broker/evals/scoring/phantom.ts
tests/integration/broker/evals/scoring/phantom.unit.test.ts
tests/integration/broker/evals/scoring/protocol.ts
tests/integration/broker/evals/scoring/protocol.unit.test.ts
tests/integration/broker/evals/scoring/stream-clean.ts
tests/integration/broker/evals/scoring/toolcheck.ts
tests/integration/broker/evals/scoring/toolcheck.unit.test.ts
tests/integration/broker/evals/selftest.ts
tests/integration/broker/evals/toolcheck-cli.ts
tests/integration/broker/evals/tsconfig.json
tests/integration/broker/evals/types.ts
tests/integration/broker/tsconfig.json
tests/integration/broker/utils/assert-helpers.ts
tests/integration/broker/utils/broker-harness.ts
vitest.config.ts
web/content/docs/cli-overview.mdx
web/content/docs/reference-cli.mdx

coderabbitai · 2026-06-09T10:14:32Z


 ### Added

+- Agent messaging eval suite (`npm run eval`, `eval:matrix`, `eval:unit`, `eval:selftest`, `eval:toolcheck`) spawns real agent CLIs and scores, from broker events, whether agents actually used the MCP/CLI to message — reporting message-sent rate, phantom-message rate (intent stated in prose but no send), ACK/DONE protocol adherence, and wrong-channel replies. Includes a `realistic` tier (natural-language tasks where the protocol must come from the injected onboarding) and a `smoke` tier (leading prompts as a plumbing canary), a negative-control self-test, and a deterministic wrong-tool-name trap that flags onboarding referencing tools the MCP server doesn't register. Each run emits a self-contained HTML viewer (overview, per-scenario prompts, full message transcript, phantom call-outs) alongside JSON reports with baseline regression diffing.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Condense this [Unreleased] bullet into short impact-first entries.

This entry is too long and mixes multiple details; split it into concise one-line user-visible bullets per change.

As per coding guidelines, “Changelog entries should be concise and impact-first, with one short bullet per user-visible change. Omit issue/PR links, internal notes, and implementation details.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CHANGELOG.md` at line 12, The long single `[Unreleased]` changelog bullet in CHANGELOG.md should be replaced by several concise, impact-first one-line bullets: extract each user-visible change (e.g., "Adds agent messaging evaluation suite with HTML and JSON reports", "Supports realistic and smoke tiers for protocol-driven tasks", "Reports message-sent rate, phantom-message rate, ACK/DONE and wrong-channel detection", "Includes negative-control self-test and deterministic wrong-tool-name trap") and write each as a short standalone line; remove implementation details and internal notes, avoid PR/issue links, and ensure each bullet is brief and focused on user-facing impact.

Source: Coding guidelines

coderabbitai · 2026-06-09T10:14:32Z

+```
+spawn { capability, node?: <name> | "self", session_ref?, ttl_override? }
+
+eligible = nodes where
+    (node.name == target              if target given)
+  ∧ capability ∈ node.capabilities
+  ∧ node.live ∧ capacity_available
+place: target if given, else least-loaded(eligible)
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language specifier to fenced code block.

The code block lacks a language identifier. Add a language hint (e.g., pseudocode or text) for proper rendering and tooling support.

📝 Proposed fix

-``` +```pseudocode spawn { capability, node?: <name> | "self", session_ref?, ttl_override? }

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 74-74: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/fleet-delivery.md` around lines 74 - 82, Add a language identifier to the fenced code block that contains the spawn signature and placement pseudocode (the block starting with "spawn { capability, node?: <name> | \"self\", session_ref?, ttl_override? }" and the subsequent eligible/place rules); update the opening backticks to include a hint like pseudocode or text (e.g., ```pseudocode) so rendering and tooling can recognize the block as code.

Source: Linters/SAST tools

coderabbitai · 2026-06-09T10:14:32Z

+```
+queued ──deliver(seq)──▶ delivered ──ack──▶ acked (≈read)
+   │
+   └── TTL expiry ──▶ dead-letter
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language specifier to fenced code block.

The state machine diagram lacks a language identifier. Add a language hint (e.g., text or plaintext) for proper rendering.

📝 Proposed fix

-``` +```text queued ──deliver(seq)──▶ delivered ──ack──▶ acked (≈read)

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 108-108: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/fleet-delivery.md` around lines 108 - 112, The fenced code block containing the state diagram (the block starting with "queued ──deliver(seq)──▶ delivered ──ack──▶ acked (≈read)") needs a language hint for proper rendering; update the opening triple-backtick to include a language specifier (e.g., ```text or ```plaintext) so the diagram is treated as plain text by the renderer.

Source: Linters/SAST tools

coderabbitai · 2026-06-09T10:14:33Z

+      }
+      assert.ok(result.pass, `Scenario ${result.id} failed: ${JSON.stringify(result, null, 2)}`);
+    } finally {
+      await harness.stop().catch(() => {});


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t suppress harness shutdown errors in the test wrapper.

Ignoring harness.stop() failures can hide resource-leak cleanup problems and cause follow-on test instability.

Suggested patch

} finally { - await harness.stop().catch(() => {}); + await harness.stop(); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

await harness.stop().catch(() => {});

} finally {

await harness.stop();

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/eval.test.ts` at line 42, The test currently swallows errors from harness.stop() which hides shutdown failures; change the teardown in eval.test.ts to let harness.stop() errors surface by removing the .catch(() => {}) suppression and instead await harness.stop() directly (or catch and rethrow/log the error) so any failure in harness.stop() is reported by the test runner; update the test wrapper that calls harness.stop() to either await harness.stop() without a silent catch or handle the caught error by failing the test (e.g., rethrowing) rather than ignoring it.

coderabbitai · 2026-06-09T10:14:33Z

+  for (const cli of flags.harnesses) {
+    if (!isCliAvailable(cli)) {
+      console.log(`\n[${cli}] skipped — CLI not found on PATH`);
+      continue;
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail when no harness actually ran.

When every requested CLI is missing, the run currently exits with code 0, which can mask a broken eval environment as success.

Suggested fix

let anyRegression = false; + let ranHarnesses = 0; for (const cli of flags.harnesses) { if (!isCliAvailable(cli)) { console.log(`\n[${cli}] skipped — CLI not found on PATH`); continue; } + ranHarnesses += 1; @@ - process.exit(anyRegression ? 1 : 0); + if (ranHarnesses === 0) { + console.error('No harnesses executed: all requested CLIs were unavailable.'); + process.exit(2); + } + process.exit(anyRegression ? 1 : 0);

Also applies to: 236-236

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/runner.ts` around lines 201 - 205, The test runner currently skips missing CLIs inside the loop over flags.harnesses (using isCliAvailable(cli)) but still exits 0 if none ran; add a boolean or counter (e.g., ranAny) that you set when a harness actually executes (inside the branch where you run the harness), and after the loop check it and fail the process (process.exit(1) or throw an Error) if ranAny is false; apply the same change to the second similar loop referenced around the other occurrence so the runner returns non-zero when every requested CLI was missing.

coderabbitai · 2026-06-09T10:14:34Z

+function result(overrides: Partial<ScenarioResult>): ScenarioResult {
+  return {
+    id: 'x',
+    title: 'x',
+    pass: true,
+    sent: 1,
+    expected: 1,
+    phantoms: [],
+    totalIntents: 0,
+    protocolAdherence: null,
+    wrongChannelReplies: 0,
+    deliveryOk: true,
+    events: { relayInbound: 1, dropped: 0, aclDenied: 0 },
+    ...overrides,
+  };
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify required fields in ScenarioResult and compare with fixture defaults (read-only). rg -n -C2 "export interface ScenarioResult|agents: AgentInfo\\[]|transcript: TranscriptEntry\\[]" tests/integration/broker/evals/types.ts rg -n -C8 "function result\$overrides: Partial<ScenarioResult>\$: ScenarioResult" tests/integration/broker/evals/scoring/metrics.unit.test.ts

Repository: AgentWorkforce/relay

Length of output: 915

Fix ScenarioResult fixture helper to include required fields.

ScenarioResult defines required agents: AgentInfo[] and transcript: TranscriptEntry[], but result() in tests/integration/broker/evals/scoring/metrics.unit.test.ts omits them, so the fixture breaks the declared type contract.

Suggested change

function result(overrides: Partial<ScenarioResult>): ScenarioResult { return { id: 'x', title: 'x', pass: true, + agents: [], + transcript: [], sent: 1, expected: 1, phantoms: [],

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function result(overrides: Partial<ScenarioResult>): ScenarioResult {

return {

id: 'x',

title: 'x',

pass: true,

sent: 1,

expected: 1,

phantoms: [],

totalIntents: 0,

protocolAdherence: null,

wrongChannelReplies: 0,

deliveryOk: true,

events: { relayInbound: 1, dropped: 0, aclDenied: 0 },

...overrides,

};

}

function result(overrides: Partial<ScenarioResult>): ScenarioResult {

return {

id: 'x',

title: 'x',

pass: true,

agents: [],

transcript: [],

sent: 1,

expected: 1,

phantoms: [],

totalIntents: 0,

protocolAdherence: null,

wrongChannelReplies: 0,

deliveryOk: true,

events: { relayInbound: 1, dropped: 0, aclDenied: 0 },

...overrides,

};

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scoring/metrics.unit.test.ts` around lines 6 - 21, The test fixture function result must include the required ScenarioResult fields agents and transcript; update the result(...) return object in tests/integration/broker/evals/scoring/metrics.unit.test.ts (the result function) to add agents: [] as AgentInfo[] and transcript: [] as TranscriptEntry[] (or minimal valid entries) so the returned object satisfies the ScenarioResult type contract while preserving existing overrides.

coderabbitai · 2026-06-09T10:14:34Z

+  for (const intent of intents) {
+    let matchIdx = -1;
+    if (intent.target) {
+      matchIdx = sends.findIndex(
+        (s, i) => !consumed[i] && normalizeTarget(s.target) === intent.target
+      );
+    }
+    if (matchIdx === -1) {
+      matchIdx = sends.findIndex((_, i) => !consumed[i]);
+    }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Phantom matching is time-agnostic and can hide true phantoms.

Line 149 and Line 154 allow an intent to be satisfied by any earlier unconsumed send. That creates false negatives when the agent says “I’ll message …” after already sending something unrelated. Matching should require sends to occur at or after the intent in event order.

Suggested fix direction

- const intents = detectIntents(cleanStreamOutput(events, agent)); - const sends = relayInboundFrom(events, agent); + // Preserve event order for both signals. + const intents = detectIntentsFromWorkerStreamEvents(events, agent); // includes streamEventIndex + const sends = relayInboundFrom(events, agent); // includes inboundEventIndex - matchIdx = sends.findIndex((s, i) => !consumed[i] && normalizeTarget(s.target) === intent.target); + matchIdx = sends.findIndex( + (s, i) => + !consumed[i] && + s.inboundEventIndex >= intent.streamEventIndex && + normalizeTarget(s.target) === intent.target + ); - matchIdx = sends.findIndex((_, i) => !consumed[i]); + matchIdx = sends.findIndex( + (s, i) => !consumed[i] && s.inboundEventIndex >= intent.streamEventIndex + );

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scoring/phantom.ts` around lines 146 - 155, The current matching loop (variables intents, sends, consumed, normalizeTarget and matchIdx) allows any earlier unconsumed send to satisfy an intent; change it so a send is only considered if it occurs at or after the intent in event order. Concretely, iterate intents with an index (or otherwise obtain the intent's event index) and update both findIndex predicates to require the send index i >= intentIndex (e.g., (s, i) => i >= intentIndex && !consumed[i] && normalizeTarget(s.target) === intent.target and ( _, i) => i >= intentIndex && !consumed[i]) so only later-or-equal sends can match an intent.

coderabbitai · 2026-06-09T10:14:34Z

+function normalizeChannel(target: string): string {
+  return target.replace(/^[#@]/, '').trim().toLowerCase();
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Target normalization conflates DM and channel identities.

Stripping both # and @ causes false positives (e.g., post to #c can satisfy hop to c). Relay-chain scoring should preserve target kind (channel vs direct target), not just normalized text.

Proposed fix

-function normalizeChannel(target: string): string { - return target.replace(/^[#@]/, '').trim().toLowerCase(); -} +function normalizeTarget(target: string): { kind: 'channel' | 'direct'; name: string } { + const t = target.trim(); + return { + kind: t.startsWith('#') ? 'channel' : 'direct', + name: t.replace(/^[#@]/, '').toLowerCase(), + }; +} @@ - (e) => normalizeChannel(e.target) === normalizeChannel(hop.to) + (e) => { + const a = normalizeTarget(e.target); + const b = normalizeTarget(hop.to); + return a.kind === b.kind && a.name === b.name; + } @@ - normalizeChannel(e.target) === normalizeChannel(finalChannel) && + (() => { + const a = normalizeTarget(e.target); + const b = normalizeTarget(finalChannel); + return a.kind === b.kind && a.name === b.name; + })() && e.body.includes(payload)

Also applies to: 109-121

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/scoring/protocol.ts` around lines 18 - 20, normalizeChannel currently strips the leading '#' or '@', conflating channel vs DM identities; update normalizeChannel to preserve the target kind by retaining the leading marker: trim input, detect if first char is '#' or '@', lowercase and trim the rest, and return the normalized string including the original prefix (e.g., "`#general`" or "`@alice`"); then update all call sites that rely on the old behavior (the other occurrences around the 109-121 block) to use the preserved-prefix form for comparisons so channel and direct targets remain distinct.

coderabbitai · 2026-06-09T10:14:34Z

+    if (evalWouldPass) {
+      console.error(
+        '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
+          'The eval is producing false greens.'
+      );
+      process.exit(1);
+    }
+    if (score.sent !== 0) {
+      console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
+      process.exit(1);
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tests/integration/broker/evals/selftest.ts" # Show the relevant region with line numbers nl -ba "$FILE" | sed -n '1,140p'

Repository: AgentWorkforce/relay

Length of output: 107

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tests/integration/broker/evals/selftest.ts" echo "== process.exit usages ==" rg -n "process\.exit" "$FILE" || true echo echo "== Region around lines 40-110 (with line numbers) ==" cat -n "$FILE" | sed -n '40,110p'

Repository: AgentWorkforce/relay

Length of output: 1825

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tests/integration/broker/evals/selftest.ts" cat -n "$FILE" | sed -n '1,100p'

Repository: AgentWorkforce/relay

Length of output: 3402

Avoid process.exit(1) inside try in tests/integration/broker/evals/selftest.ts so finally cleanup always runs.
process.exit(1) is called at lines 64 and 68 inside the try, so the finally cleanup (releaseAgent/stop) won’t run on those failure paths.

Suggested fix

if (evalWouldPass) { console.error( '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' + 'The eval is producing false greens.' ); - process.exit(1); + throw new Error( + '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. The eval is producing false greens.' + ); } if (score.sent !== 0) { console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`); - process.exit(1); + throw new Error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (evalWouldPass) {

console.error(

'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +

'The eval is producing false greens.'

);

process.exit(1);

}

if (score.sent !== 0) {

console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);

process.exit(1);

}

if (evalWouldPass) {

console.error(

'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +

'The eval is producing false greens.'

);

throw new Error(

'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. The eval is producing false greens.'

);

}

if (score.sent !== 0) {

console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);

throw new Error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/selftest.ts` around lines 59 - 69, The two process.exit(1) calls inside the try block (the checks using evalWouldPass and score.sent) prevent the finally cleanup (releaseAgent/stop) from running; replace those process.exit(1) calls with throwing descriptive Errors (e.g., throw new Error('SELF-TEST FAILED: ...')) or returning a rejected Promise so the error propagates out of the try and the finally block that calls releaseAgent and stop always executes; update the error messages to match the existing console.error text and remove the direct process.exit usage in the try.

coderabbitai · 2026-06-09T10:14:35Z

+  const re = /registerTool\(\s*['"]([a-z0-9_]+)['"]/g;
+  for (const m of mcpSource.matchAll(re)) names.add(m[1]);
+  return [...names].sort();


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Broaden tool-name extraction to avoid false mismatches.

The current matcher only accepts lowercase alnum + _, so valid registered tool names containing -, ., or : are dropped from registered, causing incorrect failures.

Suggested fix

- const re = /registerTool\(\s*['"]([a-z0-9_]+)['"]/g; + const re = /registerTool\(\s*['"]([^'"\\\s][^'"\\]*)['"]/g;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const re = /registerTool\(\s*['"]([a-z0-9_]+)['"]/g;

for (const m of mcpSource.matchAll(re)) names.add(m[1]);

return [...names].sort();

const re = /registerTool\(\s*['"]([^'"\\\s][^'"\\]*)['"]/g;

for (const m of mcpSource.matchAll(re)) names.add(m[1]);

return [...names].sort();

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/broker/evals/toolcheck-cli.ts` around lines 25 - 27, The regex used to extract registered tool names from mcpSource in the registerTool(...) matcher is too narrow (only [a-z0-9_]) and drops valid names with characters like -, ., : (and possibly uppercase); update the regex assigned to re in toolcheck-cli.ts to expand the character class to include dash, dot, colon and uppercase letters (e.g., include A-Z and the characters - . :) and ensure the hyphen is either escaped or placed at the start/end of the class so it is treated literally, then keep the same matchAll logic and return value.

willwashburn added 5 commits June 3, 2026 19:54

Add local workflow run commands

37b510d

Avoid log read file race

25f5e2f

Merge branch 'main' into feature/local-cloud-workflow-files

d27feec

Use Relayflows for local workflow runs

5012154

willwashburn requested a review from khaliqgant as a code owner June 9, 2026 10:00

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

This was referenced Jun 11, 2026

feat(evals): add Relay SDK eval harness #1092

Closed

feat(evals): combined offline SDK harness + live behavioral eval suite #1109

Merged

khaliqgant closed this in #1109 Jun 14, 2026

	resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli'),
	resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli/dist/cli.js'),

	return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports');
	return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..', '..', '..', 'evals-reports');


		### Added

		- Agent messaging eval suite (`npm run eval`, `eval:matrix`, `eval:unit`, `eval:selftest`, `eval:toolcheck`) spawns real agent CLIs and scores, from broker events, whether agents actually used the MCP/CLI to message — reporting message-sent rate, phantom-message rate (intent stated in prose but no send), ACK/DONE protocol adherence, and wrong-channel replies. Includes a `realistic` tier (natural-language tasks where the protocol must come from the injected onboarding) and a `smoke` tier (leading prompts as a plumbing canary), a negative-control self-test, and a deterministic wrong-tool-name trap that flags onboarding referencing tools the MCP server doesn't register. Each run emits a self-contained HTML viewer (overview, per-scenario prompts, full message transcript, phantom call-outs) alongside JSON reports with baseline regression diffing.

Conversation

willwashburn commented Jun 9, 2026

Uh oh!

codeant-ai Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading