Skip to content

Protocol Evals#1066

Closed
willwashburn wants to merge 5 commits into
mainfrom
feature/messaging-protocol-evals
Closed

Protocol Evals#1066
willwashburn wants to merge 5 commits into
mainfrom
feature/messaging-protocol-evals

Conversation

@willwashburn

Copy link
Copy Markdown
Member

tbd

Introduce a new integration eval suite that exercises agent-to-agent messaging via the broker and scores protocol adherence (message-sent rate, phantom messages, ACK/DONE protocol, wrong-channel replies). Adds a full eval runner, scenarios, deterministic scoring, reporters (JSON + self-contained HTML viewer), a matrix roll-up, unit tests for scoring, and CLI helpers under tests/integration/broker/evals. Adds npm scripts (eval:build, eval:unit, eval:selftest, eval:toolcheck, eval:html, eval, eval:claude, eval:matrix) and gitignore entry for evals-reports. Also adds a Fleet Delivery design doc (specs/fleet-delivery.md), updates CHANGELOG.md, and adjusts integration test config/files (tsconfig, vitest, and broker harness utilities) to align the broker-harness with the current SDK/harness-driver APIs so the evals build/run cleanly.
@willwashburn willwashburn requested a review from khaliqgant as a code owner June 9, 2026 10:00
@codeant-ai

codeant-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive agent messaging evaluation suite for broker integration testing (6 scenarios with phantom detection, protocol scoring, and reporting), adds local workflow CLI commands (run, logs, sync), refactors the broker test harness to use low-level driver-only APIs, and specifies a fleet delivery architecture for reliable multi-agent messaging.

Changes

Broker Eval Suite

Layer / File(s) Summary
Eval types and build configuration
tests/integration/broker/evals/types.ts, tests/integration/broker/evals/tsconfig.json, package.json, tests/integration/broker/evals/README.md
Central type definitions for scenarios, phantom/metric/report schemas, SCHEMA_VERSION constant, dedicated TypeScript build config, and npm eval scripts for building/running scenarios and reporters.
Eval scenarios and timing
tests/integration/broker/evals/scenarios/*
Six scenarios (DM roundtrip, channel reply, ACK/DONE protocol, 3-agent relay chain, incidental report, proactive handoff, channel-vs-DM) plus shared timing constants and waitForSends polling helper.
Scoring: phantom, protocol, metrics
tests/integration/broker/evals/scoring/base.ts, phantom.ts, protocol.ts, metrics.ts, stream-clean.ts, toolcheck.ts, fixtures.ts
Transcript building, phantom intent detection with ANSI normalization, ACK/DONE and channel-reply scoring, metrics aggregation from scenario results, stream output cleaning, tool reference validation, and event test fixtures.
Report generation
tests/integration/broker/evals/report/html.ts, report/write.ts, report/render-cli.ts
HTML rendering for metric cards, transcripts, scenario sections, and matrix tables; JSON/HTML persistence with git SHA and iso timestamps; baseline regression comparison via compareReports.
Runner and integration
tests/integration/broker/evals/runner.ts, eval.test.ts, selftest.ts, toolcheck-cli.ts
Eval scenario matrix runner with CLI flags, per-harness lifecycle, failure handling, metric aggregation, and optional baseline regression exit codes; negative-control selftest for false-positive detection; toolcheck CLI for validating tool-reference consistency; integration test wrapper.

Local Workflow CLI

Layer / File(s) Summary
Workflow implementation
packages/cli/src/cli/commands/local-workflow.ts
Dependency injection interface for run management, file-type inference, run-ID generation, atomic JSON persistence, monitor script generation that spawns workflows and tracks state, log streaming with follow/poll, and commander wiring for run, logs, sync subcommands.
Testing and bootstrap
packages/cli/src/cli/commands/local-workflow.test.ts, packages/cli/src/cli/bootstrap.ts, packages/cli/src/cli/bootstrap.test.ts
Vitest suite verifying command registration and end-to-end workflow execution, log output capture, and entrypoint delegation; CLI bootstrap registration and updated leaf-command expectations.
Relayflows dependency
packages/cli/package.json
Added @relayflows/cli@^1.0.1 for delegating YAML, TypeScript, and Python workflow execution.

Broker Harness Refactoring

Layer / File(s) Summary
Harness driver-only
tests/integration/broker/utils/broker-harness.ts, tests/integration/broker/utils/assert-helpers.ts
Removed high-level AgentRelay facade usage; now provisions RELAY_API_KEY via RelayCast.createWorkspace and spawns only low-level HarnessDriverClient; updated type imports from @agent-relay/sdk to @agent-relay/harness-driver.

Fleet Delivery Specification and Documentation

Layer / File(s) Summary
Architecture specification
specs/fleet-delivery.md
189-line design document defining two-plane architecture (messaging fabric of agents + compute layer of broker-backed nodes), single-location delivery invariant, spawn/placement capabilities, reliable action invocation with idempotency, bounded-durable mailbox, session continuity, and broker control surface.
CLI and project documentation
packages/cli/README.md, CHANGELOG.md, web/content/docs/cli-overview.mdx, web/content/docs/reference-cli.mdx, .gitignore, tests/integration/broker/tsconfig.json, vitest.config.ts, .agentworkforce/trajectories/*
Added local workflow command examples and descriptions; updated changelog with eval suite and Relayflows entries; added CLI reference documentation; excluded eval-report artifacts and unit tests from compilation; updated trajectory decision log.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Suggested reviewers

  • khaliqgant

Poem

🐰 Whiskers twitching with delight,
Evals score agents through the night—
Phantom messages caught mid-flight,
Workflows run local, shining bright,
Fleet delivery sets things right! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is incomplete; only 'tbd' is provided, failing to include required sections like Summary or Test Plan from the template. Complete the description by filling in the Summary section with what this PR does and the Test Plan section documenting tests added/updated and manual testing completed.
Docstring Coverage ⚠️ Warning Docstring coverage is 47.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Protocol Evals' is vague and does not clearly convey the main changes; it uses a generic term without sufficient specificity about what protocol evaluation work was added. Consider a more descriptive title like 'Add agent messaging protocol evaluation suite' or 'Add broker eval suite with protocol scoring and reporters' to better summarize the primary changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/messaging-protocol-evals
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch feature/messaging-protocol-evals

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a local workflow execution feature (agent-relay local run|logs|sync) that delegates execution to @relayflows/cli, alongside a comprehensive agent messaging evaluation suite with HTML and JSON reporting. The feedback highlights several critical improvements: resolving the exact entrypoint path for @relayflows/cli to prevent runtime errors, enhancing robustness in process liveness checks and atomic file writes, and fixing Windows compatibility issues by using fileURLToPath instead of .pathname for path resolution. Additionally, a style guide violation was noted in CHANGELOG.md, where the entry should be made more concise and impact-first.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

cwd: () => process.cwd(),
env: process.env,
spawnProcess,
resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli'),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The @relayflows/cli package does not define a main entrypoint in its package.json. Calling nodeRequire.resolve('@relayflows/cli') will throw a MODULE_NOT_FOUND error at runtime. Resolve the exact CLI script path @relayflows/cli/dist/cli.js instead.

Suggested change
resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli'),
resolveRelayflowsCliEntrypoint: () => nodeRequire.resolve('@relayflows/cli/dist/cli.js'),

Comment on lines +74 to +82
isProcessRunning: (pid) => {
if (!Number.isInteger(pid) || pid <= 0) return false;
try {
process.kill(pid, 0);
return true;
} catch (error) {
return (error as NodeJS.ErrnoException).code === 'EPERM';
}
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If process.kill throws an error that is not an object or is null, accessing (error as NodeJS.ErrnoException).code will throw a TypeError and crash the process. Add a safety check to ensure error is a valid object with a code property before accessing it.

Suggested change
isProcessRunning: (pid) => {
if (!Number.isInteger(pid) || pid <= 0) return false;
try {
process.kill(pid, 0);
return true;
} catch (error) {
return (error as NodeJS.ErrnoException).code === 'EPERM';
}
},
isProcessRunning: (pid) => {
if (!Number.isInteger(pid) || pid <= 0) return false;
try {
process.kill(pid, 0);
return true;
} catch (error) {
return typeof error === "object" && error !== null && "code" in error && (error as any).code === "EPERM";
}
},

Comment on lines +162 to +166
async function writeJsonAtomic(filePath: string, value: unknown): Promise<void> {
const tmpPath = `${filePath}.tmp-${process.pid}`;
await fsp.writeFile(tmpPath, `${JSON.stringify(value, null, 2)}\n`, 'utf-8');
await fsp.rename(tmpPath, filePath);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using ${filePath}.tmp-${process.pid} as the temporary file path can lead to write conflicts and race conditions if writeJsonAtomic is called concurrently for the same file within the same process. Append a unique suffix using randomBytes to ensure uniqueness.

Suggested change
async function writeJsonAtomic(filePath: string, value: unknown): Promise<void> {
const tmpPath = `${filePath}.tmp-${process.pid}`;
await fsp.writeFile(tmpPath, `${JSON.stringify(value, null, 2)}\n`, 'utf-8');
await fsp.rename(tmpPath, filePath);
}
async function writeJsonAtomic(filePath: string, value: unknown): Promise<void> {
const tmpPath = filePath + ".tmp-" + process.pid + "-" + randomBytes(4).toString("hex");
await fsp.writeFile(tmpPath, JSON.stringify(value, null, 2) + "\n", "utf-8");
await fsp.rename(tmpPath, filePath);
}

Comment on lines +4 to +6
import { execSync } from 'node:child_process';
import fs from 'node:fs';
import path from 'node:path';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import fileURLToPath from node:url to safely resolve file URLs to platform-specific paths.

Suggested change
import { execSync } from 'node:child_process';
import fs from 'node:fs';
import path from 'node:path';
import { execSync } from 'node:child_process';
import fs from 'node:fs';
import path from 'node:path';
import { fileURLToPath } from 'node:url';

/** Directory where reports are written (gitignored). */
export function reportsDir(): string {
// Compiled location: dist/evals/report/ → scenario source lives under evals/.
return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports');

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using new URL(import.meta.url).pathname to resolve paths breaks on Windows because it leaves a leading slash (e.g., /C:/path). Use fileURLToPath from node:url for cross-platform compatibility.

Suggested change
return path.resolve(path.dirname(new URL(import.meta.url).pathname), '..', '..', '..', 'evals-reports');
return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..', '..', '..', 'evals-reports');

Comment on lines +12 to +13
import fs from 'node:fs';
import path from 'node:path';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import fileURLToPath from node:url to safely resolve file URLs to platform-specific paths.

Suggested change
import fs from 'node:fs';
import path from 'node:path';
import fs from 'node:fs';
import path from 'node:path';
import { fileURLToPath } from 'node:url';


function repoRoot(): string {
// dist/evals/toolcheck-cli.js → up 5 to repo root.
return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..');

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using new URL(import.meta.url).pathname to resolve paths breaks on Windows because it leaves a leading slash (e.g., /C:/path). Use fileURLToPath from node:url for cross-platform compatibility.

Suggested change
return path.resolve(path.dirname(new URL(import.meta.url).pathname), '../../../../..');
return path.resolve(path.dirname(fileURLToPath(import.meta.url)), '../../../../..');

Comment thread CHANGELOG.md

### Added

- Agent messaging eval suite (`npm run eval`, `eval:matrix`, `eval:unit`, `eval:selftest`, `eval:toolcheck`) spawns real agent CLIs and scores, from broker events, whether agents actually used the MCP/CLI to message — reporting message-sent rate, phantom-message rate (intent stated in prose but no send), ACK/DONE protocol adherence, and wrong-channel replies. Includes a `realistic` tier (natural-language tasks where the protocol must come from the injected onboarding) and a `smoke` tier (leading prompts as a plumbing canary), a negative-control self-test, and a deterministic wrong-tool-name trap that flags onboarding referencing tools the MCP server doesn't register. Each run emits a self-contained HTML viewer (overview, per-scenario prompts, full message transcript, phantom call-outs) alongside JSON reports with baseline regression diffing.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The changelog entry is extremely detailed and contains implementation specifics. According to the repository style guide, changelog entries should be concise and impact-first.

Suggested change
- Agent messaging eval suite (`npm run eval`, `eval:matrix`, `eval:unit`, `eval:selftest`, `eval:toolcheck`) spawns real agent CLIs and scores, from broker events, whether agents actually used the MCP/CLI to message — reporting message-sent rate, phantom-message rate (intent stated in prose but no send), ACK/DONE protocol adherence, and wrong-channel replies. Includes a `realistic` tier (natural-language tasks where the protocol must come from the injected onboarding) and a `smoke` tier (leading prompts as a plumbing canary), a negative-control self-test, and a deterministic wrong-tool-name trap that flags onboarding referencing tools the MCP server doesn't register. Each run emits a self-contained HTML viewer (overview, per-scenario prompts, full message transcript, phantom call-outs) alongside JSON reports with baseline regression diffing.
- Add agent messaging eval suite to spawn real agent CLIs and score protocol adherence, phantom messages, and wrong-channel replies, with HTML and JSON reporting.
References
  1. Changelog entries should be concise and impact-first, preferring one short bullet per user-visible change. (link)

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 431c227474

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +378 to +382
status: current.status === 'starting' ? 'running' : current.status,
monitorPid: monitor.pid,
updatedAt: deps.now().toISOString(),
};
await writeJsonAtomic(metadataPath, next);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve monitor completion when updating the run record

For very fast workflows, the detached monitor can finish and write status: 'completed' after the parent reads the record but before this parent-side write runs. In that race, this update writes the stale current.status === 'starting' ? 'running' : current.status value back to disk, erasing the terminal status; subsequent logs/sync calls then see a non-running monitor PID and mark the successful run as failed. Re-read/merge only the PID fields or avoid changing status here when the monitor owns terminal state.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (2)
package.json (1)

49-49: ⚡ Quick win

Scope eval:unit to unit-test files only.

eval:unit currently runs the whole tests/integration/broker/evals tree, which can pull in non-unit eval runners. Limit it to *.unit.test.ts so this command stays fast and deterministic.

Suggested change
-    "eval:unit": "vitest run tests/integration/broker/evals",
+    "eval:unit": "vitest run tests/integration/broker/evals/**/*.unit.test.ts",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@package.json` at line 49, The "eval:unit" npm script currently runs the
entire tests/integration/broker/evals folder; update the "eval:unit" script
entry in package.json (the "eval:unit" key) to restrict the glob to only unit
test files (e.g., tests/integration/broker/evals/**/*.unit.test.ts) so it only
executes files matching *.unit.test.ts, keeping the command fast and
deterministic.
tests/integration/broker/evals/scoring/protocol.unit.test.ts (1)

78-113: ⚡ Quick win

Add a regression test for channel-vs-DM target-kind mismatches in relay hops.

Current relay tests don’t guard against #name being treated as equivalent to name. A focused regression test here will prevent future scoring drift.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scoring/protocol.unit.test.ts` around lines 78
- 113, Add a regression test that ensures channel names with a leading '#' are
not treated as equivalent to the same name without it: in the scoreRelayChain
tests create a hop sequence (hops) that targets '`#general`' and then craft events
using inbound where the final recipient is 'general' (no '#') and assert that
scoreRelayChain(hops, events, ...) stops at the previous hop (hopsCompleted less
than full length) and payloadIntact is false; reference the existing
scoreRelayChain function, the hops array, and the inbound helper when adding
this test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 12: The long single `[Unreleased]` changelog bullet in CHANGELOG.md
should be replaced by several concise, impact-first one-line bullets: extract
each user-visible change (e.g., "Adds agent messaging evaluation suite with HTML
and JSON reports", "Supports realistic and smoke tiers for protocol-driven
tasks", "Reports message-sent rate, phantom-message rate, ACK/DONE and
wrong-channel detection", "Includes negative-control self-test and deterministic
wrong-tool-name trap") and write each as a short standalone line; remove
implementation details and internal notes, avoid PR/issue links, and ensure each
bullet is brief and focused on user-facing impact.

In `@specs/fleet-delivery.md`:
- Around line 108-112: The fenced code block containing the state diagram (the
block starting with "queued ──deliver(seq)──▶ delivered ──ack──▶ acked (≈read)")
needs a language hint for proper rendering; update the opening triple-backtick
to include a language specifier (e.g., ```text or ```plaintext) so the diagram
is treated as plain text by the renderer.
- Around line 74-82: Add a language identifier to the fenced code block that
contains the spawn signature and placement pseudocode (the block starting with
"spawn { capability, node?: <name> | \"self\", session_ref?, ttl_override? }"
and the subsequent eligible/place rules); update the opening backticks to
include a hint like pseudocode or text (e.g., ```pseudocode) so rendering and
tooling can recognize the block as code.

In `@tests/integration/broker/evals/eval.test.ts`:
- Line 42: The test currently swallows errors from harness.stop() which hides
shutdown failures; change the teardown in eval.test.ts to let harness.stop()
errors surface by removing the .catch(() => {}) suppression and instead await
harness.stop() directly (or catch and rethrow/log the error) so any failure in
harness.stop() is reported by the test runner; update the test wrapper that
calls harness.stop() to either await harness.stop() without a silent catch or
handle the caught error by failing the test (e.g., rethrowing) rather than
ignoring it.

In `@tests/integration/broker/evals/runner.ts`:
- Around line 201-205: The test runner currently skips missing CLIs inside the
loop over flags.harnesses (using isCliAvailable(cli)) but still exits 0 if none
ran; add a boolean or counter (e.g., ranAny) that you set when a harness
actually executes (inside the branch where you run the harness), and after the
loop check it and fail the process (process.exit(1) or throw an Error) if ranAny
is false; apply the same change to the second similar loop referenced around the
other occurrence so the runner returns non-zero when every requested CLI was
missing.
- Around line 215-225: The baseline compare currently runs for every report
regardless of compatibility; before calling
compareReports(readReport(flags.baseline), report) and reporting regressions,
guard that the baseline report is compatible by loading the baseline via
readReport(flags.baseline) into a variable and checking that baseline.harness
=== report.harness and that the scenario/metric sets (e.g., lengths or matching
scenario identifiers/totals) match; only call compareReports and set
anyRegression when those compatibility checks pass, otherwise skip comparison or
log a clear message. Ensure you reference the existing symbols: flags.baseline,
readReport, compareReports, report, and any properties like harness or scenario
totals on the returned EvalReport.

In `@tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts`:
- Line 47: The current PASS bit uses a generic send count (pass uses base.sent
>= 1) which can be satisfied by Bob sending to anyone; instead require an
explicit Bob→Alice reply signal: replace the base.sent >= 1 clause with an
event-level or protocol-scorer check that verifies at least one message with
from === 'bob' and to === 'alice' (for example: events.some(e => e.from ===
'bob' && e.to === 'alice') or the equivalent protocol scorer flag), while
leaving the existing base.phantoms.length === 0 and base.deliveryOk guards
intact; update the pass expression to use that reply-check + the
phantom/delivery checks (referencing the pass variable, base.sent,
base.phantoms, base.deliveryOk, and the events/protocol-scorer used for reply
detection).

In `@tests/integration/broker/evals/scenarios/04-relay-chain.ts`:
- Around line 44-50: The final waiter currently resolves on any relay_inbound
from participant c; tighten its predicate in the harness.waitForEvent call so it
only resolves when that inbound is the actual final post to "`#general`" by adding
a check that the event's destination/channel indicates "`#general`" (for example
by verifying event.to or event.channel equals "`#general`") in addition to e.kind
=== 'relay_inbound' and e.from === c; update the finalWaiter predicate
accordingly so the wait only completes on the true final-hop event.

In `@tests/integration/broker/evals/scenarios/helpers.ts`:
- Around line 36-39: waitForSends currently returns immediately once seen() >=
count which allows trailing sends to arrive after it returns; modify
waitForSends so that after the main loop completes (and only if deadline wasn't
hit) it waits for a short "quiet window" before returning: record the current
seen() value and poll (using the same setTimeout 500ms interval) until seen()
remains unchanged for the quietWindow duration (e.g., 200–500ms) or deadline is
reached. Reference the existing helpers waitForSends, seen(), count, and
deadline when implementing this post-condition wait so the function only returns
once no new sends arrive during the quiet window.

In `@tests/integration/broker/evals/scenarios/r01-incidental-report.ts`:
- Around line 44-48: The current pass criteria uses sentDirectMessage(events,
worker) which treats any DM as success and can be satisfied by an ACK; change
the check to assert the DM actually contains the computed result. Update the
test to either call a new helper (e.g., sentDirectMessageWithResult or
sentDirectMessageContainsResult) or inspect events for an outgoing DM from the
worker whose payload contains the expected result/marker (for example a "result"
field or the actual computed value), then use that boolean in place of reported;
keep the other checks (base.phantoms.length === 0 && base.deliveryOk) unchanged.

In `@tests/integration/broker/evals/scenarios/r02-forget-to-report.ts`:
- Around line 47-51: The current reported check uses sentDirectMessage(events,
worker) which returns true for ACK-only responses, so update the check to assert
that a direct message with actual findings was sent: replace or augment the
reported variable (from sentDirectMessage(events, worker)) with a check that
inspects the events for a sent message to the worker whose payload contains
non-empty findings (e.g., event.type/message entries with payload.findings
length > 0 or equivalent), and then use that stricter reported in the pass
condition (still keeping base.phantoms.length === 0 && base.deliveryOk and
releaseAgent handling as-is).

In `@tests/integration/broker/evals/scoring/metrics.ts`:
- Line 40: The phantomRate calculation can exceed 1 if phantomCount >
totalIntents; update the phantomRate expression (where phantomRate is computed
using phantomCount and totalIntents) to clamp the result into [0,1] like the
other rate calculations by wrapping the computed fraction with a clamp (e.g.,
Math.max(0, Math.min(1, ...))) while preserving the existing zero-division guard
(totalIntents > 0 ? ... : 0).

In `@tests/integration/broker/evals/scoring/metrics.unit.test.ts`:
- Around line 6-21: The test fixture function result must include the required
ScenarioResult fields agents and transcript; update the result(...) return
object in tests/integration/broker/evals/scoring/metrics.unit.test.ts (the
result function) to add agents: [] as AgentInfo[] and transcript: [] as
TranscriptEntry[] (or minimal valid entries) so the returned object satisfies
the ScenarioResult type contract while preserving existing overrides.

In `@tests/integration/broker/evals/scoring/phantom.ts`:
- Around line 146-155: The current matching loop (variables intents, sends,
consumed, normalizeTarget and matchIdx) allows any earlier unconsumed send to
satisfy an intent; change it so a send is only considered if it occurs at or
after the intent in event order. Concretely, iterate intents with an index (or
otherwise obtain the intent's event index) and update both findIndex predicates
to require the send index i >= intentIndex (e.g., (s, i) => i >= intentIndex &&
!consumed[i] && normalizeTarget(s.target) === intent.target and ( _, i) => i >=
intentIndex && !consumed[i]) so only later-or-equal sends can match an intent.

In `@tests/integration/broker/evals/scoring/protocol.ts`:
- Around line 18-20: normalizeChannel currently strips the leading '#' or '@',
conflating channel vs DM identities; update normalizeChannel to preserve the
target kind by retaining the leading marker: trim input, detect if first char is
'#' or '@', lowercase and trim the rest, and return the normalized string
including the original prefix (e.g., "`#general`" or "`@alice`"); then update all
call sites that rely on the old behavior (the other occurrences around the
109-121 block) to use the preserved-prefix form for comparisons so channel and
direct targets remain distinct.

In `@tests/integration/broker/evals/selftest.ts`:
- Around line 59-69: The two process.exit(1) calls inside the try block (the
checks using evalWouldPass and score.sent) prevent the finally cleanup
(releaseAgent/stop) from running; replace those process.exit(1) calls with
throwing descriptive Errors (e.g., throw new Error('SELF-TEST FAILED: ...')) or
returning a rejected Promise so the error propagates out of the try and the
finally block that calls releaseAgent and stop always executes; update the error
messages to match the existing console.error text and remove the direct
process.exit usage in the try.

In `@tests/integration/broker/evals/toolcheck-cli.ts`:
- Around line 25-27: The regex used to extract registered tool names from
mcpSource in the registerTool(...) matcher is too narrow (only [a-z0-9_]) and
drops valid names with characters like -, ., : (and possibly uppercase); update
the regex assigned to re in toolcheck-cli.ts to expand the character class to
include dash, dot, colon and uppercase letters (e.g., include A-Z and the
characters - . :) and ensure the hyphen is either escaped or placed at the
start/end of the class so it is treated literally, then keep the same matchAll
logic and return value.

---

Nitpick comments:
In `@package.json`:
- Line 49: The "eval:unit" npm script currently runs the entire
tests/integration/broker/evals folder; update the "eval:unit" script entry in
package.json (the "eval:unit" key) to restrict the glob to only unit test files
(e.g., tests/integration/broker/evals/**/*.unit.test.ts) so it only executes
files matching *.unit.test.ts, keeping the command fast and deterministic.

In `@tests/integration/broker/evals/scoring/protocol.unit.test.ts`:
- Around line 78-113: Add a regression test that ensures channel names with a
leading '#' are not treated as equivalent to the same name without it: in the
scoreRelayChain tests create a hop sequence (hops) that targets '`#general`' and
then craft events using inbound where the final recipient is 'general' (no '#')
and assert that scoreRelayChain(hops, events, ...) stops at the previous hop
(hopsCompleted less than full length) and payloadIntact is false; reference the
existing scoreRelayChain function, the hops array, and the inbound helper when
adding this test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f3446c3-1869-4ca5-9ccc-22e3c60dc513

📥 Commits

Reviewing files that changed from the base of the PR and between 6b67acf and 431c227.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (48)
  • .agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json
  • .gitignore
  • CHANGELOG.md
  • package.json
  • packages/cli/README.md
  • packages/cli/package.json
  • packages/cli/src/cli/bootstrap.test.ts
  • packages/cli/src/cli/bootstrap.ts
  • packages/cli/src/cli/commands/local-workflow.test.ts
  • packages/cli/src/cli/commands/local-workflow.ts
  • specs/fleet-delivery.md
  • tests/integration/broker/evals/README.md
  • tests/integration/broker/evals/eval.test.ts
  • tests/integration/broker/evals/report/html.ts
  • tests/integration/broker/evals/report/render-cli.ts
  • tests/integration/broker/evals/report/write.ts
  • tests/integration/broker/evals/runner.ts
  • tests/integration/broker/evals/scenarios/01-dm-roundtrip.ts
  • tests/integration/broker/evals/scenarios/02-channel-reply.ts
  • tests/integration/broker/evals/scenarios/03-ack-done.ts
  • tests/integration/broker/evals/scenarios/04-relay-chain.ts
  • tests/integration/broker/evals/scenarios/helpers.ts
  • tests/integration/broker/evals/scenarios/index.ts
  • tests/integration/broker/evals/scenarios/r01-incidental-report.ts
  • tests/integration/broker/evals/scenarios/r02-forget-to-report.ts
  • tests/integration/broker/evals/scenarios/r03-proactive-handoff.ts
  • tests/integration/broker/evals/scenarios/r04-channel-vs-dm.ts
  • tests/integration/broker/evals/scoring/base.ts
  • tests/integration/broker/evals/scoring/fixtures.ts
  • tests/integration/broker/evals/scoring/metrics.ts
  • tests/integration/broker/evals/scoring/metrics.unit.test.ts
  • tests/integration/broker/evals/scoring/phantom.ts
  • tests/integration/broker/evals/scoring/phantom.unit.test.ts
  • tests/integration/broker/evals/scoring/protocol.ts
  • tests/integration/broker/evals/scoring/protocol.unit.test.ts
  • tests/integration/broker/evals/scoring/stream-clean.ts
  • tests/integration/broker/evals/scoring/toolcheck.ts
  • tests/integration/broker/evals/scoring/toolcheck.unit.test.ts
  • tests/integration/broker/evals/selftest.ts
  • tests/integration/broker/evals/toolcheck-cli.ts
  • tests/integration/broker/evals/tsconfig.json
  • tests/integration/broker/evals/types.ts
  • tests/integration/broker/tsconfig.json
  • tests/integration/broker/utils/assert-helpers.ts
  • tests/integration/broker/utils/broker-harness.ts
  • vitest.config.ts
  • web/content/docs/cli-overview.mdx
  • web/content/docs/reference-cli.mdx

Comment thread CHANGELOG.md

### Added

- Agent messaging eval suite (`npm run eval`, `eval:matrix`, `eval:unit`, `eval:selftest`, `eval:toolcheck`) spawns real agent CLIs and scores, from broker events, whether agents actually used the MCP/CLI to message — reporting message-sent rate, phantom-message rate (intent stated in prose but no send), ACK/DONE protocol adherence, and wrong-channel replies. Includes a `realistic` tier (natural-language tasks where the protocol must come from the injected onboarding) and a `smoke` tier (leading prompts as a plumbing canary), a negative-control self-test, and a deterministic wrong-tool-name trap that flags onboarding referencing tools the MCP server doesn't register. Each run emits a self-contained HTML viewer (overview, per-scenario prompts, full message transcript, phantom call-outs) alongside JSON reports with baseline regression diffing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Condense this [Unreleased] bullet into short impact-first entries.

This entry is too long and mixes multiple details; split it into concise one-line user-visible bullets per change.

As per coding guidelines, “Changelog entries should be concise and impact-first, with one short bullet per user-visible change. Omit issue/PR links, internal notes, and implementation details.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CHANGELOG.md` at line 12, The long single `[Unreleased]` changelog bullet in
CHANGELOG.md should be replaced by several concise, impact-first one-line
bullets: extract each user-visible change (e.g., "Adds agent messaging
evaluation suite with HTML and JSON reports", "Supports realistic and smoke
tiers for protocol-driven tasks", "Reports message-sent rate, phantom-message
rate, ACK/DONE and wrong-channel detection", "Includes negative-control
self-test and deterministic wrong-tool-name trap") and write each as a short
standalone line; remove implementation details and internal notes, avoid
PR/issue links, and ensure each bullet is brief and focused on user-facing
impact.

Source: Coding guidelines

Comment thread specs/fleet-delivery.md
Comment on lines +74 to +82
```
spawn { capability, node?: <name> | "self", session_ref?, ttl_override? }

eligible = nodes where
(node.name == target if target given)
∧ capability ∈ node.capabilities
∧ node.live ∧ capacity_available
place: target if given, else least-loaded(eligible)
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language specifier to fenced code block.

The code block lacks a language identifier. Add a language hint (e.g., pseudocode or text) for proper rendering and tooling support.

📝 Proposed fix
-```
+```pseudocode
 spawn { capability, node?: <name> | "self", session_ref?, ttl_override? }
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 74-74: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/fleet-delivery.md` around lines 74 - 82, Add a language identifier to
the fenced code block that contains the spawn signature and placement pseudocode
(the block starting with "spawn { capability, node?: <name> | \"self\",
session_ref?, ttl_override? }" and the subsequent eligible/place rules); update
the opening backticks to include a hint like pseudocode or text (e.g.,
```pseudocode) so rendering and tooling can recognize the block as code.

Source: Linters/SAST tools

Comment thread specs/fleet-delivery.md
Comment on lines +108 to +112
```
queued ──deliver(seq)──▶ delivered ──ack──▶ acked (≈read)
└── TTL expiry ──▶ dead-letter
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language specifier to fenced code block.

The state machine diagram lacks a language identifier. Add a language hint (e.g., text or plaintext) for proper rendering.

📝 Proposed fix
-```
+```text
 queued ──deliver(seq)──▶ delivered ──ack──▶ acked (≈read)
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 108-108: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/fleet-delivery.md` around lines 108 - 112, The fenced code block
containing the state diagram (the block starting with "queued ──deliver(seq)──▶
delivered ──ack──▶ acked (≈read)") needs a language hint for proper rendering;
update the opening triple-backtick to include a language specifier (e.g.,
```text or ```plaintext) so the diagram is treated as plain text by the
renderer.

Source: Linters/SAST tools

}
assert.ok(result.pass, `Scenario ${result.id} failed: ${JSON.stringify(result, null, 2)}`);
} finally {
await harness.stop().catch(() => {});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t suppress harness shutdown errors in the test wrapper.

Ignoring harness.stop() failures can hide resource-leak cleanup problems and cause follow-on test instability.

Suggested patch
     } finally {
-      await harness.stop().catch(() => {});
+      await harness.stop();
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
await harness.stop().catch(() => {});
} finally {
await harness.stop();
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/eval.test.ts` at line 42, The test currently
swallows errors from harness.stop() which hides shutdown failures; change the
teardown in eval.test.ts to let harness.stop() errors surface by removing the
.catch(() => {}) suppression and instead await harness.stop() directly (or catch
and rethrow/log the error) so any failure in harness.stop() is reported by the
test runner; update the test wrapper that calls harness.stop() to either await
harness.stop() without a silent catch or handle the caught error by failing the
test (e.g., rethrowing) rather than ignoring it.

Comment on lines +201 to +205
for (const cli of flags.harnesses) {
if (!isCliAvailable(cli)) {
console.log(`\n[${cli}] skipped — CLI not found on PATH`);
continue;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail when no harness actually ran.

When every requested CLI is missing, the run currently exits with code 0, which can mask a broken eval environment as success.

Suggested fix
   let anyRegression = false;
+  let ranHarnesses = 0;
   for (const cli of flags.harnesses) {
     if (!isCliAvailable(cli)) {
       console.log(`\n[${cli}] skipped — CLI not found on PATH`);
       continue;
     }
+    ranHarnesses += 1;
@@
-  process.exit(anyRegression ? 1 : 0);
+  if (ranHarnesses === 0) {
+    console.error('No harnesses executed: all requested CLIs were unavailable.');
+    process.exit(2);
+  }
+  process.exit(anyRegression ? 1 : 0);

Also applies to: 236-236

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/runner.ts` around lines 201 - 205, The test
runner currently skips missing CLIs inside the loop over flags.harnesses (using
isCliAvailable(cli)) but still exits 0 if none ran; add a boolean or counter
(e.g., ranAny) that you set when a harness actually executes (inside the branch
where you run the harness), and after the loop check it and fail the process
(process.exit(1) or throw an Error) if ranAny is false; apply the same change to
the second similar loop referenced around the other occurrence so the runner
returns non-zero when every requested CLI was missing.

Comment on lines +6 to +21
function result(overrides: Partial<ScenarioResult>): ScenarioResult {
return {
id: 'x',
title: 'x',
pass: true,
sent: 1,
expected: 1,
phantoms: [],
totalIntents: 0,
protocolAdherence: null,
wrongChannelReplies: 0,
deliveryOk: true,
events: { relayInbound: 1, dropped: 0, aclDenied: 0 },
...overrides,
};
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify required fields in ScenarioResult and compare with fixture defaults (read-only).
rg -n -C2 "export interface ScenarioResult|agents: AgentInfo\\[]|transcript: TranscriptEntry\\[]" tests/integration/broker/evals/types.ts
rg -n -C8 "function result\\(overrides: Partial<ScenarioResult>\\): ScenarioResult" tests/integration/broker/evals/scoring/metrics.unit.test.ts

Repository: AgentWorkforce/relay

Length of output: 915


Fix ScenarioResult fixture helper to include required fields.

ScenarioResult defines required agents: AgentInfo[] and transcript: TranscriptEntry[], but result() in tests/integration/broker/evals/scoring/metrics.unit.test.ts omits them, so the fixture breaks the declared type contract.

Suggested change
 function result(overrides: Partial<ScenarioResult>): ScenarioResult {
   return {
     id: 'x',
     title: 'x',
     pass: true,
+    agents: [],
+    transcript: [],
     sent: 1,
     expected: 1,
     phantoms: [],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function result(overrides: Partial<ScenarioResult>): ScenarioResult {
return {
id: 'x',
title: 'x',
pass: true,
sent: 1,
expected: 1,
phantoms: [],
totalIntents: 0,
protocolAdherence: null,
wrongChannelReplies: 0,
deliveryOk: true,
events: { relayInbound: 1, dropped: 0, aclDenied: 0 },
...overrides,
};
}
function result(overrides: Partial<ScenarioResult>): ScenarioResult {
return {
id: 'x',
title: 'x',
pass: true,
agents: [],
transcript: [],
sent: 1,
expected: 1,
phantoms: [],
totalIntents: 0,
protocolAdherence: null,
wrongChannelReplies: 0,
deliveryOk: true,
events: { relayInbound: 1, dropped: 0, aclDenied: 0 },
...overrides,
};
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scoring/metrics.unit.test.ts` around lines 6 -
21, The test fixture function result must include the required ScenarioResult
fields agents and transcript; update the result(...) return object in
tests/integration/broker/evals/scoring/metrics.unit.test.ts (the result
function) to add agents: [] as AgentInfo[] and transcript: [] as
TranscriptEntry[] (or minimal valid entries) so the returned object satisfies
the ScenarioResult type contract while preserving existing overrides.

Comment on lines +146 to +155
for (const intent of intents) {
let matchIdx = -1;
if (intent.target) {
matchIdx = sends.findIndex(
(s, i) => !consumed[i] && normalizeTarget(s.target) === intent.target
);
}
if (matchIdx === -1) {
matchIdx = sends.findIndex((_, i) => !consumed[i]);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Phantom matching is time-agnostic and can hide true phantoms.

Line 149 and Line 154 allow an intent to be satisfied by any earlier unconsumed send. That creates false negatives when the agent says “I’ll message …” after already sending something unrelated. Matching should require sends to occur at or after the intent in event order.

Suggested fix direction
- const intents = detectIntents(cleanStreamOutput(events, agent));
- const sends = relayInboundFrom(events, agent);
+ // Preserve event order for both signals.
+ const intents = detectIntentsFromWorkerStreamEvents(events, agent); // includes streamEventIndex
+ const sends = relayInboundFrom(events, agent); // includes inboundEventIndex

- matchIdx = sends.findIndex((s, i) => !consumed[i] && normalizeTarget(s.target) === intent.target);
+ matchIdx = sends.findIndex(
+   (s, i) =>
+     !consumed[i] &&
+     s.inboundEventIndex >= intent.streamEventIndex &&
+     normalizeTarget(s.target) === intent.target
+ );

- matchIdx = sends.findIndex((_, i) => !consumed[i]);
+ matchIdx = sends.findIndex(
+   (s, i) => !consumed[i] && s.inboundEventIndex >= intent.streamEventIndex
+ );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scoring/phantom.ts` around lines 146 - 155,
The current matching loop (variables intents, sends, consumed, normalizeTarget
and matchIdx) allows any earlier unconsumed send to satisfy an intent; change it
so a send is only considered if it occurs at or after the intent in event order.
Concretely, iterate intents with an index (or otherwise obtain the intent's
event index) and update both findIndex predicates to require the send index i >=
intentIndex (e.g., (s, i) => i >= intentIndex && !consumed[i] &&
normalizeTarget(s.target) === intent.target and ( _, i) => i >= intentIndex &&
!consumed[i]) so only later-or-equal sends can match an intent.

Comment on lines +18 to +20
function normalizeChannel(target: string): string {
return target.replace(/^[#@]/, '').trim().toLowerCase();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Target normalization conflates DM and channel identities.

Stripping both # and @ causes false positives (e.g., post to #c can satisfy hop to c). Relay-chain scoring should preserve target kind (channel vs direct target), not just normalized text.

Proposed fix
-function normalizeChannel(target: string): string {
-  return target.replace(/^[#@]/, '').trim().toLowerCase();
-}
+function normalizeTarget(target: string): { kind: 'channel' | 'direct'; name: string } {
+  const t = target.trim();
+  return {
+    kind: t.startsWith('#') ? 'channel' : 'direct',
+    name: t.replace(/^[#@]/, '').toLowerCase(),
+  };
+}
@@
-      (e) => normalizeChannel(e.target) === normalizeChannel(hop.to)
+      (e) => {
+        const a = normalizeTarget(e.target);
+        const b = normalizeTarget(hop.to);
+        return a.kind === b.kind && a.name === b.name;
+      }
@@
-        normalizeChannel(e.target) === normalizeChannel(finalChannel) &&
+        (() => {
+          const a = normalizeTarget(e.target);
+          const b = normalizeTarget(finalChannel);
+          return a.kind === b.kind && a.name === b.name;
+        })() &&
         e.body.includes(payload)

Also applies to: 109-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/scoring/protocol.ts` around lines 18 - 20,
normalizeChannel currently strips the leading '#' or '@', conflating channel vs
DM identities; update normalizeChannel to preserve the target kind by retaining
the leading marker: trim input, detect if first char is '#' or '@', lowercase
and trim the rest, and return the normalized string including the original
prefix (e.g., "`#general`" or "`@alice`"); then update all call sites that rely on
the old behavior (the other occurrences around the 109-121 block) to use the
preserved-prefix form for comparisons so channel and direct targets remain
distinct.

Comment on lines +59 to +69
if (evalWouldPass) {
console.error(
'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
'The eval is producing false greens.'
);
process.exit(1);
}
if (score.sent !== 0) {
console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
process.exit(1);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tests/integration/broker/evals/selftest.ts"

# Show the relevant region with line numbers
nl -ba "$FILE" | sed -n '1,140p'

Repository: AgentWorkforce/relay

Length of output: 107


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tests/integration/broker/evals/selftest.ts"

echo "== process.exit usages =="
rg -n "process\.exit" "$FILE" || true

echo
echo "== Region around lines 40-110 (with line numbers) =="
cat -n "$FILE" | sed -n '40,110p'

Repository: AgentWorkforce/relay

Length of output: 1825


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="tests/integration/broker/evals/selftest.ts"

cat -n "$FILE" | sed -n '1,100p'

Repository: AgentWorkforce/relay

Length of output: 3402


Avoid process.exit(1) inside try in tests/integration/broker/evals/selftest.ts so finally cleanup always runs.
process.exit(1) is called at lines 64 and 68 inside the try, so the finally cleanup (releaseAgent/stop) won’t run on those failure paths.

Suggested fix
     if (evalWouldPass) {
       console.error(
         '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
           'The eval is producing false greens.'
       );
-      process.exit(1);
+      throw new Error(
+        '✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. The eval is producing false greens.'
+      );
     }
     if (score.sent !== 0) {
       console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
-      process.exit(1);
+      throw new Error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (evalWouldPass) {
console.error(
'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
'The eval is producing false greens.'
);
process.exit(1);
}
if (score.sent !== 0) {
console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
process.exit(1);
}
if (evalWouldPass) {
console.error(
'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. ' +
'The eval is producing false greens.'
);
throw new Error(
'✗ SELF-TEST FAILED: the eval reported PASS for an agent that cannot message. The eval is producing false greens.'
);
}
if (score.sent !== 0) {
console.error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
throw new Error(`✗ SELF-TEST FAILED: expected 0 real sends from a no-op agent, got ${score.sent}.`);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/selftest.ts` around lines 59 - 69, The two
process.exit(1) calls inside the try block (the checks using evalWouldPass and
score.sent) prevent the finally cleanup (releaseAgent/stop) from running;
replace those process.exit(1) calls with throwing descriptive Errors (e.g.,
throw new Error('SELF-TEST FAILED: ...')) or returning a rejected Promise so the
error propagates out of the try and the finally block that calls releaseAgent
and stop always executes; update the error messages to match the existing
console.error text and remove the direct process.exit usage in the try.

Comment on lines +25 to +27
const re = /registerTool\(\s*['"]([a-z0-9_]+)['"]/g;
for (const m of mcpSource.matchAll(re)) names.add(m[1]);
return [...names].sort();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Broaden tool-name extraction to avoid false mismatches.

The current matcher only accepts lowercase alnum + _, so valid registered tool names containing -, ., or : are dropped from registered, causing incorrect failures.

Suggested fix
-  const re = /registerTool\(\s*['"]([a-z0-9_]+)['"]/g;
+  const re = /registerTool\(\s*['"]([^'"\\\s][^'"\\]*)['"]/g;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const re = /registerTool\(\s*['"]([a-z0-9_]+)['"]/g;
for (const m of mcpSource.matchAll(re)) names.add(m[1]);
return [...names].sort();
const re = /registerTool\(\s*['"]([^'"\\\s][^'"\\]*)['"]/g;
for (const m of mcpSource.matchAll(re)) names.add(m[1]);
return [...names].sort();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/broker/evals/toolcheck-cli.ts` around lines 25 - 27, The
regex used to extract registered tool names from mcpSource in the
registerTool(...) matcher is too narrow (only [a-z0-9_]) and drops valid names
with characters like -, ., : (and possibly uppercase); update the regex assigned
to re in toolcheck-cli.ts to expand the character class to include dash, dot,
colon and uppercase letters (e.g., include A-Z and the characters - . :) and
ensure the hyphen is either escaped or placed at the start/end of the class so
it is treated literally, then keep the same matchAll logic and return value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant