Add eval harness + cases for all 10 agents#58
Conversation
Repeatable dry-runs of the showcase agents, ported from the watchdog-agents eval harness. Each case fires one event at one agent's handler and asserts routing + side effects. - scripts/evals/run-evals.mjs: simulate (free, offline, stubbed harness/llm) and live (cheap opencode model + LLM-as-judge) executors. Flat/nested agent layout; seeds materialized to both ctx.files and the disk mount so relayClient reads (linearClient().getIssue) resolve. New expect.status: "failed" + expect.errorIncludes for required-input guard cases. - evals/cases.jsonl: 11 cases covering linear-slack, linear, review, repo-hygiene, hn-monitor, spotify-releases, vendor-monitor, granola, and the two cloud-team members. 11/11 green in simulate. - evals/seeds/*: linear board fixtures + PR meta, granola note, issue alias. - evals/README.md: documents the two executors, _index.json seeding, what simulate can/can't observe, and the guard-case pattern. - package.json: `evals` / `evals:live` scripts (compile personas first); tsx devDep. .gitignore: .evals/ + provider draft-tree safety net. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 16 minutes and 27 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR introduces a comprehensive evaluation framework for agent showcase cases. It includes detailed documentation of the eval workflow, 11 structured test cases with supporting seed data, a Node.js eval runner supporting both simulate (deterministic, stubbed) and live (real handler, opencode model) execution modes, and integration into the development workflow via npm scripts and environment configuration. ChangesAgent Evaluation Framework
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a repeatable evaluation suite for showcase agents, adding a runner script (run-evals.mjs), test cases, mock seed data, and npm scripts for both simulated and live runs. The review feedback highlights several key improvements for the runner script: fixing the premature cleanup of the RELAYFILE_MOUNT_ROOT environment variable, adding support for named handler exports, implementing error handling for the opencode process execution, and ensuring that malformed or failed LLM judgments correctly fail the test cases instead of silently passing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const mod = await tsImport(pathToFileURL(agentEntry(testCase.agent)).href, import.meta.url); | ||
| if (!event) throw new Error('envelopeToAgentEvent returned null (unsupported envelope)'); | ||
| const handler = mod.default?.handler ?? mod.default; | ||
| await withCaseEnv(personaSpec, testCase.inputs ?? {}, { RELAYFILE_MOUNT_ROOT: mount }, () => handler(ctx, event)); |
There was a problem hiding this comment.
The comment on lines 268-273 states that RELAYFILE_MOUNT_ROOT is pinned for the rest of the process to allow fire-and-forget draft writes to complete without falling back to the current working directory. However, because it is passed as part of extraEnv to withCaseEnv, it gets deleted or restored in the finally block of withCaseEnv as soon as the handler promise resolves. This defeats the purpose of pinning it.
To fix this, set process.env.RELAYFILE_MOUNT_ROOT = mount; globally before calling withCaseEnv, and pass an empty object {} for extraEnv so it is not cleaned up prematurely.
process.env.RELAYFILE_MOUNT_ROOT = mount;
await withCaseEnv(personaSpec, testCase.inputs ?? {}, {}, () => handler(ctx, event));| const rec = await withCaseEnv(persona, testCase.inputs ?? {}, { RELAYFILE_MOUNT_ROOT: tmp }, () => | ||
| simulateInvocation({ | ||
| persona, | ||
| handler: mod.default?.handler ?? mod.default, |
There was a problem hiding this comment.
The current handler resolution only supports default exports (mod.default?.handler ?? mod.default). If an agent uses a named export (e.g., export const handler = ...), mod.default will be undefined, causing the handler to resolve to undefined and crash at runtime.
Updating this to support named exports improves robustness and prevents future agents from failing. Please apply this same change to line 282 in runLive as well.
| handler: mod.default?.handler ?? mod.default, | |
| handler: mod.handler ?? mod.default?.handler ?? mod.default, |
| maxBuffer: 16 * 1024 * 1024, | ||
| env: { ...process.env }, | ||
| }); | ||
| const raw = (res.stdout ?? '').replace(/\x1b\[[0-9;]*m/g, ''); |
There was a problem hiding this comment.
spawnSync is called without checking for execution errors or non-zero exit codes. If the opencode binary is missing, misconfigured, or fails during execution, the function will fail silently and return an empty string, leading to hard-to-debug test failures or false positives.
Adding explicit error handling ensures that any issues with the LLM runner are surfaced immediately.
if (res.error) {
throw new Error(`Failed to execute opencode: ${res.error.message}`);
}
if (res.status !== 0) {
throw new Error(`opencode exited with code ${res.status}: ${res.stderr || res.stdout}`);
}
const raw = (res.stdout ?? '').replace(/\x1b\[[0-9;]*m/g, '');| // A case may deliberately expect a failure (e.g. a required-input guard throw); | ||
| // only treat an unexpected failed status as an automatic fail. | ||
| const expectsFailure = (testCase.expect?.status ?? null) === 'failed'; | ||
| const passed = checks.every((c) => c.pass) && (expectsFailure || outcome.status !== 'failed') && (verdict ? verdict.pass !== false : true); |
There was a problem hiding this comment.
If the LLM judge fails to parse the response (e.g., due to invalid JSON), judge returns { pass: null, reason: ... }. The check verdict ? verdict.pass !== false : true evaluates to true when verdict.pass is null. This means a failed or malformed judgment silently passes the test instead of failing it.
Changing this to verdict.pass === true ensures that any non-true judgment (including parsing errors) correctly fails the test case.
| const passed = checks.every((c) => c.pass) && (expectsFailure || outcome.status !== 'failed') && (verdict ? verdict.pass !== false : true); | |
| const passed = checks.every((c) => c.pass) && (expectsFailure || outcome.status !== 'failed') && (verdict ? verdict.pass === true : true); |
|
Reviewed PR #58 and made scoped fixes. Changes made:
Validation run locally:
Addressed comments
Advisory Notes
|
|
Reviewed PR #58 and made one scoped runner fix pass in scripts/evals/run-evals.mjs. Changes made:
Validation run locally:
Addressed comments
Advisory Notes
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/evals/run-evals.mjs (1)
268-273: 💤 Low valueTemp directory leak is documented but may accumulate.
The comment explains that the mount directory is intentionally not cleaned up to avoid breaking async draft writes, but this means every live eval run leaves a directory in
/tmp. Consider tracking these directories and cleaning up older runs (e.g., older than 1 hour) to prevent unbounded accumulation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/evals/run-evals.mjs` around lines 268 - 273, The temp mount directories created and intentionally not removed (see RELAYFILE_MOUNT_ROOT and local variable mount) can accumulate; add a short cleanup routine at the top of the run-evals flow that scans the same temp-root pattern used for mount (e.g., RELAYFILE_MOUNT_ROOT/* or whatever naming convention creates the per-run mount dirs), checks mtime/ctime, and deletes any directories older than a threshold (suggest 1 hour) using fs/stat and fs.rm or rimraf; invoke this cleanup before creating a new mount to avoid unbounded accumulation while preserving recent mounts used by running drafts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/cases.jsonl`:
- Line 4: The test case currently can't assert "no harness run" because the
expectation only checks status/logs; update the case in evals/cases.jsonl to
include a new boolean flag (e.g., "sideEffectsNone": true) and then modify the
runner's checkExpectations function in scripts/evals/run-evals.mjs to assert
that when sideEffectsNone is true no side-effecting methods were invoked
(specifically ensure harness.run was not called) by adding a negative assertion
path that fails if harness.run or equivalent side-effect markers were observed;
reference the expectation key "sideEffectsNone", the runner helper
checkExpectations, and the side-effecting method harness.run to locate where to
add the check.
---
Nitpick comments:
In `@scripts/evals/run-evals.mjs`:
- Around line 268-273: The temp mount directories created and intentionally not
removed (see RELAYFILE_MOUNT_ROOT and local variable mount) can accumulate; add
a short cleanup routine at the top of the run-evals flow that scans the same
temp-root pattern used for mount (e.g., RELAYFILE_MOUNT_ROOT/* or whatever
naming convention creates the per-run mount dirs), checks mtime/ctime, and
deletes any directories older than a threshold (suggest 1 hour) using fs/stat
and fs.rm or rimraf; invoke this cleanup before creating a new mount to avoid
unbounded accumulation while preserving recent mounts used by running drafts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 8cbecaaf-f1a8-4e0d-bd04-2bc29897ec9a
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (14)
.gitignoreevals/README.mdevals/cases.jsonlevals/seeds/github-pr-widget-7-meta.jsonevals/seeds/granola-note-prospect.jsonevals/seeds/linear-issue-1.jsonevals/seeds/linear-issues.jsonevals/seeds/linear-projects.jsonevals/seeds/linear-teams.jsonevals/seeds/linear-workflow-states.jsonevals/seeds/slack-users.jsonpackage.jsonscripts/evals/run-evals.mjstests/linear-slack-agent.test.mjs
| {"id":"linear-slack.chat","agent":"linear-slack","kind":"chat","fixture":{"type":"slack.message.created","resource":{"channel":"C0TEST","ts":"100.1","text":"What's open on the board for the export work?","user":"U1"}},"inputs":{"SLACK_CHANNEL":"C0TEST"},"seeds":["linear/projects","linear/issues","linear/teams"],"expect":{"status":"succeeded","eventSource":"slack","sideEffectsAll":["harness.run"]},"rubric":"A grounded Slack answer about open Linear issues for the export work, citing real issues from the board. Read-only unless asked to create; must not fabricate issue refs."} | ||
| {"id":"linear.chat","agent":"linear","kind":"chat","fixture":{"type":"linear.AgentSessionEvent.prompted","resource":{"payload":{"agentSession":{"id":"session-1","issue":{"id":"issue-1"}},"agentActivity":{"body":"What's the current status of this issue?"}}}},"inputs":{},"seeds":[{"vfs":"/linear/issues/by-uuid/issue-1.json","file":"linear-issue-1.json"}],"expect":{"status":"succeeded","eventSource":"linear","sideEffectsAll":["llm.complete"],"logsAny":["linear event"]},"rubric":"A grounded conversational status reply about the issue. Read-only: must not claim to have edited or closed anything."} | ||
| {"id":"review.review","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","user":{"login":"alice"},"head":{"sha":"abc123"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"]},"rubric":"A code review that runs the harness against the PR diff and surfaces real issues (e.g. the unpaginated export OOM)."} | ||
| {"id":"review.skip-label","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":8,"html_url":"https://github.com/acme/widget/pull/8","user":{"login":"alice"},"labels":[{"name":"no-agent-relay-review"}],"head":{"sha":"def456"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},"rubric":"A PR carrying the opt-out label must be skipped without running the review harness."} |
There was a problem hiding this comment.
Rubric requires “no harness run,” but this case cannot enforce it.
Line 4 says skip must happen without running review, but expect only checks status/logs. With current runner checks, this can still pass even if harness.run is called.
Proposed direction (cases + runner)
-{"id":"review.skip-label",...,"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},...}
+{"id":"review.skip-label",...,"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"],"sideEffectsNone":["harness.run"]},...}And in scripts/evals/run-evals.mjs, add a corresponding negative assertion in checkExpectations(...) for sideEffectsNone.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@evals/cases.jsonl` at line 4, The test case currently can't assert "no
harness run" because the expectation only checks status/logs; update the case in
evals/cases.jsonl to include a new boolean flag (e.g., "sideEffectsNone": true)
and then modify the runner's checkExpectations function in
scripts/evals/run-evals.mjs to assert that when sideEffectsNone is true no
side-effecting methods were invoked (specifically ensure harness.run was not
called) by adding a negative assertion path that fails if harness.run or
equivalent side-effect markers were observed; reference the expectation key
"sideEffectsNone", the runner helper checkExpectations, and the side-effecting
method harness.run to locate where to add the check.
Repeatable dry-runs of the showcase agents, ported from the watchdog-agents eval harness. Each case fires one event at one agent's real handler and asserts routing + side effects. Complements the existing
npm testunit suite (which uses hand-rolled ctx spies) by running the real handler through the runtime's simulation API and, in live mode, against an actual cheap model.What's here
scripts/evals/run-evals.mjs— two executors:simulateInvocationagainst an in-memory VFS;harness.run/llm.completeare stubbed but recorded as side effects. Assertsstatus,eventSource,sideEffectsAll/Any,logsAny.--live): backs the model calls with a cheap opencode model (gpt-5-nano);--judgegrades chat replies against the caserubric(LLM-as-judge).ctx.filesand the disk mount so relayClient reads (linearClient().getIssue) resolve; newexpect.status:"failed"+expect.errorIncludesfor required-input guard cases.evals/cases.jsonl— 11 cases covering all 10 agents (linear-slack, linear, review, repo-hygiene, hn-monitor, spotify-releases, vendor-monitor, granola, both cloud-team members). 11/11 green in simulate.evals/seeds/*,evals/README.md— fixtures + docs.package.json:evals/evals:livescripts (compile personas first);tsxdevDep..gitignore:.evals/+ provider draft-tree safety net.How to run
Verification
tsc --noEmitclean · all 10 personas compile ·npm run evals→ 11/11.🤖 Generated with Claude Code
Summary by cubic
Adds a repeatable eval harness that runs real agent handlers in simulate or live mode, with 11 cases across all 10 showcase agents. Improves coverage for routing, side effects, and chat reply quality.
New Features
scripts/evals/run-evals.mjs: simulate (offline, stubbedharness.run/llm.complete) and live (--liveviaopencode, optional--judge); supports flat and nested agent dirs.evals/cases.jsonlwith fixtures and expectations, including required-input guard checks (expect.status: "failed"+errorIncludes)..evals/runs/<stamp>/{result.json,summary.md};evals/README.mddocuments usage and seeding;.gitignoreignores.evals/and adds a provider draft-tree safety net.package.json:evals/evals:livescripts (compile personas first) andtsxdev dependency.Bug Fixes
tests/linear-slack-agent.test.mjsto mapSLACK_CHANNELtoTEST_SLACK_CHANNELviainputSpecs, preventing env collisions and aligning with the harness env mapping.Written for commit ad556b5. Summary will update on new commits.