-
Notifications
You must be signed in to change notification settings - Fork 0
Add eval harness + cases for all 10 agents #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| # agents evals | ||
|
|
||
| Repeatable dry-runs of the showcase agents. Each **case** (`cases.jsonl`) fires | ||
| one event at one agent's handler and checks what it did. This complements | ||
| `npm test` (the `node --test` unit suite + the persona-scope guard): the unit | ||
| tests assert handler internals with hand-rolled spies; these evals run the | ||
| **real handler** through the runtime's simulation API and, in live mode, against | ||
| an actual cheap model. | ||
|
|
||
| ```sh | ||
| npm run evals # simulate mode, all cases (free, offline) | ||
| npm run evals -- --agent review | ||
| npm run evals -- --case linear-slack.chat | ||
| npm run evals -- --suite chat # by kind: chat|triage|scheduled|guard | ||
| npm run evals -- --list | ||
| npm run evals:live # real cheap-model replies + LLM judge | ||
| ``` | ||
|
|
||
| `npm run evals` compiles personas first (they are gitignored — `*/persona.json` | ||
| is built from `persona.ts` on demand), then runs the suite. | ||
|
|
||
| ## Two executors | ||
|
|
||
| **simulate** (default) runs each case through the runtime's local | ||
| `simulateInvocation`. The handler executes for real against an in-memory VFS | ||
| seeded from the case, but `harness.run` / `llm.complete` are **stubbed** (no | ||
| model, no tokens). We assert deterministic facts: the run succeeded (or failed | ||
| as expected), it routed to the expected `eventSource`, and the expected side | ||
| effects / log lines appeared. This is the fast, free **routing/plumbing | ||
| regression gate**. | ||
|
|
||
| **live** (`--live`) runs the real handler with `harness.run` and `llm.complete` | ||
| backed by a cheap **opencode** model (default `opencode/gpt-5-nano`, override | ||
| with `WD_EVAL_MODEL`), so chat cases get an **actual agent reply**. Add | ||
| `--judge` to grade each chat reply against the case `rubric` with the same cheap | ||
| model (LLM-as-judge). Needs `opencode` on PATH and `OPENCODE_API_KEY`. The judge | ||
| only grades `kind:"chat"` cases; for the rest the routing + side-effect checks | ||
| are the gate. | ||
|
|
||
| ## How seeding works (and why it's `_index.json`) | ||
|
|
||
| The simulator backs the VFS as an in-memory map exposed only through | ||
| `ctx.files.read(exactPath)` — it has **no directory-listing primitive**. A | ||
| case's `seeds` list names provider dirs (`"linear/projects"`), which the runner | ||
| maps to `/linear/projects/_index.json` ← `seeds/linear-projects.json`. Agents | ||
| that enumerate a dir read that blessed `_index.json`; agents that list via | ||
| `sandbox.exec`/`find` see nothing (it's stubbed). | ||
|
|
||
| Seeds are **also materialized to the disk mount** (`RELAYFILE_MOUNT_ROOT`), so | ||
| agents that read through a relay-helpers client (e.g. | ||
| `linearClient().getIssue` resolves `/linear/issues/by-uuid/<id>.json`) get their | ||
| data too. Use the long seed form to drop a file at an exact VFS path: | ||
| `{ "vfs": "/linear/issues/by-uuid/issue-1.json", "file": "linear-issue-1.json" }`. | ||
|
|
||
| ## What simulate can and can't see | ||
|
|
||
| Only `harness.run` and `llm.complete` are recorded as side effects. | ||
| `slackClient` / `linearClient` / `githubClient` writes are **not** recorded, and | ||
| real `fetch` calls (hn-monitor's HN Algolia, spotify-releases' Spotify API, | ||
| vendor-monitor's npm registry) hit the network or fail closed. So: | ||
|
|
||
| - **harness/llm agents** (linear-slack, linear, review, repo-hygiene, granola) | ||
| assert the recorded side effect + a happy-path log. | ||
| - **cron warn-only team members** (cloud-team-implementer/reviewer) assert the | ||
| misroute warning — their real work happens only in the team dispatcher's | ||
| sandbox, which simulate does not run. | ||
| - **fetch-only agents with no recorded side effect** (spotify-releases, | ||
| vendor-monitor) can't assert a positive in simulate, so their deterministic | ||
| coverage is the **required-input guard**: a case with `expect.status:"failed"` | ||
| + `expect.errorIncludes` proves the agent refuses to run without its inputs. | ||
|
|
||
| ## Case shape (`cases.jsonl`, one JSON object per line) | ||
|
|
||
| ```jsonc | ||
| { | ||
| "id": "review.review", "agent": "review", "kind": "triage", | ||
| "fixture": { "type": "github.pull_request.opened", | ||
| "resource": { "pull_request": { "number": 7, ... }, "repository": { ... } } }, | ||
| "inputs": { "APPROVERS": "alice" }, | ||
| "seeds": [{ "vfs": "/github/repos/acme/widget/pulls/7/meta.json", "file": "github-pr-widget-7-meta.json" }], | ||
| "expect": { "status": "succeeded", "eventSource": "github", "sideEffectsAll": ["harness.run"] }, | ||
| "rubric": "..." | ||
| } | ||
| ``` | ||
|
|
||
| `agent` is the dir path relative to the repo root — flat (`review`) or nested | ||
| team member (`competitor/market-competitor`), both work. `expect` keys: | ||
| `status` (`succeeded` | `failed`), `errorIncludes` (substring of the thrown | ||
| error, for guard cases), `eventSource`, `sideEffectsAll` (all must appear), | ||
| `sideEffectsAny` (≥1), `logsAny` (≥1 log message). | ||
|
|
||
| Artifacts land in `.evals/runs/<stamp>/{result.json,summary.md}` (gitignored). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| {"id":"linear-slack.chat","agent":"linear-slack","kind":"chat","fixture":{"type":"slack.message.created","resource":{"channel":"C0TEST","ts":"100.1","text":"What's open on the board for the export work?","user":"U1"}},"inputs":{"SLACK_CHANNEL":"C0TEST"},"seeds":["linear/projects","linear/issues","linear/teams"],"expect":{"status":"succeeded","eventSource":"slack","sideEffectsAll":["harness.run"]},"rubric":"A grounded Slack answer about open Linear issues for the export work, citing real issues from the board. Read-only unless asked to create; must not fabricate issue refs."} | ||
| {"id":"linear.chat","agent":"linear","kind":"chat","fixture":{"type":"linear.AgentSessionEvent.prompted","resource":{"payload":{"agentSession":{"id":"session-1","issue":{"id":"issue-1"}},"agentActivity":{"body":"What's the current status of this issue?"}}}},"inputs":{},"seeds":[{"vfs":"/linear/issues/by-uuid/issue-1.json","file":"linear-issue-1.json"}],"expect":{"status":"succeeded","eventSource":"linear","sideEffectsAll":["llm.complete"],"logsAny":["linear event"]},"rubric":"A grounded conversational status reply about the issue. Read-only: must not claim to have edited or closed anything."} | ||
| {"id":"review.review","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","user":{"login":"alice"},"head":{"sha":"abc123"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"]},"rubric":"A code review that runs the harness against the PR diff and surfaces real issues (e.g. the unpaginated export OOM)."} | ||
| {"id":"review.skip-label","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":8,"html_url":"https://github.com/acme/widget/pull/8","user":{"login":"alice"},"labels":[{"name":"no-agent-relay-review"}],"head":{"sha":"def456"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},"rubric":"A PR carrying the opt-out label must be skipped without running the review harness."} | ||
| {"id":"repo-hygiene.diagnose","agent":"repo-hygiene","kind":"triage","fixture":{"type":"github.pull_request.synchronize","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","title":"Add CSV export endpoint","user":{"login":"alice"},"head":{"sha":"abc123","ref":"feat/x"},"base":{"ref":"main"}},"repository":{"name":"widget","full_name":"acme/widget","owner":{"login":"acme"}}}},"inputs":{"NOTION_DATABASE_ID":"db_test"},"seeds":[{"vfs":"/github/repos/acme/widget/pulls/7/meta.json","file":"github-pr-widget-7-meta.json"}],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"],"logsAny":["repo-hygiene.notion-page.no-receipt","repo-hygiene.journal-failed"]},"rubric":"Diagnoses repo-hygiene issues in the PR and runs the harness; the Notion journal write is best-effort and logs when no receipt comes back."} | ||
| {"id":"hn-monitor.nothing-new","agent":"hn-monitor","kind":"scheduled","fixture":{"type":"cron.tick","name":"scan","cron":"0 9,17 * * *"},"inputs":{"SLACK_CHANNEL":"C123","TOPICS":"zzqx-no-topic-will-match-this"},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["hn-monitor.nothing-new"]},"rubric":"With no front-page story matching the topics, the scan posts nothing and logs it found nothing new."} | ||
| {"id":"spotify-releases.requires-token","agent":"spotify-releases","kind":"guard","fixture":{"type":"cron.tick","name":"check","cron":"0 10 * * *"},"inputs":{"SLACK_USER":"U123"},"seeds":[],"expect":{"status":"failed","errorIncludes":"SPOTIFY_TOKEN"},"rubric":"Missing the required SPOTIFY_TOKEN input must fail loudly rather than silently no-op."} | ||
| {"id":"vendor-monitor.requires-channel","agent":"vendor-monitor","kind":"guard","fixture":{"type":"cron.tick","name":"check","cron":"0 8 * * 1-5"},"inputs":{},"seeds":[],"expect":{"status":"failed","errorIncludes":"SLACK_CHANNEL"},"rubric":"Missing the required SLACK_CHANNEL input must fail loudly rather than silently no-op."} | ||
| {"id":"granola.note-classify","agent":"granola","kind":"triage","fixture":{"type":"granola.file.created","provider":"granola","paths":["/granola/notes/note-2.json"],"resource":{"path":"/granola/notes/note-2.json","kind":"file","id":"note-2","provider":"granola"}},"inputs":{},"seeds":[{"vfs":"/granola/notes/note-2.json","file":"granola-note-prospect.json"}],"expect":{"status":"succeeded","eventSource":"granola","sideEffectsAll":["llm.complete"],"logsAny":["granola-prospect.not-a-prospect"]},"rubric":"Reads the meeting note and classifies whether it describes a sales prospect before doing any Linear/PR work."} | ||
| {"id":"cloud-team-implementer.misroute","agent":"cloud-team-implementer","kind":"guard","fixture":{"type":"cron.tick","name":"tick","cron":"0 * * * *"},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["cloud-team-implementer received a direct event; members are launched by the team dispatcher, not by subscriptions"]},"rubric":"A team member should never act on a direct event — it warns that it is launched by the dispatcher."} | ||
| {"id":"cloud-team-reviewer.misroute","agent":"cloud-team-reviewer","kind":"guard","fixture":{"type":"cron.tick","name":"tick","cron":"0 * * * *"},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["cloud-team-reviewer received a direct event; members are launched by the team dispatcher, not by subscriptions"]},"rubric":"A team member should never act on a direct event — it warns that it is launched by the dispatcher."} | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| { | ||
| "number": 7, | ||
| "title": "Add CSV export endpoint", | ||
| "state": "open", | ||
| "draft": false, | ||
| "labels": [], | ||
| "diff": "diff --git a/src/export.ts b/src/export.ts\nindex 1111111..2222222 100644\n--- a/src/export.ts\n+++ b/src/export.ts\n@@ -1,3 +1,12 @@\n-export function noop() {}\n+import { db } from './db';\n+\n+export async function exportCsv(accountId: string): Promise<string> {\n+ const rows = await db.activity.findMany({ where: { accountId } });\n+ const header = 'id,type,at\\n';\n+ const body = rows.map((r) => `${r.id},${r.type},${r.at}`).join('\\n');\n+ // NOTE: no pagination — will OOM on large accounts\n+ return header + body;\n+}\n" | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| { | ||
| "title": "Intro call — Acme Corp", | ||
| "transcript": "Sarah (Acme, VP Eng): we're a 200-person fintech, evaluating workflow automation. We looked at Zapier and n8n but need something that can run agents against our Linear and GitHub. Budget is approved for this quarter. Can you send pricing and SSO details? We'd want to start a pilot in two weeks.", | ||
| "summary": "Acme Corp evaluating us vs Zapier/n8n; approved budget; asked for pricing + SSO; wants a 2-week pilot." | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| { | ||
| "id": "issue-1", | ||
| "identifier": "AR-1", | ||
| "title": "CSV export still failing for Acme", | ||
| "description": "Acme reports the account-activity CSV export times out on large accounts. Needs pagination.", | ||
| "url": "https://linear.app/agentrelay/issue/AR-1", | ||
| "state": { "name": "In Progress", "type": "started" }, | ||
| "priority": 2, | ||
| "assignee": { "name": "Benjamin" }, | ||
| "team": { "id": "team-1", "key": "AR" } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| [ | ||
| { "id": "AR-1", "identifier": "AR-1", "title": "Export job times out on large accounts", "projectId": "acme", "state": "In Progress", "dueDate": "2026-05-01", "updatedAt": "2026-05-20T00:00:00Z", "objectId": "AR-1" }, | ||
| { "id": "AR-2", "identifier": "AR-2", "title": "Add SSO for Acme", "projectId": "acme", "state": "Backlog", "dueDate": "2026-07-01", "updatedAt": "2026-06-05T00:00:00Z", "objectId": "AR-2" }, | ||
| { "id": "AR-9", "identifier": "AR-9", "title": "Globex onboarding checklist", "projectId": "globex", "state": "In Progress", "dueDate": "2026-06-15", "updatedAt": "2026-06-09T00:00:00Z", "objectId": "AR-9" } | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| [ | ||
| { "id": "acme", "name": "Acme Corp", "description": "Enterprise customer; data-export integration. Domain acme.com." }, | ||
| { "id": "globex", "name": "Globex", "description": "Mid-market; onboarding in progress. Domain globex.io." } | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| [ { "id": "team-eng", "name": "Engineering" } ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| [ { "id": "ws-done", "name": "Done", "type": "completed", "teamId": "team-eng" }, | ||
| { "id": "ws-prog", "name": "In Progress", "type": "started", "teamId": "team-eng" } ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| [ { "id": "U0BENJ", "name": "benjamin", "real_name": "Benjamin Watchdog", "profile": { "email": "benjamin@watchdog.no", "real_name": "Benjamin Watchdog" } } ] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rubric requires “no harness run,” but this case cannot enforce it.
Line 4 says skip must happen without running review, but
expectonly checks status/logs. With current runner checks, this can still pass even ifharness.runis called.Proposed direction (cases + runner)
And in
scripts/evals/run-evals.mjs, add a corresponding negative assertion incheckExpectations(...)forsideEffectsNone.🤖 Prompt for AI Agents