Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,12 @@ node_modules/
# Relay VFS runtime state — rewritten on every reconcile cycle, never commit it
.relay/
memory/workspace/.relay/

# Eval run artifacts
.evals/
# Safety net: stray provider draft trees a mis-pointed mount root can leave behind
/slack/
/linear/
/github/
/notion/
/google-mail/
92 changes: 92 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# agents evals

Repeatable dry-runs of the showcase agents. Each **case** (`cases.jsonl`) fires
one event at one agent's handler and checks what it did. This complements
`npm test` (the `node --test` unit suite + the persona-scope guard): the unit
tests assert handler internals with hand-rolled spies; these evals run the
**real handler** through the runtime's simulation API and, in live mode, against
an actual cheap model.

```sh
npm run evals # simulate mode, all cases (free, offline)
npm run evals -- --agent review
npm run evals -- --case linear-slack.chat
npm run evals -- --suite chat # by kind: chat|triage|scheduled|guard
npm run evals -- --list
npm run evals:live # real cheap-model replies + LLM judge
```

`npm run evals` compiles personas first (they are gitignored — `*/persona.json`
is built from `persona.ts` on demand), then runs the suite.

## Two executors

**simulate** (default) runs each case through the runtime's local
`simulateInvocation`. The handler executes for real against an in-memory VFS
seeded from the case, but `harness.run` / `llm.complete` are **stubbed** (no
model, no tokens). We assert deterministic facts: the run succeeded (or failed
as expected), it routed to the expected `eventSource`, and the expected side
effects / log lines appeared. This is the fast, free **routing/plumbing
regression gate**.

**live** (`--live`) runs the real handler with `harness.run` and `llm.complete`
backed by a cheap **opencode** model (default `opencode/gpt-5-nano`, override
with `WD_EVAL_MODEL`), so chat cases get an **actual agent reply**. Add
`--judge` to grade each chat reply against the case `rubric` with the same cheap
model (LLM-as-judge). Needs `opencode` on PATH and `OPENCODE_API_KEY`. The judge
only grades `kind:"chat"` cases; for the rest the routing + side-effect checks
are the gate.

## How seeding works (and why it's `_index.json`)

The simulator backs the VFS as an in-memory map exposed only through
`ctx.files.read(exactPath)` — it has **no directory-listing primitive**. A
case's `seeds` list names provider dirs (`"linear/projects"`), which the runner
maps to `/linear/projects/_index.json` ← `seeds/linear-projects.json`. Agents
that enumerate a dir read that blessed `_index.json`; agents that list via
`sandbox.exec`/`find` see nothing (it's stubbed).

Seeds are **also materialized to the disk mount** (`RELAYFILE_MOUNT_ROOT`), so
agents that read through a relay-helpers client (e.g.
`linearClient().getIssue` resolves `/linear/issues/by-uuid/<id>.json`) get their
data too. Use the long seed form to drop a file at an exact VFS path:
`{ "vfs": "/linear/issues/by-uuid/issue-1.json", "file": "linear-issue-1.json" }`.

## What simulate can and can't see

Only `harness.run` and `llm.complete` are recorded as side effects.
`slackClient` / `linearClient` / `githubClient` writes are **not** recorded, and
real `fetch` calls (hn-monitor's HN Algolia, spotify-releases' Spotify API,
vendor-monitor's npm registry) hit the network or fail closed. So:

- **harness/llm agents** (linear-slack, linear, review, repo-hygiene, granola)
assert the recorded side effect + a happy-path log.
- **cron warn-only team members** (cloud-team-implementer/reviewer) assert the
misroute warning — their real work happens only in the team dispatcher's
sandbox, which simulate does not run.
- **fetch-only agents with no recorded side effect** (spotify-releases,
vendor-monitor) can't assert a positive in simulate, so their deterministic
coverage is the **required-input guard**: a case with `expect.status:"failed"`
+ `expect.errorIncludes` proves the agent refuses to run without its inputs.

## Case shape (`cases.jsonl`, one JSON object per line)

```jsonc
{
"id": "review.review", "agent": "review", "kind": "triage",
"fixture": { "type": "github.pull_request.opened",
"resource": { "pull_request": { "number": 7, ... }, "repository": { ... } } },
"inputs": { "APPROVERS": "alice" },
"seeds": [{ "vfs": "/github/repos/acme/widget/pulls/7/meta.json", "file": "github-pr-widget-7-meta.json" }],
"expect": { "status": "succeeded", "eventSource": "github", "sideEffectsAll": ["harness.run"] },
"rubric": "..."
}
```

`agent` is the dir path relative to the repo root — flat (`review`) or nested
team member (`competitor/market-competitor`), both work. `expect` keys:
`status` (`succeeded` | `failed`), `errorIncludes` (substring of the thrown
error, for guard cases), `eventSource`, `sideEffectsAll` (all must appear),
`sideEffectsAny` (≥1), `logsAny` (≥1 log message).

Artifacts land in `.evals/runs/<stamp>/{result.json,summary.md}` (gitignored).
11 changes: 11 additions & 0 deletions evals/cases.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{"id":"linear-slack.chat","agent":"linear-slack","kind":"chat","fixture":{"type":"slack.message.created","resource":{"channel":"C0TEST","ts":"100.1","text":"What's open on the board for the export work?","user":"U1"}},"inputs":{"SLACK_CHANNEL":"C0TEST"},"seeds":["linear/projects","linear/issues","linear/teams"],"expect":{"status":"succeeded","eventSource":"slack","sideEffectsAll":["harness.run"]},"rubric":"A grounded Slack answer about open Linear issues for the export work, citing real issues from the board. Read-only unless asked to create; must not fabricate issue refs."}
{"id":"linear.chat","agent":"linear","kind":"chat","fixture":{"type":"linear.AgentSessionEvent.prompted","resource":{"payload":{"agentSession":{"id":"session-1","issue":{"id":"issue-1"}},"agentActivity":{"body":"What's the current status of this issue?"}}}},"inputs":{},"seeds":[{"vfs":"/linear/issues/by-uuid/issue-1.json","file":"linear-issue-1.json"}],"expect":{"status":"succeeded","eventSource":"linear","sideEffectsAll":["llm.complete"],"logsAny":["linear event"]},"rubric":"A grounded conversational status reply about the issue. Read-only: must not claim to have edited or closed anything."}
{"id":"review.review","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","user":{"login":"alice"},"head":{"sha":"abc123"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"]},"rubric":"A code review that runs the harness against the PR diff and surfaces real issues (e.g. the unpaginated export OOM)."}
{"id":"review.skip-label","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":8,"html_url":"https://github.com/acme/widget/pull/8","user":{"login":"alice"},"labels":[{"name":"no-agent-relay-review"}],"head":{"sha":"def456"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},"rubric":"A PR carrying the opt-out label must be skipped without running the review harness."}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Rubric requires “no harness run,” but this case cannot enforce it.

Line 4 says skip must happen without running review, but expect only checks status/logs. With current runner checks, this can still pass even if harness.run is called.

Proposed direction (cases + runner)
-{"id":"review.skip-label",...,"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},...}
+{"id":"review.skip-label",...,"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"],"sideEffectsNone":["harness.run"]},...}

And in scripts/evals/run-evals.mjs, add a corresponding negative assertion in checkExpectations(...) for sideEffectsNone.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/cases.jsonl` at line 4, The test case currently can't assert "no
harness run" because the expectation only checks status/logs; update the case in
evals/cases.jsonl to include a new boolean flag (e.g., "sideEffectsNone": true)
and then modify the runner's checkExpectations function in
scripts/evals/run-evals.mjs to assert that when sideEffectsNone is true no
side-effecting methods were invoked (specifically ensure harness.run was not
called) by adding a negative assertion path that fails if harness.run or
equivalent side-effect markers were observed; reference the expectation key
"sideEffectsNone", the runner helper checkExpectations, and the side-effecting
method harness.run to locate where to add the check.

{"id":"repo-hygiene.diagnose","agent":"repo-hygiene","kind":"triage","fixture":{"type":"github.pull_request.synchronize","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","title":"Add CSV export endpoint","user":{"login":"alice"},"head":{"sha":"abc123","ref":"feat/x"},"base":{"ref":"main"}},"repository":{"name":"widget","full_name":"acme/widget","owner":{"login":"acme"}}}},"inputs":{"NOTION_DATABASE_ID":"db_test"},"seeds":[{"vfs":"/github/repos/acme/widget/pulls/7/meta.json","file":"github-pr-widget-7-meta.json"}],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"],"logsAny":["repo-hygiene.notion-page.no-receipt","repo-hygiene.journal-failed"]},"rubric":"Diagnoses repo-hygiene issues in the PR and runs the harness; the Notion journal write is best-effort and logs when no receipt comes back."}
{"id":"hn-monitor.nothing-new","agent":"hn-monitor","kind":"scheduled","fixture":{"type":"cron.tick","name":"scan","cron":"0 9,17 * * *"},"inputs":{"SLACK_CHANNEL":"C123","TOPICS":"zzqx-no-topic-will-match-this"},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["hn-monitor.nothing-new"]},"rubric":"With no front-page story matching the topics, the scan posts nothing and logs it found nothing new."}
{"id":"spotify-releases.requires-token","agent":"spotify-releases","kind":"guard","fixture":{"type":"cron.tick","name":"check","cron":"0 10 * * *"},"inputs":{"SLACK_USER":"U123"},"seeds":[],"expect":{"status":"failed","errorIncludes":"SPOTIFY_TOKEN"},"rubric":"Missing the required SPOTIFY_TOKEN input must fail loudly rather than silently no-op."}
{"id":"vendor-monitor.requires-channel","agent":"vendor-monitor","kind":"guard","fixture":{"type":"cron.tick","name":"check","cron":"0 8 * * 1-5"},"inputs":{},"seeds":[],"expect":{"status":"failed","errorIncludes":"SLACK_CHANNEL"},"rubric":"Missing the required SLACK_CHANNEL input must fail loudly rather than silently no-op."}
{"id":"granola.note-classify","agent":"granola","kind":"triage","fixture":{"type":"granola.file.created","provider":"granola","paths":["/granola/notes/note-2.json"],"resource":{"path":"/granola/notes/note-2.json","kind":"file","id":"note-2","provider":"granola"}},"inputs":{},"seeds":[{"vfs":"/granola/notes/note-2.json","file":"granola-note-prospect.json"}],"expect":{"status":"succeeded","eventSource":"granola","sideEffectsAll":["llm.complete"],"logsAny":["granola-prospect.not-a-prospect"]},"rubric":"Reads the meeting note and classifies whether it describes a sales prospect before doing any Linear/PR work."}
{"id":"cloud-team-implementer.misroute","agent":"cloud-team-implementer","kind":"guard","fixture":{"type":"cron.tick","name":"tick","cron":"0 * * * *"},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["cloud-team-implementer received a direct event; members are launched by the team dispatcher, not by subscriptions"]},"rubric":"A team member should never act on a direct event — it warns that it is launched by the dispatcher."}
{"id":"cloud-team-reviewer.misroute","agent":"cloud-team-reviewer","kind":"guard","fixture":{"type":"cron.tick","name":"tick","cron":"0 * * * *"},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["cloud-team-reviewer received a direct event; members are launched by the team dispatcher, not by subscriptions"]},"rubric":"A team member should never act on a direct event — it warns that it is launched by the dispatcher."}
8 changes: 8 additions & 0 deletions evals/seeds/github-pr-widget-7-meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"number": 7,
"title": "Add CSV export endpoint",
"state": "open",
"draft": false,
"labels": [],
"diff": "diff --git a/src/export.ts b/src/export.ts\nindex 1111111..2222222 100644\n--- a/src/export.ts\n+++ b/src/export.ts\n@@ -1,3 +1,12 @@\n-export function noop() {}\n+import { db } from './db';\n+\n+export async function exportCsv(accountId: string): Promise<string> {\n+ const rows = await db.activity.findMany({ where: { accountId } });\n+ const header = 'id,type,at\\n';\n+ const body = rows.map((r) => `${r.id},${r.type},${r.at}`).join('\\n');\n+ // NOTE: no pagination — will OOM on large accounts\n+ return header + body;\n+}\n"
}
5 changes: 5 additions & 0 deletions evals/seeds/granola-note-prospect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"title": "Intro call — Acme Corp",
"transcript": "Sarah (Acme, VP Eng): we're a 200-person fintech, evaluating workflow automation. We looked at Zapier and n8n but need something that can run agents against our Linear and GitHub. Budget is approved for this quarter. Can you send pricing and SSO details? We'd want to start a pilot in two weeks.",
"summary": "Acme Corp evaluating us vs Zapier/n8n; approved budget; asked for pricing + SSO; wants a 2-week pilot."
}
11 changes: 11 additions & 0 deletions evals/seeds/linear-issue-1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"id": "issue-1",
"identifier": "AR-1",
"title": "CSV export still failing for Acme",
"description": "Acme reports the account-activity CSV export times out on large accounts. Needs pagination.",
"url": "https://linear.app/agentrelay/issue/AR-1",
"state": { "name": "In Progress", "type": "started" },
"priority": 2,
"assignee": { "name": "Benjamin" },
"team": { "id": "team-1", "key": "AR" }
}
5 changes: 5 additions & 0 deletions evals/seeds/linear-issues.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[
{ "id": "AR-1", "identifier": "AR-1", "title": "Export job times out on large accounts", "projectId": "acme", "state": "In Progress", "dueDate": "2026-05-01", "updatedAt": "2026-05-20T00:00:00Z", "objectId": "AR-1" },
{ "id": "AR-2", "identifier": "AR-2", "title": "Add SSO for Acme", "projectId": "acme", "state": "Backlog", "dueDate": "2026-07-01", "updatedAt": "2026-06-05T00:00:00Z", "objectId": "AR-2" },
{ "id": "AR-9", "identifier": "AR-9", "title": "Globex onboarding checklist", "projectId": "globex", "state": "In Progress", "dueDate": "2026-06-15", "updatedAt": "2026-06-09T00:00:00Z", "objectId": "AR-9" }
]
4 changes: 4 additions & 0 deletions evals/seeds/linear-projects.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[
{ "id": "acme", "name": "Acme Corp", "description": "Enterprise customer; data-export integration. Domain acme.com." },
{ "id": "globex", "name": "Globex", "description": "Mid-market; onboarding in progress. Domain globex.io." }
]
1 change: 1 addition & 0 deletions evals/seeds/linear-teams.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[ { "id": "team-eng", "name": "Engineering" } ]
2 changes: 2 additions & 0 deletions evals/seeds/linear-workflow-states.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[ { "id": "ws-done", "name": "Done", "type": "completed", "teamId": "team-eng" },
{ "id": "ws-prog", "name": "In Progress", "type": "started", "teamId": "team-eng" } ]
1 change: 1 addition & 0 deletions evals/seeds/slack-users.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[ { "id": "U0BENJ", "name": "benjamin", "real_name": "Benjamin Watchdog", "profile": { "email": "benjamin@watchdog.no", "real_name": "Benjamin Watchdog" } } ]
Loading