AgentWorkforce · khaliqgant · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/.gitignore b/.gitignore
@@ -7,3 +7,12 @@ node_modules/
 # Relay VFS runtime state — rewritten on every reconcile cycle, never commit it
 .relay/
 memory/workspace/.relay/
+
+# Eval run artifacts
+.evals/
+# Safety net: stray provider draft trees a mis-pointed mount root can leave behind
+/slack/
+/linear/
+/github/
+/notion/
+/google-mail/
diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,92 @@
+# agents evals
+
+Repeatable dry-runs of the showcase agents. Each **case** (`cases.jsonl`) fires
+one event at one agent's handler and checks what it did. This complements
+`npm test` (the `node --test` unit suite + the persona-scope guard): the unit
+tests assert handler internals with hand-rolled spies; these evals run the
+**real handler** through the runtime's simulation API and, in live mode, against
+an actual cheap model.
+
+```sh
+npm run evals                 # simulate mode, all cases (free, offline)
+npm run evals -- --agent review
+npm run evals -- --case linear-slack.chat
+npm run evals -- --suite chat        # by kind: chat|triage|scheduled|guard
+npm run evals -- --list
+npm run evals:live            # real cheap-model replies + LLM judge
+```
+
+`npm run evals` compiles personas first (they are gitignored — `*/persona.json`
+is built from `persona.ts` on demand), then runs the suite.
+
+## Two executors
+
+**simulate** (default) runs each case through the runtime's local
+`simulateInvocation`. The handler executes for real against an in-memory VFS
+seeded from the case, but `harness.run` / `llm.complete` are **stubbed** (no
+model, no tokens). We assert deterministic facts: the run succeeded (or failed
+as expected), it routed to the expected `eventSource`, and the expected side
+effects / log lines appeared. This is the fast, free **routing/plumbing
+regression gate**.
+
+**live** (`--live`) runs the real handler with `harness.run` and `llm.complete`
+backed by a cheap **opencode** model (default `opencode/gpt-5-nano`, override
+with `WD_EVAL_MODEL`), so chat cases get an **actual agent reply**. Add
+`--judge` to grade each chat reply against the case `rubric` with the same cheap
+model (LLM-as-judge). Needs `opencode` on PATH and `OPENCODE_API_KEY`. The judge
+only grades `kind:"chat"` cases; for the rest the routing + side-effect checks
+are the gate.
+
+## How seeding works (and why it's `_index.json`)
+
+The simulator backs the VFS as an in-memory map exposed only through
+`ctx.files.read(exactPath)` — it has **no directory-listing primitive**. A
+case's `seeds` list names provider dirs (`"linear/projects"`), which the runner
+maps to `/linear/projects/_index.json` ← `seeds/linear-projects.json`. Agents
+that enumerate a dir read that blessed `_index.json`; agents that list via
+`sandbox.exec`/`find` see nothing (it's stubbed).
+
+Seeds are **also materialized to the disk mount** (`RELAYFILE_MOUNT_ROOT`), so
+agents that read through a relay-helpers client (e.g.
+`linearClient().getIssue` resolves `/linear/issues/by-uuid/<id>.json`) get their
+data too. Use the long seed form to drop a file at an exact VFS path:
+`{ "vfs": "/linear/issues/by-uuid/issue-1.json", "file": "linear-issue-1.json" }`.
+
+## What simulate can and can't see
+
+Only `harness.run` and `llm.complete` are recorded as side effects.
+`slackClient` / `linearClient` / `githubClient` writes are **not** recorded, and
+real `fetch` calls (hn-monitor's HN Algolia, spotify-releases' Spotify API,
+vendor-monitor's npm registry) hit the network or fail closed. So:
+
+- **harness/llm agents** (linear-slack, linear, review, repo-hygiene, granola)
+  assert the recorded side effect + a happy-path log.
+- **cron warn-only team members** (cloud-team-implementer/reviewer) assert the
+  misroute warning — their real work happens only in the team dispatcher's
+  sandbox, which simulate does not run.
+- **fetch-only agents with no recorded side effect** (spotify-releases,
+  vendor-monitor) can't assert a positive in simulate, so their deterministic
+  coverage is the **required-input guard**: a case with `expect.status:"failed"`
+  + `expect.errorIncludes` proves the agent refuses to run without its inputs.
+
+## Case shape (`cases.jsonl`, one JSON object per line)
+
+```jsonc
+{
+  "id": "review.review", "agent": "review", "kind": "triage",
+  "fixture": { "type": "github.pull_request.opened",
+               "resource": { "pull_request": { "number": 7, ... }, "repository": { ... } } },
+  "inputs": { "APPROVERS": "alice" },
+  "seeds": [{ "vfs": "/github/repos/acme/widget/pulls/7/meta.json", "file": "github-pr-widget-7-meta.json" }],
+  "expect": { "status": "succeeded", "eventSource": "github", "sideEffectsAll": ["harness.run"] },
+  "rubric": "..."
+}
+```
+
+`agent` is the dir path relative to the repo root — flat (`review`) or nested
+team member (`competitor/market-competitor`), both work. `expect` keys:
+`status` (`succeeded` | `failed`), `errorIncludes` (substring of the thrown
+error, for guard cases), `eventSource`, `sideEffectsAll` (all must appear),
+`sideEffectsAny` (≥1), `logsAny` (≥1 log message).
+
+Artifacts land in `.evals/runs/<stamp>/{result.json,summary.md}` (gitignored).
diff --git a/evals/cases.jsonl b/evals/cases.jsonl
@@ -0,0 +1,11 @@
+{"id":"linear-slack.chat","agent":"linear-slack","kind":"chat","fixture":{"type":"slack.message.created","resource":{"channel":"C0TEST","ts":"100.1","text":"What's open on the board for the export work?","user":"U1"}},"inputs":{"SLACK_CHANNEL":"C0TEST"},"seeds":["linear/projects","linear/issues","linear/teams"],"expect":{"status":"succeeded","eventSource":"slack","sideEffectsAll":["harness.run"]},"rubric":"A grounded Slack answer about open Linear issues for the export work, citing real issues from the board. Read-only unless asked to create; must not fabricate issue refs."}
+{"id":"linear.chat","agent":"linear","kind":"chat","fixture":{"type":"linear.AgentSessionEvent.prompted","resource":{"payload":{"agentSession":{"id":"session-1","issue":{"id":"issue-1"}},"agentActivity":{"body":"What's the current status of this issue?"}}}},"inputs":{},"seeds":[{"vfs":"/linear/issues/by-uuid/issue-1.json","file":"linear-issue-1.json"}],"expect":{"status":"succeeded","eventSource":"linear","sideEffectsAll":["llm.complete"],"logsAny":["linear event"]},"rubric":"A grounded conversational status reply about the issue. Read-only: must not claim to have edited or closed anything."}
+{"id":"review.review","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","user":{"login":"alice"},"head":{"sha":"abc123"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"]},"rubric":"A code review that runs the harness against the PR diff and surfaces real issues (e.g. the unpaginated export OOM)."}
+{"id":"review.skip-label","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":8,"html_url":"https://github.com/acme/widget/pull/8","user":{"login":"alice"},"labels":[{"name":"no-agent-relay-review"}],"head":{"sha":"def456"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},"rubric":"A PR carrying the opt-out label must be skipped without running the review harness."}
+{"id":"repo-hygiene.diagnose","agent":"repo-hygiene","kind":"triage","fixture":{"type":"github.pull_request.synchronize","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","title":"Add CSV export endpoint","user":{"login":"alice"},"head":{"sha":"abc123","ref":"feat/x"},"base":{"ref":"main"}},"repository":{"name":"widget","full_name":"acme/widget","owner":{"login":"acme"}}}},"inputs":{"NOTION_DATABASE_ID":"db_test"},"seeds":[{"vfs":"/github/repos/acme/widget/pulls/7/meta.json","file":"github-pr-widget-7-meta.json"}],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"],"logsAny":["repo-hygiene.notion-page.no-receipt","repo-hygiene.journal-failed"]},"rubric":"Diagnoses repo-hygiene issues in the PR and runs the harness; the Notion journal write is best-effort and logs when no receipt comes back."}
+{"id":"hn-monitor.nothing-new","agent":"hn-monitor","kind":"scheduled","fixture":{"type":"cron.tick","name":"scan","cron":"0 9,17 * * *"},"inputs":{"SLACK_CHANNEL":"C123","TOPICS":"zzqx-no-topic-will-match-this"},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["hn-monitor.nothing-new"]},"rubric":"With no front-page story matching the topics, the scan posts nothing and logs it found nothing new."}
+{"id":"spotify-releases.requires-token","agent":"spotify-releases","kind":"guard","fixture":{"type":"cron.tick","name":"check","cron":"0 10 * * *"},"inputs":{"SLACK_USER":"U123"},"seeds":[],"expect":{"status":"failed","errorIncludes":"SPOTIFY_TOKEN"},"rubric":"Missing the required SPOTIFY_TOKEN input must fail loudly rather than silently no-op."}
+{"id":"vendor-monitor.requires-channel","agent":"vendor-monitor","kind":"guard","fixture":{"type":"cron.tick","name":"check","cron":"0 8 * * 1-5"},"inputs":{},"seeds":[],"expect":{"status":"failed","errorIncludes":"SLACK_CHANNEL"},"rubric":"Missing the required SLACK_CHANNEL input must fail loudly rather than silently no-op."}
+{"id":"granola.note-classify","agent":"granola","kind":"triage","fixture":{"type":"granola.file.created","provider":"granola","paths":["/granola/notes/note-2.json"],"resource":{"path":"/granola/notes/note-2.json","kind":"file","id":"note-2","provider":"granola"}},"inputs":{},"seeds":[{"vfs":"/granola/notes/note-2.json","file":"granola-note-prospect.json"}],"expect":{"status":"succeeded","eventSource":"granola","sideEffectsAll":["llm.complete"],"logsAny":["granola-prospect.not-a-prospect"]},"rubric":"Reads the meeting note and classifies whether it describes a sales prospect before doing any Linear/PR work."}
+{"id":"cloud-team-implementer.misroute","agent":"cloud-team-implementer","kind":"guard","fixture":{"type":"cron.tick","name":"tick","cron":"0 * * * *"},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["cloud-team-implementer received a direct event; members are launched by the team dispatcher, not by subscriptions"]},"rubric":"A team member should never act on a direct event — it warns that it is launched by the dispatcher."}
+{"id":"cloud-team-reviewer.misroute","agent":"cloud-team-reviewer","kind":"guard","fixture":{"type":"cron.tick","name":"tick","cron":"0 * * * *"},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"cron","logsAny":["cloud-team-reviewer received a direct event; members are launched by the team dispatcher, not by subscriptions"]},"rubric":"A team member should never act on a direct event — it warns that it is launched by the dispatcher."}
diff --git a/evals/seeds/github-pr-widget-7-meta.json b/evals/seeds/github-pr-widget-7-meta.json
@@ -0,0 +1,8 @@
+{
+  "number": 7,
+  "title": "Add CSV export endpoint",
+  "state": "open",
+  "draft": false,
+  "labels": [],
+  "diff": "diff --git a/src/export.ts b/src/export.ts\nindex 1111111..2222222 100644\n--- a/src/export.ts\n+++ b/src/export.ts\n@@ -1,3 +1,12 @@\n-export function noop() {}\n+import { db } from './db';\n+\n+export async function exportCsv(accountId: string): Promise<string> {\n+  const rows = await db.activity.findMany({ where: { accountId } });\n+  const header = 'id,type,at\\n';\n+  const body = rows.map((r) => `${r.id},${r.type},${r.at}`).join('\\n');\n+  // NOTE: no pagination — will OOM on large accounts\n+  return header + body;\n+}\n"
+}
diff --git a/evals/seeds/granola-note-prospect.json b/evals/seeds/granola-note-prospect.json
@@ -0,0 +1,5 @@
+{
+  "title": "Intro call — Acme Corp",
+  "transcript": "Sarah (Acme, VP Eng): we're a 200-person fintech, evaluating workflow automation. We looked at Zapier and n8n but need something that can run agents against our Linear and GitHub. Budget is approved for this quarter. Can you send pricing and SSO details? We'd want to start a pilot in two weeks.",
+  "summary": "Acme Corp evaluating us vs Zapier/n8n; approved budget; asked for pricing + SSO; wants a 2-week pilot."
+}
diff --git a/evals/seeds/linear-issue-1.json b/evals/seeds/linear-issue-1.json
@@ -0,0 +1,11 @@
+{
+  "id": "issue-1",
+  "identifier": "AR-1",
+  "title": "CSV export still failing for Acme",
+  "description": "Acme reports the account-activity CSV export times out on large accounts. Needs pagination.",
+  "url": "https://linear.app/agentrelay/issue/AR-1",
+  "state": { "name": "In Progress", "type": "started" },
+  "priority": 2,
+  "assignee": { "name": "Benjamin" },
+  "team": { "id": "team-1", "key": "AR" }
+}
diff --git a/evals/seeds/linear-issues.json b/evals/seeds/linear-issues.json
@@ -0,0 +1,5 @@
+[
+  { "id": "AR-1", "identifier": "AR-1", "title": "Export job times out on large accounts", "projectId": "acme", "state": "In Progress", "dueDate": "2026-05-01", "updatedAt": "2026-05-20T00:00:00Z", "objectId": "AR-1" },
+  { "id": "AR-2", "identifier": "AR-2", "title": "Add SSO for Acme", "projectId": "acme", "state": "Backlog", "dueDate": "2026-07-01", "updatedAt": "2026-06-05T00:00:00Z", "objectId": "AR-2" },
+  { "id": "AR-9", "identifier": "AR-9", "title": "Globex onboarding checklist", "projectId": "globex", "state": "In Progress", "dueDate": "2026-06-15", "updatedAt": "2026-06-09T00:00:00Z", "objectId": "AR-9" }
+]
diff --git a/evals/seeds/linear-projects.json b/evals/seeds/linear-projects.json
@@ -0,0 +1,4 @@
+[
+  { "id": "acme", "name": "Acme Corp", "description": "Enterprise customer; data-export integration. Domain acme.com." },
+  { "id": "globex", "name": "Globex", "description": "Mid-market; onboarding in progress. Domain globex.io." }
+]
diff --git a/evals/seeds/linear-teams.json b/evals/seeds/linear-teams.json
@@ -0,0 +1 @@
+[ { "id": "team-eng", "name": "Engineering" } ]
diff --git a/evals/seeds/linear-workflow-states.json b/evals/seeds/linear-workflow-states.json
@@ -0,0 +1,2 @@
+[ { "id": "ws-done", "name": "Done", "type": "completed", "teamId": "team-eng" },
+  { "id": "ws-prog", "name": "In Progress", "type": "started", "teamId": "team-eng" } ]
diff --git a/evals/seeds/slack-users.json b/evals/seeds/slack-users.json
@@ -0,0 +1 @@
+[ { "id": "U0BENJ", "name": "benjamin", "real_name": "Benjamin Watchdog", "profile": { "email": "benjamin@watchdog.no", "real_name": "Benjamin Watchdog" } } ]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		[ { "id": "team-eng", "name": "Engineering" } ]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[ { "id": "ws-done", "name": "Done", "type": "completed", "teamId": "team-eng" },
		{ "id": "ws-prog", "name": "In Progress", "type": "started", "teamId": "team-eng" } ]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		[ { "id": "U0BENJ", "name": "benjamin", "real_name": "Benjamin Watchdog", "profile": { "email": "benjamin@watchdog.no", "real_name": "Benjamin Watchdog" } } ]