Skip to content

Add eval harness + cases for all 10 agents#58

Merged
khaliqgant merged 3 commits into
mainfrom
feat/eval-harness
Jun 11, 2026
Merged

Add eval harness + cases for all 10 agents#58
khaliqgant merged 3 commits into
mainfrom
feat/eval-harness

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 11, 2026

Copy link
Copy Markdown
Member

Repeatable dry-runs of the showcase agents, ported from the watchdog-agents eval harness. Each case fires one event at one agent's real handler and asserts routing + side effects. Complements the existing npm test unit suite (which uses hand-rolled ctx spies) by running the real handler through the runtime's simulation API and, in live mode, against an actual cheap model.

What's here

  • scripts/evals/run-evals.mjs — two executors:
    • simulate (default): free, offline. Runs each handler through the runtime's simulateInvocation against an in-memory VFS; harness.run/llm.complete are stubbed but recorded as side effects. Asserts status, eventSource, sideEffectsAll/Any, logsAny.
    • live (--live): backs the model calls with a cheap opencode model (gpt-5-nano); --judge grades chat replies against the case rubric (LLM-as-judge).
    • Flat/nested agent layout; seeds materialized to both ctx.files and the disk mount so relayClient reads (linearClient().getIssue) resolve; new expect.status:"failed" + expect.errorIncludes for required-input guard cases.
  • evals/cases.jsonl — 11 cases covering all 10 agents (linear-slack, linear, review, repo-hygiene, hn-monitor, spotify-releases, vendor-monitor, granola, both cloud-team members). 11/11 green in simulate.
  • evals/seeds/*, evals/README.md — fixtures + docs.
  • package.json: evals / evals:live scripts (compile personas first); tsx devDep. .gitignore: .evals/ + provider draft-tree safety net.

How to run

npm run evals                 # simulate, all cases
npm run evals -- --list
npm run evals:live -- --judge # real cheap-model replies + judge

Verification

tsc --noEmit clean · all 10 personas compile · npm run evals → 11/11.

🤖 Generated with Claude Code


Summary by cubic

Adds a repeatable eval harness that runs real agent handlers in simulate or live mode, with 11 cases across all 10 showcase agents. Improves coverage for routing, side effects, and chat reply quality.

  • New Features

    • scripts/evals/run-evals.mjs: simulate (offline, stubbed harness.run/llm.complete) and live (--live via opencode, optional --judge); supports flat and nested agent dirs.
    • Seeds written to in-memory VFS and disk mount so client reads resolve; supports short and exact VFS-path seeds.
    • 11 cases in evals/cases.jsonl with fixtures and expectations, including required-input guard checks (expect.status: "failed" + errorIncludes).
    • Artifacts saved to .evals/runs/<stamp>/{result.json,summary.md}; evals/README.md documents usage and seeding; .gitignore ignores .evals/ and adds a provider draft-tree safety net.
    • package.json: evals / evals:live scripts (compile personas first) and tsx dev dependency.
  • Bug Fixes

    • Updated tests/linear-slack-agent.test.mjs to map SLACK_CHANNEL to TEST_SLACK_CHANNEL via inputSpecs, preventing env collisions and aligning with the harness env mapping.

Written for commit ad556b5. Summary will update on new commits.

Review in cubic

Repeatable dry-runs of the showcase agents, ported from the watchdog-agents
eval harness. Each case fires one event at one agent's handler and asserts
routing + side effects.

- scripts/evals/run-evals.mjs: simulate (free, offline, stubbed harness/llm)
  and live (cheap opencode model + LLM-as-judge) executors. Flat/nested agent
  layout; seeds materialized to both ctx.files and the disk mount so
  relayClient reads (linearClient().getIssue) resolve. New expect.status:
  "failed" + expect.errorIncludes for required-input guard cases.
- evals/cases.jsonl: 11 cases covering linear-slack, linear, review,
  repo-hygiene, hn-monitor, spotify-releases, vendor-monitor, granola, and the
  two cloud-team members. 11/11 green in simulate.
- evals/seeds/*: linear board fixtures + PR meta, granola note, issue alias.
- evals/README.md: documents the two executors, _index.json seeding, what
  simulate can/can't observe, and the guard-case pattern.
- package.json: `evals` / `evals:live` scripts (compile personas first); tsx
  devDep. .gitignore: .evals/ + provider draft-tree safety net.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@agent-relay-code[bot], we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 16 minutes and 27 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5a1cbcfd-5f9c-4c9e-8fe0-d6ca76b28a67

📥 Commits

Reviewing files that changed from the base of the PR and between 7c0e59f and ad556b5.

📒 Files selected for processing (1)
  • scripts/evals/run-evals.mjs
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive evaluation framework for agent showcase cases. It includes detailed documentation of the eval workflow, 11 structured test cases with supporting seed data, a Node.js eval runner supporting both simulate (deterministic, stubbed) and live (real handler, opencode model) execution modes, and integration into the development workflow via npm scripts and environment configuration.

Changes

Agent Evaluation Framework

Layer / File(s) Summary
Evaluation documentation and architecture
evals/README.md
README explains repeatable evaluation workflow, distinguishes simulate (deterministic stubs) and live (real handler + opencode model) modes, details VFS seeding via _index.json, documents recorded vs. untracked side effects, and specifies cases.jsonl schema with fixture, inputs, seeds, expect, and rubric fields.
Test cases and seed data
evals/cases.jsonl, evals/seeds/*
11 structured evaluation cases covering chat, triage, guard, and scheduled agents with fixtures (Slack, Linear, GitHub, cron, Granola), plus supporting seed data: GitHub PR diffs with exportCsv function, Linear issues/projects/teams/workflow states, Slack users, and Granola prospect records.
Eval runner script implementation
scripts/evals/run-evals.mjs
Node.js runner with CLI parsing, simulate executor (seed mounts, stubbed model behavior), live executor (real handler + opencode integration), expectation validation, optional rubric-based grading for live chat cases, and orchestration with result.json/summary.md output and exit code handling.
Project configuration and integration
.gitignore, package.json, tests/linear-slack-agent.test.mjs
Updates .gitignore for eval artifacts and provider drafts, adds compile/evals/evals:live npm scripts, extends devDependencies with tsx and pins @relayfile/adapter-linear to 0.3.11, and wires SLACK_CHANNEL via TEST_SLACK_CHANNEL in linear-slack test context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • AgentWorkforce/agents#53: The update to tests/linear-slack-agent.test.mjs wiring SLACK_CHANNEL via TEST_SLACK_CHANNEL directly aligns with the newly added linear-slack persona/agent that depends on this environment variable for Slack-channel scoping.

Poem

🐰 Eval cases hop and skip so fine,
Seeds germinate in JSON line,
Simulate runs and live runs dance,
Agents showcase their best stance,
Results bloom in summary's light!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly matches the main change: adding an eval harness and test cases for all 10 agents, which is the primary purpose of this PR.
Description check ✅ Passed The description is comprehensively related to the changeset, detailing the eval harness structure, execution modes, cases, seeds, and how to run them.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/eval-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a repeatable evaluation suite for showcase agents, adding a runner script (run-evals.mjs), test cases, mock seed data, and npm scripts for both simulated and live runs. The review feedback highlights several key improvements for the runner script: fixing the premature cleanup of the RELAYFILE_MOUNT_ROOT environment variable, adding support for named handler exports, implementing error handling for the opencode process execution, and ensuring that malformed or failed LLM judgments correctly fail the test cases instead of silently passing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/evals/run-evals.mjs Outdated
const mod = await tsImport(pathToFileURL(agentEntry(testCase.agent)).href, import.meta.url);
if (!event) throw new Error('envelopeToAgentEvent returned null (unsupported envelope)');
const handler = mod.default?.handler ?? mod.default;
await withCaseEnv(personaSpec, testCase.inputs ?? {}, { RELAYFILE_MOUNT_ROOT: mount }, () => handler(ctx, event));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment on lines 268-273 states that RELAYFILE_MOUNT_ROOT is pinned for the rest of the process to allow fire-and-forget draft writes to complete without falling back to the current working directory. However, because it is passed as part of extraEnv to withCaseEnv, it gets deleted or restored in the finally block of withCaseEnv as soon as the handler promise resolves. This defeats the purpose of pinning it.

To fix this, set process.env.RELAYFILE_MOUNT_ROOT = mount; globally before calling withCaseEnv, and pass an empty object {} for extraEnv so it is not cleaned up prematurely.

    process.env.RELAYFILE_MOUNT_ROOT = mount;
    await withCaseEnv(personaSpec, testCase.inputs ?? {}, {}, () => handler(ctx, event));

Comment thread scripts/evals/run-evals.mjs Outdated
const rec = await withCaseEnv(persona, testCase.inputs ?? {}, { RELAYFILE_MOUNT_ROOT: tmp }, () =>
simulateInvocation({
persona,
handler: mod.default?.handler ?? mod.default,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current handler resolution only supports default exports (mod.default?.handler ?? mod.default). If an agent uses a named export (e.g., export const handler = ...), mod.default will be undefined, causing the handler to resolve to undefined and crash at runtime.

Updating this to support named exports improves robustness and prevents future agents from failing. Please apply this same change to line 282 in runLive as well.

Suggested change
handler: mod.default?.handler ?? mod.default,
handler: mod.handler ?? mod.default?.handler ?? mod.default,

maxBuffer: 16 * 1024 * 1024,
env: { ...process.env },
});
const raw = (res.stdout ?? '').replace(/\x1b\[[0-9;]*m/g, '');

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

spawnSync is called without checking for execution errors or non-zero exit codes. If the opencode binary is missing, misconfigured, or fails during execution, the function will fail silently and return an empty string, leading to hard-to-debug test failures or false positives.

Adding explicit error handling ensures that any issues with the LLM runner are surfaced immediately.

  if (res.error) {
    throw new Error(`Failed to execute opencode: ${res.error.message}`);
  }
  if (res.status !== 0) {
    throw new Error(`opencode exited with code ${res.status}: ${res.stderr || res.stdout}`);
  }
  const raw = (res.stdout ?? '').replace(/\x1b\[[0-9;]*m/g, '');

Comment thread scripts/evals/run-evals.mjs Outdated
// A case may deliberately expect a failure (e.g. a required-input guard throw);
// only treat an unexpected failed status as an automatic fail.
const expectsFailure = (testCase.expect?.status ?? null) === 'failed';
const passed = checks.every((c) => c.pass) && (expectsFailure || outcome.status !== 'failed') && (verdict ? verdict.pass !== false : true);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the LLM judge fails to parse the response (e.g., due to invalid JSON), judge returns { pass: null, reason: ... }. The check verdict ? verdict.pass !== false : true evaluates to true when verdict.pass is null. This means a failed or malformed judgment silently passes the test instead of failing it.

Changing this to verdict.pass === true ensures that any non-true judgment (including parsing errors) correctly fails the test case.

Suggested change
const passed = checks.every((c) => c.pass) && (expectsFailure || outcome.status !== 'failed') && (verdict ? verdict.pass !== false : true);
const passed = checks.every((c) => c.pass) && (expectsFailure || outcome.status !== 'failed') && (verdict ? verdict.pass === true : true);

@agent-relay-code

Copy link
Copy Markdown
Contributor

Reviewed PR #58 and made scoped fixes.

Changes made:

Validation run locally:

  • npm ci
  • npm test passed, 60/60
  • npm run typecheck passed
  • npm run evals passed, 11/11
  • npm ls @relayfile/adapter-linear @relayfile/adapter-core @relayfile/relay-helpers passed

Addressed comments

  • No bot or reviewer comments were present in the provided .workforce artifacts, so there were no external review threads to validate or address.

Advisory Notes

  • I could not verify GitHub-side CI status or mergeability from this checkout without using GitHub tooling, so I am not marking this as READY.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Reviewed PR #58 and made one scoped runner fix pass in scripts/evals/run-evals.mjs.

Changes made:

  • Added named handler export support for simulate/live eval execution.
  • Made opencode execution failures explicit in live mode.
  • Kept live-mode RELAYFILE_MOUNT_ROOT and WORKSPACE_ROOT pinned after handler return so async writebacks do not fall back to cwd.
  • Made malformed/non-true LLM judge verdicts fail instead of passing.

Validation run locally:

  • npm ci passed
  • npm test passed, 60/60
  • npm run typecheck passed
  • npm run evals passed, 11/11
  • npm ls @relayfile/adapter-linear @relayfile/adapter-core @relayfile/relay-helpers passed

Addressed comments

  • gemini-code-assist[bot]: live-mode mount env was restored immediately despite the comment saying it stays pinned; fixed in scripts/evals/run-evals.mjs.
  • gemini-code-assist[bot]: handler resolution ignored named handler exports; fixed with resolveHandler in scripts/evals/run-evals.mjs, used at scripts/evals/run-evals.mjs and scripts/evals/run-evals.mjs.
  • gemini-code-assist[bot]: spawnSync failures from opencode were swallowed; fixed in scripts/evals/run-evals.mjs.
  • gemini-code-assist[bot]: malformed judge output with pass: null could pass; fixed in scripts/evals/run-evals.mjs.
  • coderabbitai[bot]: processing/status comment only; no concrete code finding to address.
  • agent-relay-code[bot]: prior summary noted scoped fixes already present in the current checkout; I validated them and only changed the still-reproducible Gemini findings above.

Advisory Notes

  • I am not printing READY: GitHub API currently reports combined status pending because CodeRabbit is pending, and PR mergeable_state is unstable, even though local validation is green.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/evals/run-evals.mjs (1)

268-273: 💤 Low value

Temp directory leak is documented but may accumulate.

The comment explains that the mount directory is intentionally not cleaned up to avoid breaking async draft writes, but this means every live eval run leaves a directory in /tmp. Consider tracking these directories and cleaning up older runs (e.g., older than 1 hour) to prevent unbounded accumulation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/evals/run-evals.mjs` around lines 268 - 273, The temp mount
directories created and intentionally not removed (see RELAYFILE_MOUNT_ROOT and
local variable mount) can accumulate; add a short cleanup routine at the top of
the run-evals flow that scans the same temp-root pattern used for mount (e.g.,
RELAYFILE_MOUNT_ROOT/* or whatever naming convention creates the per-run mount
dirs), checks mtime/ctime, and deletes any directories older than a threshold
(suggest 1 hour) using fs/stat and fs.rm or rimraf; invoke this cleanup before
creating a new mount to avoid unbounded accumulation while preserving recent
mounts used by running drafts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/cases.jsonl`:
- Line 4: The test case currently can't assert "no harness run" because the
expectation only checks status/logs; update the case in evals/cases.jsonl to
include a new boolean flag (e.g., "sideEffectsNone": true) and then modify the
runner's checkExpectations function in scripts/evals/run-evals.mjs to assert
that when sideEffectsNone is true no side-effecting methods were invoked
(specifically ensure harness.run was not called) by adding a negative assertion
path that fails if harness.run or equivalent side-effect markers were observed;
reference the expectation key "sideEffectsNone", the runner helper
checkExpectations, and the side-effecting method harness.run to locate where to
add the check.

---

Nitpick comments:
In `@scripts/evals/run-evals.mjs`:
- Around line 268-273: The temp mount directories created and intentionally not
removed (see RELAYFILE_MOUNT_ROOT and local variable mount) can accumulate; add
a short cleanup routine at the top of the run-evals flow that scans the same
temp-root pattern used for mount (e.g., RELAYFILE_MOUNT_ROOT/* or whatever
naming convention creates the per-run mount dirs), checks mtime/ctime, and
deletes any directories older than a threshold (suggest 1 hour) using fs/stat
and fs.rm or rimraf; invoke this cleanup before creating a new mount to avoid
unbounded accumulation while preserving recent mounts used by running drafts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8cbecaaf-f1a8-4e0d-bd04-2bc29897ec9a

📥 Commits

Reviewing files that changed from the base of the PR and between 946c200 and 7c0e59f.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (14)
  • .gitignore
  • evals/README.md
  • evals/cases.jsonl
  • evals/seeds/github-pr-widget-7-meta.json
  • evals/seeds/granola-note-prospect.json
  • evals/seeds/linear-issue-1.json
  • evals/seeds/linear-issues.json
  • evals/seeds/linear-projects.json
  • evals/seeds/linear-teams.json
  • evals/seeds/linear-workflow-states.json
  • evals/seeds/slack-users.json
  • package.json
  • scripts/evals/run-evals.mjs
  • tests/linear-slack-agent.test.mjs

Comment thread evals/cases.jsonl
{"id":"linear-slack.chat","agent":"linear-slack","kind":"chat","fixture":{"type":"slack.message.created","resource":{"channel":"C0TEST","ts":"100.1","text":"What's open on the board for the export work?","user":"U1"}},"inputs":{"SLACK_CHANNEL":"C0TEST"},"seeds":["linear/projects","linear/issues","linear/teams"],"expect":{"status":"succeeded","eventSource":"slack","sideEffectsAll":["harness.run"]},"rubric":"A grounded Slack answer about open Linear issues for the export work, citing real issues from the board. Read-only unless asked to create; must not fabricate issue refs."}
{"id":"linear.chat","agent":"linear","kind":"chat","fixture":{"type":"linear.AgentSessionEvent.prompted","resource":{"payload":{"agentSession":{"id":"session-1","issue":{"id":"issue-1"}},"agentActivity":{"body":"What's the current status of this issue?"}}}},"inputs":{},"seeds":[{"vfs":"/linear/issues/by-uuid/issue-1.json","file":"linear-issue-1.json"}],"expect":{"status":"succeeded","eventSource":"linear","sideEffectsAll":["llm.complete"],"logsAny":["linear event"]},"rubric":"A grounded conversational status reply about the issue. Read-only: must not claim to have edited or closed anything."}
{"id":"review.review","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":7,"html_url":"https://github.com/acme/widget/pull/7","user":{"login":"alice"},"head":{"sha":"abc123"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","sideEffectsAll":["harness.run"]},"rubric":"A code review that runs the harness against the PR diff and surfaces real issues (e.g. the unpaginated export OOM)."}
{"id":"review.skip-label","agent":"review","kind":"triage","fixture":{"type":"github.pull_request.opened","resource":{"pull_request":{"number":8,"html_url":"https://github.com/acme/widget/pull/8","user":{"login":"alice"},"labels":[{"name":"no-agent-relay-review"}],"head":{"sha":"def456"},"state":"open","draft":false},"repository":{"name":"widget","owner":{"login":"acme"}}}},"inputs":{},"seeds":[],"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},"rubric":"A PR carrying the opt-out label must be skipped without running the review harness."}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Rubric requires “no harness run,” but this case cannot enforce it.

Line 4 says skip must happen without running review, but expect only checks status/logs. With current runner checks, this can still pass even if harness.run is called.

Proposed direction (cases + runner)
-{"id":"review.skip-label",...,"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"]},...}
+{"id":"review.skip-label",...,"expect":{"status":"succeeded","eventSource":"github","logsAny":["pr-reviewer skipped"],"sideEffectsNone":["harness.run"]},...}

And in scripts/evals/run-evals.mjs, add a corresponding negative assertion in checkExpectations(...) for sideEffectsNone.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/cases.jsonl` at line 4, The test case currently can't assert "no
harness run" because the expectation only checks status/logs; update the case in
evals/cases.jsonl to include a new boolean flag (e.g., "sideEffectsNone": true)
and then modify the runner's checkExpectations function in
scripts/evals/run-evals.mjs to assert that when sideEffectsNone is true no
side-effecting methods were invoked (specifically ensure harness.run was not
called) by adding a negative assertion path that fails if harness.run or
equivalent side-effect markers were observed; reference the expectation key
"sideEffectsNone", the runner helper checkExpectations, and the side-effecting
method harness.run to locate where to add the check.

@khaliqgant khaliqgant merged commit b109a4d into main Jun 11, 2026
2 checks passed
@khaliqgant khaliqgant deleted the feat/eval-harness branch June 11, 2026 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant