Skip to content

feat: eval scorecard — (persona × profile) score timeline#78

Merged
drewstone merged 1 commit into
mainfrom
feat/eval-scorecard
May 22, 2026
Merged

feat: eval scorecard — (persona × profile) score timeline#78
drewstone merged 1 commit into
mainfrom
feat/eval-scorecard

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

A single eval run answers "what's the score now." It cannot answer the question that actually gates a feature PR: did this change regress persona P on model+profile F, even while the aggregate improved? Today RunRecords land in per-run .runs/ dirs and baseline.ts compares one run to one baseline — there is no persistent per-cell history.

This adds two primitives, built additively on what already exists — RunRecord carries every cell dimension (model, scenarioId, commitSha, seed, …); statistics.ts/baseline.ts already do the rigorous comparison. No RunRecord schema change.

AgentProfile + agentProfileHash (src/agent-profile.ts)

The harness's unit of variation — model + skills + promptVersion + tools. The model is not a separate axis; it lives inside the profile, so "same model, different skills" is simply two profiles. agentProfileHash is behaviour identity: skill/tool order doesn't matter, the human id label is excluded, a blank model fails loud.

scorecard (src/scorecard.ts)

An append-only JSONL store keyed (scenarioId, profileHash):

  • recordRuns / recordRunsToScorecard — fold a RunRecord[] + its AgentProfile into one per-cell entry (per-seed scores + median composite + provenance).
  • loadScorecard — fold the log into the queryable Scorecard. Append-only means concurrent campaign runs cannot clobber; a malformed line is skipped, never breaking the read.
  • diffScorecard — compare each cell's latest entry to its predecessor (or a named baselineCommit) with Cohen's d + Welch's t-testimproved | regressed | flat | new. n<2 falls back to a raw-delta threshold.
  • formatScorecardDiff — the PR-facing report, regressions first.

Test plan

  • pnpm typecheck — clean
  • pnpm test — 1287 passed (26 new: hash determinism / order-insensitivity / fail-loud; record→load round-trip; malformed-line skip; regression vs flat vs improved vs new verdicts; named-baseline diff)
  • pnpm exec biome check src — clean
  • pnpm build — green

Next: wire into one agent's canonical.ts (needs a minor release first) — the runner folds its RunRecord[] into the scorecard and prints the diff.

A single eval run answers "what's the score now." It cannot answer the
question that actually gates a feature PR: did this change regress
persona P on model+profile F, even while the aggregate improved? Today
RunRecords land in per-run .runs/ dirs and baseline.ts compares one run
to one baseline — there is no per-cell history.

Add two primitives, built additively on what already exists (RunRecord
carries every cell dimension; statistics.ts/baseline.ts do the rigorous
comparison):

- AgentProfile + agentProfileHash — the harness's unit of variation
  (model + skills + promptVersion + tools). The model is not a separate
  axis; it lives in the profile, so "same model, different skills" is
  two profiles. The hash is behaviour identity — skill/tool order does
  not matter, the id label is excluded, a blank model fails loud.

- scorecard — an append-only JSONL store keyed (scenarioId, profileHash).
  recordRuns folds a RunRecord[] into per-cell entries (per-seed scores
  + median composite); loadScorecard folds the log (concurrent appends
  cannot clobber; a malformed line is skipped); diffScorecard compares
  each cell's latest entry to its predecessor (or a named baseline
  commit) with Cohen's d + Welch's t-test → improved | regressed | flat
  | new; formatScorecardDiff renders the PR-facing report.

26 tests; agent-eval suite 1287 green; typecheck + biome + build clean.
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified — AgentProfile (behaviour-identity hash, model lives inside the profile) + append-only JSONL scorecard keyed (scenarioId, profileHash). diffScorecard classifies per-cell moves with Cohen's d + Welch's t-test (improved/regressed/flat/new). Additive — no RunRecord schema change; concurrent appends cannot clobber; malformed lines skipped. 26 tests incl. the regression-vs-flat keystone; suite 1287 green; typecheck + biome + build clean.

@drewstone drewstone merged commit 0e18c55 into main May 22, 2026
1 check failed
@drewstone drewstone deleted the feat/eval-scorecard branch May 22, 2026 18:46
drewstone added a commit that referenced this pull request May 22, 2026
src/index.ts has exported `PrReviewAuditCase`, `scorePrReviewComments`,
`summarizePrReviewBenchmark`, et al. from `./pr-review-benchmark` since
the run-record refactor landed, but `src/pr-review-benchmark.ts` and
its co-located test were authored locally and never committed. A fresh
clone fails typecheck; CI on main has been red on #78, #79, and #81.

The files were already typecheck-clean, biome-clean, and the 5
co-located tests pass. No content changes — only `git add`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants