feat: eval scorecard — (persona × profile) score timeline#78
Merged
Conversation
A single eval run answers "what's the score now." It cannot answer the question that actually gates a feature PR: did this change regress persona P on model+profile F, even while the aggregate improved? Today RunRecords land in per-run .runs/ dirs and baseline.ts compares one run to one baseline — there is no per-cell history. Add two primitives, built additively on what already exists (RunRecord carries every cell dimension; statistics.ts/baseline.ts do the rigorous comparison): - AgentProfile + agentProfileHash — the harness's unit of variation (model + skills + promptVersion + tools). The model is not a separate axis; it lives in the profile, so "same model, different skills" is two profiles. The hash is behaviour identity — skill/tool order does not matter, the id label is excluded, a blank model fails loud. - scorecard — an append-only JSONL store keyed (scenarioId, profileHash). recordRuns folds a RunRecord[] into per-cell entries (per-seed scores + median composite); loadScorecard folds the log (concurrent appends cannot clobber; a malformed line is skipped); diffScorecard compares each cell's latest entry to its predecessor (or a named baseline commit) with Cohen's d + Welch's t-test → improved | regressed | flat | new; formatScorecardDiff renders the PR-facing report. 26 tests; agent-eval suite 1287 green; typecheck + biome + build clean.
tangletools
approved these changes
May 22, 2026
Contributor
tangletools
left a comment
There was a problem hiding this comment.
Verified — AgentProfile (behaviour-identity hash, model lives inside the profile) + append-only JSONL scorecard keyed (scenarioId, profileHash). diffScorecard classifies per-cell moves with Cohen's d + Welch's t-test (improved/regressed/flat/new). Additive — no RunRecord schema change; concurrent appends cannot clobber; malformed lines skipped. 26 tests incl. the regression-vs-flat keystone; suite 1287 green; typecheck + biome + build clean.
4 tasks
3 tasks
drewstone
added a commit
that referenced
this pull request
May 22, 2026
src/index.ts has exported `PrReviewAuditCase`, `scorePrReviewComments`, `summarizePrReviewBenchmark`, et al. from `./pr-review-benchmark` since the run-record refactor landed, but `src/pr-review-benchmark.ts` and its co-located test were authored locally and never committed. A fresh clone fails typecheck; CI on main has been red on #78, #79, and #81. The files were already typecheck-clean, biome-clean, and the 5 co-located tests pass. No content changes — only `git add`.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A single eval run answers "what's the score now." It cannot answer the question that actually gates a feature PR: did this change regress persona P on model+profile F, even while the aggregate improved? Today
RunRecords land in per-run.runs/dirs andbaseline.tscompares one run to one baseline — there is no persistent per-cell history.This adds two primitives, built additively on what already exists —
RunRecordcarries every cell dimension (model,scenarioId,commitSha,seed, …);statistics.ts/baseline.tsalready do the rigorous comparison. NoRunRecordschema change.AgentProfile+agentProfileHash(src/agent-profile.ts)The harness's unit of variation — model + skills + promptVersion + tools. The model is not a separate axis; it lives inside the profile, so "same model, different skills" is simply two profiles.
agentProfileHashis behaviour identity: skill/tool order doesn't matter, the humanidlabel is excluded, a blank model fails loud.scorecard(src/scorecard.ts)An append-only JSONL store keyed
(scenarioId, profileHash):recordRuns/recordRunsToScorecard— fold aRunRecord[]+ itsAgentProfileinto one per-cell entry (per-seed scores + median composite + provenance).loadScorecard— fold the log into the queryableScorecard. Append-only means concurrent campaign runs cannot clobber; a malformed line is skipped, never breaking the read.diffScorecard— compare each cell's latest entry to its predecessor (or a namedbaselineCommit) with Cohen's d + Welch's t-test →improved | regressed | flat | new. n<2 falls back to a raw-delta threshold.formatScorecardDiff— the PR-facing report, regressions first.Test plan
pnpm typecheck— cleanpnpm test— 1287 passed (26 new: hash determinism / order-insensitivity / fail-loud; record→load round-trip; malformed-line skip; regression vs flat vs improved vs new verdicts; named-baseline diff)pnpm exec biome check src— cleanpnpm build— greenNext: wire into one agent's
canonical.ts(needs a minor release first) — the runner folds itsRunRecord[]into the scorecard and prints the diff.