feat: eval scorecard — (persona × profile) score timeline by drewstone · Pull Request #78 · tangle-network/agent-eval

drewstone · 2026-05-22T18:46:24Z

Summary

A single eval run answers "what's the score now." It cannot answer the question that actually gates a feature PR: did this change regress persona P on model+profile F, even while the aggregate improved? Today RunRecords land in per-run .runs/ dirs and baseline.ts compares one run to one baseline — there is no persistent per-cell history.

This adds two primitives, built additively on what already exists — RunRecord carries every cell dimension (model, scenarioId, commitSha, seed, …); statistics.ts/baseline.ts already do the rigorous comparison. No RunRecord schema change.

`AgentProfile` + `agentProfileHash` (`src/agent-profile.ts`)

The harness's unit of variation — model + skills + promptVersion + tools. The model is not a separate axis; it lives inside the profile, so "same model, different skills" is simply two profiles. agentProfileHash is behaviour identity: skill/tool order doesn't matter, the human id label is excluded, a blank model fails loud.

`scorecard` (`src/scorecard.ts`)

An append-only JSONL store keyed (scenarioId, profileHash):

recordRuns / recordRunsToScorecard — fold a RunRecord[] + its AgentProfile into one per-cell entry (per-seed scores + median composite + provenance).
loadScorecard — fold the log into the queryable Scorecard. Append-only means concurrent campaign runs cannot clobber; a malformed line is skipped, never breaking the read.
diffScorecard — compare each cell's latest entry to its predecessor (or a named baselineCommit) with Cohen's d + Welch's t-test → improved | regressed | flat | new. n<2 falls back to a raw-delta threshold.
formatScorecardDiff — the PR-facing report, regressions first.

Test plan

pnpm typecheck — clean
pnpm test — 1287 passed (26 new: hash determinism / order-insensitivity / fail-loud; record→load round-trip; malformed-line skip; regression vs flat vs improved vs new verdicts; named-baseline diff)
pnpm exec biome check src — clean
pnpm build — green

Next: wire into one agent's canonical.ts (needs a minor release first) — the runner folds its RunRecord[] into the scorecard and prints the diff.

A single eval run answers "what's the score now." It cannot answer the question that actually gates a feature PR: did this change regress persona P on model+profile F, even while the aggregate improved? Today RunRecords land in per-run .runs/ dirs and baseline.ts compares one run to one baseline — there is no per-cell history. Add two primitives, built additively on what already exists (RunRecord carries every cell dimension; statistics.ts/baseline.ts do the rigorous comparison): - AgentProfile + agentProfileHash — the harness's unit of variation (model + skills + promptVersion + tools). The model is not a separate axis; it lives in the profile, so "same model, different skills" is two profiles. The hash is behaviour identity — skill/tool order does not matter, the id label is excluded, a blank model fails loud. - scorecard — an append-only JSONL store keyed (scenarioId, profileHash). recordRuns folds a RunRecord[] into per-cell entries (per-seed scores + median composite); loadScorecard folds the log (concurrent appends cannot clobber; a malformed line is skipped); diffScorecard compares each cell's latest entry to its predecessor (or a named baseline commit) with Cohen's d + Welch's t-test → improved | regressed | flat | new; formatScorecardDiff renders the PR-facing report. 26 tests; agent-eval suite 1287 green; typecheck + biome + build clean.

tangletools

Verified — AgentProfile (behaviour-identity hash, model lives inside the profile) + append-only JSONL scorecard keyed (scenarioId, profileHash). diffScorecard classifies per-cell moves with Cohen's d + Welch's t-test (improved/regressed/flat/new). Additive — no RunRecord schema change; concurrent appends cannot clobber; malformed lines skipped. 26 tests incl. the regression-vs-flat keystone; suite 1287 green; typecheck + biome + build clean.

src/index.ts has exported `PrReviewAuditCase`, `scorePrReviewComments`, `summarizePrReviewBenchmark`, et al. from `./pr-review-benchmark` since the run-record refactor landed, but `src/pr-review-benchmark.ts` and its co-located test were authored locally and never committed. A fresh clone fails typecheck; CI on main has been red on #78, #79, and #81. The files were already typecheck-clean, biome-clean, and the 5 co-located tests pass. No content changes — only `git add`.

tangletools approved these changes May 22, 2026

View reviewed changes

drewstone merged commit 0e18c55 into main May 22, 2026
1 check failed

drewstone deleted the feat/eval-scorecard branch May 22, 2026 18:46

tangletools mentioned this pull request May 22, 2026

feat(examples): scorecard, held-out-gate, user-simulation-driver #81

Merged

4 tasks

drewstone mentioned this pull request May 22, 2026

fix: commit pr-review-benchmark source — restores green CI on main #83

Merged

3 tasks

drewstone mentioned this pull request May 22, 2026

chore(0.34.0): release — eval scorecard + agent profile cells #84

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval scorecard — (persona × profile) score timeline#78

feat: eval scorecard — (persona × profile) score timeline#78
drewstone merged 1 commit into
mainfrom
feat/eval-scorecard

drewstone commented May 22, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented May 22, 2026

Summary

AgentProfile + agentProfileHash (src/agent-profile.ts)

scorecard (src/scorecard.ts)

Test plan

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`AgentProfile` + `agentProfileHash` (`src/agent-profile.ts`)

`scorecard` (`src/scorecard.ts`)