chore(0.34.0): release — eval scorecard + agent profile cells by drewstone · Pull Request #84 · tangle-network/agent-eval

drewstone · 2026-05-22T22:54:58Z

Headline

A feature PR's eval can finally answer the question a single run cannot — did this change regress persona P on profile F, even while the aggregate improved?

AgentProfile + agentProfileHash — the harness's unit of variation (feat: eval scorecard — (persona × profile) score timeline #78).
Append-only JSONL scorecard keyed (scenarioId, profileHash) (feat: eval scorecard — (persona × profile) score timeline #78).
diffScorecard — per-cell improved/regressed/flat/new verdict with Cohen's d + Welch's t-test (feat: eval scorecard — (persona × profile) score timeline #78).
Agent profile cells — extends the contract end-to-end through RunRecord + runEvalCampaign (feat(eval): add agent profile cells #79).
Stats consolidation (refactor: consolidate paired-stats + power-analysis into statistics #73), LLM retry classifier unified (refactor: unify the LLM retry classifier across client + judge retry #74).
pr-review-benchmark source committed — restored green CI (fix: commit pr-review-benchmark source — restores green CI on main #83).
Examples: scorecard, held-out-gate, user-simulation-driver (feat(examples): scorecard, held-out-gate, user-simulation-driver #81).

No breaking changes — additive across the board. See CHANGELOG.md for the full list.

Test plan

pnpm typecheck — 0 errors
pnpm test — 1306 passed (135 files)
pnpm build — green; OpenAPI spec emitted

After merge: push tag v0.34.0 to trigger publish.yml for the npm + Python tarballs.

Headline: a feature PR's eval can finally answer the question a single run cannot — did this change regress persona P on profile F, even while the aggregate improved? AgentProfile is the unit of variation; the append-only JSONL scorecard is the per-cell timeline; diffScorecard issues per-cell verdicts with Cohen's d + Welch's t-test. Also consolidates the paired statistics, unifies the LLM retry classifier, commits the long-missing pr-review-benchmark source, and adds three production-pattern examples. See CHANGELOG.md for the full list.

tangletools

Verified — 0.34.0 release bump. package.json + clients/python/pyproject.toml in lockstep at 0.34.0; CHANGELOG entry covers the scorecard, agent-profile-cells, stats consolidation, llm-retry unify, pr-review-benchmark commit, and examples. typecheck 0, suite 1306 green, build emits OpenAPI cleanly. Tag v0.34.0 after merge to trigger publish.yml.

tangletools approved these changes May 22, 2026

View reviewed changes

drewstone merged commit 13c995a into main May 22, 2026
1 check passed

drewstone deleted the chore/release-0.34.0 branch May 22, 2026 22:55

drewstone mentioned this pull request May 22, 2026

chore(release): sync agent-eval 0.33.1 #80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(0.34.0): release — eval scorecard + agent profile cells#84

chore(0.34.0): release — eval scorecard + agent profile cells#84
drewstone merged 1 commit into
mainfrom
chore/release-0.34.0

drewstone commented May 22, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented May 22, 2026

Headline

Test plan

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants