From c9beb2aa6822e18227296e2c7546fd8d10ed8844 Mon Sep 17 00:00:00 2001 From: Drew Stone Date: Sat, 23 May 2026 01:54:40 +0300 Subject: [PATCH] =?UTF-8?q?chore(0.34.0):=20release=20=E2=80=94=20eval=20s?= =?UTF-8?q?corecard=20+=20agent=20profile=20cells?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Headline: a feature PR's eval can finally answer the question a single run cannot — did this change regress persona P on profile F, even while the aggregate improved? AgentProfile is the unit of variation; the append-only JSONL scorecard is the per-cell timeline; diffScorecard issues per-cell verdicts with Cohen's d + Welch's t-test. Also consolidates the paired statistics, unifies the LLM retry classifier, commits the long-missing pr-review-benchmark source, and adds three production-pattern examples. See CHANGELOG.md for the full list. --- CHANGELOG.md | 33 +++++++++++++++++++++++++++++++++ clients/python/pyproject.toml | 2 +- package.json | 2 +- 3 files changed, 35 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 03e0044..f07a67f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,38 @@ # Changelog +## 0.34.0 — 2026-05-23 + +### Eval evolution-tracking — first-class `AgentProfile` + per-cell scorecard + +The headline shift: a feature PR's eval can now answer the question a single +run cannot — *did this change regress persona P on profile F, even while the +aggregate improved?* + +- **`AgentProfile` + `agentProfileHash`** — the harness's unit of variation. + Model lives inside the profile (skill/tool order doesn't matter; the `id` + label is excluded from identity), so "same model, different skills" is two + profiles. (#78) +- **Append-only JSONL scorecard** keyed `(scenarioId, profileHash)` — + `recordRuns` / `recordRunsToScorecard` / `loadScorecard`. Idempotent + appends on `eventId` so concurrent campaign runs cannot clobber. (#78) +- **`diffScorecard`** — per-cell verdict (`improved` / `regressed` / `flat` / + `new`) using Cohen's d + Welch's t-test; the keystone CI guard is + `diff.cells.filter(c => c.verdict === 'regressed')`. `formatScorecardDiff` + renders the PR-facing report. (#78) +- **Agent profile cells** — `src/agent-profile-cell.ts` extends the profile + contract into `RunRecord` rows and `runEvalCampaign` so every campaign row + is keyed by `(profile, scenario, seed)` end-to-end. (#79) +- **Stats consolidation** — `pairedBootstrap`, power analysis, and the + paired/Welch primitives now all live in `src/statistics.ts`. (#73) +- **LLM retry classifier unified** across `llm-client` and `judge-retry` + via `isTransientLlmError`. (#74) +- **`pr-review-benchmark` source committed** — the module was exported from + `index.ts` since the run-record refactor but the source files were never + committed; CI on `main` has been red on #78/#79/#81 as a result. (#83) +- **Examples**: `scorecard/`, `held-out-gate/`, `user-simulation-driver/`. (#81) + +No breaking changes — additive across the board. + ## 0.33.0 — 2026-05-21 ### Release — `decideNextUserTurn` in the published tarball diff --git a/clients/python/pyproject.toml b/clients/python/pyproject.toml index e5561f7..d37d5f2 100644 --- a/clients/python/pyproject.toml +++ b/clients/python/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "agent-eval-rpc" -version = "0.33.0" +version = "0.34.0" description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client." readme = "README.md" requires-python = ">=3.10" diff --git a/package.json b/package.json index cab1c23..c1c17aa 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@tangle-network/agent-eval", - "version": "0.33.0", + "version": "0.34.0", "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.", "homepage": "https://github.com/tangle-network/agent-eval#readme", "repository": {