From c9beb2aa6822e18227296e2c7546fd8d10ed8844 Mon Sep 17 00:00:00 2001
From: Drew Stone <drewstone329@gmail.com>
Date: Sat, 23 May 2026 01:54:40 +0300
Subject: [PATCH] =?UTF-8?q?chore(0.34.0):=20release=20=E2=80=94=20eval=20s?=
 =?UTF-8?q?corecard=20+=20agent=20profile=20cells?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Headline: a feature PR's eval can finally answer the question a single
run cannot — did this change regress persona P on profile F, even
while the aggregate improved? AgentProfile is the unit of variation;
the append-only JSONL scorecard is the per-cell timeline; diffScorecard
issues per-cell verdicts with Cohen's d + Welch's t-test. Also
consolidates the paired statistics, unifies the LLM retry classifier,
commits the long-missing pr-review-benchmark source, and adds three
production-pattern examples.

See CHANGELOG.md for the full list.
---
 CHANGELOG.md                  | 33 +++++++++++++++++++++++++++++++++
 clients/python/pyproject.toml |  2 +-
 package.json                  |  2 +-
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 03e0044..f07a67f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,38 @@
 # Changelog
 
+## 0.34.0 — 2026-05-23
+
+### Eval evolution-tracking — first-class `AgentProfile` + per-cell scorecard
+
+The headline shift: a feature PR's eval can now answer the question a single
+run cannot — *did this change regress persona P on profile F, even while the
+aggregate improved?*
+
+- **`AgentProfile` + `agentProfileHash`** — the harness's unit of variation.
+  Model lives inside the profile (skill/tool order doesn't matter; the `id`
+  label is excluded from identity), so "same model, different skills" is two
+  profiles. (#78)
+- **Append-only JSONL scorecard** keyed `(scenarioId, profileHash)` —
+  `recordRuns` / `recordRunsToScorecard` / `loadScorecard`. Idempotent
+  appends on `eventId` so concurrent campaign runs cannot clobber. (#78)
+- **`diffScorecard`** — per-cell verdict (`improved` / `regressed` / `flat` /
+  `new`) using Cohen's d + Welch's t-test; the keystone CI guard is
+  `diff.cells.filter(c => c.verdict === 'regressed')`. `formatScorecardDiff`
+  renders the PR-facing report. (#78)
+- **Agent profile cells** — `src/agent-profile-cell.ts` extends the profile
+  contract into `RunRecord` rows and `runEvalCampaign` so every campaign row
+  is keyed by `(profile, scenario, seed)` end-to-end. (#79)
+- **Stats consolidation** — `pairedBootstrap`, power analysis, and the
+  paired/Welch primitives now all live in `src/statistics.ts`. (#73)
+- **LLM retry classifier unified** across `llm-client` and `judge-retry`
+  via `isTransientLlmError`. (#74)
+- **`pr-review-benchmark` source committed** — the module was exported from
+  `index.ts` since the run-record refactor but the source files were never
+  committed; CI on `main` has been red on #78/#79/#81 as a result. (#83)
+- **Examples**: `scorecard/`, `held-out-gate/`, `user-simulation-driver/`. (#81)
+
+No breaking changes — additive across the board.
+
 ## 0.33.0 — 2026-05-21
 
 ### Release — `decideNextUserTurn` in the published tarball
diff --git a/clients/python/pyproject.toml b/clients/python/pyproject.toml
index e5561f7..d37d5f2 100644
--- a/clients/python/pyproject.toml
+++ b/clients/python/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "agent-eval-rpc"
-version = "0.33.0"
+version = "0.34.0"
 description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client."
 readme = "README.md"
 requires-python = ">=3.10"
diff --git a/package.json b/package.json
index cab1c23..c1c17aa 100644
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@tangle-network/agent-eval",
-  "version": "0.33.0",
+  "version": "0.34.0",
   "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
   "homepage": "https://github.com/tangle-network/agent-eval#readme",
   "repository": {