feat: 0.2.0 — nine framework primitives for stellar evals by drewstone · Pull Request #1 · tangle-network/agent-eval

drewstone · 2026-04-20T23:03:34Z

Summary

Ships the nine primitives every hand-rolled eval in tax/legal/gtm had to reinvent, as framework with real regression coverage.

Capabilities added

PromptRegistry — versioned prompts + SHA-256 content hash, idempotent re-register, loud error on conflict
TraceStore (Memory + FileSystem NDJSON with rollover) — per-LLM-call log queryable by runId/scenarioId/role
AntiSlopJudge — deterministic pattern check (banned phrases, hedging, apology, repetition, length); plugs into the judge array
ArtifactValidator — pluggable file/JSON/blob validators with partial credit; built-ins for regex, keys, size, substrings; composable with weighted aggregation
WorkspaceInspector — post-run state snapshot + fileExists/fileContains/rowCount/rowWhere assertions
ExperimentTracker — run grouping + persistence + per-scenario/aggregate/config diff between runs
PromptOptimizer — A/B variants with per-variant bootstrap CI + pairwise Mann-Whitney U; only flags winner as 'significant' when it beats every alternative
DualAgentBench — proposer/critic convergence pattern with round history + aggregate convergence rate
Statistics + — pairedTTest (with Student-t CDF), wilcoxonSignedRank, cohensD

Tests

11 files, 119 passing (up from 27). Every regression case names the specific bug it would catch.

Diff

Before	After
12 src files, 2,008 LOC	21 src files, ~4,500 LOC
3 test files, 27 tests	11 test files, 119 tests

API stability

Strictly additive. Every 0.1.0 export still works unchanged.

Migration plan (separate PRs)

Don't migrate tax/legal/gtm until their existing features are reproducible in <200 lines of domain config on top of 0.2.0 — bar set per Drew's 'strictly better first' directive. Film-agent will get migrated first as the proof case.

Every hand-rolled eval in the fleet (tax/legal/gtm, plus a dozen future verticals) had to re-invent the same five patterns. This release ships them as framework, with real tests, so consumers write <200 lines of domain config on top instead of 2,000+ lines of bespoke harness. ## What lands 1. PromptRegistry — versioned prompts with SHA-256 content hashing. get() throws on unknown keys (no silent defaults). Re-registering identical content is idempotent; conflicting content on the same version errors loudly so A/B history stays auditable. 2. TraceStore — pluggable per-LLM-call log (prompt, output, tokens, cost, timing). MemoryTraceStore for tests; FileSystemTraceStore for append-only NDJSON segments with rollover past 32 MB. Malformed lines skipped so one bad row cannot kill a whole query. 3. AntiSlopJudge — deterministic pattern-based quality check, no LLM call. Catches banned phrases, hedging, apology padding, n-gram repetition, and length bounds. Emits a JudgeScore with dimension 'anti_slop' so it composes into BenchmarkRunner's judge array transparently. Counts every occurrence, not just first match. 4. ArtifactValidator — interface for grading a produced file/JSON/blob. Built-ins: regexMatch, jsonHasKeys (partial credit), byteLengthRange (proportional score), containsAll (case-insensitive default). composeValidators aggregates pass as AND, score as weighted mean, issues namespaced by validator. 5. WorkspaceInspector — post-run state snapshot + assertion library. fileExists, fileContains, rowCount (proportional score inside/outside range), rowWhere (predicate-based row matching). runAssertions aggregates for a snapshot. 6. ExperimentTracker — groups runs, persists via pluggable ExperimentStore, computes per-scenario diff + aggregate delta + config delta between two completed runs. Not MLflow; the 20% that actually ships. Timeline view for drift tracking. 7. PromptOptimizer — A/B test N prompt variants with statistical rigor. Per-variant bootstrap CI + pairwise Mann-Whitney U. Winner is only marked 'significant' when it beats every other variant at the threshold — no declaring winners on noise. Rejects <2 variants, empty scenarios, non-finite scores. 8. DualAgentBench — proposer/critic convergence pattern lifted from tax-agent + legal-agent. Proposer sees prior critique each round. Configurable max rounds + convergence threshold. Aggregate convergenceRate + avgRoundsToConverge on the report. 9. Expanded statistics — pairedTTest (for before/after with paired removal of inter-item variance via Student-t CDF approximation), wilcoxonSignedRank (non-parametric alternative), cohensD (standardized effect size). Complements existing bootstrap CI, Mann-Whitney, inter-rater reliability. ## Tests 11 test files, 119 passing (was 27). Every regression case names the specific bug it would catch. ## Deps Adds @types/node to devDependencies (required for FileSystemTraceStore). No runtime deps change — framework is still dependency-lean. ## Bump: 0.1.0 -> 0.2.0 (strictly additive; no existing API changed)

…arness Ships the evaluation platform spine. Every score, failure class, and metric is now a view over a shared TraceSchema v1 corpus (Run + Span + Event + ToolCall + Retrieval + JudgeVerdict + BudgetLedger + Artifact). Pipelines query traces; they don't hook live agents. This unlocks retroactive re-scoring, cross-run diff, and prod/eval observability parity. Chassis: - trace/schema.ts — TraceSchema v1 (OTEL-compatible wire, agent-extended kinds) - trace/store.ts — InMemoryTraceStore + FileSystemTraceStore (NDJSON, 32MB rollover) - trace/emitter.ts — hierarchical span builder with auto-parenting + within() - trace/query.ts — typed query helpers (spans by kind, argHash, aggregateLlm) - trace/redact.ts — DEFAULT_REDACTION_RULES (email/ssn/card/bearer/private-key) - trace/otel.ts — exportRunAsOtlp for Jaeger/Honeycomb/Langfuse ingest Producers: - sandbox-harness.ts — SubprocessSandboxDriver + DockerSandboxDriver + vitest/ pytest/jest parsers; emits SandboxSpan - test-graded-scenario.ts — SWE-bench pattern: score = testsPassed/testsTotal - budget-guard.ts — enforces tokens/wallMs/calls/usd; BudgetBreachError + ledger - failure-taxonomy.ts — 17-class enum + rule-based classifier - trajectory.ts — DFS chronological view over run spans - tool-use-metrics.ts — error/duplicate/retry rates + optional selectionLabels Pipelines (views over traces): - stuck-loop, tool-waste, budget-breach, failure-cluster, judge-agreement, first-divergence, regression Auxiliary decision modules (also consumable over trace data): - slo.ts (DEFAULT_AGENT_SLOS + critical/warning severity) - baseline.ts (Welch's t-test + Cohen's d + per-side IQR stability) - oracle.ts (textInSnapshot / urlContains / jsonShape / regex / notBlocked) - pareto.ts, cost-tracker.ts, series-convergence.ts, state-continuity.ts 169 tests pass. Typecheck + tsup build clean. Legacy LlmTrace removed; nothing external consumed it yet.

…ion + behavior DSL Phase-2 deliverable: makes the framework defensible for external customers. New primitives: - Dataset: versioned scenarios with SHA-256 content hash + provenance, split (train/dev/test/holdout), difficulty tiers, seeded-deterministic slice. `HoldoutLockedError` prevents post-release mutation; jsonl serialization is byte-stable for contamination-verifiable archives. - ContaminationGuard: canary token checks, canaryLeakView over trace corpus, HoldoutAuditor wraps holdout dataset with purpose-tagged access logging. - RedTeamBattery: 9-case default corpus across 8 categories (prompt-injection direct/indirect, jailbreak DAN/persona, PII, permission-escalation, data-exfil, policy-override). Per-case scorer checks forbidden strings, forbidden tools, refusal markers, and default PII regex rules. - PowerAnalysis + FDR: requiredSampleSize (normal-approx, ~63 per arm for d=0.5 @ alpha=0.05 power=0.8), bonferroni, benjaminiHochberg with monotonicity preserved. Acklam inverse-normal-CDF included; no deps. - BehaviorDSL: expectAgent(store, runId).toCall('x').withArgs({q: /./}) .times(3) / .toRefuse() / .toOutputMatch(re) / .toRespectBudget(dim) .toCompleteWithin({wallMs, toolCalls, llmTurns}) / .toNeverCall(tool). runExpectations collects results without throwing. - JudgeCalibration: calibrateJudge (pearson + quadratic weighted κ + MAE + worst-5), positionalBias (A/B swap), verbosityBias (length-score correlation), selfPreference (in-family vs out-of-family mean delta). Fixed: - PromptOptimizer now applies BH-FDR correction across n*(n-1)/2 pairwise tests before flagging "significant". Prior 0.2/0.3 behavior inflated false-positive rate past alpha as variants grew past 2. 215 tests pass; typecheck + tsup build clean.

… + visual-diff Closes the 0.4 roadmap. Every piece consumes the trace corpus; zero duplicate accounting paths. - ci-gate.ts: ThresholdContract → evaluateContract → renderMarkdownReport. Baseline vs candidate comparison (delegates to baseline.ts), SLO evaluation, per-breach reasons, GitHub-PR-ready markdown output. Drop-in for GH Actions. - observability.ts: toLangfuseEnvelope (generations + scores, wire-compatible), toPrometheusText (8 counters + tool breakdown), replayTraceThroughJudge (the canonical retroactive re-scoring path — no agent re-execution needed). - paraphrase.ts: paraphraseRobustness harness + 5 mutators (lowercase, sentence-reorder, typo, politeness-prefix, whitespace-collapse). Surfaces brittle prompts whose score depends on surface form. - visual-diff.ts: pixel-delta scoring with configurable tolerance; maps to unchanged / changed / severely-changed. Zero deps — consumers pass decoded RGBA pixel arrays. 234 tests pass; typecheck + build (178KB ESM bundle) clean. 0.4.0 surface now covers: trace chassis, sandbox harness, 7 pipelines, failure taxonomy, budget guard, trajectory, tool-use metrics, SLO, baseline, oracle, pareto, cost tracker, state continuity, series convergence, dataset artifact, contamination guard, red-team battery, power analysis + FDR, behavior DSL, judge calibration, CI gate, Langfuse/Prometheus observability adapters, paraphrase robustness, visual diff. OTEL export ships everywhere via shared trace schema.

…l + correlation The evaluation shape for agent-builder specifically. Every build is a three-layer nested Run tree: builder (L0, the Forge chat), app-build (L1, the sandbox that compiled + tested the generated scaffold), app-runtime (L2, domain scenarios run against the generated agent). The framework-critical signal is the meta-runtime correlation: does the builder's self-score predict real-world success? If r < 0.4 the builder is optimizing for the wrong thing. Schema (backward-compat, all optional): - Run.parentRunId — ties app-build → builder, app-runtime → app-build - Run.projectId / chatId — group across chats + sessions - Run.layer — 'builder' | 'app-build' | 'app-runtime' | 'meta' | 'custom' - RunFilter + matchesRun + all callers updated builder-eval/: - BuilderSession: startChat → ship → runAppScenario, auto-parents child Runs, records meta-score as a JudgeVerdict on the builder Run. - resumeBuilderSession: reconstructs the latest builder/build/runtime runs from the trace store — chat-first UIs call this on project open. - three-layer-eval.scoreProject: meta/build/runtime scores + complete flag. - correlation.correlateLayers: Pearson + Spearman across meta↔build, meta↔runtime, build↔runtime; surfaces the meta-runtime predictive coefficient. - ProjectRegistry: listProjects, projectTimeline, projectChats — the read-side surface a chat-first dashboard consumes. 242 tests pass; typecheck + 189KB ESM build clean.

Two helpers that close the app-build/app-runtime recording gap when the caller isn't using SandboxHarness (e.g. agent-builder's publish hook does a preview-URL smoke test instead of a full test suite, and its scenarios-run invokes the sandbox chat endpoint directly). - startAppRuntime(scenarioId): returns a fresh emitter on a new app-runtime Run parented to the latest build (or builder). Caller emits spans inside and calls endRun() with the verdict. - recordShipMarker({ pass, score, notes }): lightweight app-build Run with caller-supplied outcome — for marker-style tracking without a harness.

The three frontier primitives that close the "does eval predict reality + how do we attribute what moved the needle" loop. meta-eval/: - OutcomeStore (InMemory + FileSystem NDJSON) — deployment outcomes keyed on runId: retention, CSAT, revenue, task-success, anything product telemetry exports. Lifecycle is peer to TraceStore, arrives async. - correlationStudy: joins traces ↔ outcomes, computes Pearson + Spearman + bootstrap 95% CI per (evalMetric, outcomeMetric) pair. Verdict strong/ moderate/weak. This is the one number that makes the framework trusted. - calibrationCurve: equal-width or equal-frequency binning of (eval vs reality), Expected Calibration Error + max bin gap. Shows *where* the eval is over/underconfident, not just a single correlation coefficient. prm/: - StepRubric + PrmGrader — applies per-step rubrics to every span in a trajectory, emits a JudgeVerdict span per (rubric × span) so existing pipelines (judgeAgreementView, etc.) see PRM verdicts natively. - Built-in rubrics: outputLength, toolSuccess, toolNonRedundant, nonRefusal, toolIntentAlignment — cheap rule-based reference graders. - exportTrainingData: NDJSON of (step-context, score) pairs for downstream reward-model fine-tuning. Framework-agnostic — plug into TRL/Unsloth/etc. - prmBestOfN + prmEnsembleBestOfN: inference-time candidate ranking via PRM; ensemble uses Borda count so different graders' score scales don't dominate. bisector.ts: - bisect<T>: generic binary-search with caller-supplied halfway + runEval. - commitBisect: ordered SHA list; narrows to adjacent (good, bad) pair. - promptBisect: splits good/bad prompts into paragraphs, progressively replaces paragraphs to localize the offending change; returns the paragraph index that broke. - Flags inputInconsistent when caller's good/bad premise is wrong. 267 tests pass; typecheck + 214KB ESM build clean. Published version bump to 0.6.0 (npm publish is a follow-up).

… pre-registration counterfactual.ts: - runCounterfactual: records a meta-layer Run parented to the original, applies a mutation (swap-model / swap-tool-result / truncate-after / inject-system-message / custom) at turn N, delegates execution to a caller-supplied CounterfactualRunner, captures the paired score delta. - attributeCounterfactuals: given a batch of CF results, ranks mutation kinds by mean absolute delta — surfaces which lever moves outcomes most. cross-trace-diff.ts: - crossTraceDiff: full LCS alignment of two trajectories (match / insert / delete / replace), per-step attribution using PRM verdicts when available (fallback: latency + token delta). Reports totalScoreDelta alongside prmDeltaSum so coverage gap is visible. pre-registration.ts: - signManifest: SHA-256 over canonicalized hypothesis — manifest becomes immutable. - evaluateHypothesis: mechanical check of observed effect against pre-declared direction / minEffect / alpha / preRegisteredN. Tampered manifests throw. Rejection reasons are machine-tagged so you cannot silently re-interpret. 276 tests pass; typecheck + 220KB ESM build clean.

…arning + RM export self-play.ts: runSelfPlay(proposer, scorer, targets). Scenarios survive when they produce a score *spread* across targets AND every target stays above a floor (rejects noise/gibberish). Survivors promoted to a Dataset with origin='self-play'. Framework-agnostic about how candidates are generated — caller supplies the proposer agent. causal-attribution.ts: factorial-experiment variance decomposition. causalAttribution(cells) returns per-factor main-effect share + pairwise interaction share + residual — moves from 'variant B wins' to 'model swap accounts for X%, prompt swap for Y%, interaction Z%.' Minimal 2-way; extends to N-way via the same group-mean math. active-learning.ts: proposeSynthesisTargets analyzes Dataset + trace corpus, returns prioritized gaps — difficulty coverage, undersampled scenarios, high-variance scenarios, failure-class clusters. Caller feeds (neighbors, direction) into their LLM proposer for scenario synthesis. No LLM call inside the framework; this finds WHERE to author, not WHAT to write. reward-model-export.ts: ExportedRewardModel wraps PRM training NDJSON with metadata (nTraces, nSamples, rubrics, meanReward) for handoff to a fine-tuning pipeline. loadScorerFromGrader returns a zero-deps InferenceScorer so customers embed PRM as an inference-time judge without fine-tuning. replayScorerOverCorpus validates a scorer against its training corpus — drift indicates overfitting. 286 tests pass; typecheck + 244KB ESM build clean.

Regulator-facing reports derived from the trace corpus. Each framework ships a GovernanceContext → GovernanceReport generator + a structured finding list with severity + control reference + remediation. Markdown renderer for external auditors; JSON for CI. - nist-ai-rmf.ts: GOVERN-1.1/1.3, MEASURE-2.1/2.6/2.11, MANAGE-1.1. Derives findings from real framework state — missing red-team, missing outcomes, weak judge calibration, absent dataset hashes all surface as typed controls. - soc2.ts: CC7.1–7.4 common-criteria evidence package. Failure rate, abort rate, incident events, distinct release fingerprints (codeSha/promptSha/modelFingerprint). - eu-ai-act.ts: classifyEuAiRisk({ biometricPublic, annexIII, chatbot, generatesSyntheticMedia, ... }) → unacceptable|high|limited|minimal. Art. 5 prohibits outright; Art. 9/10/11/13/14/15 checked for high-risk deployments; Art. 52 transparency for limited-risk chatbots. 296 tests pass; typecheck + 244KB ESM build clean. All Tier 1 + Tier 2 + Tier 3 + governance frontier work shipped in this session.

drewstone added 11 commits April 20, 2026 17:03

chore: bump to 0.5.0 — builder-of-builders public surface

41d0b71

drewstone merged commit fb0dcbe into main Apr 21, 2026

drewstone deleted the feat/0.2-framework-additions branch April 21, 2026 02:21

drewstone mentioned this pull request May 8, 2026

chore(deps): clear 3 moderate Dependabot alerts (hono + postcss) #39

Merged

4 tasks

drewstone mentioned this pull request May 19, 2026

feat(analyst): registry + findings envelope over existing primitives #56

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 0.2.0 — nine framework primitives for stellar evals#1

feat: 0.2.0 — nine framework primitives for stellar evals#1
drewstone merged 11 commits into
mainfrom
feat/0.2-framework-additions

drewstone commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 20, 2026

Summary

Capabilities added

Tests

Diff

API stability

Migration plan (separate PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant