Skip to content

feat: 0.2.0 — nine framework primitives for stellar evals#1

Merged
drewstone merged 11 commits into
mainfrom
feat/0.2-framework-additions
Apr 21, 2026
Merged

feat: 0.2.0 — nine framework primitives for stellar evals#1
drewstone merged 11 commits into
mainfrom
feat/0.2-framework-additions

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Ships the nine primitives every hand-rolled eval in tax/legal/gtm had to reinvent, as framework with real regression coverage.

Capabilities added

  1. PromptRegistry — versioned prompts + SHA-256 content hash, idempotent re-register, loud error on conflict
  2. TraceStore (Memory + FileSystem NDJSON with rollover) — per-LLM-call log queryable by runId/scenarioId/role
  3. AntiSlopJudge — deterministic pattern check (banned phrases, hedging, apology, repetition, length); plugs into the judge array
  4. ArtifactValidator — pluggable file/JSON/blob validators with partial credit; built-ins for regex, keys, size, substrings; composable with weighted aggregation
  5. WorkspaceInspector — post-run state snapshot + fileExists/fileContains/rowCount/rowWhere assertions
  6. ExperimentTracker — run grouping + persistence + per-scenario/aggregate/config diff between runs
  7. PromptOptimizer — A/B variants with per-variant bootstrap CI + pairwise Mann-Whitney U; only flags winner as 'significant' when it beats every alternative
  8. DualAgentBench — proposer/critic convergence pattern with round history + aggregate convergence rate
  9. Statistics + — pairedTTest (with Student-t CDF), wilcoxonSignedRank, cohensD

Tests

11 files, 119 passing (up from 27). Every regression case names the specific bug it would catch.

Diff

Before After
12 src files, 2,008 LOC 21 src files, ~4,500 LOC
3 test files, 27 tests 11 test files, 119 tests

API stability

Strictly additive. Every 0.1.0 export still works unchanged.

Migration plan (separate PRs)

Don't migrate tax/legal/gtm until their existing features are reproducible in <200 lines of domain config on top of 0.2.0 — bar set per Drew's 'strictly better first' directive. Film-agent will get migrated first as the proof case.

Every hand-rolled eval in the fleet (tax/legal/gtm, plus a dozen future
verticals) had to re-invent the same five patterns. This release ships
them as framework, with real tests, so consumers write <200 lines of
domain config on top instead of 2,000+ lines of bespoke harness.

## What lands

1. PromptRegistry — versioned prompts with SHA-256 content hashing. get()
   throws on unknown keys (no silent defaults). Re-registering identical
   content is idempotent; conflicting content on the same version errors
   loudly so A/B history stays auditable.

2. TraceStore — pluggable per-LLM-call log (prompt, output, tokens, cost,
   timing). MemoryTraceStore for tests; FileSystemTraceStore for
   append-only NDJSON segments with rollover past 32 MB. Malformed lines
   skipped so one bad row cannot kill a whole query.

3. AntiSlopJudge — deterministic pattern-based quality check, no LLM
   call. Catches banned phrases, hedging, apology padding, n-gram
   repetition, and length bounds. Emits a JudgeScore with dimension
   'anti_slop' so it composes into BenchmarkRunner's judge array
   transparently. Counts every occurrence, not just first match.

4. ArtifactValidator — interface for grading a produced file/JSON/blob.
   Built-ins: regexMatch, jsonHasKeys (partial credit), byteLengthRange
   (proportional score), containsAll (case-insensitive default).
   composeValidators aggregates pass as AND, score as weighted mean,
   issues namespaced by validator.

5. WorkspaceInspector — post-run state snapshot + assertion library.
   fileExists, fileContains, rowCount (proportional score inside/outside
   range), rowWhere (predicate-based row matching). runAssertions
   aggregates for a snapshot.

6. ExperimentTracker — groups runs, persists via pluggable
   ExperimentStore, computes per-scenario diff + aggregate delta +
   config delta between two completed runs. Not MLflow; the 20% that
   actually ships. Timeline view for drift tracking.

7. PromptOptimizer — A/B test N prompt variants with statistical rigor.
   Per-variant bootstrap CI + pairwise Mann-Whitney U. Winner is only
   marked 'significant' when it beats every other variant at the
   threshold — no declaring winners on noise. Rejects <2 variants, empty
   scenarios, non-finite scores.

8. DualAgentBench — proposer/critic convergence pattern lifted from
   tax-agent + legal-agent. Proposer sees prior critique each round.
   Configurable max rounds + convergence threshold. Aggregate
   convergenceRate + avgRoundsToConverge on the report.

9. Expanded statistics — pairedTTest (for before/after with paired
   removal of inter-item variance via Student-t CDF approximation),
   wilcoxonSignedRank (non-parametric alternative), cohensD (standardized
   effect size). Complements existing bootstrap CI, Mann-Whitney,
   inter-rater reliability.

## Tests

11 test files, 119 passing (was 27). Every regression case names the
specific bug it would catch.

## Deps

Adds @types/node to devDependencies (required for FileSystemTraceStore).
No runtime deps change — framework is still dependency-lean.

## Bump: 0.1.0 -> 0.2.0 (strictly additive; no existing API changed)
…arness

Ships the evaluation platform spine. Every score, failure class, and metric is now
a view over a shared TraceSchema v1 corpus (Run + Span + Event + ToolCall +
Retrieval + JudgeVerdict + BudgetLedger + Artifact). Pipelines query traces; they
don't hook live agents. This unlocks retroactive re-scoring, cross-run diff,
and prod/eval observability parity.

Chassis:
- trace/schema.ts — TraceSchema v1 (OTEL-compatible wire, agent-extended kinds)
- trace/store.ts — InMemoryTraceStore + FileSystemTraceStore (NDJSON, 32MB rollover)
- trace/emitter.ts — hierarchical span builder with auto-parenting + within()
- trace/query.ts — typed query helpers (spans by kind, argHash, aggregateLlm)
- trace/redact.ts — DEFAULT_REDACTION_RULES (email/ssn/card/bearer/private-key)
- trace/otel.ts — exportRunAsOtlp for Jaeger/Honeycomb/Langfuse ingest

Producers:
- sandbox-harness.ts — SubprocessSandboxDriver + DockerSandboxDriver + vitest/
  pytest/jest parsers; emits SandboxSpan
- test-graded-scenario.ts — SWE-bench pattern: score = testsPassed/testsTotal
- budget-guard.ts — enforces tokens/wallMs/calls/usd; BudgetBreachError + ledger
- failure-taxonomy.ts — 17-class enum + rule-based classifier
- trajectory.ts — DFS chronological view over run spans
- tool-use-metrics.ts — error/duplicate/retry rates + optional selectionLabels

Pipelines (views over traces):
- stuck-loop, tool-waste, budget-breach, failure-cluster, judge-agreement,
  first-divergence, regression

Auxiliary decision modules (also consumable over trace data):
- slo.ts (DEFAULT_AGENT_SLOS + critical/warning severity)
- baseline.ts (Welch's t-test + Cohen's d + per-side IQR stability)
- oracle.ts (textInSnapshot / urlContains / jsonShape / regex / notBlocked)
- pareto.ts, cost-tracker.ts, series-convergence.ts, state-continuity.ts

169 tests pass. Typecheck + tsup build clean. Legacy LlmTrace removed; nothing
external consumed it yet.
…ion + behavior DSL

Phase-2 deliverable: makes the framework defensible for external customers.

New primitives:
- Dataset: versioned scenarios with SHA-256 content hash + provenance, split
  (train/dev/test/holdout), difficulty tiers, seeded-deterministic slice.
  `HoldoutLockedError` prevents post-release mutation; jsonl serialization is
  byte-stable for contamination-verifiable archives.
- ContaminationGuard: canary token checks, canaryLeakView over trace corpus,
  HoldoutAuditor wraps holdout dataset with purpose-tagged access logging.
- RedTeamBattery: 9-case default corpus across 8 categories (prompt-injection
  direct/indirect, jailbreak DAN/persona, PII, permission-escalation, data-exfil,
  policy-override). Per-case scorer checks forbidden strings, forbidden tools,
  refusal markers, and default PII regex rules.
- PowerAnalysis + FDR: requiredSampleSize (normal-approx, ~63 per arm for d=0.5
  @ alpha=0.05 power=0.8), bonferroni, benjaminiHochberg with monotonicity
  preserved. Acklam inverse-normal-CDF included; no deps.
- BehaviorDSL: expectAgent(store, runId).toCall('x').withArgs({q: /./})
  .times(3) / .toRefuse() / .toOutputMatch(re) / .toRespectBudget(dim)
  .toCompleteWithin({wallMs, toolCalls, llmTurns}) / .toNeverCall(tool).
  runExpectations collects results without throwing.
- JudgeCalibration: calibrateJudge (pearson + quadratic weighted κ + MAE +
  worst-5), positionalBias (A/B swap), verbosityBias (length-score correlation),
  selfPreference (in-family vs out-of-family mean delta).

Fixed:
- PromptOptimizer now applies BH-FDR correction across n*(n-1)/2 pairwise
  tests before flagging "significant". Prior 0.2/0.3 behavior inflated
  false-positive rate past alpha as variants grew past 2.

215 tests pass; typecheck + tsup build clean.
… + visual-diff

Closes the 0.4 roadmap. Every piece consumes the trace corpus; zero duplicate
accounting paths.

- ci-gate.ts: ThresholdContract → evaluateContract → renderMarkdownReport.
  Baseline vs candidate comparison (delegates to baseline.ts), SLO evaluation,
  per-breach reasons, GitHub-PR-ready markdown output. Drop-in for GH Actions.
- observability.ts: toLangfuseEnvelope (generations + scores, wire-compatible),
  toPrometheusText (8 counters + tool breakdown), replayTraceThroughJudge (the
  canonical retroactive re-scoring path — no agent re-execution needed).
- paraphrase.ts: paraphraseRobustness harness + 5 mutators (lowercase,
  sentence-reorder, typo, politeness-prefix, whitespace-collapse). Surfaces
  brittle prompts whose score depends on surface form.
- visual-diff.ts: pixel-delta scoring with configurable tolerance; maps to
  unchanged / changed / severely-changed. Zero deps — consumers pass decoded
  RGBA pixel arrays.

234 tests pass; typecheck + build (178KB ESM bundle) clean.

0.4.0 surface now covers: trace chassis, sandbox harness, 7 pipelines, failure
taxonomy, budget guard, trajectory, tool-use metrics, SLO, baseline, oracle,
pareto, cost tracker, state continuity, series convergence, dataset artifact,
contamination guard, red-team battery, power analysis + FDR, behavior DSL,
judge calibration, CI gate, Langfuse/Prometheus observability adapters,
paraphrase robustness, visual diff. OTEL export ships everywhere via shared
trace schema.
…l + correlation

The evaluation shape for agent-builder specifically. Every build is a three-layer
nested Run tree: builder (L0, the Forge chat), app-build (L1, the sandbox that
compiled + tested the generated scaffold), app-runtime (L2, domain scenarios
run against the generated agent). The framework-critical signal is the
meta-runtime correlation: does the builder's self-score predict real-world
success? If r < 0.4 the builder is optimizing for the wrong thing.

Schema (backward-compat, all optional):
- Run.parentRunId — ties app-build → builder, app-runtime → app-build
- Run.projectId / chatId — group across chats + sessions
- Run.layer — 'builder' | 'app-build' | 'app-runtime' | 'meta' | 'custom'
- RunFilter + matchesRun + all callers updated

builder-eval/:
- BuilderSession: startChat → ship → runAppScenario, auto-parents child Runs,
  records meta-score as a JudgeVerdict on the builder Run.
- resumeBuilderSession: reconstructs the latest builder/build/runtime runs
  from the trace store — chat-first UIs call this on project open.
- three-layer-eval.scoreProject: meta/build/runtime scores + complete flag.
- correlation.correlateLayers: Pearson + Spearman across meta↔build,
  meta↔runtime, build↔runtime; surfaces the meta-runtime predictive
  coefficient.
- ProjectRegistry: listProjects, projectTimeline, projectChats — the
  read-side surface a chat-first dashboard consumes.

242 tests pass; typecheck + 189KB ESM build clean.
Two helpers that close the app-build/app-runtime recording gap when the
caller isn't using SandboxHarness (e.g. agent-builder's publish hook
does a preview-URL smoke test instead of a full test suite, and its
scenarios-run invokes the sandbox chat endpoint directly).

- startAppRuntime(scenarioId): returns a fresh emitter on a new
  app-runtime Run parented to the latest build (or builder). Caller
  emits spans inside and calls endRun() with the verdict.
- recordShipMarker({ pass, score, notes }): lightweight app-build Run
  with caller-supplied outcome — for marker-style tracking without a
  harness.
The three frontier primitives that close the "does eval predict reality +
how do we attribute what moved the needle" loop.

meta-eval/:
- OutcomeStore (InMemory + FileSystem NDJSON) — deployment outcomes keyed
  on runId: retention, CSAT, revenue, task-success, anything product
  telemetry exports. Lifecycle is peer to TraceStore, arrives async.
- correlationStudy: joins traces ↔ outcomes, computes Pearson + Spearman +
  bootstrap 95% CI per (evalMetric, outcomeMetric) pair. Verdict strong/
  moderate/weak. This is the one number that makes the framework trusted.
- calibrationCurve: equal-width or equal-frequency binning of (eval vs
  reality), Expected Calibration Error + max bin gap. Shows *where* the
  eval is over/underconfident, not just a single correlation coefficient.

prm/:
- StepRubric + PrmGrader — applies per-step rubrics to every span in a
  trajectory, emits a JudgeVerdict span per (rubric × span) so existing
  pipelines (judgeAgreementView, etc.) see PRM verdicts natively.
- Built-in rubrics: outputLength, toolSuccess, toolNonRedundant,
  nonRefusal, toolIntentAlignment — cheap rule-based reference graders.
- exportTrainingData: NDJSON of (step-context, score) pairs for downstream
  reward-model fine-tuning. Framework-agnostic — plug into TRL/Unsloth/etc.
- prmBestOfN + prmEnsembleBestOfN: inference-time candidate ranking via
  PRM; ensemble uses Borda count so different graders' score scales don't
  dominate.

bisector.ts:
- bisect<T>: generic binary-search with caller-supplied halfway + runEval.
- commitBisect: ordered SHA list; narrows to adjacent (good, bad) pair.
- promptBisect: splits good/bad prompts into paragraphs, progressively
  replaces paragraphs to localize the offending change; returns the
  paragraph index that broke.
- Flags inputInconsistent when caller's good/bad premise is wrong.

267 tests pass; typecheck + 214KB ESM build clean. Published version
bump to 0.6.0 (npm publish is a follow-up).
… pre-registration

counterfactual.ts:
- runCounterfactual: records a meta-layer Run parented to the original,
  applies a mutation (swap-model / swap-tool-result / truncate-after /
  inject-system-message / custom) at turn N, delegates execution to a
  caller-supplied CounterfactualRunner, captures the paired score delta.
- attributeCounterfactuals: given a batch of CF results, ranks mutation
  kinds by mean absolute delta — surfaces which lever moves outcomes most.

cross-trace-diff.ts:
- crossTraceDiff: full LCS alignment of two trajectories (match / insert /
  delete / replace), per-step attribution using PRM verdicts when available
  (fallback: latency + token delta). Reports totalScoreDelta alongside
  prmDeltaSum so coverage gap is visible.

pre-registration.ts:
- signManifest: SHA-256 over canonicalized hypothesis — manifest becomes
  immutable.
- evaluateHypothesis: mechanical check of observed effect against
  pre-declared direction / minEffect / alpha / preRegisteredN. Tampered
  manifests throw. Rejection reasons are machine-tagged so you cannot
  silently re-interpret.

276 tests pass; typecheck + 220KB ESM build clean.
…arning + RM export

self-play.ts: runSelfPlay(proposer, scorer, targets). Scenarios survive
when they produce a score *spread* across targets AND every target stays
above a floor (rejects noise/gibberish). Survivors promoted to a Dataset
with origin='self-play'. Framework-agnostic about how candidates are
generated — caller supplies the proposer agent.

causal-attribution.ts: factorial-experiment variance decomposition.
causalAttribution(cells) returns per-factor main-effect share +
pairwise interaction share + residual — moves from 'variant B wins'
to 'model swap accounts for X%, prompt swap for Y%, interaction Z%.'
Minimal 2-way; extends to N-way via the same group-mean math.

active-learning.ts: proposeSynthesisTargets analyzes Dataset + trace
corpus, returns prioritized gaps — difficulty coverage, undersampled
scenarios, high-variance scenarios, failure-class clusters. Caller
feeds (neighbors, direction) into their LLM proposer for scenario
synthesis. No LLM call inside the framework; this finds WHERE to author,
not WHAT to write.

reward-model-export.ts: ExportedRewardModel wraps PRM training NDJSON
with metadata (nTraces, nSamples, rubrics, meanReward) for handoff to
a fine-tuning pipeline. loadScorerFromGrader returns a zero-deps
InferenceScorer so customers embed PRM as an inference-time judge
without fine-tuning. replayScorerOverCorpus validates a scorer against
its training corpus — drift indicates overfitting.

286 tests pass; typecheck + 244KB ESM build clean.
Regulator-facing reports derived from the trace corpus. Each framework
ships a GovernanceContext → GovernanceReport generator + a structured
finding list with severity + control reference + remediation. Markdown
renderer for external auditors; JSON for CI.

- nist-ai-rmf.ts: GOVERN-1.1/1.3, MEASURE-2.1/2.6/2.11, MANAGE-1.1.
  Derives findings from real framework state — missing red-team,
  missing outcomes, weak judge calibration, absent dataset hashes all
  surface as typed controls.
- soc2.ts: CC7.1–7.4 common-criteria evidence package. Failure rate,
  abort rate, incident events, distinct release fingerprints
  (codeSha/promptSha/modelFingerprint).
- eu-ai-act.ts: classifyEuAiRisk({ biometricPublic, annexIII, chatbot,
  generatesSyntheticMedia, ... }) → unacceptable|high|limited|minimal.
  Art. 5 prohibits outright; Art. 9/10/11/13/14/15 checked for high-risk
  deployments; Art. 52 transparency for limited-risk chatbots.

296 tests pass; typecheck + 244KB ESM build clean. All Tier 1 + Tier 2 +
Tier 3 + governance frontier work shipped in this session.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant