feat: 0.2.0 — nine framework primitives for stellar evals#1
Merged
Conversation
Every hand-rolled eval in the fleet (tax/legal/gtm, plus a dozen future verticals) had to re-invent the same five patterns. This release ships them as framework, with real tests, so consumers write <200 lines of domain config on top instead of 2,000+ lines of bespoke harness. ## What lands 1. PromptRegistry — versioned prompts with SHA-256 content hashing. get() throws on unknown keys (no silent defaults). Re-registering identical content is idempotent; conflicting content on the same version errors loudly so A/B history stays auditable. 2. TraceStore — pluggable per-LLM-call log (prompt, output, tokens, cost, timing). MemoryTraceStore for tests; FileSystemTraceStore for append-only NDJSON segments with rollover past 32 MB. Malformed lines skipped so one bad row cannot kill a whole query. 3. AntiSlopJudge — deterministic pattern-based quality check, no LLM call. Catches banned phrases, hedging, apology padding, n-gram repetition, and length bounds. Emits a JudgeScore with dimension 'anti_slop' so it composes into BenchmarkRunner's judge array transparently. Counts every occurrence, not just first match. 4. ArtifactValidator — interface for grading a produced file/JSON/blob. Built-ins: regexMatch, jsonHasKeys (partial credit), byteLengthRange (proportional score), containsAll (case-insensitive default). composeValidators aggregates pass as AND, score as weighted mean, issues namespaced by validator. 5. WorkspaceInspector — post-run state snapshot + assertion library. fileExists, fileContains, rowCount (proportional score inside/outside range), rowWhere (predicate-based row matching). runAssertions aggregates for a snapshot. 6. ExperimentTracker — groups runs, persists via pluggable ExperimentStore, computes per-scenario diff + aggregate delta + config delta between two completed runs. Not MLflow; the 20% that actually ships. Timeline view for drift tracking. 7. PromptOptimizer — A/B test N prompt variants with statistical rigor. Per-variant bootstrap CI + pairwise Mann-Whitney U. Winner is only marked 'significant' when it beats every other variant at the threshold — no declaring winners on noise. Rejects <2 variants, empty scenarios, non-finite scores. 8. DualAgentBench — proposer/critic convergence pattern lifted from tax-agent + legal-agent. Proposer sees prior critique each round. Configurable max rounds + convergence threshold. Aggregate convergenceRate + avgRoundsToConverge on the report. 9. Expanded statistics — pairedTTest (for before/after with paired removal of inter-item variance via Student-t CDF approximation), wilcoxonSignedRank (non-parametric alternative), cohensD (standardized effect size). Complements existing bootstrap CI, Mann-Whitney, inter-rater reliability. ## Tests 11 test files, 119 passing (was 27). Every regression case names the specific bug it would catch. ## Deps Adds @types/node to devDependencies (required for FileSystemTraceStore). No runtime deps change — framework is still dependency-lean. ## Bump: 0.1.0 -> 0.2.0 (strictly additive; no existing API changed)
…arness Ships the evaluation platform spine. Every score, failure class, and metric is now a view over a shared TraceSchema v1 corpus (Run + Span + Event + ToolCall + Retrieval + JudgeVerdict + BudgetLedger + Artifact). Pipelines query traces; they don't hook live agents. This unlocks retroactive re-scoring, cross-run diff, and prod/eval observability parity. Chassis: - trace/schema.ts — TraceSchema v1 (OTEL-compatible wire, agent-extended kinds) - trace/store.ts — InMemoryTraceStore + FileSystemTraceStore (NDJSON, 32MB rollover) - trace/emitter.ts — hierarchical span builder with auto-parenting + within() - trace/query.ts — typed query helpers (spans by kind, argHash, aggregateLlm) - trace/redact.ts — DEFAULT_REDACTION_RULES (email/ssn/card/bearer/private-key) - trace/otel.ts — exportRunAsOtlp for Jaeger/Honeycomb/Langfuse ingest Producers: - sandbox-harness.ts — SubprocessSandboxDriver + DockerSandboxDriver + vitest/ pytest/jest parsers; emits SandboxSpan - test-graded-scenario.ts — SWE-bench pattern: score = testsPassed/testsTotal - budget-guard.ts — enforces tokens/wallMs/calls/usd; BudgetBreachError + ledger - failure-taxonomy.ts — 17-class enum + rule-based classifier - trajectory.ts — DFS chronological view over run spans - tool-use-metrics.ts — error/duplicate/retry rates + optional selectionLabels Pipelines (views over traces): - stuck-loop, tool-waste, budget-breach, failure-cluster, judge-agreement, first-divergence, regression Auxiliary decision modules (also consumable over trace data): - slo.ts (DEFAULT_AGENT_SLOS + critical/warning severity) - baseline.ts (Welch's t-test + Cohen's d + per-side IQR stability) - oracle.ts (textInSnapshot / urlContains / jsonShape / regex / notBlocked) - pareto.ts, cost-tracker.ts, series-convergence.ts, state-continuity.ts 169 tests pass. Typecheck + tsup build clean. Legacy LlmTrace removed; nothing external consumed it yet.
…ion + behavior DSL
Phase-2 deliverable: makes the framework defensible for external customers.
New primitives:
- Dataset: versioned scenarios with SHA-256 content hash + provenance, split
(train/dev/test/holdout), difficulty tiers, seeded-deterministic slice.
`HoldoutLockedError` prevents post-release mutation; jsonl serialization is
byte-stable for contamination-verifiable archives.
- ContaminationGuard: canary token checks, canaryLeakView over trace corpus,
HoldoutAuditor wraps holdout dataset with purpose-tagged access logging.
- RedTeamBattery: 9-case default corpus across 8 categories (prompt-injection
direct/indirect, jailbreak DAN/persona, PII, permission-escalation, data-exfil,
policy-override). Per-case scorer checks forbidden strings, forbidden tools,
refusal markers, and default PII regex rules.
- PowerAnalysis + FDR: requiredSampleSize (normal-approx, ~63 per arm for d=0.5
@ alpha=0.05 power=0.8), bonferroni, benjaminiHochberg with monotonicity
preserved. Acklam inverse-normal-CDF included; no deps.
- BehaviorDSL: expectAgent(store, runId).toCall('x').withArgs({q: /./})
.times(3) / .toRefuse() / .toOutputMatch(re) / .toRespectBudget(dim)
.toCompleteWithin({wallMs, toolCalls, llmTurns}) / .toNeverCall(tool).
runExpectations collects results without throwing.
- JudgeCalibration: calibrateJudge (pearson + quadratic weighted κ + MAE +
worst-5), positionalBias (A/B swap), verbosityBias (length-score correlation),
selfPreference (in-family vs out-of-family mean delta).
Fixed:
- PromptOptimizer now applies BH-FDR correction across n*(n-1)/2 pairwise
tests before flagging "significant". Prior 0.2/0.3 behavior inflated
false-positive rate past alpha as variants grew past 2.
215 tests pass; typecheck + tsup build clean.
… + visual-diff Closes the 0.4 roadmap. Every piece consumes the trace corpus; zero duplicate accounting paths. - ci-gate.ts: ThresholdContract → evaluateContract → renderMarkdownReport. Baseline vs candidate comparison (delegates to baseline.ts), SLO evaluation, per-breach reasons, GitHub-PR-ready markdown output. Drop-in for GH Actions. - observability.ts: toLangfuseEnvelope (generations + scores, wire-compatible), toPrometheusText (8 counters + tool breakdown), replayTraceThroughJudge (the canonical retroactive re-scoring path — no agent re-execution needed). - paraphrase.ts: paraphraseRobustness harness + 5 mutators (lowercase, sentence-reorder, typo, politeness-prefix, whitespace-collapse). Surfaces brittle prompts whose score depends on surface form. - visual-diff.ts: pixel-delta scoring with configurable tolerance; maps to unchanged / changed / severely-changed. Zero deps — consumers pass decoded RGBA pixel arrays. 234 tests pass; typecheck + build (178KB ESM bundle) clean. 0.4.0 surface now covers: trace chassis, sandbox harness, 7 pipelines, failure taxonomy, budget guard, trajectory, tool-use metrics, SLO, baseline, oracle, pareto, cost tracker, state continuity, series convergence, dataset artifact, contamination guard, red-team battery, power analysis + FDR, behavior DSL, judge calibration, CI gate, Langfuse/Prometheus observability adapters, paraphrase robustness, visual diff. OTEL export ships everywhere via shared trace schema.
…l + correlation The evaluation shape for agent-builder specifically. Every build is a three-layer nested Run tree: builder (L0, the Forge chat), app-build (L1, the sandbox that compiled + tested the generated scaffold), app-runtime (L2, domain scenarios run against the generated agent). The framework-critical signal is the meta-runtime correlation: does the builder's self-score predict real-world success? If r < 0.4 the builder is optimizing for the wrong thing. Schema (backward-compat, all optional): - Run.parentRunId — ties app-build → builder, app-runtime → app-build - Run.projectId / chatId — group across chats + sessions - Run.layer — 'builder' | 'app-build' | 'app-runtime' | 'meta' | 'custom' - RunFilter + matchesRun + all callers updated builder-eval/: - BuilderSession: startChat → ship → runAppScenario, auto-parents child Runs, records meta-score as a JudgeVerdict on the builder Run. - resumeBuilderSession: reconstructs the latest builder/build/runtime runs from the trace store — chat-first UIs call this on project open. - three-layer-eval.scoreProject: meta/build/runtime scores + complete flag. - correlation.correlateLayers: Pearson + Spearman across meta↔build, meta↔runtime, build↔runtime; surfaces the meta-runtime predictive coefficient. - ProjectRegistry: listProjects, projectTimeline, projectChats — the read-side surface a chat-first dashboard consumes. 242 tests pass; typecheck + 189KB ESM build clean.
Two helpers that close the app-build/app-runtime recording gap when the
caller isn't using SandboxHarness (e.g. agent-builder's publish hook
does a preview-URL smoke test instead of a full test suite, and its
scenarios-run invokes the sandbox chat endpoint directly).
- startAppRuntime(scenarioId): returns a fresh emitter on a new
app-runtime Run parented to the latest build (or builder). Caller
emits spans inside and calls endRun() with the verdict.
- recordShipMarker({ pass, score, notes }): lightweight app-build Run
with caller-supplied outcome — for marker-style tracking without a
harness.
The three frontier primitives that close the "does eval predict reality + how do we attribute what moved the needle" loop. meta-eval/: - OutcomeStore (InMemory + FileSystem NDJSON) — deployment outcomes keyed on runId: retention, CSAT, revenue, task-success, anything product telemetry exports. Lifecycle is peer to TraceStore, arrives async. - correlationStudy: joins traces ↔ outcomes, computes Pearson + Spearman + bootstrap 95% CI per (evalMetric, outcomeMetric) pair. Verdict strong/ moderate/weak. This is the one number that makes the framework trusted. - calibrationCurve: equal-width or equal-frequency binning of (eval vs reality), Expected Calibration Error + max bin gap. Shows *where* the eval is over/underconfident, not just a single correlation coefficient. prm/: - StepRubric + PrmGrader — applies per-step rubrics to every span in a trajectory, emits a JudgeVerdict span per (rubric × span) so existing pipelines (judgeAgreementView, etc.) see PRM verdicts natively. - Built-in rubrics: outputLength, toolSuccess, toolNonRedundant, nonRefusal, toolIntentAlignment — cheap rule-based reference graders. - exportTrainingData: NDJSON of (step-context, score) pairs for downstream reward-model fine-tuning. Framework-agnostic — plug into TRL/Unsloth/etc. - prmBestOfN + prmEnsembleBestOfN: inference-time candidate ranking via PRM; ensemble uses Borda count so different graders' score scales don't dominate. bisector.ts: - bisect<T>: generic binary-search with caller-supplied halfway + runEval. - commitBisect: ordered SHA list; narrows to adjacent (good, bad) pair. - promptBisect: splits good/bad prompts into paragraphs, progressively replaces paragraphs to localize the offending change; returns the paragraph index that broke. - Flags inputInconsistent when caller's good/bad premise is wrong. 267 tests pass; typecheck + 214KB ESM build clean. Published version bump to 0.6.0 (npm publish is a follow-up).
… pre-registration counterfactual.ts: - runCounterfactual: records a meta-layer Run parented to the original, applies a mutation (swap-model / swap-tool-result / truncate-after / inject-system-message / custom) at turn N, delegates execution to a caller-supplied CounterfactualRunner, captures the paired score delta. - attributeCounterfactuals: given a batch of CF results, ranks mutation kinds by mean absolute delta — surfaces which lever moves outcomes most. cross-trace-diff.ts: - crossTraceDiff: full LCS alignment of two trajectories (match / insert / delete / replace), per-step attribution using PRM verdicts when available (fallback: latency + token delta). Reports totalScoreDelta alongside prmDeltaSum so coverage gap is visible. pre-registration.ts: - signManifest: SHA-256 over canonicalized hypothesis — manifest becomes immutable. - evaluateHypothesis: mechanical check of observed effect against pre-declared direction / minEffect / alpha / preRegisteredN. Tampered manifests throw. Rejection reasons are machine-tagged so you cannot silently re-interpret. 276 tests pass; typecheck + 220KB ESM build clean.
…arning + RM export self-play.ts: runSelfPlay(proposer, scorer, targets). Scenarios survive when they produce a score *spread* across targets AND every target stays above a floor (rejects noise/gibberish). Survivors promoted to a Dataset with origin='self-play'. Framework-agnostic about how candidates are generated — caller supplies the proposer agent. causal-attribution.ts: factorial-experiment variance decomposition. causalAttribution(cells) returns per-factor main-effect share + pairwise interaction share + residual — moves from 'variant B wins' to 'model swap accounts for X%, prompt swap for Y%, interaction Z%.' Minimal 2-way; extends to N-way via the same group-mean math. active-learning.ts: proposeSynthesisTargets analyzes Dataset + trace corpus, returns prioritized gaps — difficulty coverage, undersampled scenarios, high-variance scenarios, failure-class clusters. Caller feeds (neighbors, direction) into their LLM proposer for scenario synthesis. No LLM call inside the framework; this finds WHERE to author, not WHAT to write. reward-model-export.ts: ExportedRewardModel wraps PRM training NDJSON with metadata (nTraces, nSamples, rubrics, meanReward) for handoff to a fine-tuning pipeline. loadScorerFromGrader returns a zero-deps InferenceScorer so customers embed PRM as an inference-time judge without fine-tuning. replayScorerOverCorpus validates a scorer against its training corpus — drift indicates overfitting. 286 tests pass; typecheck + 244KB ESM build clean.
Regulator-facing reports derived from the trace corpus. Each framework
ships a GovernanceContext → GovernanceReport generator + a structured
finding list with severity + control reference + remediation. Markdown
renderer for external auditors; JSON for CI.
- nist-ai-rmf.ts: GOVERN-1.1/1.3, MEASURE-2.1/2.6/2.11, MANAGE-1.1.
Derives findings from real framework state — missing red-team,
missing outcomes, weak judge calibration, absent dataset hashes all
surface as typed controls.
- soc2.ts: CC7.1–7.4 common-criteria evidence package. Failure rate,
abort rate, incident events, distinct release fingerprints
(codeSha/promptSha/modelFingerprint).
- eu-ai-act.ts: classifyEuAiRisk({ biometricPublic, annexIII, chatbot,
generatesSyntheticMedia, ... }) → unacceptable|high|limited|minimal.
Art. 5 prohibits outright; Art. 9/10/11/13/14/15 checked for high-risk
deployments; Art. 52 transparency for limited-risk chatbots.
296 tests pass; typecheck + 244KB ESM build clean. All Tier 1 + Tier 2 +
Tier 3 + governance frontier work shipped in this session.
4 tasks
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships the nine primitives every hand-rolled eval in tax/legal/gtm had to reinvent, as framework with real regression coverage.
Capabilities added
Tests
11 files, 119 passing (up from 27). Every regression case names the specific bug it would catch.
Diff
API stability
Strictly additive. Every 0.1.0 export still works unchanged.
Migration plan (separate PRs)
Don't migrate tax/legal/gtm until their existing features are reproducible in <200 lines of domain config on top of 0.2.0 — bar set per Drew's 'strictly better first' directive. Film-agent will get migrated first as the proof case.