Skip to content

feat(0.7.2): extract muffled-gate scanner + CostTracker.recordVerdict#7

Merged
drewstone merged 5 commits into
mainfrom
feat/muffled-gate-testing-util
Apr 24, 2026
Merged

feat(0.7.2): extract muffled-gate scanner + CostTracker.recordVerdict#7
drewstone merged 5 commits into
mainfrom
feat/muffled-gate-testing-util

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

What

Promote two reusable primitives out of starter-foundry so every agent-eval consumer gets them for free.

1. scanForMuffledGates + default finders

Test helper that greps consumer source for gate/measurement anti-patterns. 5 default finders cover the common forms (fallback-to-pass, literal-true-pass, auto-match-no-expectation, skip-counts-as-pass, construct-vs-call-cwd). Supports per-file context-specific finders + auto-derived scan across importers. // muffle-ok: <reason> escape hatch.

Pattern is documented at starter-foundry/.evolve/patterns/muffled-gate.md (both gating + measurement layers). 10+ incidents in starter-foundry motivated this; same class hits every consumer.

2. CostTracker.recordVerdict(verdict, scenarioId, tags?)

Convenience wrapper over record + markOutcome for {usage, verdict}-shaped judge responses. Returns null + no-ops when verdict has no usage (compile-gate short-circuits don't spend). Starter-foundry's agent-eval-scaffold.mjs hand-rolls this 3-line pattern per seed; now one call.

Impact

  • agent-eval becomes the canonical home for both primitives
  • starter-foundry can replace its hand-rolled copy with an import (follow-up PR)
  • any future consumer (BA, GTM agent, third-party) gets both for free

Test plan

  • pnpm build — clean
  • pnpm test — 336/336 (+12 new: 7 scanner + 4 recordVerdict + 1 absorbed)
  • No breaking changes; purely additive exports
  • No Co-Authored-By

…or fallbacks

Motivation: meta-analysis of starter-foundry's Gen 6→Round-0-post-Gen-9
arc surfaced 10+ incident-driven lessons about using this package. They
lived nowhere canonical (README/CLAUDE.md described a stale 0.2-era API
surface; the actual v0.7 builder-of-builders + sandbox harness exports
had zero usage docs). Two shipped bugs traced to the same driver
construct-vs-call cwd footgun. Consolidating into one authoritative
doc + closing the footgun at the source.

Changes:

- .claude/skills/agent-eval/SKILL.md (NEW, sole source of truth)
  - minimal builder-of-builders path
  - 4 footguns (cwd-in-constructor, fallback-to-pass, fidelity-without-
    compile-gate, blob-vs-files channel)
  - 3 rules (both gates, single-source dispatch, Phase 1.5 walks entry
    points)
  - three-layer eval contract (builder → app-build → app-runtime)
  - regression tests every consumer should carry
  - extend-don't-duplicate index over the 100+ exports
  - muffled-gate pattern catalog (7 sub-shapes from shipped bugs)

- README.md + CLAUDE.md → pointers to SKILL.md. No duplicated content.

- SubprocessSandboxDriver constructor now accepts `{cwd?, env?}` as
  FALLBACKS when HarnessConfig omits them. Per-call config always wins.
  Pre-0.7.1 the constructor took no declared args, so TS tolerated
  `new Driver({cwd})` and silently dropped the arg at runtime — the
  exact shape of the Gen 8b promoter + Round-0 runtime eval bugs in
  starter-foundry. 0.7.1 makes the natural misuse do the obvious thing.
  New type: `SubprocessDriverDefaults`. Zero breaking changes for
  code that already reads cwd from HarnessConfig (the documented path).

- tests/sandbox-harness.test.ts: +3 tests guarding the new defaults
  contract — default.cwd honored, per-call wins over default,
  defaults.env merges correctly.

322/322 tests pass (was 319; +3 new). typecheck clean.
Version: 0.7.0 → 0.7.1.
PR #4 shipped the same SubprocessSandboxDriver constructor-fallback fix
while this branch was open. Resolved:
- src/sandbox-harness.ts + tests/sandbox-harness.test.ts: take main's version (functionally equivalent; type named SubprocessSandboxDriverOptions instead of SubprocessDriverDefaults — main's name is better, already shipped)
- src/index.ts: export SubprocessSandboxDriverOptions (main forgot to export the new type)
- tests: also fix the env-merge test's printenv form — BSD printenv on macOS only prints the first matched var, making the test platform-flaky. Switch to env|grep which survives missing vars.

Net: keep SKILL.md + README/CLAUDE pointers + version bump + type re-export + macOS test fix. All 322 tests pass.
… helper

Two reusable primitives promoted out of starter-foundry so every
agent-eval consumer gets them for free:

1) scanForMuffledGates() + DEFAULT_FINDERS + UNIVERSAL_FINDERS
   (src/muffled-gate-scanner.ts, exported from index)

   Test helper that greps consumer source for gate/measurement
   anti-patterns and returns {file, line, pattern} findings. 5
   default finders (fallback-to-pass, literal-true-pass, auto-
   match-no-expectation, skip-counts-as-pass, construct-vs-call-cwd).
   Supports per-file context-specific finders + auto-derived scan
   across importers of a target string (e.g. '@tangle-network/agent-eval').
   `muffle-ok: <reason>` annotation is the opt-out escape hatch.

   Pattern documented at starter-foundry/.evolve/patterns/muffled-gate.md
   (both gating + measurement layers). 10+ incidents in starter-foundry
   motivated this; any agent-eval consumer hits the same class.

2) CostTracker.recordVerdict(verdict, scenarioId, tags?)
   (src/cost-tracker.ts)

   Convenience: record + markOutcome in one call from a
   {usage, verdict}-shaped judge response. Returns null + no-ops when
   verdict has no usage (e.g. compile-gate short-circuit) so callers
   don't need their own guard. Starter-foundry's agent-eval-scaffold.mjs
   hand-rolls this 3-line pattern per seed; now one call.

Tests: +12 (7 scanner + 4 recordVerdict + 1 absorbed). 336/336 pass.
Build clean. Version 0.7.1 → 0.7.2. No breaking changes; purely
additive exports.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant