Eval framework: live API runner, WORKFLOWS.md SSoT, interactive HTML dashboard by ArtyETH06 · Pull Request #71 · leadbay/leadclaw

ArtyETH06 · 2026-05-26T21:25:56Z

What this PR does

Replaces the fixture-based MCP eval framework with a live, skill-driven runner where WORKFLOWS.md is the only file you ever edit to add or change a test.

Key changes:

WORKFLOWS.md — SSoT format: each workflow row has a yaml expected contract block (required/forbidden calls, success criteria) and a yaml scenario block (prompt). The /eval skill reads these directly.
Deleted ~3,400 lines of TypeScript eval infrastructure replaced by the /eval skill.
packages/mcp/test/eval/helpers/gen-dashboard.py — self-contained Python script generating a single-file HTML dashboard at .context/evals/eval-report.html.
live-mcp-server.ts — thin stdio MCP bridge for real Leadbay API sessions.

How to run evals

/eval --workflow 1
/eval --workflow 1,3,5
/eval

Prerequisites: .env.eval at repo root with LEADBAY_TOKEN=u.xxx and LEADBAY_REGION=us.

For unattended runs: claude --dangerously-skip-permissions

Self-improving evals (new)

Add --improve to automatically fix any workflow scoring below 5/5:

/eval --workflow 5 --improve

Flow:

Runs eval as normal — scores MM, IA, NF, TSF
If all 5/5 → prints ✓ and stops
If any < 5 → loads /relentless and immediately executes the self-improvement loop inline (no hand-off, same session):
- Edits packages/promptforge/prompts/<prompt_name>.md.tmpl
- Rebuilds with pnpm prompts:build
- Re-evals, reads the JSON result, iterates until 5/5
Regression guard: once target reaches 5/5, runs /eval --workflow <others> to confirm no regressions

What gets improved: prompt .md.tmpl source files only — never .generated.ts directly.

Dashboard changes

Last session chip — groups all eval files within 60 minutes of the newest file.
🔄 Self-improve chip — filters to all runs containing self-improvement workflow entries.
[✓/✗] in phase gate output (was [x/ ]).

Validated in this session

Ran /eval --workflow 5 --improve end-to-end:

Detected TSF:4 (redundant async bulk_qualify call)
Loaded relentless, executed 5 iterations autonomously
Edited leadbay_qualify_top_n.md.tmpl — added wait_for_completion=true, BulkTracker resilience rule, exhausted=true status line format, explicit pull_leads justification
Reached MM:5 IA:5 NF:5 TSF:5 in iteration 5
Workflows 1 and 3 confirmed no regression

Changes since last review

Removed workflow 2b and the hard disambiguation rule from leadbay_followup_check_in — "best leads for today" routes naturally to Discover; context-sensitive routing should be learned from user memory, not hard-pinned
Moved gen-dashboard.py from repo root to packages/mcp/test/eval/helpers/
Removed committed plan/spec files (session artifacts, not repo artifacts)
Resolved merge conflict with main (leadbay_remove_leads_from_campaign fix)

… guard - Add cli-session-runner.ts: drives multi-turn eval sessions via the claude CLI using OAuth subscription auth (no ANTHROPIC_API_KEY required). Parses stream-json events, strips mcp__ tool name prefixes, captures final text from result events with lastAssistantText fallback. - Add fixture-mcp-server.ts: standalone MCP stdio server that patches node:https before importing LeadbayClient, serving all backend requests from EVAL_FIXTURES (base64 JSON). Enables realistic tool execution without a live backend. - Rewrite mission-match-judge.ts: agentic per-criterion loop (SDK path) and single-shot with full evidence dump (CLI path). Adds per_criterion verdicts to evidence L3. - Add CriterionVerdict to evidence.ts and per_criterion field to MCPEvidence. - Add llm-judge-shared.ts: shared callJudgeAuto() that auto-selects SDK vs CLI judge backend, hasCLI() detection, makeAnthropicClientIfAvailable(). - Update run-eval.ts: auto-selects SDK runner (ANTHROPIC_API_KEY present) vs CLI runner (subscription-only). Adds backend_requests pyramid validation. - Fix vitest.eval.config.ts: switched to pool:threads + singleThread:true for correct @leadbay/* workspace package resolution; removed broken vite aliases. - Add widget-overdelivery-guard scenario + eval: verifies the daily check-in agent stops before drafting/sending outreach without explicit user consent. Fixtures use correct /1.5 paths matching LeadbayClient's URL construction. All 251 existing tests continue to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Token counters were declared inside the try block but referenced after the finally in the return statement, causing a ReferenceError. Moved declarations before the try block. Added scoring rubric anchors to the CLI single-shot judge prompt so agents acknowledging tool errors are not penalised as fabrication (no_fabrication=5 when agent correctly says "I don't have data").

Shows mission_match/instruction_adherence/no_fabrication/tool_selection_fit scores, per-criterion pass/fail with reasons, tools called sequence, and duration — so failures are immediately visible without digging into transcripts.

Allows both call signatures: new LeadbayClient("https://...", "token") new LeadbayClient({ baseUrl: "https://...", bearer: "token" }) Fixes eval tests that use the options-object form.

…hs, skip routing classifier without API key - Judge: clarify verdict() semantics — pass=true when confirmed, pass=false when absent; reasoning must agree with boolean; explicit examples for tool-call and byproduct criteria - Judge: show reasoning for all criteria in console output (not only failures) - Judge: add no_fabrication rubric note that rendering fixture data is not fabrication - Scenarios: rewrite all 10 broken scenarios from defunct /v1/... paths to correct /1.5/... paths matching LeadbayClient.request() which prepends /1.5 to every call - tool-routing-classifier: skip gracefully when ANTHROPIC_API_KEY is absent

Claude Code transparently handles auth (subscription or API key) for `claude -p` subprocesses — no ANTHROPIC_API_KEY branching needed. Removes ~800 lines of dead SDK runner and agentic judge code. - llm-judge-shared: CLI-only callJudge, drop SDK client factory - mission-match-judge: single-shot `claude -p` judge prompt; remove entire agentic multi-turn SDK loop (runAgenticJudgeSDK, evidence tools, buildAgentSystemPrompt). Rename buildFallbackPrompt → buildJudgePrompt (now the only path). - drift-judge: remove Anthropic import and client? field - run-eval: setupScenarioFixtures is a no-op; always use CLI runner; remove SDK runner import and ANTHROPIC_API_KEY check - touchfiles: update GLOBAL_TOUCHFILES to cli-session-runner.ts - eval files: rewrite daily-check-in + import-file evals from 150+ line inline SDK sessions to clean ~25 line runScenarioEval calls Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

session-runner.ts (SDK-based runner) and tool-routing-classifier.eval.ts (SDK-only test) are unreachable since the framework moved to CLI-only execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move the no_fabrication rule above the scoring table with an explicit bulleted list of what is NOT fabrication: score bars, tool-response rendering, stop phrases, summarisation. The judge was consistently scoring 4/5 for rendering fixture-grounded markdown (▰❖▱ bars, emails, company names), which the rubric always intended to be a 5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

milstan

It does work, but tests fail. As a general rule, we don't want to comit cone that fails tests. But this can be a separate PR.

I am inclined to push the architecture of the test framefowk a step further:

test account instead of Fake HTTL canned responses (not 100% sure about this, but I think the risk of canned responses drifting from real implementation is greater than the risk of test account being underepresentative of user reality - only concern is limiting the spend)
Single source of truth of supported/testable workflows (not both WORKFLOWS.md and test files)
I'd also challenge the configuration of the main and testing agent from your diagram. The frameworks as it is now, mainly tests the MCP API surface, and its resoning, prose etc. I am inclinded to think we can have it go more end-to-end basicly take prompt as input. Even some session history (imagine we want to test how the MCP behaves in the middle of a anready ongoing conversations, there is prior chat conext, the user uses a prompt -> what happens" I think our tests could be: test(previous context - optional, user prompt)->scores.
The limitation I see of our current approach, is that we don't know if the agent using our MCP has actually infovked the widget. Maybe we can collect session logs and see from there if it ivoked them (but maybe it knows it's being run as subagent and does not invoke them)
Generally we want to capture session logs, see what the agent did, how did it interpret the request, did it call the tools we wanted it to call (this is a deterministic test), did it present the result the right way, did it propose next steps the right way, did it record memory, did it leverage previously recorded memory...
We want the evaluator to strictly rely on the logs to produce the judgment and not look at the MCP code or run any functions itself.
Finally, and this is maybe the worst, I really see different behavior in different user agents (Claude chat behaves differently than Claude Cowrok). ideally we'd want to in fact emulate them. I would expect an emulator to exist but have not found one in fact. So maybe we drop this point for now (but one of my previous points about usage of the native UI components might also be impossible without the emulator - we need to see if we can trick the model into thinking it is within a particular harness COwork/ClaudeChat..., and pick a UI component and catch that in the log)
We want it to store reports in a file, so we can easilly collect them, and pass them onto a fixing agent. Also make it output a nice recap html file with all the tests and scores, colors which pass,e tc. I think the tripple (previous context, prompt, outcome logs) is what if you give to any coding tagent that can run test(previous context - optional, user prompt)->scores, it can optimize it iteratively.

See also my inline comment. I wonder to which extent we might want to actually drop the .ts files totally from the /eval framework (I know!) and just make:

a big fat eval skill with IRON LAWS, DEADLY SINS, BYPRODUCT FILE GATES (modeled after gstack skills or my /relentless skill from skills repo in leadbay - basically gates are files the model needs to produce in order to pass to the next step, so that enforces its multi-step approach, can't skip)
A file listing all the workflows, expected outcomes, expected tool calls, native UI invocations, everything thant needs to checklist.

It spawns agets to run them, it captures the log, it judges the log and produces outcome table.

We can even make this script in a /reletnless way - bascially we would evaluate it if evaluated well (if the log does not contractic its judgement).

I'm crazy, I know.

milstan · 2026-05-27T02:39:49Z

-    "  instruction_adherence — did the agent follow the prompt's PHASES without skipping?",
-    "  no_fabrication — every claim must trace to a tool response in the ledger.",
-    "  tool_selection_fit — were the chosen tools the right ones for the user intent?",
+    "NO_FABRICATION RULE (read before scoring):",


I dont calling prompts from .ts is generally a good idea. I think generating SKILL.md as we do in promtforge is OK, but I think we want to give this sort of prompts as skills to the agent and let it evaluate. I think particularly for the /eval part of tests, we are better served with skills that with deterministic code, that does not age very well for this kind of thing.

Agree on the SKILL, might do this in another PR, to keep this one just about the eval agent, and include the skill in a skill specified PR.

…rce SSoT via audit - Add eval scenarios + invariants for prospecting overview (#7), outreach drafting (#8), field sales tour (#10), and team prospecting (#11) - Update WORKFLOWS.md: every Supported row now cites its eval file in the Tests column — one place to see what's covered - Add workflows-eval-coverage.test.ts audit that enforces eval coverage for all Supported rows; CI fails if a row is added without an eval file

WORKFLOWS.md is now the single source of truth for eval contracts. Required calls, forbidden calls, and success criteria live in fenced ```yaml expected blocks in the doc — no separate TypeScript invariant files needed. The workflows-parser.ts runtime reads these blocks; run-eval.ts derives invariants and the judge mission from them. Deletes 867 lines across 12 invariants/*.ts files and all inline mission objects in 19 scenario files. Adds a new audit test that asserts every Supported workflow row has a parseable expected block with non-empty required_calls and success_criteria. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All 13 eval prompt files drop the invariants import/parameter. All 19 scenario files replace the inline mission object with a single workflow_id field. run-eval.ts wires workflow_id through the workflows-parser to derive invariants and the judge mission at runtime from WORKFLOWS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Evals use .eval.ts extension intentionally excluded from the normal test suite. The new config + script makes them runnable without remembering the flag: EVAL=1 pnpm --filter @leadbay/mcp run test:eval EVAL=1 npx vitest run --config vitest.eval.config.ts <file> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reads .context/evals/*.json run files and generates a self-contained dark-mode HTML report with per-scenario score bars, invariant results, per-criterion verdicts, tool call sequences, and judge reasoning. Usage: pnpm --filter @leadbay/mcp run eval:report # latest run pnpm --filter @leadbay/mcp run eval:report -- --all # all runs pnpm --filter @leadbay/mcp run eval:report -- --run <run_id> pnpm --filter @leadbay/mcp run eval:report -- --output /path/to/report.html Output: .context/evals/eval-report.html Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Option B: evals now run against a real Leadbay test account instead of canned HTTP fixtures. No scenario .ts files needed — workflows are selected by ID and the contract comes from WORKFLOWS.md yaml blocks. New files: - live-mcp-server.ts: minimal stdio MCP server using real Leadbay auth - live-session-runner.ts: CLI session runner without fixture machinery; accepts systemPrompt injected via --system-prompt - run-workflow.ts: CLI script — LEADBAY_TOKEN=... eval:live --workflow 1,3 WORKFLOWS.md: added yaml scenario blocks (trigger prompt per workflow). workflows-parser.ts: parses scenario blocks, exports getWorkflowScenario(). Usage: LEADBAY_TOKEN=<token> LEADBAY_REGION=us \ pnpm --filter @leadbay/mcp run eval:live --workflow 1 Verified: workflow #1 passes 5/5 mission_match against real account (SnapLock Industries, lens 39107, real leads from api-us.leadbay.app). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- .env.eval at repo root stores LEADBAY_TOKEN + LEADBAY_REGION (gitignored) Create it: echo "LEADBAY_TOKEN=...\nLEADBay_REGION=us" > .env.eval - `eval` script loads it via dotenv-cli: pnpm --filter @leadbay/mcp run eval -- --workflow 2 - `eval:view` generates HTML report and opens it in browser - report.ts prints xdg-open hint with absolute path after generation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y path Deleted: - helpers: cli-session-runner, fixture-mcp-server, backend-recorder, touchfiles, drift-judge, run-eval - all 18 scenario files (scenarios/) - all 13 prompt eval stubs (prompts/*.eval.ts) - drift-detector.ts script - vitest.eval.config.ts + test:eval npm script - audit tests that checked for now-deleted .eval.ts files The live runner (run-workflow.ts + live-session-runner.ts) against the real Leadbay API replaces all of this. WORKFLOWS.md yaml expected/scenario blocks are the only source of truth. 251 tests still pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…S.md All three hardcoded maps (WORKFLOW_PROMPT, WORKFLOW_NAME, ALL_WORKFLOW_IDS) removed from run-workflow.ts. Parser now reads workflow_name and prompt_name scalar fields from each yaml expected block; run-workflow.ts calls getAllWorkflowExpected() at startup to derive the workflow list. Adding a new eval now requires only a WORKFLOWS.md edit — no TypeScript files to touch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- live-session-runner: add readFileSync to fs import (renderFullLog was crashing with ReferenceError on every run) - WORKFLOWS.md workflow #3: replace required leadbay_research_lead_by_id with leadbay_research_lead_by_name_fuzzy — the fuzzy lookup alone is a valid completion path for domain research; by_id is called only when the agent wants deeper detail after the fuzzy result Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ToolSearch and WebFetch were leaking through despite --allowedTools mcp__leadbay-live__* — the agent used them to answer from training data instead of calling real Leadbay tools, causing workflow #3 to show zero tool calls and score 1/5 across the board. Add explicit --disallowedTools list covering all Claude Code built-ins that could leak: ToolSearch, WebFetch, WebSearch, Bash, Read, Edit, Write, Glob, Grep, LS, Skill, LSP, Agent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…routing With prompt_name set, the system prompt overrides routing and the test always passes. Without it (prompt_name: ~), the agent uses only tool descriptions to route — exposing the real gap where 'reach out to today' fires pull_leads instead of pull_followups.

…routing pull_leads.md.tmpl: add 6 anti_triggers routing reach-out phrasings to pull_followups (reach out to, get back to, contact today, should I contact, reconnect with, re-engage). pull-followups.md.tmpl: add 8 triggers actively claiming the same phrasing class (reach out to today, should reach out to, get back to, contact today, reconnect with, re-engage, leads to contact, who should I ping). Fixes workflow 2b: 'Show me leads I should reach out to today' was misfiring to leadbay_pull_leads. Root cause: pull_leads had no anti-trigger for this semantic class; pull_followups had no trigger claiming it.

…hrasing pull_leads.md.tmpl: add 3 negative examples including the exact failing phrase 'Show me leads I should reach out to today' — gives LLM concrete evidence this phrase should NOT route here. pull-followups.md.tmpl: add 3 positive examples including the exact phrase — gives LLM concrete evidence this phrase SHOULD route here. The combination of anti_trigger entries + concrete examples provides strong bidirectional routing signal for the reach-out phrasing class.

… me new/today leads' 'show me leads' was too broad — matched re-engagement phrasings like 'Show me leads I should reach out to today'. Replaced with more specific triggers: 'show me new leads', 'show me today's leads', 'fresh leads', 'what's new today'. The scenario phrase no longer matches any pull_leads trigger, while pull_followups now claims it via 'reach out to today'.

…cenario Tool-description-only routing (prompt_name: ~) is architecturally insufficient — the model priors for 'show me leads' always win over hint text. Real failure target: 'Show me my best leads for today' misfires to pull_leads EVEN WITH the leadbay_followup_check_in system prompt. That's a prompt-body fix, not a tool-description fix. Scenario now uses prompt_name: leadbay_followup_check_in.

…k_in context Add explicit disambiguation rule to PHASE 1 of leadbay_followup_check_in: 'best leads', 'top leads', 'leads for today', 'show me my leads' in the follow-up workflow context means Monitor pipeline, not a fresh Discover batch. Fixes the misroute for 'Show me my best leads for today' → was pulling leadbay_pull_leads, should pull leadbay_pull_followups.

1. Restore 'show me leads' to pull_leads triggers (was narrowed too aggressively) 2. Tighten anti-triggers to specific phrases ('leads I should reach out to') instead of broad substrings ('reach out to' which matched discovery intent) 3. Update pull_leads short_description to say 'NEW leads' and mention pull_followups for known pipeline leads 4. Fix WORKFLOWS.md 2b row description to match actual scenario phrase

Anti-triggers use full phrases ('leads I should reach out to') rather than substrings ('reach out to') to avoid intercepting legitimate discovery intent. 'reach out to new leads' should still fire pull_leads; only re-engagement phrasings ('leads I should reach out to') route to pull_followups. Documents the architectural decision for future engineers.

- Add '🔄 Self-improve' chip filter showing all eval runs that are part of the relentless self-improvement loop (any run containing workflow-2b entries) - Fix workflow_label regex to handle alphanumeric suffixes like 'workflow-2b' - Fix data-workflow sanitization to use safe CSS-id characters - Timestamp now reads from filename (not entry name) for correct display

…BulkTracker error Add resilience rule to PHASE 1: call bulk_qualify_leads with wait_for_completion=true by default. If BulkTracker-not-configured error occurs, skip retry and proceed directly to pull_leads. Fixes TSF:4 caused by redundant async-first call followed by synchronous retry.

…d/pending split Add precise format instruction to PHASE 3: '✓ N leads qualified · M still processing (lead IDs: X)'. Handles 3 cases: all done, mixed, all pending. Targets NF:4 deduction from unclear 7/3 framing in prior eval run.

Add explicit variants for the 4 status cases: 1. exhausted=true / all pre-qualified: 'All N leads already qualified · 0 still processing' 2. all newly qualified: 'N leads qualified' 3. mixed: 'N leads qualified · M still processing (IDs)' 4. all pending: '0 leads qualified · N still processing (IDs)' Restores TSF:5 on the pre-qualified batch edge case.

'All N/N leads already qualified' with actual count (e.g. '10/10') so the user can verify scope. Targets NF:4 deduction for missing count.

… render Explicit note that pull_leads is always needed after bulk_qualify because the qualification response does not contain the full lead data for the table. Addresses judge TSF concern about 'redundant' pull_leads call.

…l relentless iterations WORKFLOWS.md: add 'Self-improving evals' section documenting /eval --improve, what it fixes, regression guard, and --dangerously-skip-permissions note. gen-dashboard.py: 'Last run' → 'Last session' groups all eval files within 60 minutes of the newest file. Previously showed only 1 entry (the newest file); now shows all N iterations from a relentless self-improvement run.

…es in follow-up context Adds explicit rule: "best leads"/"top leads"/"leads for today" within the follow-up workflow always routes to leadbay_pull_followups, not leadbay_pull_leads. Fixes workflow 2b misrouting regression.

… block Keeps memory pointer within the 600-char truncation-safe window. Anti-triggers were pushing it to position 672 for leadbay_pull_leads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… memory pointer and remove_leads fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Plans and specs are session artifacts — outcomes only belong in the repo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Proper home alongside the other eval helpers rather than repo root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Per Milan's review: "best leads for today" should route to Discover by default; routing should be learned from user behavior via memory, not hard-pinned. Removed the over-eager disambiguation rule from the followup skill and the 2b scenario from WORKFLOWS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… SKILL.md Two pre-merge cleanups on PR #71: 1. `.gitignore` — `.context/` only caught the literal directory. Sibling workspaces and tooling spawn `.context-<id>` paths that the original pattern missed. `.context*/` covers both. 2. `.claude-plugin/.../leadbay_followup_check_in/SKILL.md` was out of sync with its `.tmpl` source. The prompt template's "discovery- sounding phrases" disambiguation rule had been added but the generated SKILL.md was not re-emitted, so the Claude Code skill surface disagreed with the MCP prompt surface. Re-ran `pnpm prompts:build`; this commit lands the regenerated file. `pnpm -r typecheck` and `pnpm -r test` (257/257) still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

milstan · 2026-05-29T03:39:30Z

Review

Close to merge. I pushed a fix commit (aae5fd0) addressing items 1 & 2 below; item 3 is a verified-green report from running /eval against this branch.

Required (fix commit `aae5fd0` covers 1 & 2)

1. .gitignore — broaden .context/ to .context*/
.context/ only catches the literal directory. Conductor + tooling spawn sibling dirs (.context-<workspace>, .context-evals/, etc.) that the original pattern missed. The eval skill itself writes to .context/evals/ which is fine, but the gitignore should cover the pattern, not the one path.

2. leadbay_followup_check_in/SKILL.md was stale
packages/promptforge/prompts/leadbay_followup_check_in.md.tmpl had the new disambiguation rule but the matching .claude-plugin/.../leadbay_followup_check_in/SKILL.md was not re-emitted. Running pnpm prompts:build locally produced a non-empty diff. Without the regen, the MCP-prompt surface and the Claude Code skill surface disagree. Fixed by running prompts:build and committing the regenerated file.

(Side note on the SKILL.md surface in general — the generated file under .claude-plugin/.../skills/<name>/SKILL.md is emitted from the same .md.tmpl that produces the MCP prompt. So the canonical edit surface stays the template under packages/promptforge/prompts/; the SKILL.md is just a build artifact that happens to be checked in so .dxt consumers don't need a build step. The architecture is fine; the regen was just forgotten on one of the two prompts this PR touched.)

3. /eval skill ran green on this branch
Installed eval from leadbay/skills@main, wrote .env.eval, ran /eval --workflow 5 against the real Leadbay API:

PHASE 0 GATE — PASS  (env + WORKFLOWS.md found)
PHASE 1 GATE — PASS  (contract parsed)
PHASE 2 GATE — PASS  (system prompt LOADED, 8854 chars)
PHASE 3 GATE — PASS  (exit=0, 146s, 1 turn, 2 tool calls)
PHASE 4 GATE — PASS  (2/2 invariants)
PHASE 5 GATE — PASS  (judge JSON valid)

Result: PASS  MM:5 IA:5 NF:5 TSF:4

Tool ledger: leadbay_bulk_qualify_leads → leadbay_pull_leads. Output included the mandated status line, "Standouts from this batch" callout, and the canonical 3-column pull_leads table with score-bars and constructed-LinkedIn fallbacks. .context/evals/eval-report.html dashboard generated correctly.

The judge dinged tool_selection_fit to 4/5 because it didn't see the prompt body (correct per Iron Law #3) and flagged leadbay_pull_leads as not "strictly required by the mission." But the prompt's PHASE 3 explicitly mandates the pull for the table render. This is a small WORKFLOWS.md gap, not a product regression.

Nice-to-have (don't block merge)

a. Add leadbay_pull_leads to workflow 5 required_calls so the judge can confirm both calls per the prompt's PHASE 3 contract.

b. live-session-runner.ts findTsx() — walks 5 levels up from helpers/ looking for .bin/tsx. In a fresh pnpm install the workspace-root tsx sits 4 levels up, not 5. Worth confirming the PATH fallback resolves on a clean clone.

c. assembler.ts reorder — moving memoryPointer from after prefer_when to before anti_triggers ships unverified (no snapshot test on routing-block layout). If the layout is meant to be load-bearing, add a snapshot test for one known prompt's generated description.

d. packages/mcp/test/eval/README.md still describes the deleted vitest-based runner with EVAL=1 EVALS_ALL=1 .... Either update it to describe the /eval skill flow or delete it — keeping the old README produces a contradiction for anyone landing on test/eval/ cold.

Things I checked and they're fine

pnpm -r typecheck — green
pnpm -r test — 257/257 pass
workflows.test.ts still asserts every backticked leadbay_* identifier resolves
LeadbayClient constructor extension to {baseUrl, bearer, region} — backward-compatible
Routing additions to pull-leads / pull-followups for reach-out / contact / re-engage phrasing — right call, positive/negative examples paired correctly
~3,400 LOC fixture-based eval infrastructure removed — the live approach is more honest

Bottom line

Land it. The fix commit clears the two blockers; the third (/eval run) is verified green.

[Claude]

ArtyETH06 added the feature label May 26, 2026

ArtyETH06 self-assigned this May 26, 2026

ArtyETH06 marked this pull request as draft May 26, 2026 21:34

ArtyETH06 and others added 8 commits May 26, 2026 15:32

Print judge scorecard after each eval run

1105451

Shows mission_match/instruction_adherence/no_fabrication/tool_selection_fit scores, per-criterion pass/fail with reasons, tools called sequence, and duration — so failures are immediately visible without digging into transcripts.

LeadbayClient: accept options object {baseUrl, bearer} in constructor

7ad25b9

Allows both call signatures: new LeadbayClient("https://...", "token") new LeadbayClient({ baseUrl: "https://...", bearer: "token" }) Fixes eval tests that use the options-object form.

Remove dead SDK runner and routing classifier eval

bbf5108

session-runner.ts (SDK-based runner) and tool-routing-classifier.eval.ts (SDK-only test) are unreachable since the framework moved to CLI-only execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add eval framework README

7af26ed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ArtyETH06 marked this pull request as ready for review May 26, 2026 23:53

ArtyETH06 requested a review from milstan May 26, 2026 23:53

milstan requested changes May 27, 2026

View reviewed changes

ArtyETH06 marked this pull request as draft May 27, 2026 05:17

ArtyETH06 and others added 8 commits May 27, 2026 10:23

ArtyETH06 changed the title ~~Eval framework: subscription-only runner, agentic judge, overdelivery guard~~ Eval framework: live API runner, WORKFLOWS.md SSoT, interactive HTML dashboard May 27, 2026

ArtyETH06 and others added 2 commits May 27, 2026 12:56

ArtyETH06 marked this pull request as ready for review May 27, 2026 21:14

ArtyETH06 requested a review from milstan May 27, 2026 21:14

ArtyETH06 marked this pull request as draft May 28, 2026 17:40

ArtyETH06 added 16 commits May 28, 2026 10:42

fix(prompt): qualify_top_n — explicit N/N count in exhausted status line

84915db

'All N/N leads already qualified' with actual count (e.g. '10/10') so the user can verify scope. Targets NF:4 deduction for missing count.

ArtyETH06 marked this pull request as ready for review May 29, 2026 00:38

ArtyETH06 marked this pull request as draft May 29, 2026 00:38

ArtyETH06 and others added 2 commits May 28, 2026 17:49

fix(promptforge): move memory pointer before anti-triggers in routing…

557674f

… block Keeps memory pointer within the 600-char truncation-safe window. Anti-triggers were pushing it to position 672 for leadbay_pull_leads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

merge: resolve conflict in tool-descriptions.generated.ts — take both…

aa3a287

… memory pointer and remove_leads fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ArtyETH06 marked this pull request as ready for review May 29, 2026 00:56

milstan requested changes May 29, 2026

View reviewed changes

Comment thread WORKFLOWS.md Outdated

Comment thread docs/superpowers/plans/2026-05-28-relentless-eval-loop.md Outdated

Comment thread docs/superpowers/specs/2026-05-28-relentless-eval-loop-design.md Outdated

ArtyETH06 and others added 3 commits May 28, 2026 19:21

chore: remove disposable plans, specs, and byproduct scripts

69efccd

Plans and specs are session artifacts — outcomes only belong in the repo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(eval): move gen-dashboard.py to packages/mcp/test/eval/helpers/

5a1e1a1

Proper home alongside the other eval helpers rather than repo root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ArtyETH06 requested a review from milstan May 29, 2026 02:33

milstan merged commit b3d724f into main May 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval framework: live API runner, WORKFLOWS.md SSoT, interactive HTML dashboard#71

Eval framework: live API runner, WORKFLOWS.md SSoT, interactive HTML dashboard#71
milstan merged 57 commits into
mainfrom
ArtyETH06/eval-framework

ArtyETH06 commented May 26, 2026 •

edited

Loading

Uh oh!

milstan left a comment •

edited

Loading

Uh oh!

milstan May 27, 2026

Uh oh!

ArtyETH06 May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

milstan commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArtyETH06 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

How to run evals

Self-improving evals (new)

Dashboard changes

Validated in this session

Changes since last review

Uh oh!

milstan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milstan May 27, 2026

Choose a reason for hiding this comment

Uh oh!

ArtyETH06 May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

milstan commented May 29, 2026

Review

Required (fix commit aae5fd0 covers 1 & 2)

Nice-to-have (don't block merge)

Things I checked and they're fine

Bottom line

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArtyETH06 commented May 26, 2026 •

edited

Loading

milstan left a comment •

edited

Loading

Required (fix commit `aae5fd0` covers 1 & 2)