Eval framework: live API runner, WORKFLOWS.md SSoT, interactive HTML dashboard#71
Conversation
… guard - Add cli-session-runner.ts: drives multi-turn eval sessions via the claude CLI using OAuth subscription auth (no ANTHROPIC_API_KEY required). Parses stream-json events, strips mcp__ tool name prefixes, captures final text from result events with lastAssistantText fallback. - Add fixture-mcp-server.ts: standalone MCP stdio server that patches node:https before importing LeadbayClient, serving all backend requests from EVAL_FIXTURES (base64 JSON). Enables realistic tool execution without a live backend. - Rewrite mission-match-judge.ts: agentic per-criterion loop (SDK path) and single-shot with full evidence dump (CLI path). Adds per_criterion verdicts to evidence L3. - Add CriterionVerdict to evidence.ts and per_criterion field to MCPEvidence. - Add llm-judge-shared.ts: shared callJudgeAuto() that auto-selects SDK vs CLI judge backend, hasCLI() detection, makeAnthropicClientIfAvailable(). - Update run-eval.ts: auto-selects SDK runner (ANTHROPIC_API_KEY present) vs CLI runner (subscription-only). Adds backend_requests pyramid validation. - Fix vitest.eval.config.ts: switched to pool:threads + singleThread:true for correct @leadbay/* workspace package resolution; removed broken vite aliases. - Add widget-overdelivery-guard scenario + eval: verifies the daily check-in agent stops before drafting/sending outreach without explicit user consent. Fixtures use correct /1.5 paths matching LeadbayClient's URL construction. All 251 existing tests continue to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Token counters were declared inside the try block but referenced after the finally in the return statement, causing a ReferenceError. Moved declarations before the try block. Added scoring rubric anchors to the CLI single-shot judge prompt so agents acknowledging tool errors are not penalised as fabrication (no_fabrication=5 when agent correctly says "I don't have data").
Shows mission_match/instruction_adherence/no_fabrication/tool_selection_fit scores, per-criterion pass/fail with reasons, tools called sequence, and duration — so failures are immediately visible without digging into transcripts.
Allows both call signatures:
new LeadbayClient("https://...", "token")
new LeadbayClient({ baseUrl: "https://...", bearer: "token" })
Fixes eval tests that use the options-object form.
…hs, skip routing classifier without API key - Judge: clarify verdict() semantics — pass=true when confirmed, pass=false when absent; reasoning must agree with boolean; explicit examples for tool-call and byproduct criteria - Judge: show reasoning for all criteria in console output (not only failures) - Judge: add no_fabrication rubric note that rendering fixture data is not fabrication - Scenarios: rewrite all 10 broken scenarios from defunct /v1/... paths to correct /1.5/... paths matching LeadbayClient.request() which prepends /1.5 to every call - tool-routing-classifier: skip gracefully when ANTHROPIC_API_KEY is absent
Claude Code transparently handles auth (subscription or API key) for `claude -p` subprocesses — no ANTHROPIC_API_KEY branching needed. Removes ~800 lines of dead SDK runner and agentic judge code. - llm-judge-shared: CLI-only callJudge, drop SDK client factory - mission-match-judge: single-shot `claude -p` judge prompt; remove entire agentic multi-turn SDK loop (runAgenticJudgeSDK, evidence tools, buildAgentSystemPrompt). Rename buildFallbackPrompt → buildJudgePrompt (now the only path). - drift-judge: remove Anthropic import and client? field - run-eval: setupScenarioFixtures is a no-op; always use CLI runner; remove SDK runner import and ANTHROPIC_API_KEY check - touchfiles: update GLOBAL_TOUCHFILES to cli-session-runner.ts - eval files: rewrite daily-check-in + import-file evals from 150+ line inline SDK sessions to clean ~25 line runScenarioEval calls Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
session-runner.ts (SDK-based runner) and tool-routing-classifier.eval.ts (SDK-only test) are unreachable since the framework moved to CLI-only execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move the no_fabrication rule above the scoring table with an explicit bulleted list of what is NOT fabrication: score bars, tool-response rendering, stop phrases, summarisation. The judge was consistently scoring 4/5 for rendering fixture-grounded markdown (▰❖▱ bars, emails, company names), which the rubric always intended to be a 5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
It does work, but tests fail. As a general rule, we don't want to comit cone that fails tests. But this can be a separate PR.
I am inclined to push the architecture of the test framefowk a step further:
-
test account instead of Fake HTTL canned responses (not 100% sure about this, but I think the risk of canned responses drifting from real implementation is greater than the risk of test account being underepresentative of user reality - only concern is limiting the spend)
-
Single source of truth of supported/testable workflows (not both WORKFLOWS.md and test files)
-
I'd also challenge the configuration of the main and testing agent from your diagram. The frameworks as it is now, mainly tests the MCP API surface, and its resoning, prose etc. I am inclinded to think we can have it go more end-to-end basicly take prompt as input. Even some session history (imagine we want to test how the MCP behaves in the middle of a anready ongoing conversations, there is prior chat conext, the user uses a prompt -> what happens" I think our tests could be: test(previous context - optional, user prompt)->scores.
-
The limitation I see of our current approach, is that we don't know if the agent using our MCP has actually infovked the widget. Maybe we can collect session logs and see from there if it ivoked them (but maybe it knows it's being run as subagent and does not invoke them)
-
Generally we want to capture session logs, see what the agent did, how did it interpret the request, did it call the tools we wanted it to call (this is a deterministic test), did it present the result the right way, did it propose next steps the right way, did it record memory, did it leverage previously recorded memory...
-
We want the evaluator to strictly rely on the logs to produce the judgment and not look at the MCP code or run any functions itself.
-
Finally, and this is maybe the worst, I really see different behavior in different user agents (Claude chat behaves differently than Claude Cowrok). ideally we'd want to in fact emulate them. I would expect an emulator to exist but have not found one in fact. So maybe we drop this point for now (but one of my previous points about usage of the native UI components might also be impossible without the emulator - we need to see if we can trick the model into thinking it is within a particular harness COwork/ClaudeChat..., and pick a UI component and catch that in the log)
-
We want it to store reports in a file, so we can easilly collect them, and pass them onto a fixing agent. Also make it output a nice recap html file with all the tests and scores, colors which pass,e tc. I think the tripple (previous context, prompt, outcome logs) is what if you give to any coding tagent that can run test(previous context - optional, user prompt)->scores, it can optimize it iteratively.
See also my inline comment. I wonder to which extent we might want to actually drop the .ts files totally from the /eval framework (I know!) and just make:
- a big fat eval skill with IRON LAWS, DEADLY SINS, BYPRODUCT FILE GATES (modeled after gstack skills or my /relentless skill from skills repo in leadbay - basically gates are files the model needs to produce in order to pass to the next step, so that enforces its multi-step approach, can't skip)
- A file listing all the workflows, expected outcomes, expected tool calls, native UI invocations, everything thant needs to checklist.
It spawns agets to run them, it captures the log, it judges the log and produces outcome table.
We can even make this script in a /reletnless way - bascially we would evaluate it if evaluated well (if the log does not contractic its judgement).
I'm crazy, I know.
| " instruction_adherence — did the agent follow the prompt's PHASES without skipping?", | ||
| " no_fabrication — every claim must trace to a tool response in the ledger.", | ||
| " tool_selection_fit — were the chosen tools the right ones for the user intent?", | ||
| "NO_FABRICATION RULE (read before scoring):", |
There was a problem hiding this comment.
I dont calling prompts from .ts is generally a good idea. I think generating SKILL.md as we do in promtforge is OK, but I think we want to give this sort of prompts as skills to the agent and let it evaluate. I think particularly for the /eval part of tests, we are better served with skills that with deterministic code, that does not age very well for this kind of thing.
There was a problem hiding this comment.
Agree on the SKILL, might do this in another PR, to keep this one just about the eval agent, and include the skill in a skill specified PR.
…rce SSoT via audit - Add eval scenarios + invariants for prospecting overview (#7), outreach drafting (#8), field sales tour (#10), and team prospecting (#11) - Update WORKFLOWS.md: every Supported row now cites its eval file in the Tests column — one place to see what's covered - Add workflows-eval-coverage.test.ts audit that enforces eval coverage for all Supported rows; CI fails if a row is added without an eval file
WORKFLOWS.md is now the single source of truth for eval contracts. Required calls, forbidden calls, and success criteria live in fenced ```yaml expected blocks in the doc — no separate TypeScript invariant files needed. The workflows-parser.ts runtime reads these blocks; run-eval.ts derives invariants and the judge mission from them. Deletes 867 lines across 12 invariants/*.ts files and all inline mission objects in 19 scenario files. Adds a new audit test that asserts every Supported workflow row has a parseable expected block with non-empty required_calls and success_criteria. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All 13 eval prompt files drop the invariants import/parameter. All 19 scenario files replace the inline mission object with a single workflow_id field. run-eval.ts wires workflow_id through the workflows-parser to derive invariants and the judge mission at runtime from WORKFLOWS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Evals use .eval.ts extension intentionally excluded from the normal test suite. The new config + script makes them runnable without remembering the flag: EVAL=1 pnpm --filter @leadbay/mcp run test:eval EVAL=1 npx vitest run --config vitest.eval.config.ts <file> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads .context/evals/*.json run files and generates a self-contained dark-mode HTML report with per-scenario score bars, invariant results, per-criterion verdicts, tool call sequences, and judge reasoning. Usage: pnpm --filter @leadbay/mcp run eval:report # latest run pnpm --filter @leadbay/mcp run eval:report -- --all # all runs pnpm --filter @leadbay/mcp run eval:report -- --run <run_id> pnpm --filter @leadbay/mcp run eval:report -- --output /path/to/report.html Output: .context/evals/eval-report.html Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Option B: evals now run against a real Leadbay test account instead of
canned HTTP fixtures. No scenario .ts files needed — workflows are
selected by ID and the contract comes from WORKFLOWS.md yaml blocks.
New files:
- live-mcp-server.ts: minimal stdio MCP server using real Leadbay auth
- live-session-runner.ts: CLI session runner without fixture machinery;
accepts systemPrompt injected via --system-prompt
- run-workflow.ts: CLI script — LEADBAY_TOKEN=... eval:live --workflow 1,3
WORKFLOWS.md: added yaml scenario blocks (trigger prompt per workflow).
workflows-parser.ts: parses scenario blocks, exports getWorkflowScenario().
Usage:
LEADBAY_TOKEN=<token> LEADBAY_REGION=us \
pnpm --filter @leadbay/mcp run eval:live --workflow 1
Verified: workflow #1 passes 5/5 mission_match against real account
(SnapLock Industries, lens 39107, real leads from api-us.leadbay.app).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- .env.eval at repo root stores LEADBAY_TOKEN + LEADBAY_REGION (gitignored) Create it: echo "LEADBAY_TOKEN=...\nLEADBay_REGION=us" > .env.eval - `eval` script loads it via dotenv-cli: pnpm --filter @leadbay/mcp run eval -- --workflow 2 - `eval:view` generates HTML report and opens it in browser - report.ts prints xdg-open hint with absolute path after generation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y path Deleted: - helpers: cli-session-runner, fixture-mcp-server, backend-recorder, touchfiles, drift-judge, run-eval - all 18 scenario files (scenarios/) - all 13 prompt eval stubs (prompts/*.eval.ts) - drift-detector.ts script - vitest.eval.config.ts + test:eval npm script - audit tests that checked for now-deleted .eval.ts files The live runner (run-workflow.ts + live-session-runner.ts) against the real Leadbay API replaces all of this. WORKFLOWS.md yaml expected/scenario blocks are the only source of truth. 251 tests still pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…S.md All three hardcoded maps (WORKFLOW_PROMPT, WORKFLOW_NAME, ALL_WORKFLOW_IDS) removed from run-workflow.ts. Parser now reads workflow_name and prompt_name scalar fields from each yaml expected block; run-workflow.ts calls getAllWorkflowExpected() at startup to derive the workflow list. Adding a new eval now requires only a WORKFLOWS.md edit — no TypeScript files to touch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- live-session-runner: add readFileSync to fs import (renderFullLog was crashing with ReferenceError on every run) - WORKFLOWS.md workflow #3: replace required leadbay_research_lead_by_id with leadbay_research_lead_by_name_fuzzy — the fuzzy lookup alone is a valid completion path for domain research; by_id is called only when the agent wants deeper detail after the fuzzy result Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ToolSearch and WebFetch were leaking through despite --allowedTools mcp__leadbay-live__* — the agent used them to answer from training data instead of calling real Leadbay tools, causing workflow #3 to show zero tool calls and score 1/5 across the board. Add explicit --disallowedTools list covering all Claude Code built-ins that could leak: ToolSearch, WebFetch, WebSearch, Bash, Read, Edit, Write, Glob, Grep, LS, Skill, LSP, Agent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…routing With prompt_name set, the system prompt overrides routing and the test always passes. Without it (prompt_name: ~), the agent uses only tool descriptions to route — exposing the real gap where 'reach out to today' fires pull_leads instead of pull_followups.
…routing pull_leads.md.tmpl: add 6 anti_triggers routing reach-out phrasings to pull_followups (reach out to, get back to, contact today, should I contact, reconnect with, re-engage). pull-followups.md.tmpl: add 8 triggers actively claiming the same phrasing class (reach out to today, should reach out to, get back to, contact today, reconnect with, re-engage, leads to contact, who should I ping). Fixes workflow 2b: 'Show me leads I should reach out to today' was misfiring to leadbay_pull_leads. Root cause: pull_leads had no anti-trigger for this semantic class; pull_followups had no trigger claiming it.
…hrasing pull_leads.md.tmpl: add 3 negative examples including the exact failing phrase 'Show me leads I should reach out to today' — gives LLM concrete evidence this phrase should NOT route here. pull-followups.md.tmpl: add 3 positive examples including the exact phrase — gives LLM concrete evidence this phrase SHOULD route here. The combination of anti_trigger entries + concrete examples provides strong bidirectional routing signal for the reach-out phrasing class.
… me new/today leads' 'show me leads' was too broad — matched re-engagement phrasings like 'Show me leads I should reach out to today'. Replaced with more specific triggers: 'show me new leads', 'show me today's leads', 'fresh leads', 'what's new today'. The scenario phrase no longer matches any pull_leads trigger, while pull_followups now claims it via 'reach out to today'.
…cenario Tool-description-only routing (prompt_name: ~) is architecturally insufficient — the model priors for 'show me leads' always win over hint text. Real failure target: 'Show me my best leads for today' misfires to pull_leads EVEN WITH the leadbay_followup_check_in system prompt. That's a prompt-body fix, not a tool-description fix. Scenario now uses prompt_name: leadbay_followup_check_in.
…k_in context Add explicit disambiguation rule to PHASE 1 of leadbay_followup_check_in: 'best leads', 'top leads', 'leads for today', 'show me my leads' in the follow-up workflow context means Monitor pipeline, not a fresh Discover batch. Fixes the misroute for 'Show me my best leads for today' → was pulling leadbay_pull_leads, should pull leadbay_pull_followups.
1. Restore 'show me leads' to pull_leads triggers (was narrowed too aggressively)
2. Tighten anti-triggers to specific phrases ('leads I should reach out to')
instead of broad substrings ('reach out to' which matched discovery intent)
3. Update pull_leads short_description to say 'NEW leads' and mention
pull_followups for known pipeline leads
4. Fix WORKFLOWS.md 2b row description to match actual scenario phrase
Anti-triggers use full phrases ('leads I should reach out to') rather
than substrings ('reach out to') to avoid intercepting legitimate
discovery intent. 'reach out to new leads' should still fire pull_leads;
only re-engagement phrasings ('leads I should reach out to') route to
pull_followups. Documents the architectural decision for future engineers.
- Add '🔄 Self-improve' chip filter showing all eval runs that are part of the relentless self-improvement loop (any run containing workflow-2b entries) - Fix workflow_label regex to handle alphanumeric suffixes like 'workflow-2b' - Fix data-workflow sanitization to use safe CSS-id characters - Timestamp now reads from filename (not entry name) for correct display
…BulkTracker error Add resilience rule to PHASE 1: call bulk_qualify_leads with wait_for_completion=true by default. If BulkTracker-not-configured error occurs, skip retry and proceed directly to pull_leads. Fixes TSF:4 caused by redundant async-first call followed by synchronous retry.
…d/pending split Add precise format instruction to PHASE 3: '✓ N leads qualified · M still processing (lead IDs: X)'. Handles 3 cases: all done, mixed, all pending. Targets NF:4 deduction from unclear 7/3 framing in prior eval run.
Add explicit variants for the 4 status cases: 1. exhausted=true / all pre-qualified: 'All N leads already qualified · 0 still processing' 2. all newly qualified: 'N leads qualified' 3. mixed: 'N leads qualified · M still processing (IDs)' 4. all pending: '0 leads qualified · N still processing (IDs)' Restores TSF:5 on the pre-qualified batch edge case.
'All N/N leads already qualified' with actual count (e.g. '10/10') so the user can verify scope. Targets NF:4 deduction for missing count.
… render Explicit note that pull_leads is always needed after bulk_qualify because the qualification response does not contain the full lead data for the table. Addresses judge TSF concern about 'redundant' pull_leads call.
…l relentless iterations WORKFLOWS.md: add 'Self-improving evals' section documenting /eval --improve, what it fixes, regression guard, and --dangerously-skip-permissions note. gen-dashboard.py: 'Last run' → 'Last session' groups all eval files within 60 minutes of the newest file. Previously showed only 1 entry (the newest file); now shows all N iterations from a relentless self-improvement run.
…es in follow-up context Adds explicit rule: "best leads"/"top leads"/"leads for today" within the follow-up workflow always routes to leadbay_pull_followups, not leadbay_pull_leads. Fixes workflow 2b misrouting regression.
… block Keeps memory pointer within the 600-char truncation-safe window. Anti-triggers were pushing it to position 672 for leadbay_pull_leads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… memory pointer and remove_leads fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plans and specs are session artifacts — outcomes only belong in the repo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Proper home alongside the other eval helpers rather than repo root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per Milan's review: "best leads for today" should route to Discover by default; routing should be learned from user behavior via memory, not hard-pinned. Removed the over-eager disambiguation rule from the followup skill and the 2b scenario from WORKFLOWS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… SKILL.md Two pre-merge cleanups on PR #71: 1. `.gitignore` — `.context/` only caught the literal directory. Sibling workspaces and tooling spawn `.context-<id>` paths that the original pattern missed. `.context*/` covers both. 2. `.claude-plugin/.../leadbay_followup_check_in/SKILL.md` was out of sync with its `.tmpl` source. The prompt template's "discovery- sounding phrases" disambiguation rule had been added but the generated SKILL.md was not re-emitted, so the Claude Code skill surface disagreed with the MCP prompt surface. Re-ran `pnpm prompts:build`; this commit lands the regenerated file. `pnpm -r typecheck` and `pnpm -r test` (257/257) still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ReviewClose to merge. I pushed a fix commit (aae5fd0) addressing items 1 & 2 below; item 3 is a verified-green report from running Required (fix commit aae5fd0 covers 1 & 2)1. 2. (Side note on the SKILL.md surface in general — the generated file under 3. Tool ledger: The judge dinged Nice-to-have (don't block merge)a. Add b. c. d. Things I checked and they're fine
Bottom lineLand it. The fix commit clears the two blockers; the third ( [Claude] |
What this PR does
Replaces the fixture-based MCP eval framework with a live, skill-driven runner where WORKFLOWS.md is the only file you ever edit to add or change a test.
Key changes:
yaml expectedcontract block (required/forbidden calls, success criteria) and ayaml scenarioblock (prompt). The/evalskill reads these directly./evalskill.packages/mcp/test/eval/helpers/gen-dashboard.py— self-contained Python script generating a single-file HTML dashboard at.context/evals/eval-report.html.live-mcp-server.ts— thin stdio MCP bridge for real Leadbay API sessions.How to run evals
Prerequisites:
.env.evalat repo root withLEADBAY_TOKEN=u.xxxandLEADBAY_REGION=us.For unattended runs:
claude --dangerously-skip-permissionsSelf-improving evals (new)
Add
--improveto automatically fix any workflow scoring below 5/5:Flow:
/relentlessand immediately executes the self-improvement loop inline (no hand-off, same session):packages/promptforge/prompts/<prompt_name>.md.tmplpnpm prompts:build/eval --workflow <others>to confirm no regressionsWhat gets improved: prompt
.md.tmplsource files only — never.generated.tsdirectly.Dashboard changes
[✓/✗]in phase gate output (was[x/ ]).Validated in this session
Ran
/eval --workflow 5 --improveend-to-end:leadbay_qualify_top_n.md.tmpl— addedwait_for_completion=true, BulkTracker resilience rule, exhausted=true status line format, explicit pull_leads justificationChanges since last review
leadbay_followup_check_in— "best leads for today" routes naturally to Discover; context-sensitive routing should be learned from user memory, not hard-pinnedgen-dashboard.pyfrom repo root topackages/mcp/test/eval/helpers/leadbay_remove_leads_from_campaignfix)