fix(release): drop --provenance (repo is internal, not public) by milstan · Pull Request #8 · leadbay/leadclaw

milstan · 2026-04-21T06:29:00Z

Sigstore provenance requires public source repo; our first mcp-v0.2.0 publish hit 422 because the repo is internal. Dropping --provenance unblocks publishing. Re-add if/when the repo goes public.

Sigstore provenance requires the source repo to be public so anyone can verify the attestation chain. Our repo is internal, which made the first mcp-v0.2.0 publish fail with: 422 Unprocessable Entity Error verifying sigstore provenance bundle: Unsupported GitHub Actions source repository visibility: "internal". Only public source repositories are supported when publishing with provenance. Publishing proceeds without attestations. If/when the repo flips to public, re-add --provenance in both publish steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provenance was dropped in c0mmit that merged as PR #8 because the repo was internal — sigstore requires a public source repo to verify the attestation chain. Repo is now public, so provenance publishes again. Next bump will ship with signed provenance to transparency.sigstore.dev. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rce SSoT via audit - Add eval scenarios + invariants for prospecting overview (#7), outreach drafting (#8), field sales tour (#10), and team prospecting (#11) - Update WORKFLOWS.md: every Supported row now cites its eval file in the Tests column — one place to see what's covered - Add workflows-eval-coverage.test.ts audit that enforces eval coverage for all Supported rows; CI fails if a row is added without an eval file

…dashboard (#71) * Eval framework: subscription-only runner, agentic judge, overdelivery guard - Add cli-session-runner.ts: drives multi-turn eval sessions via the claude CLI using OAuth subscription auth (no ANTHROPIC_API_KEY required). Parses stream-json events, strips mcp__ tool name prefixes, captures final text from result events with lastAssistantText fallback. - Add fixture-mcp-server.ts: standalone MCP stdio server that patches node:https before importing LeadbayClient, serving all backend requests from EVAL_FIXTURES (base64 JSON). Enables realistic tool execution without a live backend. - Rewrite mission-match-judge.ts: agentic per-criterion loop (SDK path) and single-shot with full evidence dump (CLI path). Adds per_criterion verdicts to evidence L3. - Add CriterionVerdict to evidence.ts and per_criterion field to MCPEvidence. - Add llm-judge-shared.ts: shared callJudgeAuto() that auto-selects SDK vs CLI judge backend, hasCLI() detection, makeAnthropicClientIfAvailable(). - Update run-eval.ts: auto-selects SDK runner (ANTHROPIC_API_KEY present) vs CLI runner (subscription-only). Adds backend_requests pyramid validation. - Fix vitest.eval.config.ts: switched to pool:threads + singleThread:true for correct @leadbay/* workspace package resolution; removed broken vite aliases. - Add widget-overdelivery-guard scenario + eval: verifies the daily check-in agent stops before drafting/sending outreach without explicit user consent. Fixtures use correct /1.5 paths matching LeadbayClient's URL construction. All 251 existing tests continue to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix token tracking scope bug + add no_fabrication rubric to CLI judge Token counters were declared inside the try block but referenced after the finally in the return statement, causing a ReferenceError. Moved declarations before the try block. Added scoring rubric anchors to the CLI single-shot judge prompt so agents acknowledging tool errors are not penalised as fabrication (no_fabrication=5 when agent correctly says "I don't have data"). * Print judge scorecard after each eval run Shows mission_match/instruction_adherence/no_fabrication/tool_selection_fit scores, per-criterion pass/fail with reasons, tools called sequence, and duration — so failures are immediately visible without digging into transcripts. * LeadbayClient: accept options object {baseUrl, bearer} in constructor Allows both call signatures: new LeadbayClient("https://...", "token") new LeadbayClient({ baseUrl: "https://...", bearer: "token" }) Fixes eval tests that use the options-object form. * Fix eval judge pass/fail inversion, migrate all scenarios to /1.5 paths, skip routing classifier without API key - Judge: clarify verdict() semantics — pass=true when confirmed, pass=false when absent; reasoning must agree with boolean; explicit examples for tool-call and byproduct criteria - Judge: show reasoning for all criteria in console output (not only failures) - Judge: add no_fabrication rubric note that rendering fixture data is not fabrication - Scenarios: rewrite all 10 broken scenarios from defunct /v1/... paths to correct /1.5/... paths matching LeadbayClient.request() which prepends /1.5 to every call - tool-routing-classifier: skip gracefully when ANTHROPIC_API_KEY is absent * Eval framework: CLI-only auth, single-shot judge, remove SDK paths Claude Code transparently handles auth (subscription or API key) for `claude -p` subprocesses — no ANTHROPIC_API_KEY branching needed. Removes ~800 lines of dead SDK runner and agentic judge code. - llm-judge-shared: CLI-only callJudge, drop SDK client factory - mission-match-judge: single-shot `claude -p` judge prompt; remove entire agentic multi-turn SDK loop (runAgenticJudgeSDK, evidence tools, buildAgentSystemPrompt). Rename buildFallbackPrompt → buildJudgePrompt (now the only path). - drift-judge: remove Anthropic import and client? field - run-eval: setupScenarioFixtures is a no-op; always use CLI runner; remove SDK runner import and ANTHROPIC_API_KEY check - touchfiles: update GLOBAL_TOUCHFILES to cli-session-runner.ts - eval files: rewrite daily-check-in + import-file evals from 150+ line inline SDK sessions to clean ~25 line runScenarioEval calls Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove dead SDK runner and routing classifier eval session-runner.ts (SDK-based runner) and tool-routing-classifier.eval.ts (SDK-only test) are unreachable since the framework moved to CLI-only execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Judge: strengthen no_fabrication rubric to prevent false deductions Move the no_fabrication rule above the scoring table with an explicit bulleted list of what is NOT fabrication: score bars, tool-response rendering, stop phrases, summarisation. The judge was consistently scoring 4/5 for rendering fixture-grounded markdown (▰❖▱ bars, emails, company names), which the rubric always intended to be a 5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add eval framework README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Eval coverage: add missing evals for workflows #7, #8, #10, #11; enforce SSoT via audit - Add eval scenarios + invariants for prospecting overview (#7), outreach drafting (#8), field sales tour (#10), and team prospecting (#11) - Update WORKFLOWS.md: every Supported row now cites its eval file in the Tests column — one place to see what's covered - Add workflows-eval-coverage.test.ts audit that enforces eval coverage for all Supported rows; CI fails if a row is added without an eval file * Consolidate eval specs into WORKFLOWS.md; delete 12 invariant files WORKFLOWS.md is now the single source of truth for eval contracts. Required calls, forbidden calls, and success criteria live in fenced ```yaml expected blocks in the doc — no separate TypeScript invariant files needed. The workflows-parser.ts runtime reads these blocks; run-eval.ts derives invariants and the judge mission from them. Deletes 867 lines across 12 invariants/*.ts files and all inline mission objects in 19 scenario files. Adds a new audit test that asserts every Supported workflow row has a parseable expected block with non-empty required_calls and success_criteria. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Migrate all eval prompts and scenarios to workflow_id pattern All 13 eval prompt files drop the invariants import/parameter. All 19 scenario files replace the inline mission object with a single workflow_id field. run-eval.ts wires workflow_id through the workflows-parser to derive invariants and the judge mission at runtime from WORKFLOWS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add vitest.eval.config.ts and test:eval script for running evals Evals use .eval.ts extension intentionally excluded from the normal test suite. The new config + script makes them runnable without remembering the flag: EVAL=1 pnpm --filter @leadbay/mcp run test:eval EVAL=1 npx vitest run --config vitest.eval.config.ts <file> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add HTML eval report generator Reads .context/evals/*.json run files and generates a self-contained dark-mode HTML report with per-scenario score bars, invariant results, per-criterion verdicts, tool call sequences, and judge reasoning. Usage: pnpm --filter @leadbay/mcp run eval:report # latest run pnpm --filter @leadbay/mcp run eval:report -- --all # all runs pnpm --filter @leadbay/mcp run eval:report -- --run <run_id> pnpm --filter @leadbay/mcp run eval:report -- --output /path/to/report.html Output: .context/evals/eval-report.html Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add live eval runner — test account, no fixtures Option B: evals now run against a real Leadbay test account instead of canned HTTP fixtures. No scenario .ts files needed — workflows are selected by ID and the contract comes from WORKFLOWS.md yaml blocks. New files: - live-mcp-server.ts: minimal stdio MCP server using real Leadbay auth - live-session-runner.ts: CLI session runner without fixture machinery; accepts systemPrompt injected via --system-prompt - run-workflow.ts: CLI script — LEADBAY_TOKEN=... eval:live --workflow 1,3 WORKFLOWS.md: added yaml scenario blocks (trigger prompt per workflow). workflows-parser.ts: parses scenario blocks, exports getWorkflowScenario(). Usage: LEADBAY_TOKEN=<token> LEADBAY_REGION=us \ pnpm --filter @leadbay/mcp run eval:live --workflow 1 Verified: workflow #1 passes 5/5 mission_match against real account (SnapLock Industries, lens 39107, real leads from api-us.leadbay.app). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * QoL: single eval command, .env.eval for credentials, eval:view script - .env.eval at repo root stores LEADBAY_TOKEN + LEADBAY_REGION (gitignored) Create it: echo "LEADBAY_TOKEN=...\nLEADBay_REGION=us" > .env.eval - `eval` script loads it via dotenv-cli: pnpm --filter @leadbay/mcp run eval -- --workflow 2 - `eval:view` generates HTML report and opens it in browser - report.ts prints xdg-open hint with absolute path after generation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove fixture-based eval infrastructure — live API runner is the only path Deleted: - helpers: cli-session-runner, fixture-mcp-server, backend-recorder, touchfiles, drift-judge, run-eval - all 18 scenario files (scenarios/) - all 13 prompt eval stubs (prompts/*.eval.ts) - drift-detector.ts script - vitest.eval.config.ts + test:eval npm script - audit tests that checked for now-deleted .eval.ts files The live runner (run-workflow.ts + live-session-runner.ts) against the real Leadbay API replaces all of this. WORKFLOWS.md yaml expected/scenario blocks are the only source of truth. 251 tests still pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Derive workflow_name, prompt_name, and ALL_WORKFLOW_IDS from WORKFLOWS.md All three hardcoded maps (WORKFLOW_PROMPT, WORKFLOW_NAME, ALL_WORKFLOW_IDS) removed from run-workflow.ts. Parser now reads workflow_name and prompt_name scalar fields from each yaml expected block; run-workflow.ts calls getAllWorkflowExpected() at startup to derive the workflow list. Adding a new eval now requires only a WORKFLOWS.md edit — no TypeScript files to touch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix readFileSync missing import; relax workflow #3 required_calls - live-session-runner: add readFileSync to fs import (renderFullLog was crashing with ReferenceError on every run) - WORKFLOWS.md workflow #3: replace required leadbay_research_lead_by_id with leadbay_research_lead_by_name_fuzzy — the fuzzy lookup alone is a valid completion path for domain research; by_id is called only when the agent wants deeper detail after the fuzzy result Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Block non-Leadbay tools via --disallowedTools in eval runner ToolSearch and WebFetch were leaking through despite --allowedTools mcp__leadbay-live__* — the agent used them to answer from training data instead of calling real Leadbay tools, causing workflow #3 to show zero tool calls and score 1/5 across the board. Add explicit --disallowedTools list covering all Claude Code built-ins that could leak: ToolSearch, WebFetch, WebSearch, Bash, Read, Edit, Write, Glob, Grep, LS, Skill, LSP, Agent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Show token consumption in eval summary table Adds tokens_in / tokens_out columns per workflow row and a total tokens line at the bottom of the summary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Track and display session + judge token consumption separately - llm-judge-shared: switch callClaudeCLI to --output-format json to capture input/output token counts from the judge call - mission-match-judge: thread tokens_in/tokens_out through to caller - run-workflow: show per-workflow session vs judge token columns in the summary table, plus totals broken out by session / judge / combined Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix session token tracking; add cache read column to summary - live-session-runner: wire totalTokensIn/Out into the returned cost object (was hardcoded 0); also capture cache_read_input_tokens - run-workflow: show session tokens as in/cache/out format so the large cache_read numbers are visible and not confused with new input Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Show grand total token count in terminal and dashboard - evidence: add token fields to EvalEntry (session in/cache/out, judge in/out) - run-workflow: populate token fields in collector; show grand total line in terminal summary (session + cache + judge) - report: show total tokens as a hoverable chip on each workflow card (hover reveals the session/cache/judge breakdown) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Suppress superpowers hooks in eval sessions The global ~/.claude/settings.json has PreToolUse/SessionStart hooks from claude-hook.js (superpowers) that inject "Checking for applicable skills now" into the agent, causing it to skip Leadbay MCP tools and answer from training data. Explicitly set all hook arrays to [] in the eval settings file so they override the global hooks for the duration of the eval session. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * WORKFLOWS.md: document eval-skill as the runner, add eval instructions Replace the "How this stays normative" section with full eval runner documentation — /eval skill usage, prerequisites, how to add a new eval. The skill reads this file directly at runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * WORKFLOWS.md: drop Tests/Notes columns, replace with Prompt/Required/Forbidden/Scenario Table now shows the contract inline — no noise columns. Audit test: remove Tests-column path check (column is gone). * WORKFLOWS.md: eliminate table/yaml redundancy Table rows are now index-only (user story + prompt + scenario trigger). All required/forbidden calls and success criteria live once in the yaml expected blocks in the contracts section. Parser: switch from lastRowNum (table-row tracking) to sequential block counting — Nth yaml expected block = workflow #N. * Remove dead TS eval infrastructure replaced by eval skill workflows-parser.ts, run-workflow.ts, workflows-expected-blocks.test.ts were only used by each other. The skill parses WORKFLOWS.md directly at runtime — no TypeScript parser needed. * eval: delete report.ts script, dashboard now generated by /eval skill The /eval skill (v1.3.0) generates the HTML dashboard directly in Phase 7 by writing and executing a Python script. There is no longer a need for a standalone TypeScript report generator. Removes: - packages/mcp/test/eval/scripts/report.ts — 1009-line TS dashboard generator - eval:report and eval:view package.json scripts - eval script (run-workflow.ts-based runner, superseded by the skill) Updates WORKFLOWS.md to reference the dashboard file directly instead of the now-deleted pnpm script. * eval: move gen-dashboard.py to repo root so it's version-controlled Was living in .context/evals/ (gitignored). Moving to repo root makes dashboard improvements visible in PRs and persistent across clones. Also fixes: - JSON loader: handle top-level array files (not just {entries:[]} shape) - Last run filter: new chip filters entry list to most recent run file - Token grand total: include session_cache in sum (was being dropped) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add relentless eval loop design spec Designs the /relentless + /eval self-improvement loop for MCP prompt quality. Workflow 2b (routing violation) is the deliberate failure target. * docs: sharpen relentless eval loop spec with confirmed routing gap Scenario updated from ambiguous phrasing to a confirmed structural failure: 'Show me leads I should reach out to today' reliably fires pull_leads (discovery) instead of pull_followups (Monitor) because pull_leads triggers on 'show me leads' + 'today' and has no anti-trigger for 'reach out to'. Fix target is tool-description routing, not just the prompt text. * docs: add relentless eval loop implementation plan * eval: add workflow 2b — routing stress-test for reach-out phrasing 'Show me leads I should reach out to today' reliably misfires to leadbay_pull_leads. Confirmed structural gap: pull_leads triggers on 'show me leads' + 'today'; pull_followups has no matching trigger for 'reach out to'. Used as the relentless loop's target failure. * eval: workflow 2b uses no system prompt — tests raw tool-description routing With prompt_name set, the system prompt overrides routing and the test always passes. Without it (prompt_name: ~), the agent uses only tool descriptions to route — exposing the real gap where 'reach out to today' fires pull_leads instead of pull_followups. * fix(routing): add reach-out/contact/re-engage phrasing class to tool routing pull_leads.md.tmpl: add 6 anti_triggers routing reach-out phrasings to pull_followups (reach out to, get back to, contact today, should I contact, reconnect with, re-engage). pull-followups.md.tmpl: add 8 triggers actively claiming the same phrasing class (reach out to today, should reach out to, get back to, contact today, reconnect with, re-engage, leads to contact, who should I ping). Fixes workflow 2b: 'Show me leads I should reach out to today' was misfiring to leadbay_pull_leads. Root cause: pull_leads had no anti-trigger for this semantic class; pull_followups had no trigger claiming it. * fix(routing): add concrete negative/positive examples for reach-out phrasing pull_leads.md.tmpl: add 3 negative examples including the exact failing phrase 'Show me leads I should reach out to today' — gives LLM concrete evidence this phrase should NOT route here. pull-followups.md.tmpl: add 3 positive examples including the exact phrase — gives LLM concrete evidence this phrase SHOULD route here. The combination of anti_trigger entries + concrete examples provides strong bidirectional routing signal for the reach-out phrasing class. * fix(routing): narrow pull_leads trigger from 'show me leads' to 'show me new/today leads' 'show me leads' was too broad — matched re-engagement phrasings like 'Show me leads I should reach out to today'. Replaced with more specific triggers: 'show me new leads', 'show me today's leads', 'fresh leads', 'what's new today'. The scenario phrase no longer matches any pull_leads trigger, while pull_followups now claims it via 'reach out to today'. * eval: workflow 2b — use system prompt + discovery-phrased ambiguous scenario Tool-description-only routing (prompt_name: ~) is architecturally insufficient — the model priors for 'show me leads' always win over hint text. Real failure target: 'Show me my best leads for today' misfires to pull_leads EVEN WITH the leadbay_followup_check_in system prompt. That's a prompt-body fix, not a tool-description fix. Scenario now uses prompt_name: leadbay_followup_check_in. * fix(prompt): disambiguate discovery-sounding phrases in followup_check_in context Add explicit disambiguation rule to PHASE 1 of leadbay_followup_check_in: 'best leads', 'top leads', 'leads for today', 'show me my leads' in the follow-up workflow context means Monitor pipeline, not a fresh Discover batch. Fixes the misroute for 'Show me my best leads for today' → was pulling leadbay_pull_leads, should pull leadbay_pull_followups. * fix(routing): address second-opinion findings 1. Restore 'show me leads' to pull_leads triggers (was narrowed too aggressively) 2. Tighten anti-triggers to specific phrases ('leads I should reach out to') instead of broad substrings ('reach out to' which matched discovery intent) 3. Update pull_leads short_description to say 'NEW leads' and mention pull_followups for known pipeline leads 4. Fix WORKFLOWS.md 2b row description to match actual scenario phrase * docs(routing): add comment explaining anti-trigger phrase specificity Anti-triggers use full phrases ('leads I should reach out to') rather than substrings ('reach out to') to avoid intercepting legitimate discovery intent. 'reach out to new leads' should still fire pull_leads; only re-engagement phrasings ('leads I should reach out to') route to pull_followups. Documents the architectural decision for future engineers. * fix(dashboard): add self-improve filter; fix workflow-2b label regex - Add '🔄 Self-improve' chip filter showing all eval runs that are part of the relentless self-improvement loop (any run containing workflow-2b entries) - Fix workflow_label regex to handle alphanumeric suffixes like 'workflow-2b' - Fix data-workflow sanitization to use safe CSS-id characters - Timestamp now reads from filename (not entry name) for correct display * fix(prompt): qualify_top_n — prefer wait_for_completion=true, handle BulkTracker error Add resilience rule to PHASE 1: call bulk_qualify_leads with wait_for_completion=true by default. If BulkTracker-not-configured error occurs, skip retry and proceed directly to pull_leads. Fixes TSF:4 caused by redundant async-first call followed by synchronous retry. * fix(prompt): qualify_top_n — explicit status line format for completed/pending split Add precise format instruction to PHASE 3: '✓ N leads qualified · M still processing (lead IDs: X)'. Handles 3 cases: all done, mixed, all pending. Targets NF:4 deduction from unclear 7/3 framing in prior eval run. * fix(prompt): qualify_top_n — cover exhausted=true in status line format Add explicit variants for the 4 status cases: 1. exhausted=true / all pre-qualified: 'All N leads already qualified · 0 still processing' 2. all newly qualified: 'N leads qualified' 3. mixed: 'N leads qualified · M still processing (IDs)' 4. all pending: '0 leads qualified · N still processing (IDs)' Restores TSF:5 on the pre-qualified batch edge case. * fix(prompt): qualify_top_n — explicit N/N count in exhausted status line 'All N/N leads already qualified' with actual count (e.g. '10/10') so the user can verify scope. Targets NF:4 deduction for missing count. * fix(prompt): qualify_top_n — clarify pull_leads is required for table render Explicit note that pull_leads is always needed after bulk_qualify because the qualification response does not contain the full lead data for the table. Addresses judge TSF concern about 'redundant' pull_leads call. * docs+dashboard: --improve docs in WORKFLOWS.md; Last session shows all relentless iterations WORKFLOWS.md: add 'Self-improving evals' section documenting /eval --improve, what it fixes, regression guard, and --dangerously-skip-permissions note. gen-dashboard.py: 'Last run' → 'Last session' groups all eval files within 60 minutes of the newest file. Previously showed only 1 entry (the newest file); now shows all N iterations from a relentless self-improvement run. * fix(skill): followup_check_in — disambiguate discovery-sounding phrases in follow-up context Adds explicit rule: "best leads"/"top leads"/"leads for today" within the follow-up workflow always routes to leadbay_pull_followups, not leadbay_pull_leads. Fixes workflow 2b misrouting regression. * fix(promptforge): move memory pointer before anti-triggers in routing block Keeps memory pointer within the 600-char truncation-safe window. Anti-triggers were pushing it to position 672 for leadbay_pull_leads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove disposable plans, specs, and byproduct scripts Plans and specs are session artifacts — outcomes only belong in the repo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(eval): move gen-dashboard.py to packages/mcp/test/eval/helpers/ Proper home alongside the other eval helpers rather than repo root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * revert: drop workflow 2b and followup disambiguation rule Per Milan's review: "best leads for today" should route to Discover by default; routing should be learned from user behavior via memory, not hard-pinned. Removed the over-eager disambiguation rule from the followup skill and the 2b scenario from WORKFLOWS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: broaden .gitignore to .context*/ + regen stale followup_check_in SKILL.md Two pre-merge cleanups on PR #71: 1. `.gitignore` — `.context/` only caught the literal directory. Sibling workspaces and tooling spawn `.context-<id>` paths that the original pattern missed. `.context*/` covers both. 2. `.claude-plugin/.../leadbay_followup_check_in/SKILL.md` was out of sync with its `.tmpl` source. The prompt template's "discovery- sounding phrases" disambiguation rule had been added but the generated SKILL.md was not re-emitted, so the Claude Code skill surface disagreed with the MCP prompt surface. Re-ran `pnpm prompts:build`; this commit lands the regenerated file. `pnpm -r typecheck` and `pnpm -r test` (257/257) still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: milstan <milstan@gmail.com>

milstan force-pushed the milstan/drop-provenance branch from 1340be9 to d0f1a12 Compare April 21, 2026 06:30

milstan merged commit 60e44ff into main Apr 21, 2026

ArtyETH06 mentioned this pull request May 27, 2026

Eval framework: live API runner, WORKFLOWS.md SSoT, interactive HTML dashboard #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(release): drop --provenance (repo is internal, not public)#8

fix(release): drop --provenance (repo is internal, not public)#8
milstan merged 1 commit into
mainfrom
milstan/drop-provenance

milstan commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

milstan commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant