Skip to content

fix(release): drop --provenance (repo is internal, not public)#8

Merged
milstan merged 1 commit into
mainfrom
milstan/drop-provenance
Apr 21, 2026
Merged

fix(release): drop --provenance (repo is internal, not public)#8
milstan merged 1 commit into
mainfrom
milstan/drop-provenance

Conversation

@milstan

@milstan milstan commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Sigstore provenance requires public source repo; our first mcp-v0.2.0 publish hit 422 because the repo is internal. Dropping --provenance unblocks publishing. Re-add if/when the repo goes public.

Sigstore provenance requires the source repo to be public so anyone can
verify the attestation chain. Our repo is internal, which made the first
mcp-v0.2.0 publish fail with:

  422 Unprocessable Entity
  Error verifying sigstore provenance bundle: Unsupported GitHub Actions
  source repository visibility: "internal". Only public source
  repositories are supported when publishing with provenance.

Publishing proceeds without attestations. If/when the repo flips to
public, re-add --provenance in both publish steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@milstan milstan force-pushed the milstan/drop-provenance branch from 1340be9 to d0f1a12 Compare April 21, 2026 06:30
@milstan milstan merged commit 60e44ff into main Apr 21, 2026
milstan added a commit that referenced this pull request Apr 21, 2026
Provenance was dropped in c0mmit that merged as PR #8 because the repo
was internal — sigstore requires a public source repo to verify the
attestation chain. Repo is now public, so provenance publishes again.

Next bump will ship with signed provenance to transparency.sigstore.dev.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ArtyETH06 added a commit that referenced this pull request May 27, 2026
…rce SSoT via audit

- Add eval scenarios + invariants for prospecting overview (#7),
  outreach drafting (#8), field sales tour (#10), and team prospecting (#11)
- Update WORKFLOWS.md: every Supported row now cites its eval file in
  the Tests column — one place to see what's covered
- Add workflows-eval-coverage.test.ts audit that enforces eval coverage
  for all Supported rows; CI fails if a row is added without an eval file
milstan added a commit that referenced this pull request May 29, 2026
…dashboard (#71)

* Eval framework: subscription-only runner, agentic judge, overdelivery guard

- Add cli-session-runner.ts: drives multi-turn eval sessions via the claude
  CLI using OAuth subscription auth (no ANTHROPIC_API_KEY required). Parses
  stream-json events, strips mcp__ tool name prefixes, captures final text
  from result events with lastAssistantText fallback.

- Add fixture-mcp-server.ts: standalone MCP stdio server that patches
  node:https before importing LeadbayClient, serving all backend requests
  from EVAL_FIXTURES (base64 JSON). Enables realistic tool execution without
  a live backend.

- Rewrite mission-match-judge.ts: agentic per-criterion loop (SDK path) and
  single-shot with full evidence dump (CLI path). Adds per_criterion verdicts
  to evidence L3.

- Add CriterionVerdict to evidence.ts and per_criterion field to MCPEvidence.

- Add llm-judge-shared.ts: shared callJudgeAuto() that auto-selects SDK vs
  CLI judge backend, hasCLI() detection, makeAnthropicClientIfAvailable().

- Update run-eval.ts: auto-selects SDK runner (ANTHROPIC_API_KEY present) vs
  CLI runner (subscription-only). Adds backend_requests pyramid validation.

- Fix vitest.eval.config.ts: switched to pool:threads + singleThread:true for
  correct @leadbay/* workspace package resolution; removed broken vite aliases.

- Add widget-overdelivery-guard scenario + eval: verifies the daily check-in
  agent stops before drafting/sending outreach without explicit user consent.
  Fixtures use correct /1.5 paths matching LeadbayClient's URL construction.

All 251 existing tests continue to pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix token tracking scope bug + add no_fabrication rubric to CLI judge

Token counters were declared inside the try block but referenced after
the finally in the return statement, causing a ReferenceError. Moved
declarations before the try block.

Added scoring rubric anchors to the CLI single-shot judge prompt so
agents acknowledging tool errors are not penalised as fabrication
(no_fabrication=5 when agent correctly says "I don't have data").

* Print judge scorecard after each eval run

Shows mission_match/instruction_adherence/no_fabrication/tool_selection_fit
scores, per-criterion pass/fail with reasons, tools called sequence, and
duration — so failures are immediately visible without digging into transcripts.

* LeadbayClient: accept options object {baseUrl, bearer} in constructor

Allows both call signatures:
  new LeadbayClient("https://...", "token")
  new LeadbayClient({ baseUrl: "https://...", bearer: "token" })

Fixes eval tests that use the options-object form.

* Fix eval judge pass/fail inversion, migrate all scenarios to /1.5 paths, skip routing classifier without API key

- Judge: clarify verdict() semantics — pass=true when confirmed, pass=false when absent;
  reasoning must agree with boolean; explicit examples for tool-call and byproduct criteria
- Judge: show reasoning for all criteria in console output (not only failures)
- Judge: add no_fabrication rubric note that rendering fixture data is not fabrication
- Scenarios: rewrite all 10 broken scenarios from defunct /v1/... paths to correct
  /1.5/... paths matching LeadbayClient.request() which prepends /1.5 to every call
- tool-routing-classifier: skip gracefully when ANTHROPIC_API_KEY is absent

* Eval framework: CLI-only auth, single-shot judge, remove SDK paths

Claude Code transparently handles auth (subscription or API key) for
`claude -p` subprocesses — no ANTHROPIC_API_KEY branching needed.
Removes ~800 lines of dead SDK runner and agentic judge code.

- llm-judge-shared: CLI-only callJudge, drop SDK client factory
- mission-match-judge: single-shot `claude -p` judge prompt; remove
  entire agentic multi-turn SDK loop (runAgenticJudgeSDK, evidence
  tools, buildAgentSystemPrompt). Rename buildFallbackPrompt →
  buildJudgePrompt (now the only path).
- drift-judge: remove Anthropic import and client? field
- run-eval: setupScenarioFixtures is a no-op; always use CLI runner;
  remove SDK runner import and ANTHROPIC_API_KEY check
- touchfiles: update GLOBAL_TOUCHFILES to cli-session-runner.ts
- eval files: rewrite daily-check-in + import-file evals from 150+
  line inline SDK sessions to clean ~25 line runScenarioEval calls

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove dead SDK runner and routing classifier eval

session-runner.ts (SDK-based runner) and tool-routing-classifier.eval.ts
(SDK-only test) are unreachable since the framework moved to CLI-only execution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Judge: strengthen no_fabrication rubric to prevent false deductions

Move the no_fabrication rule above the scoring table with an explicit
bulleted list of what is NOT fabrication: score bars, tool-response
rendering, stop phrases, summarisation. The judge was consistently
scoring 4/5 for rendering fixture-grounded markdown (▰❖▱ bars, emails,
company names), which the rubric always intended to be a 5.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add eval framework README

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Eval coverage: add missing evals for workflows #7, #8, #10, #11; enforce SSoT via audit

- Add eval scenarios + invariants for prospecting overview (#7),
  outreach drafting (#8), field sales tour (#10), and team prospecting (#11)
- Update WORKFLOWS.md: every Supported row now cites its eval file in
  the Tests column — one place to see what's covered
- Add workflows-eval-coverage.test.ts audit that enforces eval coverage
  for all Supported rows; CI fails if a row is added without an eval file

* Consolidate eval specs into WORKFLOWS.md; delete 12 invariant files

WORKFLOWS.md is now the single source of truth for eval contracts.
Required calls, forbidden calls, and success criteria live in fenced
```yaml expected blocks in the doc — no separate TypeScript invariant
files needed. The workflows-parser.ts runtime reads these blocks;
run-eval.ts derives invariants and the judge mission from them.

Deletes 867 lines across 12 invariants/*.ts files and all inline
mission objects in 19 scenario files. Adds a new audit test that
asserts every Supported workflow row has a parseable expected block
with non-empty required_calls and success_criteria.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Migrate all eval prompts and scenarios to workflow_id pattern

All 13 eval prompt files drop the invariants import/parameter.
All 19 scenario files replace the inline mission object with a
single workflow_id field. run-eval.ts wires workflow_id through
the workflows-parser to derive invariants and the judge mission
at runtime from WORKFLOWS.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add vitest.eval.config.ts and test:eval script for running evals

Evals use .eval.ts extension intentionally excluded from the normal
test suite. The new config + script makes them runnable without
remembering the flag:

  EVAL=1 pnpm --filter @leadbay/mcp run test:eval
  EVAL=1 npx vitest run --config vitest.eval.config.ts <file>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add HTML eval report generator

Reads .context/evals/*.json run files and generates a self-contained
dark-mode HTML report with per-scenario score bars, invariant results,
per-criterion verdicts, tool call sequences, and judge reasoning.

Usage:
  pnpm --filter @leadbay/mcp run eval:report           # latest run
  pnpm --filter @leadbay/mcp run eval:report -- --all  # all runs
  pnpm --filter @leadbay/mcp run eval:report -- --run <run_id>
  pnpm --filter @leadbay/mcp run eval:report -- --output /path/to/report.html

Output: .context/evals/eval-report.html

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add live eval runner — test account, no fixtures

Option B: evals now run against a real Leadbay test account instead of
canned HTTP fixtures. No scenario .ts files needed — workflows are
selected by ID and the contract comes from WORKFLOWS.md yaml blocks.

New files:
- live-mcp-server.ts: minimal stdio MCP server using real Leadbay auth
- live-session-runner.ts: CLI session runner without fixture machinery;
  accepts systemPrompt injected via --system-prompt
- run-workflow.ts: CLI script — LEADBAY_TOKEN=... eval:live --workflow 1,3

WORKFLOWS.md: added yaml scenario blocks (trigger prompt per workflow).
workflows-parser.ts: parses scenario blocks, exports getWorkflowScenario().

Usage:
  LEADBAY_TOKEN=<token> LEADBAY_REGION=us \
    pnpm --filter @leadbay/mcp run eval:live --workflow 1

Verified: workflow #1 passes 5/5 mission_match against real account
(SnapLock Industries, lens 39107, real leads from api-us.leadbay.app).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* QoL: single eval command, .env.eval for credentials, eval:view script

- .env.eval at repo root stores LEADBAY_TOKEN + LEADBAY_REGION (gitignored)
  Create it: echo "LEADBAY_TOKEN=...\nLEADBay_REGION=us" > .env.eval
- `eval` script loads it via dotenv-cli:
  pnpm --filter @leadbay/mcp run eval -- --workflow 2
- `eval:view` generates HTML report and opens it in browser
- report.ts prints xdg-open hint with absolute path after generation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove fixture-based eval infrastructure — live API runner is the only path

Deleted:
- helpers: cli-session-runner, fixture-mcp-server, backend-recorder,
  touchfiles, drift-judge, run-eval
- all 18 scenario files (scenarios/)
- all 13 prompt eval stubs (prompts/*.eval.ts)
- drift-detector.ts script
- vitest.eval.config.ts + test:eval npm script
- audit tests that checked for now-deleted .eval.ts files

The live runner (run-workflow.ts + live-session-runner.ts) against the
real Leadbay API replaces all of this. WORKFLOWS.md yaml expected/scenario
blocks are the only source of truth. 251 tests still pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Derive workflow_name, prompt_name, and ALL_WORKFLOW_IDS from WORKFLOWS.md

All three hardcoded maps (WORKFLOW_PROMPT, WORKFLOW_NAME, ALL_WORKFLOW_IDS)
removed from run-workflow.ts. Parser now reads workflow_name and prompt_name
scalar fields from each yaml expected block; run-workflow.ts calls
getAllWorkflowExpected() at startup to derive the workflow list.

Adding a new eval now requires only a WORKFLOWS.md edit — no TypeScript
files to touch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix readFileSync missing import; relax workflow #3 required_calls

- live-session-runner: add readFileSync to fs import (renderFullLog was
  crashing with ReferenceError on every run)
- WORKFLOWS.md workflow #3: replace required leadbay_research_lead_by_id
  with leadbay_research_lead_by_name_fuzzy — the fuzzy lookup alone is a
  valid completion path for domain research; by_id is called only when
  the agent wants deeper detail after the fuzzy result

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Block non-Leadbay tools via --disallowedTools in eval runner

ToolSearch and WebFetch were leaking through despite --allowedTools
mcp__leadbay-live__* — the agent used them to answer from training data
instead of calling real Leadbay tools, causing workflow #3 to show
zero tool calls and score 1/5 across the board.

Add explicit --disallowedTools list covering all Claude Code built-ins
that could leak: ToolSearch, WebFetch, WebSearch, Bash, Read, Edit,
Write, Glob, Grep, LS, Skill, LSP, Agent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Show token consumption in eval summary table

Adds tokens_in / tokens_out columns per workflow row and a total
tokens line at the bottom of the summary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Track and display session + judge token consumption separately

- llm-judge-shared: switch callClaudeCLI to --output-format json to
  capture input/output token counts from the judge call
- mission-match-judge: thread tokens_in/tokens_out through to caller
- run-workflow: show per-workflow session vs judge token columns in the
  summary table, plus totals broken out by session / judge / combined

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix session token tracking; add cache read column to summary

- live-session-runner: wire totalTokensIn/Out into the returned cost
  object (was hardcoded 0); also capture cache_read_input_tokens
- run-workflow: show session tokens as in/cache/out format so the
  large cache_read numbers are visible and not confused with new input

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Show grand total token count in terminal and dashboard

- evidence: add token fields to EvalEntry (session in/cache/out, judge in/out)
- run-workflow: populate token fields in collector; show grand total line
  in terminal summary (session + cache + judge)
- report: show total tokens as a hoverable chip on each workflow card
  (hover reveals the session/cache/judge breakdown)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Suppress superpowers hooks in eval sessions

The global ~/.claude/settings.json has PreToolUse/SessionStart hooks
from claude-hook.js (superpowers) that inject "Checking for applicable
skills now" into the agent, causing it to skip Leadbay MCP tools and
answer from training data.

Explicitly set all hook arrays to [] in the eval settings file so they
override the global hooks for the duration of the eval session.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* WORKFLOWS.md: document eval-skill as the runner, add eval instructions

Replace the "How this stays normative" section with full eval runner
documentation — /eval skill usage, prerequisites, how to add a new
eval. The skill reads this file directly at runtime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* WORKFLOWS.md: drop Tests/Notes columns, replace with Prompt/Required/Forbidden/Scenario

Table now shows the contract inline — no noise columns.
Audit test: remove Tests-column path check (column is gone).

* WORKFLOWS.md: eliminate table/yaml redundancy

Table rows are now index-only (user story + prompt + scenario trigger).
All required/forbidden calls and success criteria live once in the
yaml expected blocks in the contracts section.

Parser: switch from lastRowNum (table-row tracking) to sequential
block counting — Nth yaml expected block = workflow #N.

* Remove dead TS eval infrastructure replaced by eval skill

workflows-parser.ts, run-workflow.ts, workflows-expected-blocks.test.ts
were only used by each other. The skill parses WORKFLOWS.md directly
at runtime — no TypeScript parser needed.

* eval: delete report.ts script, dashboard now generated by /eval skill

The /eval skill (v1.3.0) generates the HTML dashboard directly in Phase 7
by writing and executing a Python script. There is no longer a need for a
standalone TypeScript report generator.

Removes:
- packages/mcp/test/eval/scripts/report.ts — 1009-line TS dashboard generator
- eval:report and eval:view package.json scripts
- eval script (run-workflow.ts-based runner, superseded by the skill)

Updates WORKFLOWS.md to reference the dashboard file directly instead of
the now-deleted pnpm script.

* eval: move gen-dashboard.py to repo root so it's version-controlled

Was living in .context/evals/ (gitignored). Moving to repo root makes
dashboard improvements visible in PRs and persistent across clones.

Also fixes:
- JSON loader: handle top-level array files (not just {entries:[]} shape)
- Last run filter: new chip filters entry list to most recent run file
- Token grand total: include session_cache in sum (was being dropped)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add relentless eval loop design spec

Designs the /relentless + /eval self-improvement loop for MCP prompt
quality. Workflow 2b (routing violation) is the deliberate failure target.

* docs: sharpen relentless eval loop spec with confirmed routing gap

Scenario updated from ambiguous phrasing to a confirmed structural failure:
'Show me leads I should reach out to today' reliably fires pull_leads
(discovery) instead of pull_followups (Monitor) because pull_leads
triggers on 'show me leads' + 'today' and has no anti-trigger for
'reach out to'. Fix target is tool-description routing, not just the
prompt text.

* docs: add relentless eval loop implementation plan

* eval: add workflow 2b — routing stress-test for reach-out phrasing

'Show me leads I should reach out to today' reliably misfires to
leadbay_pull_leads. Confirmed structural gap: pull_leads triggers on
'show me leads' + 'today'; pull_followups has no matching trigger for
'reach out to'. Used as the relentless loop's target failure.

* eval: workflow 2b uses no system prompt — tests raw tool-description routing

With prompt_name set, the system prompt overrides routing and the test
always passes. Without it (prompt_name: ~), the agent uses only tool
descriptions to route — exposing the real gap where 'reach out to today'
fires pull_leads instead of pull_followups.

* fix(routing): add reach-out/contact/re-engage phrasing class to tool routing

pull_leads.md.tmpl: add 6 anti_triggers routing reach-out phrasings to
pull_followups (reach out to, get back to, contact today, should I contact,
reconnect with, re-engage).

pull-followups.md.tmpl: add 8 triggers actively claiming the same phrasing
class (reach out to today, should reach out to, get back to, contact today,
reconnect with, re-engage, leads to contact, who should I ping).

Fixes workflow 2b: 'Show me leads I should reach out to today' was misfiring
to leadbay_pull_leads. Root cause: pull_leads had no anti-trigger for this
semantic class; pull_followups had no trigger claiming it.

* fix(routing): add concrete negative/positive examples for reach-out phrasing

pull_leads.md.tmpl: add 3 negative examples including the exact failing
phrase 'Show me leads I should reach out to today' — gives LLM concrete
evidence this phrase should NOT route here.

pull-followups.md.tmpl: add 3 positive examples including the exact
phrase — gives LLM concrete evidence this phrase SHOULD route here.

The combination of anti_trigger entries + concrete examples provides
strong bidirectional routing signal for the reach-out phrasing class.

* fix(routing): narrow pull_leads trigger from 'show me leads' to 'show me new/today leads'

'show me leads' was too broad — matched re-engagement phrasings like
'Show me leads I should reach out to today'. Replaced with more specific
triggers: 'show me new leads', 'show me today's leads', 'fresh leads',
'what's new today'. The scenario phrase no longer matches any pull_leads
trigger, while pull_followups now claims it via 'reach out to today'.

* eval: workflow 2b — use system prompt + discovery-phrased ambiguous scenario

Tool-description-only routing (prompt_name: ~) is architecturally insufficient
— the model priors for 'show me leads' always win over hint text. Real failure
target: 'Show me my best leads for today' misfires to pull_leads EVEN WITH the
leadbay_followup_check_in system prompt. That's a prompt-body fix, not a
tool-description fix. Scenario now uses prompt_name: leadbay_followup_check_in.

* fix(prompt): disambiguate discovery-sounding phrases in followup_check_in context

Add explicit disambiguation rule to PHASE 1 of leadbay_followup_check_in:
'best leads', 'top leads', 'leads for today', 'show me my leads' in the
follow-up workflow context means Monitor pipeline, not a fresh Discover batch.
Fixes the misroute for 'Show me my best leads for today' → was pulling
leadbay_pull_leads, should pull leadbay_pull_followups.

* fix(routing): address second-opinion findings

1. Restore 'show me leads' to pull_leads triggers (was narrowed too aggressively)
2. Tighten anti-triggers to specific phrases ('leads I should reach out to')
   instead of broad substrings ('reach out to' which matched discovery intent)
3. Update pull_leads short_description to say 'NEW leads' and mention
   pull_followups for known pipeline leads
4. Fix WORKFLOWS.md 2b row description to match actual scenario phrase

* docs(routing): add comment explaining anti-trigger phrase specificity

Anti-triggers use full phrases ('leads I should reach out to') rather
than substrings ('reach out to') to avoid intercepting legitimate
discovery intent. 'reach out to new leads' should still fire pull_leads;
only re-engagement phrasings ('leads I should reach out to') route to
pull_followups. Documents the architectural decision for future engineers.

* fix(dashboard): add self-improve filter; fix workflow-2b label regex

- Add '🔄 Self-improve' chip filter showing all eval runs that are part of
  the relentless self-improvement loop (any run containing workflow-2b entries)
- Fix workflow_label regex to handle alphanumeric suffixes like 'workflow-2b'
- Fix data-workflow sanitization to use safe CSS-id characters
- Timestamp now reads from filename (not entry name) for correct display

* fix(prompt): qualify_top_n — prefer wait_for_completion=true, handle BulkTracker error

Add resilience rule to PHASE 1: call bulk_qualify_leads with
wait_for_completion=true by default. If BulkTracker-not-configured
error occurs, skip retry and proceed directly to pull_leads.

Fixes TSF:4 caused by redundant async-first call followed by
synchronous retry.

* fix(prompt): qualify_top_n — explicit status line format for completed/pending split

Add precise format instruction to PHASE 3: '✓ N leads qualified · M still
processing (lead IDs: X)'. Handles 3 cases: all done, mixed, all pending.
Targets NF:4 deduction from unclear 7/3 framing in prior eval run.

* fix(prompt): qualify_top_n — cover exhausted=true in status line format

Add explicit variants for the 4 status cases:
1. exhausted=true / all pre-qualified: 'All N leads already qualified · 0 still processing'
2. all newly qualified: 'N leads qualified'
3. mixed: 'N leads qualified · M still processing (IDs)'
4. all pending: '0 leads qualified · N still processing (IDs)'

Restores TSF:5 on the pre-qualified batch edge case.

* fix(prompt): qualify_top_n — explicit N/N count in exhausted status line

'All N/N leads already qualified' with actual count (e.g. '10/10')
so the user can verify scope. Targets NF:4 deduction for missing count.

* fix(prompt): qualify_top_n — clarify pull_leads is required for table render

Explicit note that pull_leads is always needed after bulk_qualify because
the qualification response does not contain the full lead data for the
table. Addresses judge TSF concern about 'redundant' pull_leads call.

* docs+dashboard: --improve docs in WORKFLOWS.md; Last session shows all relentless iterations

WORKFLOWS.md: add 'Self-improving evals' section documenting /eval --improve,
what it fixes, regression guard, and --dangerously-skip-permissions note.

gen-dashboard.py: 'Last run' → 'Last session' groups all eval files within
60 minutes of the newest file. Previously showed only 1 entry (the newest
file); now shows all N iterations from a relentless self-improvement run.

* fix(skill): followup_check_in — disambiguate discovery-sounding phrases in follow-up context

Adds explicit rule: "best leads"/"top leads"/"leads for today" within
the follow-up workflow always routes to leadbay_pull_followups, not
leadbay_pull_leads. Fixes workflow 2b misrouting regression.

* fix(promptforge): move memory pointer before anti-triggers in routing block

Keeps memory pointer within the 600-char truncation-safe window.
Anti-triggers were pushing it to position 672 for leadbay_pull_leads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: remove disposable plans, specs, and byproduct scripts

Plans and specs are session artifacts — outcomes only belong in the repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(eval): move gen-dashboard.py to packages/mcp/test/eval/helpers/

Proper home alongside the other eval helpers rather than repo root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* revert: drop workflow 2b and followup disambiguation rule

Per Milan's review: "best leads for today" should route to Discover by
default; routing should be learned from user behavior via memory, not
hard-pinned. Removed the over-eager disambiguation rule from the
followup skill and the 2b scenario from WORKFLOWS.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: broaden .gitignore to .context*/ + regen stale followup_check_in SKILL.md

Two pre-merge cleanups on PR #71:

1. `.gitignore` — `.context/` only caught the literal directory. Sibling
   workspaces and tooling spawn `.context-<id>` paths that the original
   pattern missed. `.context*/` covers both.

2. `.claude-plugin/.../leadbay_followup_check_in/SKILL.md` was out of
   sync with its `.tmpl` source. The prompt template's "discovery-
   sounding phrases" disambiguation rule had been added but the
   generated SKILL.md was not re-emitted, so the Claude Code skill
   surface disagreed with the MCP prompt surface. Re-ran
   `pnpm prompts:build`; this commit lands the regenerated file.

`pnpm -r typecheck` and `pnpm -r test` (257/257) still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: milstan <milstan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant