Number531 · Number531 · May 12, 2026 · May 12, 2026
diff --git a/...egal-mcp-refactored/docs/runbooks/citation-verifier-ab-postmortem-2026-05-12.md b/...egal-mcp-refactored/docs/runbooks/citation-verifier-ab-postmortem-2026-05-12.md
@@ -0,0 +1,108 @@
+# Citation Verifier A/B Harness — Postmortem (2026-05-12)
+
+## TL;DR
+
+The A/B harness ran successfully end-to-end against 393 footnotes (393/393, zero crashes, $9.75 cost, ~38min wall-clock). **However, post-run forensic review by three parallel Explore agents found that the harness's methodology produced an INVALID production comparison.** The headline "90.5% Exa vs 61.7% Anthropic" number does not support any conclusion about the production `EXA_WEB_TOOLS=true` config.
+
+This postmortem documents what went wrong, what's salvageable, and what the actual unanswered question is.
+
+## What the harness was supposed to measure
+
+**Production question:** Does today's `EXA_WEB_TOOLS=true` config (enabled 2026-04-18, never empirically validated end-to-end) deliver equivalent citation-verification quality vs the originally-validated `EXA_WEB_TOOLS=false` Anthropic config?
+
+**Intended methodology:** Run all 393 footnotes from session 2026-03-07-1772900028 through both tool paths in isolation; compare per-category confirm rates and disagreement patterns.
+
+## What the harness actually measured
+
+Three structural asymmetries between harness and production were identified after the run:
+
+### 1. LLM semantic judgment present on one arm, absent on the other (CRITICAL)
+
+- **Production verifier**: Always runs Haiku/Sonnet to read `results[].summary` / `results[].highlights` and apply semantic judgment ("does this result substantiate the cited claim?")
+- **Harness Exa arm**: Counts `results.length > 0` → CONFIRMED. No semantic judgment.
+- **Harness Anthropic arm**: Haiku-as-judge ("respond CONFIRMED if results match"). Semantic judgment applied.
+
+This means the 29pp gap (90.5% Exa vs 61.7% Anthropic) is **largely or entirely the difference between counting existence vs applying judgment** — not between Exa-as-tool and Anthropic-as-tool.
+
+### 2. A3 `additionalQueries` not forwarded (MATERIAL)
+
+- **Production**: When `EXA_ADDITIONAL_QUERIES=true` (current default), `BaseWebSearchClient.executeExaSearch` forwards orchestrator-authored Deep search variations to Exa. These are designed to be axis-distinct and disambiguating.
+- **Harness**: Never forwards `additionalQueries`. Exa runs with its server-side auto-expansion only.
+
+This means the harness's Exa arm is running a **degraded Exa configuration** — vanilla Exa without the A3 query variations that production uses.
+
+### 3. SEC company resolution missing on Exa arm (MATERIAL)
+
+- **Production**: `SECWebSearchClient.searchSECFilingsWeb()` resolves company name → ticker/CIK before constructing the Exa query (e.g., `site:sec.gov "AAPL" ("Form 10-K")`).
+- **Harness**: Uses raw accession number or first 120 chars of footnote body as the Exa query.
+
+This means harness queries for SEC citations are less precise than production's.
+
+## Additional bug found
+
+### `fetch_document` verdict logic bug (URL_VERIFIED reversal)
+
+The harness reported **0/34 Exa CONFIRMED on URL_VERIFIED footnotes**. Agent 2's trace forensics found that 24/34 had "crawl: success" in the Exa response — meaning Exa actually retrieved content, but the harness's verdict code didn't convert "crawl success" → CONFIRMED. The "has content" check was too strict (required `highlights[].length > 0` OR `text` OR `summary` — but many `/contents` responses returned `text === ''` and `highlights === []` while `status === 'OK'` with non-empty raw page data elsewhere).
+
+This is a fixable code bug.
+
+## Anthropic-arm production-fidelity issue
+
+10 Anthropic errors were all URL_VERIFIED footnotes. Root cause: Anthropic `web_fetch_20260209` has an internal domain allowlist that blocks `treasury.gov`, `fcc.gov`, `courtlistener.com` (per the trace's error messages). This is not a harness bug — it's a real Anthropic-tool limitation that **also affects production** when `EXA_WEB_TOOLS=false`.
+
+In other words: if the original Anthropic-validated config were rolled back to today, those 10 URLs would fail there too. The production "validated" baseline may have had its own holes.
+
+## What we can NOT conclude from this run
+
+- ❌ **Cannot conclude** that Exa over-confirms in production (90.5% reflects raw existence-counting, not production behavior)
+- ❌ **Cannot conclude** that Anthropic under-confirms in production (61.7% reflects strict Haiku judgment but with degraded production fidelity)
+- ❌ **Cannot recommend** `EXA_WEB_TOOLS` rollback or persistence based on this data
+- ❌ **Cannot conclude** that URL_VERIFIED handling is broken on either side (harness verdict logic was the bug, not the tool path)
+
+## What we CAN conclude
+
+- ✅ The harness framework + fixture (393 footnotes × 7 categories × dual-arm) is reusable infrastructure for future Exa/citation experiments
+- ✅ Anthropic `web_fetch_20260209` has hard-coded domain blocks on key government domains — this is a documented production-fidelity limitation regardless of which config we run
+- ✅ The Exa arm's underlying `/search` and `/contents` calls work correctly (no auth issues, no rate limits hit, costs match expectations)
+- ✅ The Anthropic arm's tool-choice forced invocation works correctly (after the `allowed_callers: ['direct']` fix during smoke testing)
+- ✅ The 105/105 STATUTORY agreement proves regex-only paths execute identically — when the methodology is symmetric, results align
+
+## What the actual unanswered question requires
+
+To validly compare the `EXA_WEB_TOOLS=true` config vs the validated baseline, the right method is **not** an isolated harness — it's running the **actual production verifier subagent** in both configs and comparing certificate outputs:
+
+1. Set `EXA_WEB_TOOLS=true`, run the citation-websearch-verifier subagent on the 2026-03-07 fixture session (already done — `qa-outputs/citation-verification-certificate.md` exists for that session, but only with the run-of-record Anthropic config since `EXA_WEB_TOOLS=false` was the default at the time)
+2. Re-run the same subagent against the same fixture with the OPPOSITE config flag
+3. Compare per-footnote CONFIRMED rates from the two certificate files
+
+This requires invoking the live production agent loop (the SDK orchestrator dispatching to the citation-websearch-verifier subagent), which the isolated harness was deliberately avoiding to keep costs low and methodology simple. Cost: ~$0.02/memo × 2 modes = ~$0.04 + ~10 min wall-clock. Cheaper and more truthful than the harness was.
+
+## Recommendations
+
+1. **DO NOT roll back `EXA_WEB_TOOLS=false`** based on this run. The data does not support that recommendation.
+2. **Fix the `fetch_document` verdict logic** in the harness (small bug, cheap to fix) if anyone re-runs.
+3. **Document the harness's three asymmetries** (this file).
+4. **For the real production question**, run a side-by-side certificate comparison (see above) using the actual production verifier — not an isolated harness.
+5. **Preserve all artifacts** from this run:
+   - `docs/runbooks/citation-verifier-ab-trace-2026-05-12.json` (raw per-footnote data, valid as raw measurements; invalid as a production comparison)
+   - `docs/runbooks/citation-verifier-ab-report-2026-05-12.md` (markdown — note: the verdict in this file is INVALID per this postmortem)
+   - `/tmp/cv-ab-smoke-trace-backup-*.json` (smoke test, 100% agreement)
+   - `/tmp/cv-ab-flawed-trace-backup.json` + `/tmp/cv-ab-flawed-report-backup.md` (pre-postmortem snapshots)
+   - This postmortem (intellectual honesty record)
+
+## What this cost
+
+- $9.75 in API spend
+- ~38 min wall-clock
+- 4 new test files + 1 fixture + 2 runbook outputs + this postmortem (~1,500 LoC + ~400KB data)
+- Three Explore-agent forensic reviews (no per-agent dollar cost)
+
+## What this taught us
+
+The harness's framework is sound. The methodology had three asymmetry blind spots that only surfaced under forensic review. **The lesson is to apply the same LLM judgment layer on both arms before comparing tools** — otherwise the comparison measures "with judgment vs without," not "tool A vs tool B."
+
+This pattern is worth remembering for future Exa-vs-X experiments: any LLM-mediated production workflow requires matched LLM mediation in the test harness, or the comparison is structurally incommensurable.
+
+---
+
+**Status:** the citation-verifier-ab harness ships as **tooling** (mergeable like PR #116), but its specific 2026-05-12 verdict is **superseded by this postmortem**. Production config (`EXA_WEB_TOOLS=true`) remains unchanged. The originally-asked question — does production Exa config deliver equivalent quality to validated Anthropic baseline — remains **open and unanswered**; requires a different methodology to address.
diff --git a/super-legal-mcp-refactored/docs/runbooks/citation-verifier-ab-report-2026-05-12.md b/super-legal-mcp-refactored/docs/runbooks/citation-verifier-ab-report-2026-05-12.md
@@ -0,0 +1,52 @@
+> # ⚠️ THIS REPORT'S VERDICT IS INVALID
+>
+> Post-run forensic review by 3 Explore agents identified three structural methodology issues that make the 90.5% vs 61.7% comparison incommensurable. See **[`citation-verifier-ab-postmortem-2026-05-12.md`](citation-verifier-ab-postmortem-2026-05-12.md)** for the full analysis.
+>
+> Summary of issues:
+> 1. The Anthropic arm applies Haiku-as-judge semantic matching; the Exa arm counts `results.length > 0`. The ~29pp gap is largely this asymmetry, not Exa vs Anthropic.
+> 2. The Exa arm never forwards `additionalQueries` (A3); production forwards them when `EXA_ADDITIONAL_QUERIES=true`. Harness measures a degraded Exa config.
+> 3. SEC queries use raw text; production uses ticker/CIK resolution.
+> 4. `fetch_document` verdict logic incorrectly reports UNCONFIRMED on successful crawls with empty highlights array.
+>
+> The numbers below are accurate measurements of what the harness did. They are NOT a valid production comparison. Do not act on this verdict.
+
+# Citation Verifier A/B Report — Exa vs Anthropic Tool Path
+
+**Date:** 2026-05-12T04-19-55-368Z
+**Fixture:** ../../../super-legal-mcp-refactored/reports/2026-03-07-1772900028/consolidated-footnotes.md
+**Footnotes:** 393
+**Arms:** exa, anthropic
+**Verdict:** **NOT_VIABLE**
+
+---
+
+## Aggregate
+
+| Arm | Confirmed | Unconfirmed | Error | Skip | Confirm Rate | Total Cost | Mean Latency |
+|---|---|---|---|---|---|---|---|
+| Exa | 324 | 34 | 0 | 35 | 0.905 | $3.668 | 19217ms |
+| Anthropic | 221 | 127 | 10 | 35 | 0.617 | $6.081 | 2573ms |
+
+**Agreement:** AGREE=240 DISAGREE=153 (rate: 0.611)
+
+## Decision Rule
+
+| Criterion | Value | Threshold | Pass |
+|---|---|---|---|
+| overall_rate_gap | 0.288 | ≤ 0.05 | ✗ |
+| agreement_rate | 0.611 | ≥ 0.85 | ✗ |
+| error_rate_exa | 0 | ≤ 0.05 | ✓ |
+| error_rate_anthropic | 0.02544529262086514 | ≤ 0.05 | ✓ |
+| statutory_match | 0 | 0.00 | ✓ |
+
+## Per-Category Breakdown
+
+| Category | n | Exa CFM | Exa UNC | Exa ERR | An CFM | An UNC | An ERR | Agree | Disagree |
+|---|---|---|---|---|---|---|---|---|---|
+| STATUTORY | 105 | 105 | 0 | 0 | 105 | 0 | 0 | 105 | 0 |
+| SEC | 50 | 50 | 0 | 0 | 19 | 31 | 0 | 19 | 31 |
+| GOV | 36 | 36 | 0 | 0 | 11 | 25 | 0 | 11 | 25 |
+| CASE_LAW | 26 | 26 | 0 | 0 | 14 | 12 | 0 | 14 | 12 |
+| OTHER | 107 | 107 | 0 | 0 | 52 | 55 | 0 | 52 | 55 |
+| SKIP | 35 | 0 | 0 | 0 | 0 | 0 | 0 | 35 | 0 |
+| URL_VERIFIED | 34 | 0 | 34 | 0 | 20 | 4 | 10 | 4 | 30 |