Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Citation Verifier A/B Harness — Postmortem (2026-05-12)

## TL;DR

The A/B harness ran successfully end-to-end against 393 footnotes (393/393, zero crashes, $9.75 cost, ~38min wall-clock). **However, post-run forensic review by three parallel Explore agents found that the harness's methodology produced an INVALID production comparison.** The headline "90.5% Exa vs 61.7% Anthropic" number does not support any conclusion about the production `EXA_WEB_TOOLS=true` config.

This postmortem documents what went wrong, what's salvageable, and what the actual unanswered question is.

## What the harness was supposed to measure

**Production question:** Does today's `EXA_WEB_TOOLS=true` config (enabled 2026-04-18, never empirically validated end-to-end) deliver equivalent citation-verification quality vs the originally-validated `EXA_WEB_TOOLS=false` Anthropic config?

**Intended methodology:** Run all 393 footnotes from session 2026-03-07-1772900028 through both tool paths in isolation; compare per-category confirm rates and disagreement patterns.

## What the harness actually measured

Three structural asymmetries between harness and production were identified after the run:

### 1. LLM semantic judgment present on one arm, absent on the other (CRITICAL)

- **Production verifier**: Always runs Haiku/Sonnet to read `results[].summary` / `results[].highlights` and apply semantic judgment ("does this result substantiate the cited claim?")
- **Harness Exa arm**: Counts `results.length > 0` → CONFIRMED. No semantic judgment.
- **Harness Anthropic arm**: Haiku-as-judge ("respond CONFIRMED if results match"). Semantic judgment applied.

This means the 29pp gap (90.5% Exa vs 61.7% Anthropic) is **largely or entirely the difference between counting existence vs applying judgment** — not between Exa-as-tool and Anthropic-as-tool.

### 2. A3 `additionalQueries` not forwarded (MATERIAL)

- **Production**: When `EXA_ADDITIONAL_QUERIES=true` (current default), `BaseWebSearchClient.executeExaSearch` forwards orchestrator-authored Deep search variations to Exa. These are designed to be axis-distinct and disambiguating.
- **Harness**: Never forwards `additionalQueries`. Exa runs with its server-side auto-expansion only.

This means the harness's Exa arm is running a **degraded Exa configuration** — vanilla Exa without the A3 query variations that production uses.

### 3. SEC company resolution missing on Exa arm (MATERIAL)

- **Production**: `SECWebSearchClient.searchSECFilingsWeb()` resolves company name → ticker/CIK before constructing the Exa query (e.g., `site:sec.gov "AAPL" ("Form 10-K")`).
- **Harness**: Uses raw accession number or first 120 chars of footnote body as the Exa query.

This means harness queries for SEC citations are less precise than production's.

## Additional bug found

### `fetch_document` verdict logic bug (URL_VERIFIED reversal)

The harness reported **0/34 Exa CONFIRMED on URL_VERIFIED footnotes**. Agent 2's trace forensics found that 24/34 had "crawl: success" in the Exa response — meaning Exa actually retrieved content, but the harness's verdict code didn't convert "crawl success" → CONFIRMED. The "has content" check was too strict (required `highlights[].length > 0` OR `text` OR `summary` — but many `/contents` responses returned `text === ''` and `highlights === []` while `status === 'OK'` with non-empty raw page data elsewhere).

This is a fixable code bug.

## Anthropic-arm production-fidelity issue

10 Anthropic errors were all URL_VERIFIED footnotes. Root cause: Anthropic `web_fetch_20260209` has an internal domain allowlist that blocks `treasury.gov`, `fcc.gov`, `courtlistener.com` (per the trace's error messages). This is not a harness bug — it's a real Anthropic-tool limitation that **also affects production** when `EXA_WEB_TOOLS=false`.

In other words: if the original Anthropic-validated config were rolled back to today, those 10 URLs would fail there too. The production "validated" baseline may have had its own holes.

## What we can NOT conclude from this run

- ❌ **Cannot conclude** that Exa over-confirms in production (90.5% reflects raw existence-counting, not production behavior)
- ❌ **Cannot conclude** that Anthropic under-confirms in production (61.7% reflects strict Haiku judgment but with degraded production fidelity)
- ❌ **Cannot recommend** `EXA_WEB_TOOLS` rollback or persistence based on this data
- ❌ **Cannot conclude** that URL_VERIFIED handling is broken on either side (harness verdict logic was the bug, not the tool path)

## What we CAN conclude

- ✅ The harness framework + fixture (393 footnotes × 7 categories × dual-arm) is reusable infrastructure for future Exa/citation experiments
- ✅ Anthropic `web_fetch_20260209` has hard-coded domain blocks on key government domains — this is a documented production-fidelity limitation regardless of which config we run
- ✅ The Exa arm's underlying `/search` and `/contents` calls work correctly (no auth issues, no rate limits hit, costs match expectations)
- ✅ The Anthropic arm's tool-choice forced invocation works correctly (after the `allowed_callers: ['direct']` fix during smoke testing)
- ✅ The 105/105 STATUTORY agreement proves regex-only paths execute identically — when the methodology is symmetric, results align

## What the actual unanswered question requires

To validly compare the `EXA_WEB_TOOLS=true` config vs the validated baseline, the right method is **not** an isolated harness — it's running the **actual production verifier subagent** in both configs and comparing certificate outputs:

1. Set `EXA_WEB_TOOLS=true`, run the citation-websearch-verifier subagent on the 2026-03-07 fixture session (already done — `qa-outputs/citation-verification-certificate.md` exists for that session, but only with the run-of-record Anthropic config since `EXA_WEB_TOOLS=false` was the default at the time)
2. Re-run the same subagent against the same fixture with the OPPOSITE config flag
3. Compare per-footnote CONFIRMED rates from the two certificate files

This requires invoking the live production agent loop (the SDK orchestrator dispatching to the citation-websearch-verifier subagent), which the isolated harness was deliberately avoiding to keep costs low and methodology simple. Cost: ~$0.02/memo × 2 modes = ~$0.04 + ~10 min wall-clock. Cheaper and more truthful than the harness was.

## Recommendations

1. **DO NOT roll back `EXA_WEB_TOOLS=false`** based on this run. The data does not support that recommendation.
2. **Fix the `fetch_document` verdict logic** in the harness (small bug, cheap to fix) if anyone re-runs.
3. **Document the harness's three asymmetries** (this file).
4. **For the real production question**, run a side-by-side certificate comparison (see above) using the actual production verifier — not an isolated harness.
5. **Preserve all artifacts** from this run:
- `docs/runbooks/citation-verifier-ab-trace-2026-05-12.json` (raw per-footnote data, valid as raw measurements; invalid as a production comparison)
- `docs/runbooks/citation-verifier-ab-report-2026-05-12.md` (markdown — note: the verdict in this file is INVALID per this postmortem)
- `/tmp/cv-ab-smoke-trace-backup-*.json` (smoke test, 100% agreement)
- `/tmp/cv-ab-flawed-trace-backup.json` + `/tmp/cv-ab-flawed-report-backup.md` (pre-postmortem snapshots)
- This postmortem (intellectual honesty record)

## What this cost

- $9.75 in API spend
- ~38 min wall-clock
- 4 new test files + 1 fixture + 2 runbook outputs + this postmortem (~1,500 LoC + ~400KB data)
- Three Explore-agent forensic reviews (no per-agent dollar cost)

## What this taught us

The harness's framework is sound. The methodology had three asymmetry blind spots that only surfaced under forensic review. **The lesson is to apply the same LLM judgment layer on both arms before comparing tools** — otherwise the comparison measures "with judgment vs without," not "tool A vs tool B."

This pattern is worth remembering for future Exa-vs-X experiments: any LLM-mediated production workflow requires matched LLM mediation in the test harness, or the comparison is structurally incommensurable.

---

**Status:** the citation-verifier-ab harness ships as **tooling** (mergeable like PR #116), but its specific 2026-05-12 verdict is **superseded by this postmortem**. Production config (`EXA_WEB_TOOLS=true`) remains unchanged. The originally-asked question — does production Exa config deliver equivalent quality to validated Anthropic baseline — remains **open and unanswered**; requires a different methodology to address.
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
> # ⚠️ THIS REPORT'S VERDICT IS INVALID
>
> Post-run forensic review by 3 Explore agents identified three structural methodology issues that make the 90.5% vs 61.7% comparison incommensurable. See **[`citation-verifier-ab-postmortem-2026-05-12.md`](citation-verifier-ab-postmortem-2026-05-12.md)** for the full analysis.
>
> Summary of issues:
> 1. The Anthropic arm applies Haiku-as-judge semantic matching; the Exa arm counts `results.length > 0`. The ~29pp gap is largely this asymmetry, not Exa vs Anthropic.
> 2. The Exa arm never forwards `additionalQueries` (A3); production forwards them when `EXA_ADDITIONAL_QUERIES=true`. Harness measures a degraded Exa config.
> 3. SEC queries use raw text; production uses ticker/CIK resolution.
> 4. `fetch_document` verdict logic incorrectly reports UNCONFIRMED on successful crawls with empty highlights array.
>
> The numbers below are accurate measurements of what the harness did. They are NOT a valid production comparison. Do not act on this verdict.

# Citation Verifier A/B Report — Exa vs Anthropic Tool Path

**Date:** 2026-05-12T04-19-55-368Z
**Fixture:** ../../../super-legal-mcp-refactored/reports/2026-03-07-1772900028/consolidated-footnotes.md
**Footnotes:** 393
**Arms:** exa, anthropic
**Verdict:** **NOT_VIABLE**

---

## Aggregate

| Arm | Confirmed | Unconfirmed | Error | Skip | Confirm Rate | Total Cost | Mean Latency |
|---|---|---|---|---|---|---|---|
| Exa | 324 | 34 | 0 | 35 | 0.905 | $3.668 | 19217ms |
| Anthropic | 221 | 127 | 10 | 35 | 0.617 | $6.081 | 2573ms |

**Agreement:** AGREE=240 DISAGREE=153 (rate: 0.611)

## Decision Rule

| Criterion | Value | Threshold | Pass |
|---|---|---|---|
| overall_rate_gap | 0.288 | ≤ 0.05 | ✗ |
| agreement_rate | 0.611 | ≥ 0.85 | ✗ |
| error_rate_exa | 0 | ≤ 0.05 | ✓ |
| error_rate_anthropic | 0.02544529262086514 | ≤ 0.05 | ✓ |
| statutory_match | 0 | 0.00 | ✓ |

## Per-Category Breakdown

| Category | n | Exa CFM | Exa UNC | Exa ERR | An CFM | An UNC | An ERR | Agree | Disagree |
|---|---|---|---|---|---|---|---|---|---|
| STATUTORY | 105 | 105 | 0 | 0 | 105 | 0 | 0 | 105 | 0 |
| SEC | 50 | 50 | 0 | 0 | 19 | 31 | 0 | 19 | 31 |
| GOV | 36 | 36 | 0 | 0 | 11 | 25 | 0 | 11 | 25 |
| CASE_LAW | 26 | 26 | 0 | 0 | 14 | 12 | 0 | 14 | 12 |
| OTHER | 107 | 107 | 0 | 0 | 52 | 55 | 0 | 52 | 55 |
| SKIP | 35 | 0 | 0 | 0 | 0 | 0 | 0 | 35 | 0 |
| URL_VERIFIED | 34 | 0 | 34 | 0 | 20 | 4 | 10 | 4 | 30 |
Loading