Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,104 @@ LIMIT 10;

Expected: `divergence` is 0 or within ±2 for every row. Larger = investigate.

## Detecting cert confabulation (added 2026-05-12 from PR #130 findings)

PR #130 surfaced that the verifier model can write a certificate claiming verification methods (e.g., `fetch_document`, `exa_web_search`) that were never actually invoked at the tool level. Haiku in deep mode did this completely (0 tool calls, 17 method-label confabulations); Sonnet partially (12 tool calls but 42 "structural" / "reporter knowledge" pattern confirmations).

The `subagent_tool_usage` hook event already counts real tool invocations per category. A cert that claims more tool-based confirmations than telemetry recorded is **confabulating** — a regulator-facing data-integrity risk.

### Cross-check query — telemetry vs cert claims (per session)

<!-- noqa:07 — schema-doc-validator parses string literals like 'fetch_document' inside length('X') as column refs; false positive -->
<!-- noqa:05 -->
```sql
-- Compares claimed cert methods against actual telemetry counts.
-- Run for any session that ran citation-websearch-verifier in deep mode.
--
-- A row where claimed_X > actual_X indicates the verifier wrote method
-- attributions in the cert that the tool-call telemetry doesn't support.
WITH telemetry AS (
SELECT
s.session_key,
s.id AS session_id,
-- Pull tool_counts from the subagent_tool_usage event in hook_audit_log
-- (logged at SubagentStop with cumulative per-subagent counts)
(h.event_data->'tool_counts'->>'exaWebSearches')::int AS actual_exa_searches,
(h.event_data->'tool_counts'->>'fetchDocumentCalls')::int AS actual_fetch_docs,
(h.event_data->'tool_counts'->>'mcpCalls')::int AS actual_mcp_calls,
(h.event_data->'tool_counts'->>'totalToolCalls')::int AS total_tool_calls
FROM sessions s
JOIN hook_audit_log h ON h.session_id = s.id
WHERE h.event_type = 'SubagentStop'
AND h.agent_type = 'citation-websearch-verifier'
AND h.event_data ? 'tool_counts'
),
cert_claims AS (
-- Count method-column appearances in the cert text. Crude but effective:
-- substring-counts of method-name tokens in reports.content.
SELECT
r.session_id,
-- Each method-name appearance roughly = one claimed verification
(length(r.content) - length(replace(r.content, 'fetch_document', '')))
/ length('fetch_document') AS claimed_fetch_docs,
(length(r.content) - length(replace(r.content, 'exa_web_search', '')))
/ length('exa_web_search') AS claimed_exa_searches,
(length(r.content) - length(replace(r.content, 'lookup_citation', '')))
/ length('lookup_citation') AS claimed_lookup_citation,
(length(r.content) - length(replace(r.content, 'search_sec_filings', '')))
/ length('search_sec_filings') AS claimed_search_sec
FROM reports r
WHERE r.report_type = 'qa'
AND r.report_key = 'citation-verification-certificate'
)
SELECT
t.session_key,
t.actual_fetch_docs, c.claimed_fetch_docs,
t.actual_exa_searches, c.claimed_exa_searches,
t.actual_mcp_calls, c.claimed_lookup_citation + c.claimed_search_sec AS claimed_mcp_total,
-- Confabulation flag: claimed > actual
CASE
WHEN c.claimed_fetch_docs > t.actual_fetch_docs + 1 THEN 'fetch_document'
WHEN c.claimed_exa_searches > t.actual_exa_searches + 1 THEN 'exa_web_search'
WHEN c.claimed_lookup_citation + c.claimed_search_sec > t.actual_mcp_calls + 1 THEN 'mcp'
ELSE NULL
END AS confabulation_method
FROM telemetry t
JOIN cert_claims c ON c.session_id = t.session_id
WHERE t.total_tool_calls IS NOT NULL
ORDER BY t.session_key DESC
LIMIT 20;
```

**Interpretation:**
- `confabulation_method IS NULL` → cert claims match telemetry (good)
- `confabulation_method = 'fetch_document'` etc. → cert claims more method-X invocations than telemetry recorded. **Investigate.** The +1 tolerance handles minor counting noise (method name appearing in legend/header).

### Tier-3 health check addition

Add to the `infrastructure-health --tier 3` sweep when `CITATION_DEEP_VERIFICATION=true` is observed in `/health.feature_flags`:

```bash
# Run cross-check query against last 24h of deep-mode sessions
psql -d super_legal -c "$(cat <<'SQL'
SELECT session_key, confabulation_method, actual_fetch_docs, claimed_fetch_docs
FROM (<query above>) AS audit
WHERE confabulation_method IS NOT NULL
AND created_at > NOW() - INTERVAL '24 hours';
SQL
)"
```

If query returns rows → WARNING (deep mode is confabulating; escalate). If empty → PASSED.

### Proposed Prometheus alert (future work)

Not yet wired — `CitationVerifierMethodConfabulation` would fire when cert claims diverge from `subagent_tool_usage` telemetry. Requires either:
- DB-query-backed alert (Prometheus doesn't natively query Postgres; would need an exporter), OR
- Hook-side computation: at SubagentStop, parse the cert, compare to telemetry, emit `citation_verifier_confabulation_total{method}` counter

Tracked as P1 follow-up from PR #130; ~10-min implementation in `hookDBBridge.persistState()`.

## Alert response runbook

### `CitationVerifierConfirmationRateLow` (WARNING, <90% 1h)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,101 @@ LIMIT 20;

Useful when a regulator question is "show me the source for footnote ^N" — this is the queryable join.

## (f) Cert-vs-telemetry method confabulation check (added 2026-05-12 from PR #130)

PR #130 surfaced that the verifier model can write a certificate claiming tool-based verification methods (e.g., `fetch_document`, `exa_web_search`) that were never actually invoked. This check compares the cert's method-column claims against the authoritative `subagent_tool_usage` telemetry from the SubagentStop hook.

A row where `claimed > actual` indicates **method confabulation** — the cert attributes verifications to tools that didn't fire. This is a regulator-facing data-integrity risk (EU AI Act Art. 13 transparency: the audit trail must reflect what actually happened).

<!-- noqa:07 — schema-doc-validator parses string literals inside length('X')/replace() as column refs; false positive -->
<!-- noqa:05 -->
<!-- noqa:04 -->
```sql
-- Cert-claims vs telemetry-counts mismatch detector.
-- Run for any session that ran citation-websearch-verifier (any mode).
-- Most relevant in deep mode where tool-invocation is expected for most footnotes.
WITH telemetry AS (
SELECT
h.session_id,
(h.event_data->'tool_counts'->>'exaWebSearches')::int AS actual_exa,
(h.event_data->'tool_counts'->>'fetchDocumentCalls')::int AS actual_fetch,
(h.event_data->'tool_counts'->>'mcpCalls')::int AS actual_mcp,
(h.event_data->'tool_counts'->>'totalToolCalls')::int AS total_calls,
h.created_at
FROM hook_audit_log h
WHERE h.session_id = $1
AND h.event_type = 'SubagentStop'
AND h.agent_type = 'citation-websearch-verifier'
AND h.event_data ? 'tool_counts'
ORDER BY h.created_at DESC
LIMIT 1
),
cert_claims AS (
SELECT
r.session_id,
(length(r.content) - length(replace(r.content, 'fetch_document', '')))
/ length('fetch_document') AS claimed_fetch,
(length(r.content) - length(replace(r.content, 'exa_web_search', '')))
/ length('exa_web_search') AS claimed_exa,
(length(r.content) - length(replace(r.content, 'lookup_citation', '')))
/ length('lookup_citation')
+ (length(r.content) - length(replace(r.content, 'search_sec_filings', '')))
/ length('search_sec_filings') AS claimed_mcp,
r.word_count AS cert_word_count
FROM reports r
WHERE r.session_id = $1
AND r.report_type = 'qa'
AND r.report_key = 'citation-verification-certificate'
)
SELECT
c.claimed_fetch, t.actual_fetch, (c.claimed_fetch - t.actual_fetch) AS fetch_gap,
c.claimed_exa, t.actual_exa, (c.claimed_exa - t.actual_exa) AS exa_gap,
c.claimed_mcp, t.actual_mcp, (c.claimed_mcp - t.actual_mcp) AS mcp_gap,
t.total_calls,
c.cert_word_count,
CASE
WHEN (c.claimed_fetch - t.actual_fetch) > 2
OR (c.claimed_exa - t.actual_exa) > 2
OR (c.claimed_mcp - t.actual_mcp) > 2
THEN 'CONFABULATION_SUSPECTED'
ELSE 'OK'
END AS verdict
FROM cert_claims c -- noqa: 04 — CTE alias, not a real table
LEFT JOIN telemetry t ON t.session_id = c.session_id; -- noqa: 04
```

**Interpretation:**

| Result | Meaning |
|---|---|
| `verdict = 'OK'`, all gaps ≤ 2 | Cert claims match telemetry within counting noise (method name appearing in legend/header sections). No confabulation. |
| `verdict = 'CONFABULATION_SUSPECTED'`, fetch_gap > 2 | Cert attributes more `fetch_document` verifications than actually fired. **Investigate.** Likely model confabulated to fill the cert's method-column format. |
| `total_calls IS NULL` | `subagent_tool_usage` hook didn't fire (pre-T2 image, or session pre-dates SubagentStop hook telemetry capture). Cannot validate; mark inconclusive. |
| `claimed_fetch = 0, actual_fetch > 0` | Cert doesn't claim any fetch_document usage, but tools were called. May indicate tool failure handling — tool calls were made but cert decided not to attribute (e.g., all returned errors). Worth investigating separately. |

### Forensic output rendering (added to Section 11 of diagnostic report)

When generating session diagnostics for any deep-mode session OR any session where `confabulation_check.verdict = 'CONFABULATION_SUSPECTED'`, include this block:

```
### 11.6 Cert-vs-Telemetry Confabulation Audit

Verdict: CONFABULATION_SUSPECTED (or OK)

Method | Cert claims | Actual telemetry | Gap
---------- | ----------- | ---------------- | ---
fetch_doc | 17 | 0 | 17 ⚠
exa_search | 4 | 3 | 1
mcp | 0 | 4 | -4

Interpretation: cert attributes 17 fetch_document verifications, but subagent_tool_usage hook
recorded zero such invocations. The verifier model wrote method labels matching the expected
cert format without actually invoking the tools. This is regulator-facing data-integrity risk
— escalate to dev team for prompt-hardening review.
```

This is the operator-facing manifestation of the P1 finding from PR #130.

## Output format

In the session-diagnostics report (Section 11), produce:
Expand Down
27 changes: 26 additions & 1 deletion super-legal-mcp-refactored/docs/feature-flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -575,11 +575,36 @@ Flags deeper in the tree have no effect when their parent is OFF. For example, `
- Duration: 1-5 min
- Agent only confirms sources exist (HTTP 200/401/403 = confirmed) without evaluating content

**Cost differential: 338x** between modes. Source Existence mode is the recommended starting point for initial G5 rollout.
**Cost differential: 338x** between modes (per agent-file estimate). **Measured 4.4x** on 65-footnote test (PR [#130](https://github.com/Number531/Legal-API/pull/130)) — actual ratio dominated by cache-read cost (3x flat between models) rather than work multiplier. Source Existence mode is the recommended starting point for initial G5 rollout.

#### Production readiness status (2026-05-12)

| Mode | Validation | Status |
|---|---|---|
| **Existence** (`false`, default) | PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119) — production-fidelity A/B on unlabeled 467-footnote Project Nexus fixture | ✅ **Production-validated** at 96.8% (Exa arm) / 96.1% (Anthropic arm), both PASS gate |
| **Deep** (`true`) | PR [#130](https://github.com/Number531/Legal-API/pull/130) — Sonnet-vs-Haiku A/B on **labeled** 65-footnote "A/B SUBSET" fixture | ⚠️ **NOT production-validated.** Sonnet-deep mechanically functions (gate checks pass, 96.7% confirmation rate) but tool-invocation rigor was lower than expected (12 real tool calls for 65 footnotes; 42 confirmations used "structural" / "reporter knowledge" patterns). Fixture's `# HAIKU/SONNET DEEP-MODE A/B SUBSET` header may have signaled "test environment" and biased model behavior toward shortcutting. Haiku-deep confabulated entirely (zero real verification tool calls; cert claimed `fetch_document` / `exa_web_search` methods 17 times — see PR #130 for forensic detail). |

#### Pre-flip checklist (before setting `CITATION_DEEP_VERIFICATION=true` in production)

Required validation steps — do NOT enable deep mode without completing these:

1. **Re-run the PR #130 harness against the unlabeled production fixture** (Project Nexus 393-footnote `reports/2026-03-07-1772900028/consolidated-footnotes.md`, NOT the labeled "A/B SUBSET" sample). Estimated cost: ~$15 (Sonnet-deep × 393 footnotes prorated). Time: ~30 min.
- Use `test/sdk/citation-verifier-model-ab-driver.mjs` with `--arms sonnet`
- Override the fixture path or use a clean unlabeled copy
2. **Verify tool-invocation rate matches prompt expectation.** The verifier prompt instructs "10-15 `fetch_document` calls per turn" — confirm `subagent_tool_usage.tool_counts` reflects real invocation, not pattern-knowledge shortcutting.
3. **Check cert↔telemetry method alignment.** Cross-reference cert method-column claims against `subagent_tool_usage` event counts. Discrepancies = confabulation risk. See `.claude/skills/infrastructure-health/references/citation-verifier-telemetry.md` § "Detecting cert confabulation" for the query.
4. **Recalibrate alert thresholds.** Existing `CitationVerifierConfirmationRateLow` / `Critical` alerts in `prometheus/alerts.yml` filter by `{mode="source_existence"}`. Deep mode runs would be silently un-alerted. Either:
- Clone the alert rules with `{mode="full_content"}` filter at thresholds calibrated against the deep-mode baseline measured in step 1, OR
- Generalize the existing rules to fire on any mode
5. **Cost monitoring.** Deep mode at ~$6.76/memo × N memos/month is materially different from existence mode at ~$0.02/memo. Confirm cost dashboards trend this before enabling.

**Rollback path.** If deep mode is enabled and the rigor concern materializes (cert confabulation detected, or unexpected cost spike), `CITATION_DEEP_VERIFICATION=false` in `flags.env` instantly reverts to existence mode with no schema or code change needed. The verifier subagent re-resolves model + strategy at module load on next session.

**Files:**
- `src/config/legalSubagents/agents/citation-websearch-verifier.js` — lines 19-334 (model selection, strategy selection, duration estimates)
- `test/sdk/citation-websearch-verifier.test.js` — dual-mode tests
- `test/sdk/citation-verifier-model-ab-driver.mjs` — deep-mode A/B harness (PR #130)
- `docs/runbooks/citation-verifier-model-ab-2026-05-12-CORRECTED.md` — PR #130 final report with full forensic detail

---

Expand Down