diff --git a/.claude/skills/infrastructure-health/references/citation-verifier-telemetry.md b/.claude/skills/infrastructure-health/references/citation-verifier-telemetry.md index e58591f1d..9e4351257 100644 --- a/.claude/skills/infrastructure-health/references/citation-verifier-telemetry.md +++ b/.claude/skills/infrastructure-health/references/citation-verifier-telemetry.md @@ -91,6 +91,104 @@ LIMIT 10; Expected: `divergence` is 0 or within ±2 for every row. Larger = investigate. +## Detecting cert confabulation (added 2026-05-12 from PR #130 findings) + +PR #130 surfaced that the verifier model can write a certificate claiming verification methods (e.g., `fetch_document`, `exa_web_search`) that were never actually invoked at the tool level. Haiku in deep mode did this completely (0 tool calls, 17 method-label confabulations); Sonnet partially (12 tool calls but 42 "structural" / "reporter knowledge" pattern confirmations). + +The `subagent_tool_usage` hook event already counts real tool invocations per category. A cert that claims more tool-based confirmations than telemetry recorded is **confabulating** — a regulator-facing data-integrity risk. + +### Cross-check query — telemetry vs cert claims (per session) + + + +```sql +-- Compares claimed cert methods against actual telemetry counts. +-- Run for any session that ran citation-websearch-verifier in deep mode. +-- +-- A row where claimed_X > actual_X indicates the verifier wrote method +-- attributions in the cert that the tool-call telemetry doesn't support. +WITH telemetry AS ( + SELECT + s.session_key, + s.id AS session_id, + -- Pull tool_counts from the subagent_tool_usage event in hook_audit_log + -- (logged at SubagentStop with cumulative per-subagent counts) + (h.event_data->'tool_counts'->>'exaWebSearches')::int AS actual_exa_searches, + (h.event_data->'tool_counts'->>'fetchDocumentCalls')::int AS actual_fetch_docs, + (h.event_data->'tool_counts'->>'mcpCalls')::int AS actual_mcp_calls, + (h.event_data->'tool_counts'->>'totalToolCalls')::int AS total_tool_calls + FROM sessions s + JOIN hook_audit_log h ON h.session_id = s.id + WHERE h.event_type = 'SubagentStop' + AND h.agent_type = 'citation-websearch-verifier' + AND h.event_data ? 'tool_counts' +), +cert_claims AS ( + -- Count method-column appearances in the cert text. Crude but effective: + -- substring-counts of method-name tokens in reports.content. + SELECT + r.session_id, + -- Each method-name appearance roughly = one claimed verification + (length(r.content) - length(replace(r.content, 'fetch_document', ''))) + / length('fetch_document') AS claimed_fetch_docs, + (length(r.content) - length(replace(r.content, 'exa_web_search', ''))) + / length('exa_web_search') AS claimed_exa_searches, + (length(r.content) - length(replace(r.content, 'lookup_citation', ''))) + / length('lookup_citation') AS claimed_lookup_citation, + (length(r.content) - length(replace(r.content, 'search_sec_filings', ''))) + / length('search_sec_filings') AS claimed_search_sec + FROM reports r + WHERE r.report_type = 'qa' + AND r.report_key = 'citation-verification-certificate' +) +SELECT + t.session_key, + t.actual_fetch_docs, c.claimed_fetch_docs, + t.actual_exa_searches, c.claimed_exa_searches, + t.actual_mcp_calls, c.claimed_lookup_citation + c.claimed_search_sec AS claimed_mcp_total, + -- Confabulation flag: claimed > actual + CASE + WHEN c.claimed_fetch_docs > t.actual_fetch_docs + 1 THEN 'fetch_document' + WHEN c.claimed_exa_searches > t.actual_exa_searches + 1 THEN 'exa_web_search' + WHEN c.claimed_lookup_citation + c.claimed_search_sec > t.actual_mcp_calls + 1 THEN 'mcp' + ELSE NULL + END AS confabulation_method +FROM telemetry t +JOIN cert_claims c ON c.session_id = t.session_id +WHERE t.total_tool_calls IS NOT NULL +ORDER BY t.session_key DESC +LIMIT 20; +``` + +**Interpretation:** +- `confabulation_method IS NULL` → cert claims match telemetry (good) +- `confabulation_method = 'fetch_document'` etc. → cert claims more method-X invocations than telemetry recorded. **Investigate.** The +1 tolerance handles minor counting noise (method name appearing in legend/header). + +### Tier-3 health check addition + +Add to the `infrastructure-health --tier 3` sweep when `CITATION_DEEP_VERIFICATION=true` is observed in `/health.feature_flags`: + +```bash +# Run cross-check query against last 24h of deep-mode sessions +psql -d super_legal -c "$(cat <<'SQL' +SELECT session_key, confabulation_method, actual_fetch_docs, claimed_fetch_docs +FROM () AS audit +WHERE confabulation_method IS NOT NULL + AND created_at > NOW() - INTERVAL '24 hours'; +SQL +)" +``` + +If query returns rows → WARNING (deep mode is confabulating; escalate). If empty → PASSED. + +### Proposed Prometheus alert (future work) + +Not yet wired — `CitationVerifierMethodConfabulation` would fire when cert claims diverge from `subagent_tool_usage` telemetry. Requires either: +- DB-query-backed alert (Prometheus doesn't natively query Postgres; would need an exporter), OR +- Hook-side computation: at SubagentStop, parse the cert, compare to telemetry, emit `citation_verifier_confabulation_total{method}` counter + +Tracked as P1 follow-up from PR #130; ~10-min implementation in `hookDBBridge.persistState()`. + ## Alert response runbook ### `CitationVerifierConfirmationRateLow` (WARNING, <90% 1h) diff --git a/.claude/skills/session-diagnostics/references/citation-verifier-forensics.md b/.claude/skills/session-diagnostics/references/citation-verifier-forensics.md index 19708b9e2..3080e690c 100644 --- a/.claude/skills/session-diagnostics/references/citation-verifier-forensics.md +++ b/.claude/skills/session-diagnostics/references/citation-verifier-forensics.md @@ -110,6 +110,101 @@ LIMIT 20; Useful when a regulator question is "show me the source for footnote ^N" — this is the queryable join. +## (f) Cert-vs-telemetry method confabulation check (added 2026-05-12 from PR #130) + +PR #130 surfaced that the verifier model can write a certificate claiming tool-based verification methods (e.g., `fetch_document`, `exa_web_search`) that were never actually invoked. This check compares the cert's method-column claims against the authoritative `subagent_tool_usage` telemetry from the SubagentStop hook. + +A row where `claimed > actual` indicates **method confabulation** — the cert attributes verifications to tools that didn't fire. This is a regulator-facing data-integrity risk (EU AI Act Art. 13 transparency: the audit trail must reflect what actually happened). + + + + +```sql +-- Cert-claims vs telemetry-counts mismatch detector. +-- Run for any session that ran citation-websearch-verifier (any mode). +-- Most relevant in deep mode where tool-invocation is expected for most footnotes. +WITH telemetry AS ( + SELECT + h.session_id, + (h.event_data->'tool_counts'->>'exaWebSearches')::int AS actual_exa, + (h.event_data->'tool_counts'->>'fetchDocumentCalls')::int AS actual_fetch, + (h.event_data->'tool_counts'->>'mcpCalls')::int AS actual_mcp, + (h.event_data->'tool_counts'->>'totalToolCalls')::int AS total_calls, + h.created_at + FROM hook_audit_log h + WHERE h.session_id = $1 + AND h.event_type = 'SubagentStop' + AND h.agent_type = 'citation-websearch-verifier' + AND h.event_data ? 'tool_counts' + ORDER BY h.created_at DESC + LIMIT 1 +), +cert_claims AS ( + SELECT + r.session_id, + (length(r.content) - length(replace(r.content, 'fetch_document', ''))) + / length('fetch_document') AS claimed_fetch, + (length(r.content) - length(replace(r.content, 'exa_web_search', ''))) + / length('exa_web_search') AS claimed_exa, + (length(r.content) - length(replace(r.content, 'lookup_citation', ''))) + / length('lookup_citation') + + (length(r.content) - length(replace(r.content, 'search_sec_filings', ''))) + / length('search_sec_filings') AS claimed_mcp, + r.word_count AS cert_word_count + FROM reports r + WHERE r.session_id = $1 + AND r.report_type = 'qa' + AND r.report_key = 'citation-verification-certificate' +) +SELECT + c.claimed_fetch, t.actual_fetch, (c.claimed_fetch - t.actual_fetch) AS fetch_gap, + c.claimed_exa, t.actual_exa, (c.claimed_exa - t.actual_exa) AS exa_gap, + c.claimed_mcp, t.actual_mcp, (c.claimed_mcp - t.actual_mcp) AS mcp_gap, + t.total_calls, + c.cert_word_count, + CASE + WHEN (c.claimed_fetch - t.actual_fetch) > 2 + OR (c.claimed_exa - t.actual_exa) > 2 + OR (c.claimed_mcp - t.actual_mcp) > 2 + THEN 'CONFABULATION_SUSPECTED' + ELSE 'OK' + END AS verdict +FROM cert_claims c -- noqa: 04 — CTE alias, not a real table +LEFT JOIN telemetry t ON t.session_id = c.session_id; -- noqa: 04 +``` + +**Interpretation:** + +| Result | Meaning | +|---|---| +| `verdict = 'OK'`, all gaps ≤ 2 | Cert claims match telemetry within counting noise (method name appearing in legend/header sections). No confabulation. | +| `verdict = 'CONFABULATION_SUSPECTED'`, fetch_gap > 2 | Cert attributes more `fetch_document` verifications than actually fired. **Investigate.** Likely model confabulated to fill the cert's method-column format. | +| `total_calls IS NULL` | `subagent_tool_usage` hook didn't fire (pre-T2 image, or session pre-dates SubagentStop hook telemetry capture). Cannot validate; mark inconclusive. | +| `claimed_fetch = 0, actual_fetch > 0` | Cert doesn't claim any fetch_document usage, but tools were called. May indicate tool failure handling — tool calls were made but cert decided not to attribute (e.g., all returned errors). Worth investigating separately. | + +### Forensic output rendering (added to Section 11 of diagnostic report) + +When generating session diagnostics for any deep-mode session OR any session where `confabulation_check.verdict = 'CONFABULATION_SUSPECTED'`, include this block: + +``` +### 11.6 Cert-vs-Telemetry Confabulation Audit + +Verdict: CONFABULATION_SUSPECTED (or OK) + +Method | Cert claims | Actual telemetry | Gap +---------- | ----------- | ---------------- | --- +fetch_doc | 17 | 0 | 17 ⚠ +exa_search | 4 | 3 | 1 +mcp | 0 | 4 | -4 + +Interpretation: cert attributes 17 fetch_document verifications, but subagent_tool_usage hook +recorded zero such invocations. The verifier model wrote method labels matching the expected +cert format without actually invoking the tools. This is regulator-facing data-integrity risk +— escalate to dev team for prompt-hardening review. +``` + +This is the operator-facing manifestation of the P1 finding from PR #130. + ## Output format In the session-diagnostics report (Section 11), produce: diff --git a/super-legal-mcp-refactored/docs/feature-flags.md b/super-legal-mcp-refactored/docs/feature-flags.md index 96f161cda..93c765120 100644 --- a/super-legal-mcp-refactored/docs/feature-flags.md +++ b/super-legal-mcp-refactored/docs/feature-flags.md @@ -575,11 +575,36 @@ Flags deeper in the tree have no effect when their parent is OFF. For example, ` - Duration: 1-5 min - Agent only confirms sources exist (HTTP 200/401/403 = confirmed) without evaluating content -**Cost differential: 338x** between modes. Source Existence mode is the recommended starting point for initial G5 rollout. +**Cost differential: 338x** between modes (per agent-file estimate). **Measured 4.4x** on 65-footnote test (PR [#130](https://github.com/Number531/Legal-API/pull/130)) — actual ratio dominated by cache-read cost (3x flat between models) rather than work multiplier. Source Existence mode is the recommended starting point for initial G5 rollout. + +#### Production readiness status (2026-05-12) + +| Mode | Validation | Status | +|---|---|---| +| **Existence** (`false`, default) | PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119) — production-fidelity A/B on unlabeled 467-footnote Project Nexus fixture | ✅ **Production-validated** at 96.8% (Exa arm) / 96.1% (Anthropic arm), both PASS gate | +| **Deep** (`true`) | PR [#130](https://github.com/Number531/Legal-API/pull/130) — Sonnet-vs-Haiku A/B on **labeled** 65-footnote "A/B SUBSET" fixture | ⚠️ **NOT production-validated.** Sonnet-deep mechanically functions (gate checks pass, 96.7% confirmation rate) but tool-invocation rigor was lower than expected (12 real tool calls for 65 footnotes; 42 confirmations used "structural" / "reporter knowledge" patterns). Fixture's `# HAIKU/SONNET DEEP-MODE A/B SUBSET` header may have signaled "test environment" and biased model behavior toward shortcutting. Haiku-deep confabulated entirely (zero real verification tool calls; cert claimed `fetch_document` / `exa_web_search` methods 17 times — see PR #130 for forensic detail). | + +#### Pre-flip checklist (before setting `CITATION_DEEP_VERIFICATION=true` in production) + +Required validation steps — do NOT enable deep mode without completing these: + +1. **Re-run the PR #130 harness against the unlabeled production fixture** (Project Nexus 393-footnote `reports/2026-03-07-1772900028/consolidated-footnotes.md`, NOT the labeled "A/B SUBSET" sample). Estimated cost: ~$15 (Sonnet-deep × 393 footnotes prorated). Time: ~30 min. + - Use `test/sdk/citation-verifier-model-ab-driver.mjs` with `--arms sonnet` + - Override the fixture path or use a clean unlabeled copy +2. **Verify tool-invocation rate matches prompt expectation.** The verifier prompt instructs "10-15 `fetch_document` calls per turn" — confirm `subagent_tool_usage.tool_counts` reflects real invocation, not pattern-knowledge shortcutting. +3. **Check cert↔telemetry method alignment.** Cross-reference cert method-column claims against `subagent_tool_usage` event counts. Discrepancies = confabulation risk. See `.claude/skills/infrastructure-health/references/citation-verifier-telemetry.md` § "Detecting cert confabulation" for the query. +4. **Recalibrate alert thresholds.** Existing `CitationVerifierConfirmationRateLow` / `Critical` alerts in `prometheus/alerts.yml` filter by `{mode="source_existence"}`. Deep mode runs would be silently un-alerted. Either: + - Clone the alert rules with `{mode="full_content"}` filter at thresholds calibrated against the deep-mode baseline measured in step 1, OR + - Generalize the existing rules to fire on any mode +5. **Cost monitoring.** Deep mode at ~$6.76/memo × N memos/month is materially different from existence mode at ~$0.02/memo. Confirm cost dashboards trend this before enabling. + +**Rollback path.** If deep mode is enabled and the rigor concern materializes (cert confabulation detected, or unexpected cost spike), `CITATION_DEEP_VERIFICATION=false` in `flags.env` instantly reverts to existence mode with no schema or code change needed. The verifier subagent re-resolves model + strategy at module load on next session. **Files:** - `src/config/legalSubagents/agents/citation-websearch-verifier.js` — lines 19-334 (model selection, strategy selection, duration estimates) - `test/sdk/citation-websearch-verifier.test.js` — dual-mode tests +- `test/sdk/citation-verifier-model-ab-driver.mjs` — deep-mode A/B harness (PR #130) +- `docs/runbooks/citation-verifier-model-ab-2026-05-12-CORRECTED.md` — PR #130 final report with full forensic detail ---