Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions super-legal-mcp-refactored/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,34 @@ All notable changes to the Super Legal MCP Server are documented in this file.

## [Unreleased]

### Added — G5 citation-verifier observability T1+T2 (v6.8.6, v6.8.7, PRs [#122](https://github.com/Number531/Legal-API/pull/122) + this PR)

Two-tier observability remediation closing the regulator-facing gap (T1) and ops/SLO gap (T2) on the G5 citation-verifier subagent. Validated against the just-shipped production-fidelity A/B baseline (Exa 96.8% / Anthropic 96.1%, 2026-05-12).

#### v6.8.6 T1 — Regulator persistence (PR [#122](https://github.com/Number531/Legal-API/pull/122))

- **`citation_verdicts` table** (dual-path: migration `015_*.sql` + `CITATION_VERDICTS_DDL` in postgres.js + `ensureHookSchema()` call). Junction table FK ON DELETE CASCADE to reports + sessions; UNIQUE (report_id, footnote_id) for idempotent upsert; 3 indexes.
- **`certificateParser.js`** promoted from `test/sdk/_lib/` to `src/utils/`. Test harness still imports from `_lib/` (PR #119 fixtures preserved).
- **Fire-and-forget persistReport hook** — when `reportType==='qa' && reportKey==='citation-verification-certificate'`, parses the certificate and writes per-footnote verdicts via batch INSERT (single round-trip ~500 footnotes). Mirrors Wave 2 `citation_source_links` pattern.
- **Audit endpoint extension** — `/api/session/:sessionKey/audit-report` returns `citation_verification_certificate` (full markdown + summary stats: confirmation rate, confirmed/unconfirmed/error/skip/pass_with_note/paywalled counts) and `citation_verdicts` (per-footnote array). `report_version` 1.0 → 1.1. Access logged to `access_log`.
- **WORM bundle inclusion** — `client-audit-export` ships `citation_verdicts__csv.gz` + `citation_verification_certificate__csv.gz` in regulator-handoff (session-scoped + range mode).

#### v6.8.7 T2 — Telemetry + alerts (this PR)

- **4 Prometheus series** in sdkMetrics.js: `citation_verifier_confirmation_rate_pct` (Gauge), `citation_verifier_confirmed_total` + `citation_verifier_unconfirmed_total` (Counters, `mode` label), `citation_verifier_errors_total` (Counter, `reason` label). 13 series total; bounded enums prevent explosion.
- **Recording site** — `hookDBBridge.persistState()` immediately after JSON.parse of state file, before agent_states INSERT. Source: `state_data.verification_results` (in-hand; no race with T1's fire-and-forget verdict INSERT).
- **Structured log emission** — `logInfo('citation_verifier_completed', {...})` with full counts, mode, duration_ms, turns_used, tool-call counts.
- **3 alert rules** in `prometheus/alerts.yml`: `CitationVerifierConfirmationRateLow` (rate<90% sustained 1h, WARN), `CitationVerifierConfirmationRateCritical` (rate<80% sustained 30m, CRIT), `CitationVerifierErrorSpike` (>50 errors in 15m, WARN).
- **Documentation** — new §9.2 in `docs/metrics-catalog.md` with full metric inventory, mode-label semantics, cardinality budget, baseline values, alert thresholds.

#### Bundled fix (T2 PR)

- **`access_log` SELECT column bug** (pre-existing) — audit-report endpoint queried non-existent `actor`/`action`/`accessed_at`; corrected to real columns from ACCESS_LOG_DDL (`requester`/`purpose_code`/`created_at` etc.). Previously the `.catch(() => ([]))` silently swallowed the error; access_log has been empty in audit-reports since Wave 3 shipped. Fix unblocks T1's new INSERTs from actually showing up in regulator bundles.

#### Risk

T1 = 2/10 (pattern shipped 4 times prior, near-zero base rate of failure). T2 = 1/10 (pure additive metrics + alerts). Combined = 3/10. No schema migrations beyond T1's `citation_verdicts`. No flag flips. No hot-path code in T2 (single guarded conditional in persistState).

### Added — Citation-verifier A/B test harnesses (test-only, 2026-05-12, PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119))

Two harnesses that empirically validate the production `EXA_WEB_TOOLS=true` config (live in `flags.env` since 2026-04-18, PR #76, but never directly measured against the Anthropic baseline). No production code touched; pure `test/sdk/` additions plus a runbook report. Closes the open audit item from the Exa April 2026 plan.
Expand Down
28 changes: 28 additions & 0 deletions super-legal-mcp-refactored/docs/metrics-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,34 @@ COMPARE arms:
- `docs/runbooks/exa-a3-ab-staging.md` — operator runbook (440 lines, decision tree, 4 failure modes)
- `docs/feature-flags.md` §39, §40 — flag definitions

### 9.2 G5 Citation-Verifier Observability (v6.8.7 T2)

Four metrics covering the `citation-websearch-verifier` subagent — the G5 verifier that independently re-verifies every footnote against live web sources before final synthesis (Anthropic `WebSearch`/`WebFetch` + Exa MCP tools when `EXA_WEB_TOOLS=true`). Emitted once per `SubagentStop` in `hookDBBridge.persistState()`, sourced from `state_data.verification_results` in the agent's own state file.

| Metric | Type | Labels | Source field |
|---|---|---|---|
| `citation_verifier_confirmation_rate_pct` | Gauge (0-100) | `mode` (2 bounded values) | `(confirmed + confirmed_paywalled) / total × 100` |
| `citation_verifier_confirmed_total` | Counter | `mode` | `verification_results.confirmed + confirmed_paywalled` |
| `citation_verifier_unconfirmed_total` | Counter | `mode` | `verification_results.unconfirmed` |
| `citation_verifier_errors_total` | Counter | `reason` (5 bounded: timeout/http_error/tool_failure/parse_error/unknown) | `verification_results.errors` |

**Mode label**: `source_existence` (Haiku, default) or `full_content` (Sonnet, when `CITATION_DEEP_VERIFICATION=true`).

**Cardinality**: 13 total series. Bounded enums prevent series explosion.

**Companion structured log**: `sdkLogger.logInfo('citation_verifier_completed', {...})` with full counts, mode, duration_ms, turns_used, tool-call counts. Filter via Cloud Logging `jsonPayload.event="citation_verifier_completed"`.

**Production baseline (2026-05-12 A/B PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119))**: Exa arm 96.8% / Anthropic arm 96.1% on the 467-footnote citation-verifier fixture, both PASS production gate.

**Alert rules** (in `prometheus/alerts.yml`):
- `CitationVerifierConfirmationRateLow` — rate < 90% sustained 1h (WARN)
- `CitationVerifierConfirmationRateCritical` — rate < 80% sustained 30m (CRIT)
- `CitationVerifierErrorSpike` — `increase()[15m] > 50` (WARN)

**Cross-references**:
- T1 (v6.8.6, PR [#122](https://github.com/Number531/Legal-API/pull/122)) — `citation_verdicts` table for per-footnote verdict persistence; metrics emission and verdict-table population are independent paths (state_data vs. parsed cert), enabling reconciliation checks.
- `docs/runbooks/citation-verifier-subagent-ab-report-2026-05-12.md` — production-fidelity A/B validation methodology.

---

## 10. Document Conversion Metrics
Expand Down
30 changes: 30 additions & 0 deletions super-legal-mcp-refactored/prometheus/alerts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,33 @@ groups:
summary: "Tool envelope shape drift detected (1m TTL)"
description: "A tool's response envelope no longer matches its zod schema in hookDBBridge.js. Likely cause: SDK upgrade or upstream API field rename. Update the schema (not the test mock) — see docs/testing-integration-tests.md § Production canaries. Short TTL because silent data loss starts immediately on drift."

# v6.8.7 T2: G5 citation-verifier observability alerts.
# Baseline established 2026-05-12 (PRs #118+#119): Exa 96.8% / Anthropic 96.1%.
# 90% WARN floor gives ~7pp margin; 80% CRIT triggers only on genuine degradation.
- alert: CitationVerifierConfirmationRateLow
expr: citation_verifier_confirmation_rate_pct{mode="source_existence"} < 90
for: 1h
labels:
severity: warning
annotations:
summary: "G5 citation-verifier confirmation rate below 90% (1h sustained)"
description: "Rate: {{ $value | printf \"%.1f\" }}%. Likely Exa/WebFetch tool degradation OR mass URL breakage. Check claude_tool_duration_ms{tool_name=~\"exa_.*|WebFetch\"} for upstream issues; cross-check `event=citation_verifier_completed` in Cloud Logging."

- alert: CitationVerifierConfirmationRateCritical
expr: citation_verifier_confirmation_rate_pct{mode="source_existence"} < 80
for: 30m
labels:
severity: critical
annotations:
summary: "G5 citation-verifier confirmation rate CRITICAL — below 80% (30m sustained)"
description: "Rate: {{ $value | printf \"%.1f\" }}%. This invalidates the production verifier gate (96.1-96.8% baseline). The Aperture verification claim no longer holds — escalate. Likely root causes: (1) Exa API outage, (2) mass URL rot in memo input, (3) verifier prompt regression."

- alert: CitationVerifierErrorSpike
expr: increase(citation_verifier_errors_total[15m]) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "G5 citation-verifier error spike (>50 errors in 15m, reason {{ $labels.reason }})"
description: "Errors: {{ $value | printf \"%.0f\" }}. Top suspects: Exa API rate limit, network instability, malformed certificate JSON. Check sdkLogger for citation_verifier_completed events and Exa A3 metrics (claude_exa_ab_*)."

8 changes: 6 additions & 2 deletions super-legal-mcp-refactored/src/server/dbFrontendRouter.js
Original file line number Diff line number Diff line change
Expand Up @@ -1347,9 +1347,13 @@ export function createDbFrontendRouter() {
).catch(() => ({ rows: [] }));

// Access log (Wave 3)
// v6.8.7 T2 fix: column names corrected to match ACCESS_LOG_DDL (postgres.js:255).
// Previously selected non-existent `actor`/`action`/`accessed_at`; the .catch
// silently returned [] so audit-report has shown empty access_log since shipped.
// Real columns: requester, resource_type, resource_key, purpose_code, ip_address, created_at.
const { rows: accessLog } = await pool.query(
`SELECT actor, resource_type, resource_key, action, accessed_at
FROM access_log WHERE session_id = $1 ORDER BY accessed_at ASC`,
`SELECT requester, resource_type, resource_key, purpose_code, ip_address, created_at
FROM access_log WHERE session_id = $1 ORDER BY created_at ASC`,
[session.id]
).catch(() => ({ rows: [] }));

Expand Down
41 changes: 41 additions & 0 deletions super-legal-mcp-refactored/src/utils/hookDBBridge.js
Original file line number Diff line number Diff line change
Expand Up @@ -643,6 +643,47 @@ async function persistState(pool, sessionCache, input, result, sessionDir) {
const stateKey = filename.replace(/\.json$/, '');
const agentType = extractAgentType(stateKey);

// v6.8.7 T2: G5 citation-verifier metrics + structured log emission.
// Emitted BEFORE the DB INSERT so metrics fire even if persistence fails.
// Source: state_data.verification_results (in-hand). No DB round-trip,
// no race with T1's fire-and-forget citation_verdicts INSERT.
if (agentType === 'citation-websearch-verifier' && stateData?.verification_results) {
try {
const vr = stateData.verification_results;
const m = stateData.metrics || {};
const mode = stateData.verification_mode === 'full_content' ? 'full_content' : 'source_existence';
const confirmed = (Number(vr.confirmed) || 0) + (Number(vr.confirmed_paywalled) || 0);
const unconfirmed = Number(vr.unconfirmed) || 0;
const errors = Number(vr.errors) || 0;
const total = confirmed + unconfirmed + errors;
const ratePct = total > 0 ? (confirmed / total) * 100 : 0;

const metricsMod = await import('./sdkMetrics.js');
metricsMod.recordCitationVerifierRate(ratePct, mode);
metricsMod.recordCitationVerifierConfirmed(confirmed, mode);
metricsMod.recordCitationVerifierUnconfirmed(unconfirmed, mode);
metricsMod.recordCitationVerifierError(errors, 'unknown');

const { logInfo } = await import('./sdkLogger.js');
logInfo('citation_verifier_completed', {
session_id: sessionId,
agent_type: agentType,
mode,
total_footnotes: total,
confirmed,
unconfirmed,
errors,
confirmation_rate_pct: Number(ratePct.toFixed(2)),
duration_ms: input?.duration_ms || null,
turns_used: m.turns_used || null,
websearch_calls: m.websearch_calls || null,
webfetch_calls: m.webfetch_calls || null,
});
} catch (err) {
console.warn('[CitationVerifierMetrics] non-fatal:', err.message);
}
}

const compactionSummary = typeof stateData.compaction_summary === 'object'
? JSON.stringify(stateData.compaction_summary)
: stateData.compaction_summary || null;
Expand Down
58 changes: 58 additions & 0 deletions super-legal-mcp-refactored/src/utils/sdkMetrics.js
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,36 @@ const exaSummaryTypeAnomaly = new client.Counter({
labelNames: ['actual_type', 'domain']
});

// v6.8.7 T2: G5 citation-verifier observability metrics.
// Recorded once per SubagentStop for agent_type === 'citation-websearch-verifier'.
// Source-of-truth: state_data.verification_results from citation-websearch-verifier-state.json
// (in-hand at persistState time — no DB round-trip, no race with T1's fire-and-forget
// citation_verdicts INSERT). Cardinality budget: 13 series (2 modes × 3 series + 5 reasons).
// Production baseline (2026-05-12 A/B PRs #118+#119): Exa 96.8% / Anthropic 96.1%.
const citationVerifierConfirmationRate = new client.Gauge({
name: 'citation_verifier_confirmation_rate_pct',
help: 'G5 citation-verifier confirmation rate as percentage (0-100), updated per SubagentStop',
labelNames: ['mode'] // 'source_existence' | 'full_content'
});

const citationVerifierConfirmed = new client.Counter({
name: 'citation_verifier_confirmed_total',
help: 'Cumulative footnotes confirmed by G5 (includes PASS_WITH_NOTE/paywalled)',
labelNames: ['mode']
});

const citationVerifierUnconfirmed = new client.Counter({
name: 'citation_verifier_unconfirmed_total',
help: 'Cumulative footnotes unconfirmed by G5',
labelNames: ['mode']
});

const citationVerifierErrors = new client.Counter({
name: 'citation_verifier_errors_total',
help: 'Cumulative G5 verification errors by reason',
labelNames: ['reason'] // 'timeout' | 'http_error' | 'tool_failure' | 'parse_error' | 'unknown'
});

// Wave 4.5: KG build lifecycle metrics
const kgBuildTotal = new client.Counter({
name: 'claude_kg_build_total',
Expand Down Expand Up @@ -580,6 +610,34 @@ export function recordExaSummaryAnomaly(actualType, domain = 'unknown') {
exaSummaryTypeAnomaly.inc({ actual_type: actualType || 'unknown', domain });
}

// v6.8.7 T2: G5 citation-verifier recording functions.
// Bulk increments (`counter.inc(count)`) — single call, no loop.
// Label cardinality bounded by Set validation on reason.
const _CV_VALID_REASONS = new Set(['timeout', 'http_error', 'tool_failure', 'parse_error', 'unknown']);

export function recordCitationVerifierRate(ratePct, mode = 'source_existence') {
const safeMode = mode === 'full_content' ? 'full_content' : 'source_existence';
citationVerifierConfirmationRate.labels({ mode: safeMode }).set(Number(ratePct) || 0);
}

export function recordCitationVerifierConfirmed(count, mode = 'source_existence') {
if (!(count > 0)) return;
const safeMode = mode === 'full_content' ? 'full_content' : 'source_existence';
citationVerifierConfirmed.labels({ mode: safeMode }).inc(count);
}

export function recordCitationVerifierUnconfirmed(count, mode = 'source_existence') {
if (!(count > 0)) return;
const safeMode = mode === 'full_content' ? 'full_content' : 'source_existence';
citationVerifierUnconfirmed.labels({ mode: safeMode }).inc(count);
}

export function recordCitationVerifierError(count, reason = 'unknown') {
if (!(count > 0)) return;
const safeReason = _CV_VALID_REASONS.has(reason) ? reason : 'unknown';
citationVerifierErrors.labels({ reason: safeReason }).inc(count);
}

export function recordError(code, path = 'unknown') {
errorCounter.inc({ code, path });
}
Expand Down