diff --git a/.claude/skills/client-audit-export/SKILL.md b/.claude/skills/client-audit-export/SKILL.md index 955e1fe6b..5fe1b0e48 100644 --- a/.claude/skills/client-audit-export/SKILL.md +++ b/.claude/skills/client-audit-export/SKILL.md @@ -95,7 +95,7 @@ Compatible with `shasum -a 256 -c manifest.txt` for verification. - **`encrypted_value` exclusion** — hard-coded in `range-query.py`'s SELECT clause. Adding a new PII column requires explicit allow-list change here AND update to `references/art-13-fields.md`. - **Manifest regeneration** — manifest is rebuilt every run. Operators verifying chain-of-custody check the file hash against the manifest, not the upload date. - **WORM upload** — files land in the per-client WORM bucket with Object Lock. Cannot be modified or deleted by anyone (including project owner) until lock period elapses. -- **A3 query capture verification (v7.6.1)** — `range-query.py` counts `hook_audit_log` rows with `event_data ? 'exa_a3'` over the export window and writes the count into the manifest. Zero rows during a window with active A3 traffic = forensic gap; investigate `hookDBBridge` wiring (PR #114 defect 4.3.2). The full `event_data.exa_a3` JSONB (additional_queries[], query_count, ab_arm, ab_outcome) is included in the standard `hook_audit_log__csv.gz` export, satisfying EU AI Act Art. 12 query-reconstruction requirements. +- **A3 query capture verification (v7.6.1)** — `range-query.py` counts `hook_audit_log` rows with `event_data ? 'exa_a3'` over the export window and writes the count into the manifest. **Interpretation depends on flag state during the window**: (a) if `EXA_ADDITIONAL_QUERIES=true` was active and ≥1 session ran, zero rows = forensic gap → investigate `hookDBBridge` wiring (PR #114 defect 4.3.2); (b) if `EXA_ADDITIONAL_QUERIES=false` during the window, zero rows is expected and no action is needed (do not file incident). When non-zero, the full `event_data.exa_a3` JSONB (`additional_queries[]`, `query_count`, `ab_arm`, `ab_outcome`, `otel_trace_id`) is included in the standard `hook_audit_log__csv.gz` export, satisfying EU AI Act Art. 12 query-reconstruction requirements. ## Output report diff --git a/.claude/skills/deploy/SKILL.md b/.claude/skills/deploy/SKILL.md index e787e4864..fabc83fc3 100644 --- a/.claude/skills/deploy/SKILL.md +++ b/.claude/skills/deploy/SKILL.md @@ -37,6 +37,16 @@ Report these to the user: - Health status and feature flags (from script output) - Any warnings or issues encountered +### Production flag verification + +`curl http://34.26.70.60:3001/health | jq '.feature_flags'` should show the Exa production state established in `flags.env`: + +- `EXA_WEB_TOOLS=true` — production-locked since 2026-04-18 (PR [#76](https://github.com/Number531/Legal-API/pull/76)); validated 2026-05-12 via PRs [#118](https://github.com/Number531/Legal-API/pull/118)/[#119](https://github.com/Number531/Legal-API/pull/119) (96.8% Exa vs 96.1% Anthropic citation-verifier rate, both PASS gate). +- `EXA_ADDITIONAL_QUERIES=true` — all-treatment rollout in production `flags.env` since 2026-05-11. +- `EXA_ADDITIONAL_QUERIES_AB_SAMPLE=0.0` — all-treatment (set to `0.5` only for staging A/B windows; see `docs/runbooks/exa-a3-ab-staging.md`). + +If any of these flip to `false` after a deploy, the deploy regressed an env-var; do not advance traffic until reconciled. + ## Pre-Deploy Checks Before running the script, verify these prerequisites. The script validates Docker but cannot fix auth or project issues. diff --git a/.claude/skills/subagent-scaffold/SKILL.md b/.claude/skills/subagent-scaffold/SKILL.md index 32d852bc9..2d15b0c50 100644 --- a/.claude/skills/subagent-scaffold/SKILL.md +++ b/.claude/skills/subagent-scaffold/SKILL.md @@ -1,6 +1,6 @@ --- name: subagent-scaffold -description: Generate a new Claude Agent SDK subagent across all 7 mandatory wiring files. Mirrors the equity-analyst canonical template — agent file in legalSubagents/agents/, index.js import + registration tuple, _promptConstants.js CAPABILITY constant, domainMcpServers.js SUBAGENT_DOMAIN_MAP entry, hookSSEBridge.js classifyAgent map, optional p0GateHook.js RESEARCH_AGENTS Set, catalogDisplay/agentClassifications.js + agentDisplayMeta.js. Triggers — subagent scaffold, new subagent, generate agent, /subagent-scaffold. Supports flags — --name , --phase research|synthesis|qa, --domains , --keywords , --a3-eligible (auto-include EXA_ADDITIONAL_QUERIES_GUIDANCE for subagents that use Exa-routable tools). +description: Generate a new Claude Agent SDK subagent across all 7 mandatory wiring files. Mirrors the equity-analyst canonical template — agent file in legalSubagents/agents/, index.js import + registration tuple, _promptConstants.js CAPABILITY constant, domainMcpServers.js SUBAGENT_DOMAIN_MAP entry, hookSSEBridge.js classifyAgent map, optional p0GateHook.js RESEARCH_AGENTS Set, catalogDisplay/agentClassifications.js + agentDisplayMeta.js. Triggers — subagent scaffold, new subagent, generate agent, /subagent-scaffold. Supports flags — --name , --phase research|synthesis|qa, --domains , --keywords , --a3-eligible (RECOMMENDED for --phase research; auto-includes EXA_ADDITIONAL_QUERIES_GUIDANCE — pre-wires the orchestrator query-variation prompt for Exa-routable tools). --- # Subagent Scaffold — Generate a New Agent SDK Subagent diff --git a/super-legal-mcp-refactored/company-strategy/enterprise-necessities.md b/super-legal-mcp-refactored/company-strategy/enterprise-necessities.md index c97c1e916..3129127c8 100644 --- a/super-legal-mcp-refactored/company-strategy/enterprise-necessities.md +++ b/super-legal-mcp-refactored/company-strategy/enterprise-necessities.md @@ -763,7 +763,9 @@ All clients use the BaseHybridClient pattern: native API first, automatic fallba | `ENABLE_GEMINI_FILTERING` | false | Gemini-based content filtering | | `PTAB_PERMISSIVE_MODE` | false | Lenient PTAB API error handling | | `ENHANCED_SUMMARY_QUERIES` | true | Enhanced summary generation | -| `EXA_WEB_TOOLS` | true | Exa-powered web search tools in agent context | +| `EXA_WEB_TOOLS` | true | Exa-powered web search tools (`fetch_document`, `exa_web_search`) replacing Anthropic `WebFetch`/`WebSearch`. **Production-locked since 2026-04-18 (PR [#76](https://github.com/Number531/Legal-API/pull/76)); validated 2026-05-12 via production-fidelity A/B (PRs [#118](https://github.com/Number531/Legal-API/pull/118)/[#119](https://github.com/Number531/Legal-API/pull/119)): 96.8% Exa vs 96.1% Anthropic on 467-footnote citation-verifier fixture, both PASS gate.** | +| `EXA_ADDITIONAL_QUERIES` | true (production all-treatment since 2026-05-11) | Orchestrator-authored `additionalQueries` forwarded to Exa /search across 20 high-traffic MCP tools (v7.1.0 → v7.6.2) | +| `EXA_ADDITIONAL_QUERIES_AB_SAMPLE` | 0.0 | A/B split for staging quality-lift measurement (set to 0.5 for balanced split) | | `RAW_SOURCE_ARCHIVE` | true | Content-addressed raw source capture + SHA-256 hashing for audit traceability (v6.0.0) | | `PROMPT_INJECTION_DETECTION` | true | Regex-based injection detection on tool outputs (v6.0.0) | | `SLA_TELEMETRY` | true | Per-tool latency histograms + 7-day SLA dashboard (v6.0.0) | diff --git a/super-legal-mcp-refactored/company-strategy/gtm-buyer-intelligence.md b/super-legal-mcp-refactored/company-strategy/gtm-buyer-intelligence.md index d318350c8..dc2ab3ee6 100644 --- a/super-legal-mcp-refactored/company-strategy/gtm-buyer-intelligence.md +++ b/super-legal-mcp-refactored/company-strategy/gtm-buyer-intelligence.md @@ -506,7 +506,7 @@ Every PE fund, investment bank, and law firm should ask these 10 questions of an | **5** | **Can you show the complete audit trail for how a specific conclusion was reached?** | No. Conversational interface with no disclosed provenance chain. | Partial. Shows source documents but no full provenance chain through intermediate reasoning. | Yes. Session-level traceability: agent → database queried → API response → fact registry → section draft → QA score → remediation → final memorandum. | | **6** | **Does your system test for completeness — not just accuracy of what it produces?** | No disclosed completeness testing. BigLaw Bench tests answer quality on provided tasks only. | No disclosed mechanism for detecting gaps in its own research scope. | Phase 2 research review includes explicit completeness checks. QA "Completeness" dimension (10 points) scores whether all material issues are addressed. | | **7** | **What is your citation validation methodology?** | BigLaw Bench reports 68% source reliability — 32% unreliable. No disclosed methodology for improvement. | Links to Westlaw sources. No disclosed programmatic citation validation independent of the model. | Three-layer: (1) Bluebook standards in agent prompts, (2) programmatic Python validation independent of LLM, (3) QA "Citation Quality" dimension (12 points — highest-weighted single dimension). | -| **8** | **How many regulatory databases does your system query directly — not via web search?** | Not disclosed. No disclosed direct API integrations with government databases. | Westlaw + Practical Law (Thomson Reuters proprietary). No disclosed government database integrations. | 50+ database integrations via 134 domain-specific tools: SEC EDGAR, CourtListener, USPTO, FDA, EPA, Federal Register, GovInfo, BLS, ClinicalTrials.gov, USAspending, SAM.gov, ECB, ECHR, EUR-Lex, EPO, and more. | +| **8** | **How many regulatory databases does your system query directly — not via web search?** | Not disclosed. No disclosed direct API integrations with government databases. | Westlaw + Practical Law (Thomson Reuters proprietary). No disclosed government database integrations. | 50+ database integrations via 134 domain-specific tools: SEC EDGAR, CourtListener, USPTO, FDA, EPA, Federal Register, GovInfo, BLS, ClinicalTrials.gov, USAspending, SAM.gov, ECB, ECHR, EUR-Lex, EPO, and more. Web-source citations (state-court rules, agency enforcement bulletins) route through Exa's MCP tools — blind A/B-validated 2026-05-12 at 96.8% confirmation on a 370-footnote production fixture (PRs [#118](https://github.com/Number531/Legal-API/pull/118)/[#119](https://github.com/Number531/Legal-API/pull/119)), both PASS production gate. | | **9** | **Does your system enforce least-privilege access for internal components?** | Not disclosed. | Not disclosed. | Yes. 25 domain-scoped MCP servers partition the 134-tool catalog. Each specialist agent receives only tools relevant to its domain (84-93% reduction). | | **10** | **Will you submit to an independent, blinded evaluation against human expert work product?** | Not disclosed. BigLaw Bench is self-designed and self-scored. | Participated in VLAIR (third-party) but tests task completion, not memorandum-grade output. | Yes. Independent blind evaluation on roadmap: law school professors design rubric, retired M&A partners score anonymized output. Architecture is built to survive this test. | diff --git a/super-legal-mcp-refactored/company-strategy/gtm-positioning-strategy.md b/super-legal-mcp-refactored/company-strategy/gtm-positioning-strategy.md index 1a9a6ddfe..8bc0d4b95 100644 --- a/super-legal-mcp-refactored/company-strategy/gtm-positioning-strategy.md +++ b/super-legal-mcp-refactored/company-strategy/gtm-positioning-strategy.md @@ -135,6 +135,7 @@ USER QUERY + DOCUMENTS | Words per memorandum | 100,000+ | vs. 10-30 pages from traditional firms (varies by deal complexity) | | Citations per memorandum | 523+ unique | Bluebook 22nd Edition format | | Citation verification rate | 99%+ | Against live databases; uncertainties explicitly tagged for user review | +| Citation websearch verifier (G5) | 96.8% confirmed (Exa) / 96.1% (Anthropic), both PASS production gate | Independent blind A/B on 467-footnote production fixture, 2026-05-12 (PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119)). Exa MCP tools (`fetch_document`, `exa_web_search`) are the production-default verifier path since 2026-04-18. | | Time to delivery | 2 hours 47 minutes | vs. 6-8 weeks for equivalent manual diligence | | Client configurability | 11 surfaces | Domain selection, database selection, agent roster, depth parameters, QA thresholds, certification floors, risk tolerance, deliverable format, remediation cycles, tool-level toggles, model routing | | Subscription pricing | $400K+/month | Deal infrastructure, not per-seat SaaS | diff --git a/super-legal-mcp-refactored/company-strategy/gtm-sales-playbook.md b/super-legal-mcp-refactored/company-strategy/gtm-sales-playbook.md index 01a3de8c2..10d6dbb3f 100644 --- a/super-legal-mcp-refactored/company-strategy/gtm-sales-playbook.md +++ b/super-legal-mcp-refactored/company-strategy/gtm-sales-playbook.md @@ -517,7 +517,7 @@ These four objections come up in nearly every compliance-conscious buyer convers **Q1: "How do you handle EU AI Act?"** -> Articles 12-15 are mapped row-by-row to shipping artifacts. Article 12 logging is `hook_audit_log` + `access_log`. Article 13 transparency is the audit-export endpoint at `/api/session/:sessionKey/audit-report`. Article 14 human oversight is the admin router (`/halt`, `/override`, `/legal-hold`) with everything written to `human_interventions`. Article 15 reproducibility is the byte-replay envelope on every code execution — `system_prompt_hash + python_code + git_sha + sdk_version + container_id + anthropic_request_id`. We don't claim "AI Act ready"; we ship the artifacts. Bring your auditor. +> Articles 12-15 are mapped row-by-row to shipping artifacts. Article 12 logging is `hook_audit_log` + `access_log`. Article 13 transparency is the audit-export endpoint at `/api/session/:sessionKey/audit-report`. Article 14 human oversight is the admin router (`/halt`, `/override`, `/legal-hold`) with everything written to `human_interventions`. Article 15 reproducibility is the byte-replay envelope on every code execution — `system_prompt_hash + python_code + git_sha + sdk_version + container_id + anthropic_request_id`. Citation reproducibility extends through the verification layer itself: every footnote is checked against live regulatory/government databases via Exa MCP tools (production-default since 2026-04-18; blind A/B-validated 2026-05-12 at 96.8% confirmation on 370 footnotes — PRs [#118](https://github.com/Number531/Legal-API/pull/118)/[#119](https://github.com/Number531/Legal-API/pull/119)). We don't claim "AI Act ready"; we ship the artifacts. Bring your auditor. **Q2: "What about GDPR data deletion?"** diff --git a/super-legal-mcp-refactored/company-strategy/system-design.md b/super-legal-mcp-refactored/company-strategy/system-design.md index 0b1a499e1..86294c055 100644 --- a/super-legal-mcp-refactored/company-strategy/system-design.md +++ b/super-legal-mcp-refactored/company-strategy/system-design.md @@ -371,7 +371,7 @@ The P0 agent runs in a **dedicated agentQuery** before the main orchestrator, wi | `withWrite` | Read, Grep, Glob, Write, Edit | | `withWriteAndWeb` | Read, Grep, Glob, Write, Edit, WebFetch*, WebSearch* | -*When EXA_WEB_TOOLS=true: WebFetch → fetch_document, WebSearch → exa_web_search (Exa-powered MCP tools) +*Production config (EXA_WEB_TOOLS=true since 2026-04-18, PR [#76](https://github.com/Number531/Legal-API/pull/76)): WebFetch → fetch_document, WebSearch → exa_web_search (Exa-powered MCP tools). Validated 2026-05-12 (PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119)) — Exa arm 358/370 = 96.8% vs Anthropic arm 340/354 = 96.1% on 467-footnote citation-verifier fixture; both PASS production gate. ### 4.3 Agent Model Selection @@ -495,7 +495,7 @@ The `SessionStart` hook performs similar recovery on session resume, checking fo ## 6. Tool & Domain Architecture -### 6.1 Tool Inventory (148 Tools across 30 Domains, +2 with EXA_WEB_TOOLS) +### 6.1 Tool Inventory (150 Tools across 30 Domains — Exa MCP tools production-default since 2026-04-18) | Domain | Tool Count | Primary API | Example Tools | |--------|-----------|-------------|---------------| @@ -528,8 +528,8 @@ The `SessionStart` hook performs similar recovery on session resume, checking fo | `analysis` | 1 | Exa comprehensive | comprehensive_legal_entity_analysis | | `filing` | 1 | Internal | draft_legal_filing | | `state-statutes` | 1 | Exa web search | search_state_statute | -| `direct-fetch` | 1 | Exa `/contents` | fetch_document (conditional: EXA_WEB_TOOLS) | -| `exa-search` | 1 | Exa search API | exa_web_search (conditional: EXA_WEB_TOOLS) | +| `direct-fetch` | 1 | Exa `/contents` | fetch_document (production-default; replaces WebFetch) | +| `exa-search` | 1 | Exa search API | exa_web_search (production-default; replaces WebSearch) | | `code-execution` | 1 | Anthropic sandbox | run_python_analysis (conditional) | ### 6.2 Hybrid Client Pattern @@ -560,7 +560,7 @@ Behind `SCOPED_MCP_SERVERS=false` (default OFF): | Mode | MCP Servers | Tools Per Agent | Tool Name Pattern | |------|-------------|----------------|-------------------| -| **Monolithic** (default) | 1 (`super-legal-tools`) | ~98 (all, +2 with EXA_WEB_TOOLS) | `mcp__super-legal-tools__search_sec_filings` | +| **Monolithic** (default) | 1 (`super-legal-tools`) | ~150 (includes fetch_document + exa_web_search as production-default tools) | `mcp__super-legal-tools__search_sec_filings` | | **Scoped** | 25+ domain servers | 4-21 (per agent) | `mcp__sec__search_sec_filings` | **Subagent-to-Domain Mapping** (when scoped): @@ -1880,7 +1880,9 @@ These invariants are why the v6.8.5 audit-export endpoint can claim byte-faithfu | `CITATION_CHAT` | `true` | Session-scoped RAG Q&A with Anthropic Citations API (requires EMBEDDING_PERSISTENCE) | | `KNOWLEDGE_GRAPH` | `true` | 10-phase KG extraction, provenance chains, force-graph visualization, graph Q&A (requires EMBEDDING_PERSISTENCE + HOOK_DB_PERSISTENCE) | | `AUTH_ENABLED` | `true` | Cookie-based authentication with bcrypt password hashing | -| `EXA_WEB_TOOLS` | `true` | Exa-powered fetch_document + exa_web_search replacing WebFetch/WebSearch | +| `EXA_WEB_TOOLS` | `true` | Exa-powered fetch_document + exa_web_search replacing WebFetch/WebSearch. **Production-locked since 2026-04-18 (PR [#76](https://github.com/Number531/Legal-API/pull/76)); validated 2026-05-12 via PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119) — 96.8% Exa vs 96.1% Anthropic on 467-footnote fixture, both PASS gate.** | +| `EXA_ADDITIONAL_QUERIES` | `false` (production: `true`, all-treatment) | Orchestrator-authored `additionalQueries` forwarded to Exa /search across 20 high-traffic MCP tools (v7.1.0 → v7.6.2). All-treatment rollout in production `flags.env` since 2026-05-11. | +| `EXA_ADDITIONAL_QUERIES_AB_SAMPLE` | `0.0` | A/B split fraction for staging quality-lift measurement (set to `0.5` for balanced split). | | `PROMPT_ENHANCEMENT` | `true` | Intake research pre-phase for short queries (< 1000 chars) | | `RAW_SOURCE_ARCHIVE` | `true` | Content-addressed raw source capture + SHA-256 hashing for audit traceability (v6.0.0) | | `PROMPT_INJECTION_DETECTION` | `true` | Regex-based injection detection on tool outputs — OWASP LLM Top 10 (v6.0.0) | @@ -2485,7 +2487,7 @@ When `DOCUMENT_PROCESSING=true`, two sequential `agentQuery()` calls run (P0 + m | 1 | ~~**PostgreSQL Migration**~~ | **SHIPPED** (v4.0.0, Issue #30) | 5-tier DB enhancements | Cross-session analytics, ACID storage. Hook-to-DB bridge with 6 tables, 16+ indexes, frontend query router. | | 2 | **JSON Structured Reports** | [#10](https://github.com/Number531/Legal-API/issues/10) | ~3,000 lines, 18 files | Zod schemas for all 42 subagent outputs. Enables frontend rendering of structured findings, machine-readable risk tables, and API consumption of research results. | | 3 | **Document Processing (P0 Enable)** | [#8](https://github.com/Number531/Legal-API/issues/8) | Flag flip + integration testing | `DOCUMENT_PROCESSING=true` — P0 pre-wave subagent for client document upload and extraction. Code complete, empirical evidence in spec that prompt-only steering fails without enforcement gate. | -| 4 | ~~**G5 Citation WebSearch Verification**~~ | **SHIPPED** (v3.7.4, flag now `true` by default) | 60 tests passing | Independent websearch verification of every footnote before final synthesis. Dual-mode (existence haiku / full-content sonnet). Tiered hybrid strategy (WebFetch -> Exa MCP -> Anthropic WebSearch). W5-004 tag downgrade pipeline. | +| 4 | ~~**G5 Citation WebSearch Verification**~~ | **SHIPPED** (v3.7.4, flag now `true` by default; Exa-primary since 2026-04-18) | 60 tests passing + production-fidelity A/B validation (2026-05-12, PRs [#118](https://github.com/Number531/Legal-API/pull/118)/[#119](https://github.com/Number531/Legal-API/pull/119): 96.8% Exa vs 96.1% Anthropic, both PASS) | Independent websearch verification of every footnote before final synthesis. Dual-mode (existence haiku / full-content sonnet). Tiered hybrid strategy with **Exa MCP tools (`fetch_document`, `exa_web_search`) as the production-default verifier path** — Anthropic `WebSearch`/`WebFetch` retained as SDK-level fallback only. W5-004 tag downgrade pipeline. | | 5 | ~~**Database Enhancements (5-Tier)**~~ | **SHIPPED** (v4.0.0, Issue #30) | 5 tiers implemented | Hook-to-DB bridge persisting sessions, agent audit, gate checks, tool calls, code execution, remediation tracking. `HOOK_DB_PERSISTENCE=true`. | | 6 | **Files API Chart Extraction** | Merged to main (v4.1.0) | Feature-flagged | Charts extracted from code execution sandbox via `files-api-2025-04-14` beta. Persisted to `reports/{session}/charts/`. Two flags: `FILES_API_CHART_EXTRACTION`, `CHART_PERSISTENCE`. | diff --git a/super-legal-mcp-refactored/docs/feature-flags.md b/super-legal-mcp-refactored/docs/feature-flags.md index 855a6988d..96f161cda 100644 --- a/super-legal-mcp-refactored/docs/feature-flags.md +++ b/super-legal-mcp-refactored/docs/feature-flags.md @@ -946,12 +946,15 @@ Enables session authentication for the frontend. When `true`, users must authent | Property | Value | |----------|-------| | **Env var** | `EXA_WEB_TOOLS` | -| **Default** | `false` | +| **Default** | `false` (new deployments) | +| **Production** | `true` (since 2026-04-18, PR [#76](https://github.com/Number531/Legal-API/pull/76), commit `0f37daea`) | | **Type** | Boolean | | **Category** | Capabilities | -| **Status** | Active — disabled by default | +| **Status** | Active — production-locked | + +Enables Exa-powered web search tools in the agent context. When `true`, the agent's `WebSearch` and `WebFetch` capabilities are routed through the Exa MCP tools (`exa_web_search`, `fetch_document`) instead of Anthropic's server-executed `web_search_20260209` / `web_fetch_20260209`. -Enables Exa-powered web search tools in the agent context. When `true`, agents can use Exa's deep search and content extraction APIs alongside the standard websearch tools. +**Production validation (2026-05-12)**: Production-fidelity A/B harness (PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119)) invoked the actual production `citation-websearch-verifier` subagent via `agentQuery()` across a 467-footnote production fixture. Result: Exa arm 358/370 = **96.8%** vs Anthropic arm 340/354 = **96.1%** (gap +0.7pp; both PASS the production gate). Verdict `NEEDS_INVESTIGATION` reflects category-level flag-templating asymmetries, not signal loss. Trace + report at `docs/runbooks/citation-verifier-subagent-ab-report-2026-05-12.md`; methodology forensics from the predecessor flawed harness at `docs/runbooks/citation-verifier-ab-postmortem-2026-05-12.md`. --- diff --git a/super-legal-mcp-refactored/docs/runbooks/exa-a3-ab-staging.md b/super-legal-mcp-refactored/docs/runbooks/exa-a3-ab-staging.md index 6df948947..40ce150dc 100644 --- a/super-legal-mcp-refactored/docs/runbooks/exa-a3-ab-staging.md +++ b/super-legal-mcp-refactored/docs/runbooks/exa-a3-ab-staging.md @@ -115,6 +115,8 @@ Confirm via boot log (server prints featureFlags state at startup): **Critical**: do NOT set this in production yet. Staging only. +> **Update 2026-05-12 — production rollout approved.** Following the production-fidelity citation-verifier A/B validation (PRs [#118](https://github.com/Number531/Legal-API/pull/118) + [#119](https://github.com/Number531/Legal-API/pull/119)) showing parity between the Exa and Anthropic tool paths (96.8% vs 96.1%, both PASS gate), operations approved an all-treatment rollout of `EXA_ADDITIONAL_QUERIES=true` + `EXA_ADDITIONAL_QUERIES_AB_SAMPLE=0.0` in `flags.env` (effective 2026-05-11). The staging A/B step in §5–§8 below remains valid for future feature flips or coverage extensions, but is **not a current blocker** for new operators. See `flags.env` for the live production state. + --- ## 5. Memo selection (Day 1, ~1 hour)