Number531 · Number531 · May 11, 2026 · May 9, 2026 · May 9, 2026 · May 9, 2026
diff --git a/super-legal-mcp-refactored/.dockerignore b/super-legal-mcp-refactored/.dockerignore
@@ -10,9 +10,24 @@ test/sdk/
 test/unit/
 test/integration/
 test/parity/
+test/fixtures/
+test/smoke/
+test/chaos/
+test/isolated/
 docs/
 scripts/
 docker-compose.yml
 *.md
 !prompts/**/*.md
 .claude/
+
+# Defense-in-depth (Wave 1.18): block dead-code recurrences
+**/__tests__/**
+**/*.backup*
+**/*.bak
+**/*.orig
+kg_audit.*
+staging-health-*.ndjson
+WTF-IS-THIS-*.md
+toPdfViaTempFile*
+*.tsbuildinfo
diff --git a/super-legal-mcp-refactored/docs/metrics-catalog.md b/super-legal-mcp-refactored/docs/metrics-catalog.md
@@ -1,6 +1,6 @@
 # Metrics Catalog
 
-**Version**: v6.8.5 | **Date**: 2026-05-06 | **Source**: `src/utils/sdkMetrics.js`, `src/config/alertingRules.js`, `prometheus/alerts.yml`
+**Version**: v7.6.0 | **Date**: 2026-05-10 | **Source**: `src/utils/sdkMetrics.js`, `src/config/alertingRules.js`, `prometheus/alerts.yml`
 
 This catalog enumerates every Prometheus metric, OTel span, and alert rule emitted by the Super Legal MCP server. Operators reference it when building Grafana dashboards, debugging production incidents, or extending the observability surface.
 
@@ -17,6 +17,7 @@ This catalog enumerates every Prometheus metric, OTel span, and alert rule emitt
 7. [Auto-Reconciliation Metrics (v6.7.0)](#7-auto-reconciliation-metrics-v670)
 8. [Knowledge Graph & Embedding Metrics](#8-knowledge-graph--embedding-metrics)
 9. [Subagent & API Client Metrics](#9-subagent--api-client-metrics)
+   - 9.1. [Exa A3 — additionalQueries forwarding & A/B sampling (v7.1.0–v7.6.0)](#91-exa-a3--additionalqueries-forwarding--ab-sampling-v710v760)
 10. [Document Conversion Metrics](#10-document-conversion-metrics)
 11. [Token Usage Counters](#11-token-usage-counters)
 12. [Wave 3 Observability Errors](#12-wave-3-observability-errors)
@@ -144,6 +145,57 @@ Surfaces failures that the `hookDBBridge` wrapper try/catch silently swallowed p
 
 ---
 
+### 9.1 Exa A3 — additionalQueries forwarding & A/B sampling (v7.1.0–v7.6.0)
+
+Six metrics covering the [Exa April 2026 plan §4.3 Avenue A3](pending-updates/Exa-April-2026-updates.md) — orchestrator-authored query variations forwarded to Exa's `/search` Deep API. All metrics emit only when `EXA_ADDITIONAL_QUERIES=true`; A/B-arm metrics additionally require `EXA_ADDITIONAL_QUERIES_AB_SAMPLE > 0`. Both flags default off — zero-traffic baseline is expected and correct.
+
+**Forwarding metric** (v7.1.0, [PR #106](https://github.com/Number531/Legal-API/pull/106)):
+
+| Metric | Type | Labels | Buckets | Purpose |
+|---|---|---|---|---|
+| `claude_exa_additional_queries_count` | Histogram | `domain` | `[1, 2, 3, 4, 5]` | Variation count per Exa `/search` call when A3 forwarding fires. Bucket cap matches Exa's hard limit (5 entries). Zero observations means no adopter is passing the param yet — base plumbing inert by design. Used to track per-domain Layer 3 adopter rollout. |
+
+**A/B sampling metrics** (v7.6.0, [PR #110](https://github.com/Number531/Legal-API/pull/110)):
+
+Each eligible call is randomly assigned to `treatment` (additionalQueries forwarded as authored) or `control` (additionalQueries withheld; Exa auto-expansion fires). Operators compare arms in Grafana to validate quality lift before flipping `EXA_ADDITIONAL_QUERIES` to all-treatment.
+
+| Metric | Type | Labels | Buckets | Purpose |
+|---|---|---|---|---|
+| `claude_exa_ab_sample_assignments_total` | Counter | `arm`, `domain` | — | Total A/B sample assignments. Used to verify traffic balance — should converge to the configured `EXA_ADDITIONAL_QUERIES_AB_SAMPLE` ratio over time. |
+| `claude_exa_ab_result_count` | Histogram | `arm`, `domain` | `[1, 5, 10, 20, 50, 100]` | Result count per Exa call by A/B arm. Primary quality-lift signal — treatment arm should show ≥15% higher P50 than control for rollout. |
+| `claude_exa_ab_unique_urls` | Histogram | `arm`, `domain` | `[1, 5, 10, 20, 50, 100]` | Deduplicated URL count per call. Breadth signal — treatment arm collapses redundant variations into one Deep call, so ratio of `unique_urls / result_count` is the diversity proxy. |
+| `claude_exa_ab_summary_chars` | Histogram | `arm`, `domain` | `[100, 500, 1000, 5000, 10000, 50000]` | Total summary text characters returned per call. Content-depth signal — high values with few unique URLs imply Exa returned more content per source (Deep fan-out working). |
+| `claude_exa_ab_latency_ms` | Histogram | `arm`, `domain` | `[100, 500, 1000, 2500, 5000, 10000, 30000, 60000]` | Wall-clock per Exa call by arm. Cost dimension — treatment latency penalty must stay ≤20% of control for rollout per the runbook decision tree. |
+
+**Operator decision tree** (per [`docs/runbooks/exa-a3-ab-staging.md`](runbooks/exa-a3-ab-staging.md)):
+
+```
+SET EXA_ADDITIONAL_QUERIES=true, EXA_ADDITIONAL_QUERIES_AB_SAMPLE=0.5
+WAIT 24-48h for natural traffic
+COMPARE arms:
+  IF treatment lift ≥15% on result_count, unique_urls, summary_chars
+   AND treatment latency penalty ≤20%
+   → roll out (set AB_SAMPLE=0.0; treatment becomes 100%)
+  ELIF lift ≤5% across all 3 outcome dims
+   → refine variations, re-run
+  ELIF treatment regresses on any outcome dim
+   → rollback (set EXA_ADDITIONAL_QUERIES=false)
+```
+
+**Distinctness signal (not a metric)**: `src/utils/exaQueryValidator.js:78-103` runs a Jaccard-similarity check between caller-supplied variations and the primary query at validation time. Variations that paraphrase the primary emit a structured warning log (`exa_query_distinctness_jaccard`); not a Prometheus counter — operator inspects logs when investigating low quality lift.
+
+**Cardinality budget**: 6 metrics × ~8 domains × 2 arms (where applicable) ≈ 96 series. Negligible vs. the ~900-series cap of `claude_tool_duration_ms`.
+
+**Files**:
+- `src/utils/sdkMetrics.js:197-249` — metric definitions
+- `src/utils/sdkMetrics.js:537-565` — `recordExaAdditionalQueriesCount`, `recordExaAbAssignment`, `recordExaAbOutcome` helpers
+- `src/api-clients/BaseWebSearchClient.js` — sampling + emit sites (v7.6.0 added `_ab_arm` tag on results)
+- `src/utils/exaQueryValidator.js` — Jaccard distinctness logic
+- `docs/runbooks/exa-a3-ab-staging.md` — operator runbook (440 lines, decision tree, 4 failure modes)
+- `docs/feature-flags.md` §39, §40 — flag definitions
+
+---
+
 ## 10. Document Conversion Metrics
 
 | Metric | Type | Labels | Buckets | Purpose |
@@ -295,11 +347,16 @@ If any series count exceeds 5000 in production, audit label sources before the P
 | `recordPersistenceFailure(hook, reason)` | `claude_hook_persistence_failures_total` |
 | `setCircuitBreakerState(hook, state)` | `claude_hook_circuit_breaker_state` |
 | `classifyPersistenceFailure(err)` | (helper — returns bounded `reason` enum) |
+| `recordExaAdditionalQueriesCount(count, domain)` | `claude_exa_additional_queries_count` (v7.1.0) |
+| `recordExaAbAssignment(arm, domain)` | `claude_exa_ab_sample_assignments_total` (v7.6.0) |
+| `recordExaAbOutcome({arm, domain, resultCount, uniqueUrls, summaryChars, latencyMs})` | 4 outcome histograms — `claude_exa_ab_{result_count,unique_urls,summary_chars,latency_ms}` (v7.6.0) |
 
 ---
 
 **Reference docs**:
 - Operator audit-export runbook: `docs/runbooks/v6.8.5-audit-export.md`
 - Reconciliation runbook: `docs/runbooks/v6.7.0-session-reconciliation.md`
-- Feature flag registry: `docs/feature-flags.md` §31a (OTel sampler)
+- Exa A3 A/B staging runbook: `docs/runbooks/exa-a3-ab-staging.md` (v7.6.0)
+- Feature flag registry: `docs/feature-flags.md` §31a (OTel sampler), §39 (`EXA_ADDITIONAL_QUERIES`), §40 (`EXA_ADDITIONAL_QUERIES_AB_SAMPLE`)
 - System design v6.8.5: `company-strategy/system-design.md` §14c
+- Exa A3 plan: `docs/pending-updates/Exa-April-2026-updates.md` §4.3
diff --git a/super-legal-mcp-refactored/migration-tools/shadow-mode-proxy.js b/super-legal-mcp-refactored/migration-tools/shadow-mode-proxy.js