Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
291 changes: 291 additions & 0 deletions super-legal-mcp-refactored/CHANGELOG.md

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Exa A3 Phase A — Six-Improvement Implementation Plan

**Status**: Active — implementation begins 2026-05-09
**Predecessors**: PR #106 (v7.1.0), PR #107 (v7.2.0), PR #108 (v7.3.0 — open)
**Target**: production rollout of orchestrator-authored Exa Deep variations

## Empirical findings driving this plan

1. **LLM adoption test (24 trials, Sonnet 4.6)**: 100% adoption from schema descriptions alone. Avg 2.9 variations. Forced `tool_choice` likely inflates real-world rate to ~70–90%.
2. **Quality observations**: variation [1] often echoes primary query (~50% of trials); intra-tool diversity across repeats is low (model is deterministic for fixed prompts); inter-axis distinctness is strongest in `search_opinions`, weakest in `search_federal_register`.
3. **Live smoke test**: 4/4 covered tools forward 3/3 variations to live Exa via web-search fallback path.
4. **Critical gap**: zero empirical signal on whether variations actually *improve result quality* vs. Exa's auto-expansion. Plumbing fires; quality lift is unmeasured.

## Sequenced PRs

### PR #108 amendment — schema tweaks + validator telemetry

**Items**: #1 (schema descriptions), #5 (Jaccard distinctness telemetry)
**Effort**: ~1 hour
**Blocks**: nothing
**Goal**: address the variation-1-echoes-primary defect observed in adoption test.

**Files**:
- `src/tools/toolDefinitions.js` — 5 description fields (exa_web_search + 4 per-domain)
- `src/utils/exaQueryValidator.js` — add `_distinctnessScore(primary, variations)` returning Jaccard similarity for each variation; log warning when first variation has >0.5 token overlap with primary
- `test/sdk/exa-content-strategy.test.js` — extend with 3 distinctness tests

**Schema description rewrite pattern**:
- Lead with the anti-pattern: "Each variation MUST open an axis the primary does NOT address. Do NOT restate, expand, or annotate the primary."
- Worked example inline: GOOD vs BAD variations for the domain
- Keep the existing axis hint list

### PR #109 — agentQuery adoption test

**Items**: #3
**Effort**: ~30 minutes
**Blocks**: nothing (insurance step before staging)
**Goal**: confirm the Messages-API adoption rate (100%) holds when the model is wrapped in actual subagent context.

**Approach**:
- Test rig: call `agentQuery({...})` from `@anthropic-ai/claude-agent-sdk` with:
- System prompt loaded from actual `legalSubagents/agents/securities-researcher.js` (or equivalent)
- Tool list = full production toolDefinitions slice for that subagent's domain
- PreToolUse hook captures `tool_use.input.additionalQueries`
- 5 trials per subagent × 3 subagents (securities, case-law, regulatory) = 15 trials
- Tally adoption rate, compare to Messages-API result (100%)

**Files**:
- `test/sdk/llm-additional-queries-adoption-agentquery.mjs` — NEW

**Acceptance**: adoption rate ≥80% across 15 agentQuery trials.

### PR #110 — A/B sampling flag

**Items**: #2
**Effort**: ~1 day
**Blocks**: production rollout (no quality data without it)
**Goal**: empirically measure whether `additionalQueries` improves result quality vs. Exa auto-expansion.

**Architecture**:
- New flag `EXA_ADDITIONAL_QUERIES_AB_SAMPLE: 0.0` (default 0.0 = no sampling, all forwarding follows main flag)
- Range: `0.0 → 1.0` (fraction of eligible calls routed to control arm with `additionalQueries` *withheld*)
- Stratified by `domain` label (so each domain gets balanced A/B coverage)
- Per-call decision: at executeExaSearch, if `EXA_ADDITIONAL_QUERIES=true` AND `Math.random() < EXA_ADDITIONAL_QUERIES_AB_SAMPLE`, drop additionalQueries and tag the result with `_ab_arm: 'control'`. Else `_ab_arm: 'treatment'`.

**Metrics** (added to `src/utils/sdkMetrics.js`):
- `claude_exa_ab_arm_total` (Counter, labels: `arm`, `domain`) — population balance check
- `claude_exa_result_count` (Histogram, labels: `arm`, `domain`) — primary outcome
- `claude_exa_unique_urls` (Histogram, labels: `arm`, `domain`) — diversity of returned set
- `claude_exa_summary_chars` (Histogram, labels: `arm`, `domain`) — content depth
- `claude_exa_latency_ms` (Histogram, labels: `arm`, `domain`) — cost dimension
- Optional downstream: `claude_citation_validator_pass_rate` (Counter, labels: `arm`, `domain`) — wired only if hook can correlate session→arm

**Acceptance**: 100 calls per arm per domain → tabulated comparison report. Decision rule: ship treatment if treatment unique_urls and result_count ≥ control by ≥10% with no latency regression >20%.

**Files**:
- `src/config/featureFlags.js` — add flag
- `src/api-clients/BaseWebSearchClient.js` — sampling logic
- `src/utils/sdkMetrics.js` — register 4 new metrics
- `test/sdk/exa-ab-sampling.test.js` — NEW (10+ tests covering flag-off/on, distribution, label correctness)
- Grafana dashboard JSON (separate, not in this PR)

### PR #111 — coverage extension to top-10 tools

**Items**: #4
**Effort**: ~1 day
**Blocks**: meaningful A/B signal on staging (with only 4 tools covered, ~30% of memo tool calls exercise the feature)
**Goal**: extend the same 4-edit pattern to the next 10 high-traffic tools.

**Tools to cover** (subject to explore-agent confirmation):
- ClinicalTrials: `search_clinical_trials`
- Congress: `search_congressional_records`, `search_legislation`
- USPTO: `search_patents`, `search_patent_applications`
- EPA: `search_epa_facilities`, `search_epa_violations`
- FDA: `search_fda_recalls`, `search_drug_approvals`
- USAspending: `search_federal_contracts`

**Per-tool edit pattern (same as PR #108)**:
1. `toolDefinitions.js` — add `additionalQueries` field with domain-specific axis guidance
2. `toolImplementations.js` — forward `args.additionalQueries` if wrapper strips args
3. `<Domain>WebSearchClient.<method>Web` — destructure + spread to `executeExaSearch`
4. e2e + fallback test additions

**Acceptance**: all 10 tools pass the same 5-layer plumbing trace + flag-off zero-degradation test.

### PR #112 — skill template updates

**Items**: #6
**Effort**: ~30 minutes
**Blocks**: nothing
**Goal**: future tool integrations inherit A3 support automatically.

**Templates**:
- `.claude/skills/api-integration/templates/HybridClient.js.hbs`
- `.claude/skills/api-integration/templates/WebSearchClient.js.hbs`
- `.claude/skills/api-integration/templates/toolDefinitions.snippet.hbs`
- `.claude/skills/api-integration/templates/test-e2e.test.js.hbs`
- `.claude/skills/subagent-scaffold/templates/...`

**Edits**: insert the additionalQueries inputSchema field, destructure pattern, and 2 e2e/fallback test blocks as default scaffolding.

## Critical-path summary

```
PR #108 amend (1h) ─┐
PR #109 (30m) │
├─→ PR #110 (1d) ──→ staging memo run ──→ production rollout
PR #111 (1d) │
PR #112 (30m) ─┘
```

PR #110 is the gate. Without A/B sampling data, production rollout is blind. PR #111 makes that data statistically meaningful by extending the population.

## Risk register

| Risk | Mitigation |
|---|---|
| Schema tweaks lift adoption but not quality | A/B data (#2) catches this — control arm shows if Exa auto-expansion was already optimal |
| AgentQuery path adoption <80% | Add light subagent-prompt nudge (deferred PR #113) |
| Coverage extension breaks unrelated tools | Same e2e + fallback test pattern enforced per tool |
| A/B sampling biased by retry/cache layers | Sampling decision moved upstream of cache lookup; sample assignment logged for replay |
| Skill template changes affect existing scaffolds | Templates only used at generation time — no impact on existing code |
15 changes: 13 additions & 2 deletions super-legal-mcp-refactored/src/api-clients/BaseHybridClient.js
Original file line number Diff line number Diff line change
Expand Up @@ -170,14 +170,25 @@ export class BaseHybridClient extends BaseWebSearchClient {
websearchArgs = null,
startPublishedDate,
endPublishedDate,
category
category,
// A3 (Exa April 2026 plan §4.3) — orchestrator-authored Deep variations.
// Prefer options.additionalQueries (explicit), fall back to args.additionalQueries
// so per-tool MCP wrappers that pass `additionalQueries` inside `args` flow through
// even when the per-domain hybrid client builds a separate `websearchArgs` object.
additionalQueries: optsAdditionalQueries
} = options;
const additionalQueries = optsAdditionalQueries ?? (args && args.additionalQueries);

// Forward Exa-specific options to websearch args if provided
if (websearchArgs && (startPublishedDate || endPublishedDate || category)) {
if (websearchArgs && (startPublishedDate || endPublishedDate || category || additionalQueries)) {
if (startPublishedDate) websearchArgs.startPublishedDate = startPublishedDate;
if (endPublishedDate) websearchArgs.endPublishedDate = endPublishedDate;
if (category) websearchArgs.category = category;
// A3: forward orchestrator-authored variations to the WebSearchClient method.
// The WebSearchClient method passes these to executeExaSearch, which
// validates + forwards to Exa request body when EXA_ADDITIONAL_QUERIES flag
// is enabled. Inert when flag is off (additive contract preserved).
if (additionalQueries) websearchArgs.additionalQueries = additionalQueries;
}

this.log(`executeHybrid called`, { methodName, strategy, args });
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import { ContentStrategy } from './ContentStrategy.js';
import { extractFromSummary, fallbackToTextParsing, sanitizeData } from './schemas/SchemaValidator.js';
import { featureFlags } from '../config/featureFlags.js';
import { recordExaAdditionalQueriesCount } from '../utils/sdkMetrics.js';
import { validateAdditionalQueries } from '../utils/exaQueryValidator.js';
import { validateAdditionalQueries, warnOnLowDistinctness } from '../utils/exaQueryValidator.js';

export class BaseWebSearchClient extends SearchQualityMixin {
constructor(rateLimiter, exaApiKey, contentStrategy = null) {
Expand Down Expand Up @@ -231,6 +231,11 @@ export class BaseWebSearchClient extends SearchQualityMixin {
// D9 (Exa April 2026 plan §5.5.5): observe variation count for adoption tracking.
// Domain label defaults to 'unknown' when caller didn't pass it; non-blocking.
recordExaAdditionalQueriesCount(validated.length, domain || 'unknown');
// A3 distinctness telemetry (PR #108 amendment): Jaccard-similarity check
// between `query` and each variation. Logs a warning when a variation
// is a likely paraphrase of the primary (>0.5 token overlap) — surfaces
// low-quality orchestrator authorship without blocking the call.
warnOnLowDistinctness(query, validated, domain || 'unknown');
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,8 @@ export class CPSCWebSearchClient extends BaseWebSearchClient {
product_category,
limit = 10,
include_snippet = false,
include_text = false
include_text = false,
additionalQueries // A3 (Exa April 2026 plan §4.3)
} = args;

// Validate inputs
Expand Down Expand Up @@ -229,7 +230,8 @@ export class CPSCWebSearchClient extends BaseWebSearchClient {
summaryQuery: 'CPSC recall hazard injury defect safety remedy repair consumer product recall number manufacturer',
numSentences: 4,
includeDomains: this.domains,
includeFullText: include_text
includeFullText: include_text,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

// Process results with permissive mapping
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,13 @@ export class ClinicalTrialsWebSearchClient extends BaseWebSearchClient {
}

async searchClinicalTrials(args = {}) {
const { additionalQueries } = args; // A3 (Exa April 2026 plan §4.3)
const terms = [args.query, args.condition, args.intervention, args.sponsor].filter(Boolean).join(' ') || 'clinical trial';
const query = `site:clinicaltrials.gov ${terms} trial study`;
const results = await this.executeExaSearch(query, args.limit || 10, {
domain: 'clinical_trials',
includeDomains: this.domains
includeDomains: this.domains,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});
return {
content: [{ type: 'text', text: JSON.stringify({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,13 +97,14 @@ export class CongressGovWebSearchClient extends BaseWebSearchClient {
}

async searchCongressionalRecordWeb(args = {}) {
const { query, chamber } = args;
const { query, chamber, additionalQueries } = args; // A3 additionalQueries
const chamberTerm = chamber ? ` ${chamber}` : '';
const exaQuery = `site:congress.gov/congressional-record "Congressional Record"${chamberTerm} ${query || ''}`;
const results = await this.executeExaSearch(exaQuery, args.limit || 25, {
domain: 'legislative',
includeDomains: this.domains,
summaryQuery: 'Congressional Record debate floor statement vote proceedings'
summaryQuery: 'Congressional Record debate floor statement vote proceedings',
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});
return {
content: [{ type: 'text', text: JSON.stringify({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ export class CourtListenerWebSearchClient extends BaseWebSearchClient {
limit,
include_snippet = args.include_text ?? true, // Backward compatibility
include_full_text = false,
include_text // Capture for backward compatibility
include_text, // Capture for backward compatibility
additionalQueries // A3 (Exa April 2026 plan §4.3) — orchestrator-authored Deep variations
} = args;

// Smart limit based on content type
Expand Down Expand Up @@ -81,7 +82,8 @@ export class CourtListenerWebSearchClient extends BaseWebSearchClient {
summaryQuery: 'holding precedent citation court judge opinion dissent concurrence reversed affirmed decision ruling',
numSentences: 7,
includeDomains: this.clDomains,
includeFullText: include_full_text
includeFullText: include_full_text,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

// Filter to opinion pages or storage PDFs, apply optional date window
Expand Down
11 changes: 7 additions & 4 deletions super-legal-mcp-refactored/src/api-clients/EPAWebSearchClient.js
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ export class EPAWebSearchClient extends BaseWebSearchClient {
compliance_status,
violations_last_3_years,
limit = 3,
include_full_text = false
include_full_text = false,
additionalQueries // A3 (Exa April 2026 plan §4.3) — orchestrator-authored Deep variations
} = args;

// Validate that at least one location/identifier is provided
Expand Down Expand Up @@ -118,7 +119,8 @@ export class EPAWebSearchClient extends BaseWebSearchClient {
summaryQuery: summaryQuery,
numSentences: 6,
includeDomains: ['epa.gov'], // Wildcard to include all EPA subdomains (www, echo, enviro, etc.)
includeFullText: include_full_text
includeFullText: include_full_text,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

// Map to facility summary using highlights
Expand Down Expand Up @@ -207,7 +209,7 @@ export class EPAWebSearchClient extends BaseWebSearchClient {
*/
async searchViolationsWeb(args) {
if (!args || typeof args !== 'object') args = {};
const { facility_id, program, date_after, date_before, limit = 15 } = args;
const { facility_id, program, date_after, date_before, limit = 15, additionalQueries } = args; // A3 additionalQueries
if (!facility_id) {
throw new Error(
'facility_id is required for EPA violation searches. ' +
Expand Down Expand Up @@ -241,7 +243,8 @@ export class EPAWebSearchClient extends BaseWebSearchClient {
summaryQuery: summaryQuery,
numSentences: 7,
includeDomains: ['echo.epa.gov', 'www.epa.gov'],
includeFullText: false
includeFullText: false,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

const top = results.find(r => (r.url || '').includes('echo.epa.gov')) || results[0];
Expand Down
28 changes: 16 additions & 12 deletions super-legal-mcp-refactored/src/api-clients/FDAWebSearchClient.js
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,8 @@ export class FDAWebSearchClient extends BaseWebSearchClient {
sort,
count,
include_snippet = false,
include_text = false
include_text = false,
additionalQueries // A3 (Exa April 2026 plan §4.3) — orchestrator-authored Deep variations
} = args;

const validatedLimit = validateLimit(limit, 10);
Expand Down Expand Up @@ -377,7 +378,8 @@ export class FDAWebSearchClient extends BaseWebSearchClient {
summaryQuery: summaryQuery,
numSentences: 4,
includeDomains: this.fdaDomains,
includeFullText: include_text
includeFullText: include_text,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

// Process results with permissive mapping
Expand Down Expand Up @@ -1096,14 +1098,15 @@ export class FDAWebSearchClient extends BaseWebSearchClient {
*/
async search510kWeb(args) {
if (!args || typeof args !== 'object') args = {};

const {
search = '',
limit = 5,
include_snippet = false,
include_text = false,
date_after,
date_before

const {
search = '',
limit = 5,
include_snippet = false,
include_text = false,
date_after,
date_before,
additionalQueries // A3 (Exa April 2026 plan §4.3)
} = args;

const validatedLimit = validateLimit(limit, 10);
Expand Down Expand Up @@ -1132,9 +1135,10 @@ export class FDAWebSearchClient extends BaseWebSearchClient {
summaryQuery: summaryQuery,
numSentences: 4,
includeDomains: this.fdaDomains,
includeFullText: include_text
includeFullText: include_text,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

const processedResults = results
.filter(r => this.isFDADomain(r.url))
.map(r => this.mapFDAResultPermissive(r, '510k', include_text, include_snippet));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,8 @@ export class FederalRegisterWebSearchClient extends BaseWebSearchClient {
date_before,
limit = 10,
include_text = false,
include_snippet = false
include_snippet = false,
additionalQueries // A3 (Exa April 2026 plan §4.3) — orchestrator-authored Deep variations
} = args;

// No validation required - buildFederalRegisterQuery provides smart fallbacks
Expand Down Expand Up @@ -177,7 +178,8 @@ export class FederalRegisterWebSearchClient extends BaseWebSearchClient {
summaryQuery: summaryQuery,
numSentences: 5,
includeDomains: this.federalRegisterDomains,
includeFullText: include_text
includeFullText: include_text,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

// Permissive mapping - no filtering, all results processed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,8 @@ export class PTABWebSearchClient extends BaseWebSearchClient {
status,
limit,
include_snippet = false,
include_text = false
include_text = false,
additionalQueries // A3 (Exa April 2026 plan §4.3)
} = args;

// Smart default limits aligned with USPTO/EPA
Expand Down Expand Up @@ -151,7 +152,8 @@ export class PTABWebSearchClient extends BaseWebSearchClient {
domain: 'patent',
summaryQuery: 'PTAB Patent Trial and Appeal Board IPR PGR CBM institution decision final written decision petitioner patent owner proceeding number status',
numSentences: 6,
includeFullText: include_snippet || include_text
includeFullText: include_snippet || include_text,
...(additionalQueries !== undefined && { additionalQueries }) // A3 forwarding
});

let structuredResults;
Expand Down
Loading