feat(exa): A3 Phase A — full pipeline (v7.3.0 → v7.5.0 augmentor refactor)#108
Merged
Conversation
…ive fallback tests (v7.3.0) Extends Phase A from the catch-all `exa_web_search` tool (v7.2.0) to four high-traffic per-domain MCP tools: search_sec_filings, search_cases, search_opinions, search_federal_register. Closes the test gap admitted on v7.2.0 by adding end-to-end tests through actual MCP tool implementations and explicit hybrid-fallback tests covering native-API failure paths. Plumbing layers covered: MCP tool args (additionalQueries) → toolImplementations.js wrapper → HybridClient method → BaseHybridClient.executeHybrid (forwards options OR args.additionalQueries) → WebSearchClient method → BaseWebSearchClient.executeExaSearch → fetch(api.exa.ai) with top-level additionalQueries Tests: 11 new (7 e2e + 4 fallback) all pass. 10/10 live API shapes pass (was 7/7). Zero regressions vs. baseline. Default OFF — additive contract preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exercises search_sec_filings, search_cases, search_opinions, search_federal_register against the real Exa API with EXA_ADDITIONAL_QUERIES=true. Stubs native clients to force websearch fallback, intercepts /search vs /contents endpoints separately, asserts additionalQueries forwarded only on /search calls. All 4 tools pass: 3/3 variations forwarded, type:deep, hybrid_source:web_search_fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies Sonnet 4.6 populates `additionalQueries` from inputSchema description alone (no subagent-prompt updates). Submits each of the 4 covered tool defs to Anthropic Messages API with realistic prompts; tallies adoption rate across 24 trials (bare + nudged × 3 repeats × 4 tools). Result: 12/12 bare (100%), avg 2.9 variations; 12/12 nudged (100%), avg 3.0. All variations are axis-distinct (doctrine/jurisdiction/CFR-section/etc), NOT paraphrases — schema descriptions work as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two empirical findings from the LLM adoption test (24 trials, Sonnet 4.6): - Variation-1 often paraphrased the primary query (~50% of trials), even though descriptions said "NOT paraphrases" — the rule lacked a worked example to anchor the pattern - No telemetry surfaced when orchestrator authored low-quality (paraphrase- style) variations Changes: - 5 inputSchema descriptions rewritten with WORKED EXAMPLE blocks (GOOD vs BAD variations) and concrete axis menus per domain (search_cases, search_opinions, search_sec_filings, search_federal_register, exa_web_search) - computeDistinctness() + warnOnLowDistinctness() in exaQueryValidator.js — Jaccard similarity check, logs warning when variation has >0.5 token overlap with primary; tokenization preserves § for legal citations - Wired into BaseWebSearchClient.executeExaSearch + toolImplementations exa_web_search forwarding paths - 6 new unit tests for distinctness scoring All 93 Exa-suite tests pass (was 87). Zero regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…list Distinct from llm-additional-queries-adoption.mjs (bare API + forced tool_choice + isolated tool). This rig loads each production subagent's real system prompt (40-50K chars) and the full 134-tool list, lets Sonnet 4.6 pick freely. Findings (24 trials, claude-sonnet-4-6): - securities-researcher: 3 A3 calls, 0 populated additionalQueries (0%) - case-law-analyst: 6 A3 calls, 4 populated additionalQueries (67%) - regulatory-rulemaking-analyst: 0 A3 calls (chose non-A3 tools) - Overall: 4/9 = 44% adoption — vs. 100% in isolated test Implication: production adoption will likely be 30-60%, not 100%. Schema descriptions get diluted by dense system prompts + tool-selection cognitive load. Subagent prompt updates would lift this; A/B sampling (PR #110) becomes essential to measure the smaller quality-lift signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the production-realistic adoption gap surfaced by PR #109. Bare API test showed 100% schema-only adoption; realistic test (real subagent prompts + 134-tool list) showed 44%. Dense subagent prompts dilute the inputSchema signal — needed prompt-level reinforcement. Result: 93% adoption (14/15) across 2 reproducible runs, up from 44%. Variation quality also improved — model generalized worked-example axis pattern to new domains (e.g., '40 CFR Part 98 Subpart W' for EPA queries not in any worked example). Added: - EXA_ADDITIONAL_QUERIES_GUIDANCE shared constant (~600 tokens) - Integrated into 25 A3-relevant subagents (research-tier; memo/QA agents excluded as they don't author tool calls) - 27 unit tests guarding the integration - EXA_ADDITIONAL_QUERIES_AB_SAMPLE numeric feature flag (0.0-1.0) - 5 Prometheus A/B metrics registered (sampling logic comes in PR #110) Token cost: ~70K input tokens per memo (<0.5% bloat) for +49pp adoption lift. 120/120 Exa-suite tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the A3 plumbing pattern from 4 originally-covered tools to 10 more high-traffic per-domain tools, raising A/B-test eligible population from ~30% → ~65-70% of typical memo tool calls. Tools covered: - search_clinical_trials (ClinicalTrials) - search_congressional_record (CongressGov) - search_patents (USPTO) - search_epa_facilities, search_epa_violations (EPA) - search_fda_recalls, search_fda_510k (FDA) - search_cpsc_recalls (CPSC) - search_federal_contracts (SAMGov) - search_ptab_proceedings (PTAB) Per-tool changes (uniform pattern): 1. inputSchema: additionalQueries field with domain-specific axis menu + GOOD/BAD worked examples (clinical trials → phase/intervention; patents → CPC/35-USC; EPA → CFR program/statute; FDA → recall class; CPSC → hazard type/ASTM; SAM.gov → NAICS/contract vehicle; PTAB → 35-USC § 311/Fintiv) 2. WebSearchClient method: destructure + spread to executeExaSearch options 3. toolImplementations.js: search_patents wrapper required explicit forwarding (strips args); other 9 pass args verbatim Tests: - 30 new unit tests in exa-additional-queries-coverage-extension.test.js (3 per tool: flag-ON forwarding, flag-OFF zero-degradation, omit-by-caller) - 5 new live API verification shapes — all 15/15 live shapes pass - Cumulative Exa-suite: 150/150 (was 120/120) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec for refactoring A3 cross-cutting plumbing from per-tool duplication into a composable augmentor pipeline. Validated against 3 explore agents covering: tool-definition consumers (10+ readers), WebSearchClient decorator compatibility (all 10 methods safe), and subagent prompt loading lifecycle (modular path is production via MODULAR_SUBAGENTS=true). Key findings: - Pure refactor, zero new dependencies - 6 acceptance gates (snapshot equivalence, test parity, live API, adoption ≥94.5%, boot perf <50ms, reversibility) - 5-day phased migration with rollback at each phase - Legacy monolithic path (legalSubagents.js, 15,605 lines) discovered — recommendation to deprecate, not migrate - Trim regression (securities-researcher 80%→0%) must be reverted before refactor begins Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updated augmentor refactor spec with findings from 4 additional explore agents: legacy file audit, serialization invariants, test infrastructure, CI/build pipeline. Critical findings altering the plan: - Legacy legalSubagents.js (15,605 lines) cannot be deprecated — has memo-integration-agent uniquely + 3 test imports + non-test references in domainMcpServers/agentClassifications/hookSSEBridge. Explicitly OUT OF SCOPE for this refactor. - Day 4 (prompt centralization) REMOVED — would break 27 tests in exa-prompt-guidance.test.js asserting prompt string membership. - additionalQueries property MUST be placed last in inputSchema.properties to preserve Anthropic prompt cache key. - 'required' array order MUST be preserved (toEqual is order-sensitive). - New Day 4: eager schema validation in bootstrap.js to catch augmentor errors at deploy time vs first MCP call. Round 2 confirmed: zero new dependencies, no build step (Docker copies source as-is), no TypeScript, no ESLint, no snapshot tests, no hashing/etag of tool definitions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks the architectural decision: refactor merges as new default with no USE_AUGMENTOR_PIPELINE flag. EXA_ADDITIONAL_QUERIES remains the only A3-related gate (controls value forwarding to Exa, not refactor activation). Rollback via git revert + redeploy (~10-15 min recovery). Updates: - Section 7 (Out of scope): explicitly excludes new feature flag - Section 9 (Rollback): git-revert-based path emphasized, no flag fallback - Section 11 Q9: resolved Acceptance gates (Gate 1 byte-equivalence in particular) catch regressions at PR review, not in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion) The schema trim (committed inadvertently in 103196a alongside spec update) caused securities-researcher adoption to drop from 80% to 0% in the realistic LLM test. Restoring all 15 schemas to the v7.4.0 byte-identical state. This is the baseline state required before the augmentor refactor begins (per refactor spec §11 Q2): the augmentor must produce byte-equivalent output to the un-trimmed schemas to pass Gate 1 (snapshot equivalence). Tests: 150/150 Exa-suite pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Day 1 of augmentor refactor (per spec). Pure additive — no consumer behavior change, no toolDefinitions edit yet. Added: - src/tools/augmentors/_engine.js — pipeline runner (applyAugmentors, applyMethodDecorators); pure functions, idempotent, order-preserving - src/tools/augmentors/exaAdditionalQueries.js — A3 augmentor with byte-identical descriptions for all 15 tools (extracted from pre-refactor v7.4.0 baseline) - test/sdk/exa-augmentor-snapshot.test.js — 48 tests (Gate 1) Gate 1 status: PASSING. The augmentor produces JSON-byte-identical inputSchema for all 15 A3 tools when applied to a synthesized "raw" tool (without additionalQueries field, with traits declared). Test results: - 48/48 snapshot tests pass (byte-equivalence + property order + required-array order) - 150/150 existing Exa-suite tests pass unchanged - Cumulative: 198/198 tests across 9 suites The augmentor is currently DARK CODE — exists in the module tree but not yet wired into toolDefinitions.js exports. Day 2 wires it in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the A3 augmentor into toolDefinitions.js exports. Each of the 15
A3-eligible tools now declares `traits: ['exa-routable', 'domain:X']`;
inline additionalQueries blocks removed (~13K chars deduplicated).
Each affected export array is wrapped: applyAugmentors(_rawXxx,
A3_AUGMENTORS) — augmentor injects additionalQueries into inputSchema.
Behavior verified byte-equivalent via snapshot tests (Gate 1, 49/49).
Changes:
- src/tools/toolDefinitions.js:
- Added imports for augmentor pipeline + A3 augmentor
- Added traits declarations to 15 tools
- Removed 15 inline additionalQueries schema blocks
- Renamed 12 export arrays to _rawXxx (private), re-exported under
original name via applyAugmentors wrapper
- src/tools/augmentors/_engine.js:
- Added stripInternalMetadata() to strip `traits` field from output
(prevents metadata leak to MCP wire format)
- test/sdk/exa-augmentor-snapshot.test.js:
- +1 test: verifies traits never appears in augmented output
Test results:
- 199/199 Exa-suite + augmentor tests pass (was 198 + 1 new test)
- 0 modifications to existing tests required (Gate 2 satisfied)
- code-execution-bridge.test.js failure pre-existing (Jest dynamic import
race), unrelated to this refactor
Net LoC: ~−180 lines in toolDefinitions.js (descriptions deduplicated)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the Day 1+2 refactor as v7.5.0. Captures all 6 acceptance gates, byte-equivalence proof (2341 bytes pre==post), and gate-4 caveat (87% cumulative adoption across 3 runs vs 94.5% pre-refactor — sampling variance not regression, proven by snapshot equivalence). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Number531
added a commit
that referenced
this pull request
May 9, 2026
First-use of the augmentor pipeline (PR #108) for coverage extension. Adds A3 additionalQueries plumbing to 5 high-traffic legal-research tools. Tools added: - lookup_citation (domain:case-law) - search_judges (domain:judges — NEW axis menu) - search_sec_filings_fulltext (domain:securities) - search_federal_register_notices (domain:federal-register) - search_fda_warning_letters (domain:fda-warning-letter — NEW axis menu) Per-tool effort dropped from ~80 LoC pre-refactor to ~19 LoC per tool (trait declaration + WebSearchClient destructure + spread). Existing domains reused from augmentor's DOMAIN_DESCRIPTIONS; only 2 new axis menus authored (judges, fda-warning-letter). A3 coverage: 15 tools → 20 tools (~30% population increase). Per-memo coverage estimated ~75-80% (was ~65-70%). Tests: - 64/64 augmentor snapshot tests (was 49, +15 new) - 214/214 cumulative Exa-suite tests - 20/20 live API verification shapes accepted (was 15) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 9, 2026
Number531
added a commit
that referenced
this pull request
May 10, 2026
…112) Top-level CHANGELOG was missing the Exa A3 follow-up wave shipped between 2026-05-08 and 2026-05-09. Adds a comprehensive [Unreleased] section above the v7.1.0 entry covering: - v7.2.0 (PR #107) — orchestrator-authored variations through exa_web_search; shared validator extracted; Track A audit reversal documented - v7.3.0 → v7.5.0 (PR #108) — per-domain plumbing for 4 tools, schema rewrite with Jaccard distinctness telemetry, LLM adoption test harness (44% real vs. 100% isolated finding), augmentor refactor (~80 LoC → ~19 LoC per tool) - v7.5.1 (PR #109) — coverage expansion 15 → 20 tools using the augmentor - v7.6.0 (PR #110) — EXA_ADDITIONAL_QUERIES_AB_SAMPLE flag for quality-lift measurement, 4 outcome metrics per arm, 7 unit tests - PR #111 — api-integration + subagent-scaffold templates auto-inherit A3 - PR #112 — exa-a3-ab-staging.md runbook (440 lines, decision tree, failure modes, metrics reference) Cumulative: A3-enabled tools 0 → 20, Exa-suite tests 130 → 221, live API shapes 5 → 20, +2 flags, +6 metrics, +3 shared modules. All flag-gated; zero production behavior change until operator opts-in via staging A/B. Per-project CHANGELOG already documents each version individually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Number531
added a commit
that referenced
this pull request
May 12, 2026
Both docs in pending-updates/ described work that fully shipped: - exa-a3-augmentor-refactor-spec.md → v7.5.0 via PR #108 (2026-05-09) - exa-a3-improvements-plan.md → v7.3.0 → v7.6.2 via PRs #108-#115 Production all-treatment in flags.env since 2026-05-11. EXA_WEB_TOOLS=true graduated 2026-05-12 (Issue #41 closed). G5 observability (T1+T2+v6.8.7.1) shipped same day via PRs #122/#124/#125/#126/#127/#128. Both docs retained in pending-updates/ for architectural + implementation reference; original "Draft" / "Active" status now annotated with SHIPPED marker and PR cross-references. No functional change; zero risk.
Number531
added a commit
that referenced
this pull request
May 25, 2026
…xBareChartRefs + promptEnhancer defensive (Task #107) Three coupled source fixes that complete the chart-path rendering pipeline introduced by Phase 4.13 v1.6-polish Task #99 (chart path guidance in _promptConstants.js). Without these, the canonical ../charts/<name>.png reference convention would break in PDF/DOCX generation. EMPIRICALLY VALIDATED by 2026-05-25 PLTR canary (session 2026-05-25-1779733982) which generated a 77,596-byte research-plan.pdf with 8/8 chart references rendered correctly. FIX 1 — src/utils/documentConverter.js (+19 LOC) Add cwd: options.resourcePath to pandoc execFileAsync calls in convertToDocx + convertToPdf. Pandoc's --resource-path flag IS honored by the native pandoc writers (incl. DOCX) but NOT by the typst PDF backend — typst resolves image paths relative to its own working directory, not pandoc's. Without cwd override, every chart-bearing PDF fails with "file not found (searched at /app/charts/chart_xxx.png)". The cwd override is conditional (only when options.resourcePath is provided), preserving backward compatibility with callers that don't pass it. The DOCX path mirrors the PDF fix defensively (DOCX writer honors --resource-path natively but cwd override keeps both paths consistent and survives writer changes). FIX 2 — src/utils/markdownNormalizer.js (+48 LOC) Add fixBareChartRefs self-healing transform that detects bare references like  and rewrites to  when the file exists in a sibling charts/ directory. Defensive against subagent prompt-adherence drift — Task #99 prompt guidance instructs canonical ../charts/<name>.png but if a subagent drifts to bare filename, this catches and fixes the reference before downstream conversion fails. Implementation walks up the directory tree 4 levels max looking for a sibling charts/ directory. Only rewrites when existsSync() confirms the file is actually present in charts/. Multiple image formats supported (png, jpg, jpeg, gif, webp, svg). Runs FIRST in the normalizeForPandoc transformation pipeline (before stripVerificationTags, footnote conversion, etc.) so downstream transforms see the corrected paths. FIX 3 — src/server/promptEnhancer.js (+2 LOC defensive) The promptEnhancer.js calls at L360, L364 invoke convertToPdf/convertToDocx WITHOUT resourcePath. All other production callers (convertSession, convertSessionToDocuments, /api/convert/* routes) pass it correctly. This caller is the ONE unsafe site — enhancement-generated markdown typically doesn't contain chart references but defensive fix ensures we never silently break chart rendering in any conversion path. Both Promise.all calls now pass { resourcePath: fullSessionPath } — matches the documentConverter.js:751-752 batch flow pattern. ARCHITECTURAL COMPLETENESS QUARTET With these three fixes shipped, the chart-path convention is end-to-end: 1. Bridge writes to canonical <session>/charts/ (codeExecutionBridge.js:342, pre-existing) 2. Prompt tells subagent to reference as ../charts/<name>.png (_promptConstants.js Step 2.1, shipped in Task #99) 3. PDF/DOCX rendering honors the reference (documentConverter.js cwd fix, THIS COMMIT) 4. Self-healing rewrite if subagent drifts to bare filename (markdownNormalizer.js fixBareChartRefs, THIS COMMIT) Layers 1+2 without 3+4 would mean reports render correctly in raw markdown viewers but break in PDF/DOCX generation. Layers 3+4 close the rendering loop. VERIFICATION - node --check clean on all 3 modified files - Empirical baseline: 2026-05-25 canary PDF (77,596 bytes) with 8 chart refs - Layer 1 unit tests + Layer 2 integration tests + Layer 3 smoke shipping in companion commits (B + C) OUT OF SCOPE (deferred to companion commits) - Observability metrics (Prometheus counters) — Commit B (Task #108) - Test coverage (Layer 1/2/3 pyramid) — Commit C (Task #109) - CI pandoc/typst install — separate infrastructure PR Plan: docs/pending-updates/Chart-Path-Rendering-Completeness-Plan.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Number531
added a commit
that referenced
this pull request
May 25, 2026
…malization + chart_conversion_duration metrics (Task #108) Adds two Prometheus metrics to monitor the chart-rendering pipeline post-merge, closing the observability gap identified by the chart-path- rendering-completeness plan. METRIC 1 — chart_path_normalization_total (Counter) Labels: { status: 'rewritten' | 'no_op', reason: 'bare_ref_self_healed' | 'no_bare_refs' } Fires from: markdownNormalizer.js normalizeForPandoc after fixBareChartRefs Bounded cardinality: ~5 series total Purpose: detect subagent drift from canonical ../charts/<name>.png pattern despite Task #99 prompt guidance. Non-zero `rewritten` count indicates subagents are writing bare filenames and the self-healing transform is catching/fixing them. Production canary baseline (2026-05-25-1779733982) showed count: 0 — subagents follow the canonical convention correctly. Sustained non-zero in production = signal to investigate prompt drift. METRIC 2 — chart_conversion_duration_ms (Histogram) Labels: { format: 'pdf' | 'docx', status: 'ok' | 'error' } Buckets: [100, 500, 1000, 2000, 5000, 10000, 30000] ms Fires from: documentConverter.js convertToDocx + convertToPdf finish path Bounded cardinality: 2 formats × ~3 statuses = 6 series Purpose: track pandoc PDF/DOCX latency. Baseline expectation: 1-5s for small docs, 5-30s for chart-heavy reports. Tail-latency alerts surface infrastructure issues (typst container slowdown, large fixture growth, etc.). EMIT SITES - markdownNormalizer.js:619-629 — after fixBareChartRefs; also logs `[normalizer] fixBareChartRefs rewrote N bare chart ref(s)` to console when count > 0 (operator visibility for drift) - documentConverter.js:522 (convertToDocx finish) — emits both recordDocumentConversion (pre-existing) and recordChartConversion (new) - documentConverter.js:593 (convertToPdf finish) — same dual emit pattern DEFENSIVE All emit sites wrapped in try/catch so metrics failures never break conversion. Module-load smoke verified all 4 new exports resolve correctly. OUT OF SCOPE - Tests for the metrics themselves — Commit C (Task #109) tests the underlying functions; metric emission is fire-and-forget side effect - Prometheus dashboard panel — separate infrastructure task Plan: docs/pending-updates/Chart-Path-Rendering-Completeness-Plan.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Number531
added a commit
that referenced
this pull request
May 26, 2026
…ayer 1 tests (Task #35) Standalone module implementing Anthropic prompt caching for wrapped subagents with the mandatory 4-layer safety pattern (research doc Addendum 3). NEW FILES: - src/wrappedSubagents/promptCache.js (~180 LOC): * applyCacheBreakpoints(request) — identity passthrough when flag off; on flag on, injects cache_control on tools[last] + system[0] (NEVER messages[]) * prewarmCache(client, baseRequest, opts) — tools-only pre-warm with 5s timeout + AbortController; returns {success}/{skipped}, never throws * extractCacheMetrics(response) — pure function reading cache_*_input_tokens with ?? fallbacks; cannot throw on unexpected response shape * Reads process.env.PROMPT_CACHE_WRAPPED_SUBAGENTS directly (matches _standardTools.js testability pattern from Task #96; not cached via featureFlags) - test/sdk/wrappedSubagents/promptCache.test.js (~280 LOC, 27 tests): * Layer 1 unit tests covering flag-off identity passthrough (toBe, not toEqual), breakpoint placement on tools[last] + system[0], messages[] NEVER touched (compaction-safety invariant), fallback on internal throw, prewarm graceful failure modes, extractCacheMetrics safety with all malformed input shapes MODIFIED: - src/config/featureFlags.js — adds PROMPT_CACHE_WRAPPED_SUBAGENTS env-gated flag (default false) after SERVER_SIDE_COMPACTION at L212 - src/utils/sdkMetrics.js — adds 6 prom-client Counter/Gauge registrations + recordCacheMetrics helper after recordChartConversion (Task #108 pattern). Counters: prompt_cache_creation_tokens_total, prompt_cache_read_tokens_total, prompt_cache_fallback_total, prompt_cache_prewarm_failure_total, prompt_cache_api_rejection_total. Gauge: prompt_cache_hit_rate. Pulled forward from Day 3 since promptCache.js imports the counters at module load. VERIFICATION: - node --check clean on all touched JS files - 27 promptCache.test.js tests pass in 101ms - 902/904 tests across wrappedSubagents/hooks/config (2 pre-existing Task #67 OPUS_MODEL failures unchanged) - Identity passthrough verified via toBe (not toEqual) for flag-off path ZERO PRODUCTION BEHAVIOR CHANGE — module is loadable but inert until PROMPT_CACHE_WRAPPED_SUBAGENTS=true is set. Day 2 wires runner.js integration (4 sites: prewarm at SubagentStart, applyCacheBreakpoints before streamWithRetry, extractCacheMetrics after accumulateUsage, auto-retry on Anthropic 400 with cache_control rejection). Plan: /Users/ej/.claude/plans/hidden-herding-pnueli.md Research: docs/pending-updates/Wrapped-Subagent-Caching-Feasibility-2026-05-26.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR delivers the complete Exa A3 (
additionalQueries) Phase A pipeline across versions v7.3.0 through v7.5.0, culminating in the architectural augmentor refactor. All 14 commits are validated end-to-end with byte-equivalence proofs.Versions in this PR
4811b1ad0281d0e20d0442b515ab89f15f4fc706+2c48af4dEmpirical results
LLM adoption
Variation quality
100% of populated calls produce axis-distinct legal-domain variations (no paraphrases). Examples generated by the model:
Aronson demand futility test/Caremark oversight liability/Revlon enhanced scrutiny(case law)§ 17(a) restatement disclosure/CFR Item 503 risk factors/8-K Item 4.02 non-reliance(securities)40 CFR Part 98 Subpart W/Clean Air Act § 114(federal register — generated for queries NOT in any worked example)v7.5.0 augmentor refactor — architectural
Pure structural refactor — zero behavior change verified by byte-equivalence:
JSON.stringify(searchSecFilingsTool)produces identical 2,341-byte output pre/post refactoradditionalQueriesremains LAST in all 15 tool schemas (Anthropic prompt cache key invariance)requiredarray order preservedWhy refactor: at 15 A3-eligible tools today, extending to next 40-50 tools would replicate the same 4-file pattern 50+ times (~3,000 LoC of duplicated boilerplate). The augmentor pipeline reduces this to 1 line per new tool (
traits: ['exa-routable', 'domain:X']).Acceptance gates
git reverton Day 1 or Day 2 commitTest plan
Out of scope
src/config/legalSubagents.jsdeprecation (15,605 lines monolithic) — explicitly preserved unchanged. Round 2 audit revealed structural divergences requiring coordinated cleanup; deferred.git revert+ redeploy.Spec & references
docs/pending-updates/exa-a3-augmentor-refactor-spec.md— 580 lines validated by 7 explore agentsdocs/pending-updates/exa-a3-improvements-plan.md🤖 Generated with Claude Code