Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions super-legal-mcp-refactored/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,50 @@ All notable changes to the Super Legal MCP Server are documented in this file.

## [Unreleased]

### Added — Avenue A v2: Structured-output enforcement on code-execution bridge (PR forthcoming)

Eliminates the "missing envelope on turn 1 → corrective retry on turn 2" pattern observed in production renders (PR #134 L4 logs: 1-of-5 phases retried). Adds Anthropic's `output_config: { format: { type: 'json_schema', schema: {...} } }` enforcement on the model's text-block output, validated empirically against the `code_execution_20260120` tool + streaming + `pause_turn` continuations.

**Two envelope schema variants** (`src/tools/codeExecutionBridge.js`):
- `ENVELOPE_SCHEMA_GENERAL` — non-xlsx callers (MCP gateway, Agent SDK subagents, `run_python_analysis`): enforces `{ success, analysis, data, charts, stderr }` shape on text output. L5 verified: turn_count=1, `envelope_source='text'`, output_config-enforced JSON in text block.
- `ENVELOPE_SCHEMA_XLSX` (Option A — audit-only): enforces ONLY audit metadata (`audit_results`, `sheets`, `phase_sources`, `named_ranges_count`) in text. The b64_xlsx binary payload continues via stdout (legacy path) to avoid the MAX_TOKENS budget cliff that broke the first attempt (text_len=62106 on phase3 LBO). Bridge merges audit-from-text + b64-from-stdout in `selectEnvelopeWithFallback()`.

**Anthropic structured-output schema constraints discovered empirically (L2/L4)**:
- All `type: 'object'` nodes MUST set `additionalProperties: false` (400 error otherwise)
- `minimum` / `maximum` not supported on numeric types
- `minLength` / `maxLength` likely also unsupported (stripped defensively)
- `enum` works on strings (confirmed)

**Feature flag**: `STRUCTURED_OUTPUT_ENFORCEMENT` (default `false`). Flag off → byte-identical to pre-Avenue-A-v2 (verified: L3 flag-off + flag-on both 197/0/2). Flag on → API-level JSON enforcement engages.

**Cross-caller benefit** (verified via Phase 1 Explore): the bridge is shared infrastructure. The fix benefits:
- All 5 xlsx templates (full-deal-workbook, lbo-focused, valuation-only, tax-memo-workbook, session-models) via `src/utils/xlsxRenderer/index.js:208` (single-turn) + `multiTurnOrchestrator.js:105` (multi-turn)
- MCP tool gateway (`run_python_analysis` via `toolImplementations.js:958`) when caller provides `constraints.output_format`
- Agent SDK subagents (tax analyst, antitrust analyst, etc.) via `agentSdkToolAdapter.js:167`
- NOT affected: Skills' parallel native path at `legacyStreamHandler.js:78` (registers `code_execution_20250825` directly, bypasses bridge)

**L4 empirical comparison** (full-deal-workbook live renders, paired against same seeded session inputs):

| | Baseline (flag OFF) | Treatment v1 (full schema) | **Option A (audit-only schema)** |
|---|---|---|---|
| `success` | true | **false** (phase3 LBO failed at max_tokens cliff) | ✅ **true** |
| Wall time | 8.9 min (532s) | 19.2 min (1151s) | ✅ **7.8 min (465s)** |
| Phases delivered | 5/5 | 4/5 | ✅ **5/5** |
| MAX_TOKENS hits | 0 | 1 | ✅ **0** |
| Named ranges | 32 | n/a | ✅ **46** |

Option A matches baseline reliability + adds API-level structured-output enforcement on the audit envelope. Per-phase wall times match baseline (no b64-in-text verbose overhead).

**Files modified**:
- `src/tools/codeExecutionBridge.js` — schemas, output_config injection (2 call sites: 572, 603), extractResults (parsed_output extraction), selectEnvelopeWithFallback (priority + merge logic)
- `src/config/featureFlags.js` — STRUCTURED_OUTPUT_ENFORCEMENT flag (default false)
- `flags.env` — STRUCTURED_OUTPUT_ENFORCEMENT=false (prod default)
- `test/sdk/code-execution-bridge.test.js` — 3 new Avenue A v2 source-level assertions (10 assertions total)

**Test suite**: 197/0/2 maintained on both flag states. Bridge tests gain 10 Avenue A v2 source-level assertions.

**Rollback**: trivial flag flip (set STRUCTURED_OUTPUT_ENFORCEMENT=false in flags.env, redeploy). No schema/migration/data implications.

### Changed — Avenue B Phase 1: `full-deal-workbook` sensitivity isolation (Issue #100, PR forthcoming)

The `full-deal-workbook` template's phase split has been rebalanced. Phase count is **unchanged at 5** — only the sheet routing within `phase4` and `phase5` changes:
Expand Down
5 changes: 5 additions & 0 deletions super-legal-mcp-refactored/flags.env
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ FILES_API_CHART_EXTRACTION=true
CHART_PERSISTENCE=true
# XLSX renderer — Phase 1B foundation (default false; flip after Phase 0 probes pass + staging validation)
XLSX_RENDERER=false
# Avenue A v2 — code-execution-bridge envelope JSON schema enforcement via output_config.
# Default false (existing prompt-level + corrective-retry path). When true, bridge
# enforces envelope shape at the Anthropic API level via output_config + parsed_output
# extraction. Target: eliminate the ~80% turn-1-envelope-miss rate.
STRUCTURED_OUTPUT_ENFORCEMENT=false
# Phase 7 — operational caps (per-process; multi-pod multiplies)
XLSX_RENDER_CONCURRENCY=10
XLSX_RENDER_MAX_QUEUE=50
Expand Down
9 changes: 9 additions & 0 deletions super-legal-mcp-refactored/src/config/featureFlags.js
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ export const featureFlags = {
// Default false; flip true after Phase 0 probes pass and staging validation completes.
// Plan: docs/pending-updates/excel-code-execution.md v4.5
XLSX_RENDERER: envBool(process.env.XLSX_RENDERER, false),
// Avenue A v2 (Issue forthcoming) — enforce envelope JSON shape at the Anthropic
// API level via output_config: { format: { type: 'json_schema', schema: {...} } }.
// When off (default), bridge uses existing prompt-level + corrective-retry path.
// When on, code-execution-bridge passes output_config to the API call and updates
// extractResults to read parsed_output / final text block before falling back to
// stdout. Target: eliminate the ~80% turn-1-envelope-miss rate observed in L4
// live render logs (PR #134 validation).
// Plan: /Users/ej/.claude/plans/glittery-toasting-stardust.md
STRUCTURED_OUTPUT_ENFORCEMENT: envBool(process.env.STRUCTURED_OUTPUT_ENFORCEMENT, false),
// Files API chart extraction — native PNG retrieval via file_id references
// Requires files-api-2025-04-14 beta header. Falls back to base64-through-stdout when off.
FILES_API_CHART_EXTRACTION: envBool(process.env.FILES_API_CHART_EXTRACTION, true),
Expand Down
232 changes: 228 additions & 4 deletions super-legal-mcp-refactored/src/tools/codeExecutionBridge.js
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,133 @@ const MAX_PAUSE_CONTINUATIONS = 5; // Per Anthropic docs: recommended limit for
// both addressed by streaming.
const OVERALL_TIMEOUT_MS = Number(process.env.CODE_EXECUTION_TIMEOUT_MS) || 1_200_000;

// ─── Avenue A v2: Structured-Output Envelope Schemas ────────────────────────
//
// When STRUCTURED_OUTPUT_ENFORCEMENT=true, the bridge passes one of these as
// `output_config.format` to enforce envelope JSON shape at the Anthropic API
// level (not just prompt level). Picks variant based on
// constraints?.output_format?.includes('b64_xlsx') — matches existing detection
// at line 613.
//
// SDK type signature: matches JSONOutputFormat at
// node_modules/@anthropic-ai/sdk/resources/messages/messages.d.ts:506-513
// (only `type: 'json_schema'` is supported in SDK 0.86.1; no `'json_object'` mode).
//
// Plan: /Users/ej/.claude/plans/glittery-toasting-stardust.md "Schema definitions".

// Anthropic structured-output requires `additionalProperties: false` on every
// `type: 'object'` node (L2 finding 2026-05-15 — API error 400 explicit:
// "For 'object' type, 'additionalProperties: true' is not supported. Please set
// 'additionalProperties' to false"). Permissive nested shapes use empty
// `properties: {}` + `additionalProperties: false` together — accepts an empty
// object but caller can still pass through any KEYS via tryParseJSON downstream
// since the model emits the envelope as a STRING in text/parsed_output and the
// bridge re-parses (the schema validates shape; downstream JSONB is opaque).
// For genuinely-variable sub-structures (audit_results.checks varies per phase),
// declare the sub-object as `{ type: 'object', additionalProperties: false }`
// with NO properties listed — model is forced to emit `{}` for these sub-fields,
// then the actual phase-specific keys live in `data` (top-level passthrough
// fields not validated by output_config) OR the model includes them within
// a permissive `extras` field declared below. Trade-off: tighter shape on the
// envelope contract, looser on the inner data. Acceptable because the bridge's
// existing post-extraction validation (isCompleteXlsxB64, etc.) handles the
// actual semantic checks.

// Anthropic structured-output schema constraints (verified empirically via L2/L4):
// - All `type: 'object'` nodes MUST set `additionalProperties: false`
// - `minimum` / `maximum` NOT supported on numeric types
// - `minLength` / `maxLength` likely also NOT supported (stripped defensively)
// - `enum` works on strings (L2 confirmed)
// Schema validates SHAPE only; semantic validation (isCompleteXlsxB64, magic
// bytes, value ranges) happens in downstream code after envelope extraction.
//
// Option A (post-L4 pivot, 2026-05-15): The xlsx schema enforces ONLY the audit
// metadata in the model's text-block output. b64_xlsx (multi-KB binary payload)
// stays in the bash_code_execution_tool_result.stdout path where it has no
// token-budget pressure. This separates the two data classes by their natural
// transport: small structured metadata via text/parsed_output (output_config
// enforced); large binary payload via stdout (legacy path). The bridge merges
// both at the envelope-construction step. L4 prior attempt (treatment 9600202,
// schema with b64_xlsx required in text) hit max_tokens=32K cliff on phase3
// LBO (text_len=62106) → corrupt envelope → render failure. Option A removes
// the b64_xlsx field from text-enforcement; b64 retained in stdout.

const ENVELOPE_SCHEMA_XLSX = {
type: 'json_schema',
schema: {
type: 'object',
required: ['audit_results'],
properties: {
audit_results: {
type: 'object',
required: ['status'],
properties: {
status: { type: 'string', enum: ['PASS', 'FAIL', 'UNKNOWN'] },
checks: { type: 'object', additionalProperties: false },
warnings: { type: 'array', items: { type: 'string' } },
},
additionalProperties: false,
},
phase_sources: {
type: 'array',
items: {
type: 'object',
properties: {
sheet: { type: 'string' },
cell: { type: 'string' },
source_url: { type: 'string' },
citation: { type: 'string' },
},
additionalProperties: false,
},
},
sheets: { type: 'array', items: { type: 'string' } },
named_ranges_count: { type: 'integer' },
},
additionalProperties: false,
},
};

const ENVELOPE_SCHEMA_GENERAL = {
type: 'json_schema',
schema: {
type: 'object',
required: ['success'],
properties: {
success: { type: 'boolean' },
analysis: { type: 'string' },
data: { type: 'object', additionalProperties: false }, // permissive empty-only; model puts actual values in `data` via downstream JSON
mean: { type: 'number' }, // common analytical output, hoisted to top-level for accessibility
methodology: { type: 'string' },
charts: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
data: { type: 'string' },
file_id: { type: 'string' },
source: { type: 'string' },
disk_path: { type: 'string' },
},
additionalProperties: false,
},
},
stderr: { type: ['string', 'null'] },
},
additionalProperties: false,
},
};

// Helper: returns the right schema variant based on caller's intent.
// Mirrors the existing detection logic at line 613 (needsEnvelope check).
function getEnvelopeSchema(constraints) {
if (constraints?.output_format?.includes('b64_xlsx')) {
return ENVELOPE_SCHEMA_XLSX;
}
return ENVELOPE_SCHEMA_GENERAL;
}

// ─── Feature Flag ───────────────────────────────────────────────────────────

/**
Expand Down Expand Up @@ -498,6 +625,14 @@ export async function runPythonAnalysis(
// it (cross-phase reuse). Inner pause_turn continuations already
// do this via the same `containerId` variable a few lines below.
...(containerId ? { container: containerId } : {}),
// Avenue A v2: enforce envelope JSON shape at the API level when flag on.
// SDK 0.86.1 OutputConfig (messages.d.ts:1908) accepts json_schema format;
// SDK auto-parses into response.parsed_output. Bridge's extractResults
// reads parsed_output / final text block first (see line ~944).
// Flag off → byte-identical to pre-Avenue-A-v2 behavior.
...(featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT
? { output_config: { format: getEnvelopeSchema(constraints) } }
: {}),
messages
}, {
signal: controller.signal,
Expand Down Expand Up @@ -526,6 +661,11 @@ export async function runPythonAnalysis(
system: cachedSystem,
tools: [{ type: CODE_EXECUTION_TOOL_TYPE, name: 'code_execution' }],
...(containerId ? { container: containerId } : {}),
// Avenue A v2: also enforce on pause_turn continuations so the schema
// applies consistently through the iteration loop.
...(featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT
? { output_config: { format: getEnvelopeSchema(constraints) } }
: {}),
messages: [
{ role: 'user', content: userContent },
{ role: 'assistant', content: response.content }
Expand Down Expand Up @@ -587,7 +727,8 @@ export async function runPythonAnalysis(
if (extracted.executionSuccess && extracted.outputs.length > 0) {
finalResult.success = true;
finalResult.raw_output = extracted.outputs[extracted.outputs.length - 1];
finalResult.data = selectBestJSON(extracted.outputs);
// Avenue A v2: envelope-priority order — parsed_output → final text → stdout.
finalResult.data = selectEnvelopeWithFallback(extracted, finalResult, constraints);
finalResult.charts = extracted.charts;
finalResult.stderr = extracted.stderr;
await downloadFilesApiCharts(client, extracted.fileIds, finalResult);
Expand Down Expand Up @@ -673,7 +814,7 @@ export async function runPythonAnalysis(
if (extracted.outputs.length > 0) {
finalResult.success = true;
finalResult.raw_output = extracted.outputs[extracted.outputs.length - 1];
finalResult.data = selectBestJSON(extracted.outputs);
finalResult.data = selectEnvelopeWithFallback(extracted, finalResult, constraints);
finalResult.charts = extracted.charts;
finalResult.stderr = extracted.stderr;
await downloadFilesApiCharts(client, extracted.fileIds, finalResult);
Expand All @@ -686,7 +827,7 @@ export async function runPythonAnalysis(
if (extracted.outputs.length > 0) {
finalResult.success = true;
finalResult.raw_output = extracted.outputs[extracted.outputs.length - 1];
finalResult.data = selectBestJSON(extracted.outputs);
finalResult.data = selectEnvelopeWithFallback(extracted, finalResult, constraints);
finalResult.charts = extracted.charts;
finalResult.stderr = extracted.stderr;
await downloadFilesApiCharts(client, extracted.fileIds, finalResult);
Expand Down Expand Up @@ -868,9 +1009,25 @@ export function extractResults(response) {
text: '',
executionSuccess: false,
stderr: null,
fileIds: []
fileIds: [],
// Avenue A v2: when output_config enforces envelope shape, SDK populates
// response.parsed_output. Bridge prefers this over stdout extraction.
// Also tracks which path won — useful for telemetry + L2 diagnosis.
envelopeFromParsedOutput: null,
envelopeFromText: null,
envelopeSource: null, // 'parsed_output' | 'text' | 'stdout' (set downstream by selectBestJSON)
};

// Avenue A v2 Path 1: SDK's auto-parsed envelope (highest-priority when present).
// Only consulted when STRUCTURED_OUTPUT_ENFORCEMENT is on AND output_config was
// passed. Otherwise response.parsed_output is undefined and we fall through.
if (featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT
&& response?.parsed_output
&& typeof response.parsed_output === 'object') {
results.envelopeFromParsedOutput = response.parsed_output;
results.envelopeSource = 'parsed_output';
}

for (const block of response.content) {
if (block.type === 'text') {
results.text += block.text;
Expand Down Expand Up @@ -1105,6 +1262,73 @@ function selectBestJSON(outputs) {
return best;
}

/**
* Avenue A v2 + Option A: envelope extraction with MERGE for xlsx mode.
*
* For NON-xlsx mode (GENERAL schema) — simple priority:
* 1. extracted.envelopeFromParsedOutput (SDK auto-parsed text via output_config)
* 2. Final text block parsed as JSON (output_config enforced, SDK didn't pre-parse)
* 3. selectBestJSON(extracted.outputs) (stdout-based, legacy path)
*
* For XLSX mode (audit-only schema in text + b64 in stdout) — MERGE strategy:
* - Audit metadata from text/parsed_output (output_config-enforced when flag on)
* - b64_xlsx / xlsx_filename / workbook_size_bytes from stdout (legacy path)
* - Merged envelope feeds existing downstream consumers unchanged
* - Falls back to pure-stdout when text/parsed_output empty (flag off OR
* output_config didn't enforce OR audit envelope failed to parse)
*
* Mutates finalResult.envelope_source ∈
* {'parsed_output','text','stdout','merged','none'} so callers + telemetry
* can observe which path won. Returns the chosen/merged object (or null).
*/
function selectEnvelopeWithFallback(extracted, finalResult, constraints) {
const isXlsxMode = !!(constraints?.output_format?.includes('b64_xlsx'));

// Try text-side envelope first (parsed_output → text-block JSON-parse)
let textEnvelope = null;
let textSource = null;
if (extracted.envelopeFromParsedOutput) {
textEnvelope = extracted.envelopeFromParsedOutput;
textSource = 'parsed_output';
} else if (featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT && extracted.text) {
const parsedText = tryParseJSON(extracted.text);
if (parsedText && typeof parsedText === 'object' && !Array.isArray(parsedText)) {
textEnvelope = parsedText;
textSource = 'text';
}
}

// For NON-xlsx mode: text envelope is sufficient if present, else stdout
if (!isXlsxMode) {
if (textEnvelope) {
finalResult.envelope_source = textSource;
return textEnvelope;
}
const fromStdout = selectBestJSON(extracted.outputs);
finalResult.envelope_source = fromStdout ? 'stdout' : 'none';
return fromStdout;
}

// For XLSX mode: stdout is REQUIRED (contains b64_xlsx). Merge text + stdout
// when both present; fall back to stdout-only when text empty.
const fromStdout = selectBestJSON(extracted.outputs);
if (textEnvelope && fromStdout) {
// MERGE: text-side audit metadata overlays stdout's b64-bearing envelope.
// Stdout's b64_xlsx/xlsx_filename/workbook_size_bytes are preserved;
// text-side audit_results/sheets/etc. take precedence where present.
finalResult.envelope_source = `merged:${textSource}+stdout`;
return { ...fromStdout, ...textEnvelope };
}
if (fromStdout) {
// Text envelope missing — pure stdout path (existing pre-Avenue-A-v2 behavior).
finalResult.envelope_source = 'stdout';
return fromStdout;
}
// No stdout AND no text-envelope — bridge will detect missing b64 downstream.
finalResult.envelope_source = 'none';
return null;
}

/**
* Infer model_id from task content via keyword overlap with catalog model names.
* Used as Option E fallback when caller doesn't pass explicit model_id (Day 6.G).
Expand Down
Loading