diff --git a/super-legal-mcp-refactored/docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md b/super-legal-mcp-refactored/docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md
new file mode 100644
index 000000000..ccfad0b53
--- /dev/null
+++ b/super-legal-mcp-refactored/docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md
@@ -0,0 +1,643 @@
+# Anthropic Claude Agent SDK: Best Practices Research
+
+**Created**: 2026-02-22
+**Topic**: Claude Agent SDK patterns — subagent tool access, code execution integration, tool scoping, input_examples, and agent-to-agent delegation
+**Versions covered**: @anthropic-ai/claude-agent-sdk 0.2.x, @anthropic-ai/sdk 0.74.x, advanced-tool-use-2025-11-20 beta
+**Primary sources**: platform.claude.com/docs/en/agent-sdk, code.claude.com/docs, anthropic.com/engineering (all verified February 2026)
+
+---
+
+## Table of Contents
+
+1. [Overview and Key Findings](#overview-and-key-findings)
+2. [How Subagents Access Tools: Direct vs. Delegation](#how-subagents-access-tools-direct-vs-delegation)
+3. [Code Execution Integration Patterns](#code-execution-integration-patterns)
+4. [Tool Scoping Best Practices](#tool-scoping-best-practices)
+5. [input_examples on Tool Definitions](#input_examples-on-tool-definitions)
+6. [Agent-to-Agent Delegation vs. Direct Tool Invocation](#agent-to-agent-delegation-vs-direct-tool-invocation)
+7. [SDK API Reference Highlights](#sdk-api-reference-highlights)
+8. [Relevance to Super-Legal Architecture](#relevance-to-super-legal-architecture)
+9. [References](#references)
+
+---
+
+## Overview and Key Findings
+
+Anthropic has consolidated and significantly expanded the Claude Agent SDK documentation between mid-2025 and February 2026 (the Claude Code SDK was renamed to the Claude Agent SDK). The documentation is authoritative and prescriptive about patterns. Key findings:
+
+| Question | Anthropic's Answer |
+|:---------|:-----------------|
+| Direct tool access vs. delegation? | **Direct tool access is canonical.** Subagents call tools themselves; they do not recommend delegation to another agent. |
+| Code execution invocation pattern? | **Programmatic Tool Calling** (`allowed_callers: ["code_execution_20260120"]`) — Claude writes Python that calls tools in a sandbox. |
+| Tool scoping best practices? | Use `tools` allowlist on each `AgentDefinition`. Omitting the field inherits all tools (not recommended for focused agents). |
+| `input_examples` pattern? | Top-level field on tool definitions, array of valid input objects. Improves accuracy 72% → 90% on complex params (Anthropic internal data). |
+| Agent-to-agent vs. direct invocation? | Avoid recursive delegation. Orchestrator delegates to workers; workers call tools directly. No subagent-to-subagent spawning. |
+
+---
+
+## How Subagents Access Tools: Direct vs. Delegation
+
+### The Canonical Pattern: Subagents Call Tools Directly
+
+Anthropic's documentation is unambiguous. A subagent is a scoped agent with its own context window and a restricted tool set. When a task is delegated to it, the subagent **executes the task autonomously using its own tools** — it does not recommend that the orchestrator delegate to a different agent.
+
+From the [Agent SDK subagents page](https://platform.claude.com/docs/en/agent-sdk/subagents):
+
+> "Subagents maintain separate context from the main agent, preventing information overload and keeping interactions focused."
+
+> "A `doc-reviewer` subagent might only have access to Read and Grep tools, ensuring it can analyze but never accidentally modify your documentation files."
+
+The `tools` field on `AgentDefinition` is an explicit allowlist. If a subagent has `tools: ["Read", "Grep", "Glob"]`, it will **use those tools directly**. It will not forward work to another agent.
+
+### How Subagent Invocation Works (Task Tool)
+
+The orchestrator (main agent) invokes subagents via the `Task` tool. The orchestrator must have `Task` in its `allowedTools`. The subagent does not have `Task` in its tools — this is explicitly called out:
+
+> **"Subagents cannot spawn their own subagents. Don't include `Task` in a subagent's `tools` array."**
+> — [platform.claude.com/docs/en/agent-sdk/subagents](https://platform.claude.com/docs/en/agent-sdk/subagents)
+
+This is the clearest statement of the pattern: the delegation chain is exactly **one level deep**. The orchestrator delegates once; the worker executes directly.
+
+### TypeScript Example (Canonical Pattern)
+
+```typescript
+import { query } from "@anthropic-ai/claude-agent-sdk";
+
+for await (const message of query({
+  prompt: "Review the authentication module for security issues",
+  options: {
+    // Orchestrator has Task to delegate, plus its own tools
+    allowedTools: ["Read", "Grep", "Glob", "Task"],
+    agents: {
+      "code-reviewer": {
+        description: "Expert code review specialist. Use for quality, security, and maintainability reviews.",
+        prompt: "You are a code review specialist. Analyze code quality and suggest improvements.",
+        // Subagent calls these tools DIRECTLY — never delegates further
+        tools: ["Read", "Grep", "Glob"],
+        model: "sonnet"
+      }
+    }
+  }
+}))
+```
+
+### Multi-Agent Research System (Anthropic Internal Implementation)
+
+Anthropic's [published account of their multi-agent research system](https://www.anthropic.com/engineering/multi-agent-research-system) confirms the pattern:
+
+- A **LeadResearcher orchestrator** decomposes the query and spawns 3–5 subagents in parallel
+- Each subagent **independently performs web searches** and evaluates tool results
+- Subagents **do not delegate to other subagents** — they execute tool calls directly
+- Results flow back to the orchestrator for synthesis
+
+> "The lead agent spins up 3-5 subagents in parallel... subagents use 3+ tools in parallel... independently performs web searches, evaluates tool results using interleaved thinking, and returns findings to the LeadResearcher."
+
+### Avoid Recursive Delegation
+
+The [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) document establishes this as a design principle: **do not have agents recommend delegation to other agents**. The value of a subagent is that it actually performs the work. An agent that says "you should use a different agent for this" is not following the delegation model — it has simply returned an unhelpful response.
+
+---
+
+## Code Execution Integration Patterns
+
+### Programmatic Tool Calling (November 2025 Beta, Now Stable)
+
+Released under beta header `advanced-tool-use-2025-11-20`, Programmatic Tool Calling is now documented as a first-class pattern on [platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling](https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling).
+
+The pattern: instead of Claude invoking tools one at a time through API round-trips, Claude **writes Python code** that calls tools as functions. The code runs in a sandboxed container. Intermediate results do not enter Claude's context window — only the final code output does.
+
+**IMPORTANT — Two distinct code execution tool types exist:**
+
+| Tool Type | Purpose | Beta Header |
+|:----------|:--------|:------------|
+| `code_execution_20250825` | **General-purpose** code execution (Python, bash, file ops). Used by Skills' native path (`src/server/legacyStreamHandler.js:78`). | None (GA) |
+| `code_execution_20260120` | **PTC-capable** code execution (Python, bash, file ops) — supports the `allowed_callers` pattern when used WITH custom tools. Can also be used as a plain code-execution tool (without `allowed_callers`). **This is what `codeExecutionBridge.js` uses** (verified at line 30). | None |
+
+These are **separate tool types**, not versions of the same tool. **CORRECTION (2026-05-16, Avenue A v2 audit)**: an earlier version of this doc stated "Super-Legal bridge correctly uses `code_execution_20250825`." That was incorrect — the bridge actually uses `code_execution_20260120` (verified at `codeExecutionBridge.js:30`). The bridge does NOT use PTC's `allowed_callers` pattern (verified — zero grep matches anywhere in `src/`), so it consumes `code_execution_20260120` as a plain code-execution tool without PTC features. PTC-specific restrictions on `strict: true` on tool inputs therefore don't apply to the bridge's use of this tool.
+
+**PTC-compatible models (as of February 2026):**
+
+| Model | PTC Tool Version |
+|:------|:------------|
+| claude-opus-4-6 | `code_execution_20260120` |
+| claude-sonnet-4-6 | `code_execution_20260120` |
+| claude-sonnet-4-5-20250929 | `code_execution_20260120` |
+| claude-opus-4-5-20251101 | `code_execution_20260120` |
+
+### The `allowed_callers` Field
+
+This is the mechanism that enables programmatic tool calling. It is added to a **user-defined tool's** definition, not to the code execution tool itself.
+
+```json
+{
+  "name": "query_database",
+  "description": "Execute a SQL query. Returns JSON rows.",
+  "input_schema": { ... },
+  "allowed_callers": ["code_execution_20260120"]
+}
+```
+
+**Possible values:**
+- `["direct"]` — Only Claude can call this tool through the normal API round-trip (default if field omitted)
+- `["code_execution_20260120"]` — Only callable from within a code execution sandbox
+- `["direct", "code_execution_20260120"]` — Callable both ways
+
+> **Tip from official docs**: "Choose either `["direct"]` or `["code_execution_20260120"]` for each tool rather than enabling both, as this provides clearer guidance to Claude for how best to use the tool."
+
+### Programmatic Tool Calling Request Structure
+
+```typescript
+import { Anthropic } from "@anthropic-ai/sdk";
+const anthropic = new Anthropic();
+
+const response = await anthropic.messages.create({
+  model: "claude-opus-4-6",
+  max_tokens: 4096,
+  messages: [{
+    role: "user",
+    content: "Query sales for West, East, Central regions, then find highest revenue"
+  }],
+  tools: [
+    {
+      // Step 1: Include the code execution server tool
+      type: "code_execution_20260120",
+      name: "code_execution"
+    },
+    {
+      // Step 2: User-defined tool with allowed_callers pointing to code execution
+      name: "query_database",
+      description: "Execute a SQL query. Returns JSON array of row objects.",
+      input_schema: {
+        type: "object",
+        properties: { sql: { type: "string", description: "SQL query to execute" } },
+        required: ["sql"]
+      },
+      allowed_callers: ["code_execution_20260120"]
+    }
+  ]
+});
+```
+
+When Claude responds, it writes Python code like:
+
+```python
+# Claude-generated code running in the sandbox
+results = {}
+for region in ["West", "East", "Central"]:
+    data = await query_database(f"SELECT SUM(revenue) FROM sales WHERE region='{region}'")
+    results[region] = data[0]["sum"]
+
+top = max(results.items(), key=lambda x: x[1])
+print(f"Top region: {top[0]} with ${top[1]:,}")
+```
+
+The tool calls are `await`-ed — Claude writes async Python automatically. Intermediate query results never enter the context window; only the `print()` output does.
+
+### When to Use Programmatic Tool Calling
+
+**Good use cases (from official docs):**
+- Large data processing where you only need aggregates or summaries
+- Multi-step workflows with 3+ dependent tool calls
+- Operations requiring filtering, sorting, or transformation of results
+- Tasks where intermediate data should not influence Claude's reasoning
+- Parallel operations across many items (e.g., checking 50 endpoints)
+
+**Less ideal:**
+- Single tool calls with simple responses
+- Tools that need immediate user feedback
+- Very fast operations where code execution overhead would outweigh the benefit
+
+### Constraints
+
+- `strict: true` on tool definitions is **not** supported with programmatic calling
+- `tool_choice` forcing is **not** supported with programmatic calling
+- `disable_parallel_tool_use: true` is **not** supported with programmatic calling
+- MCP connector tools cannot currently be called programmatically
+- Not covered by Zero Data Retention (ZDR) arrangements
+
+---
+
+## Tool Scoping Best Practices
+
+### The Official Guidance: Explicit Allowlists Per Agent
+
+From the [subagents documentation](https://platform.claude.com/docs/en/agent-sdk/subagents):
+
+```typescript
+// Official tool restriction pattern
+agents: {
+  "code-analyzer": {
+    description: "Static code analysis and architecture review",
+    prompt: "Analyze code structure without making changes.",
+    // Read-only tools: no Edit, Write, or Bash access
+    tools: ["Read", "Grep", "Glob"]
+  }
+}
+```
+
+The docs list canonical tool combinations by role:
+
+| Use Case | Tools | Description |
+|:---------|:------|:------------|
+| Read-only analysis | `Read`, `Grep`, `Glob` | Can examine code but not modify or execute |
+| Test execution | `Bash`, `Read`, `Grep` | Can run commands and analyze output |
+| Code modification | `Read`, `Edit`, `Write`, `Grep`, `Glob` | Full read/write without command execution |
+| Full access | (omit `tools` field) | Inherits all tools from parent |
+
+### Omitting the `tools` Field
+
+If `tools` is **omitted** from an `AgentDefinition`, the subagent **inherits all available tools** from the parent. This is the default but is not recommended for focused agents. The documentation consistently shows `tools` being specified explicitly for every meaningful subagent.
+
+### MCP Tool Scoping with `allowedTools`
+
+For MCP tools, the naming convention is `mcp__<server-name>__<tool-name>`. Wildcard patterns are supported:
+
+```typescript
+allowedTools: [
+  "mcp__github__*",           // All tools from the github server
+  "mcp__db__query",           // Only the query tool from db server
+  "mcp__slack__send_message"  // Only send_message from slack
+]
+```
+
+When the MCP tool search feature is active (ENABLE_TOOL_SEARCH env var), tools marked with `defer_loading: true` are not preloaded — they are discovered on-demand. This auto-activates when MCP tool descriptions exceed 10% of context window.
+
+### The Tool Flooding Problem (Anthropic Research Finding)
+
+From the [advanced tool use blog post](https://www.anthropic.com/engineering/advanced-tool-use):
+
+> Loading hundreds of tool definitions upfront "wastes 85% of context on unused tool definitions" in typical agent runs.
+
+The recommended fix: the Tool Search Tool (part of the `advanced-tool-use-2025-11-20` beta) allows Claude to **discover tools on-demand**. In the Agent SDK, this is surfaced as the `ENABLE_TOOL_SEARCH` environment variable.
+
+For the Super-Legal architecture, this validates the SCOPED_MCP_SERVERS approach: giving each subagent only the tools it needs is exactly aligned with Anthropic's documented best practice.
+
+### `settingSources: []` Pattern
+
+From the [TypeScript SDK reference](https://platform.claude.com/docs/en/agent-sdk/typescript):
+
+> "When `settingSources` is omitted or undefined, the SDK does **not** load any filesystem settings. This provides isolation for SDK applications."
+
+The Super-Legal codebase already sets `settingSources: []` in `agentQuery`, which is the correct pattern for SDK-only applications that define everything programmatically.
+
+---
+
+## input_examples on Tool Definitions
+
+### Feature Origin and Documentation
+
+`input_examples` is part of the `advanced-tool-use-2025-11-20` beta, but it is also documented as a **standard field** in the [main tool use implementation guide](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use) with no beta requirement for basic use.
+
+From the official docs, tool definitions now support four fields:
+
+| Parameter | Description |
+|:----------|:------------|
+| `name` | Tool name (regex: `^[a-zA-Z0-9_-]{1,64}$`) |
+| `description` | Detailed description of what the tool does and when to use it |
+| `input_schema` | JSON Schema object defining expected parameters |
+| `input_examples` | **(Optional)** Array of example input objects |
+
+### Why input_examples Matters
+
+From the [advanced tool use documentation](https://www.anthropic.com/engineering/advanced-tool-use):
+
+> "JSON schemas define what's structurally valid, but can't express usage patterns: when to include optional parameters, which combinations make sense, or what conventions your API expects."
+
+> "Tool use examples improved accuracy from 72% to 90% on complex parameter handling."
+
+The improvement comes from demonstrating:
+- When to include optional parameters
+- What realistic values look like (not schema placeholders like `"string"`)
+- Parameter correlations (e.g., if field A is set, field B has a specific format)
+- Domain-specific conventions not capturable in JSON Schema
+
+### Official Format
+
+```python
+# Official example from platform.claude.com
+tools=[
+    {
+        "name": "get_weather",
+        "description": "Get the current weather in a given location",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "location": {
+                    "type": "string",
+                    "description": "The city and state, e.g. San Francisco, CA",
+                },
+                "unit": {
+                    "type": "string",
+                    "enum": ["celsius", "fahrenheit"],
+                    "description": "The unit of temperature",
+                },
+            },
+            "required": ["location"],
+        },
+        "input_examples": [
+            {"location": "San Francisco, CA", "unit": "fahrenheit"},
+            {"location": "Tokyo, Japan", "unit": "celsius"},
+            {
+                "location": "New York, NY"  # Demonstrates that 'unit' is optional
+            },
+        ],
+    }
+]
+```
+
+### Constraints on input_examples
+
+- Each example **must be valid** according to the tool's `input_schema` — invalid examples return a 400 error
+- **Not supported for server-side tools** — only user-defined tools can have input examples
+- **Token cost**: ~20–50 tokens for simple examples, ~100–200 tokens for complex nested objects
+- 1–5 examples per tool is the recommended range
+
+### When to Add input_examples (Official Guidance)
+
+The docs are explicit that descriptions take priority:
+
+> "Prioritize descriptions, but consider using `input_examples` for complex tools. Clear descriptions are most important, but for tools with complex inputs, nested objects, or format-sensitive parameters, you can use the `input_examples` field."
+
+Add `input_examples` when:
+- Tools have complex nested structures where valid JSON does not mean correct usage
+- Many optional parameters with non-obvious inclusion patterns
+- Domain-specific conventions not captured in schemas (e.g., date formats, code conventions)
+- The tool has caused consistent parameter formatting errors in testing
+
+Do **not** add `input_examples` just to add them — the token cost is real.
+
+### Best Practices for input_examples Content
+
+From the [advanced tool use blog post](https://www.anthropic.com/engineering/advanced-tool-use):
+
+1. **Use realistic data** — real city names, plausible prices, not `"string"` or `"value"`
+2. **Show variety** — minimal, partial, and full specification patterns
+3. **Keep it concise** — 1–5 examples per tool
+4. **Focus on ambiguity** — correct usage that isn't obvious from the schema alone
+
+---
+
+## Agent-to-Agent Delegation vs. Direct Tool Invocation
+
+### Anthropic's Position: One Level of Delegation
+
+The architecture Anthropic recommends across all documentation is:
+
+```
+Orchestrator (main agent)
+  ├── Has: Task tool + its own tools
+  ├── Delegates to: Subagent A (via Task tool)
+  │     └── Has: scoped tool subset, calls tools directly
+  ├── Delegates to: Subagent B (via Task tool)
+  │     └── Has: different scoped tool subset, calls tools directly
+  └── Synthesizes results
+```
+
+There is **no documented pattern** for:
+- A subagent recommending that the orchestrator delegate to a different agent
+- A subagent spawning another subagent
+- An agent using natural language to suggest "you should ask agent X"
+
+The "do not include `Task` in a subagent's tools" constraint is structural enforcement of this one-level design.
+
+### Agent Teams (Separate Concept)
+
+For workflows that exceed the single-level delegation pattern, Anthropic documents [agent teams](https://code.claude.com/docs/en/agent-teams) as a separate concept:
+
+> "Subagents work within a single session; agent teams coordinate across separate sessions."
+
+Agent teams use the `--agent` flag to run Claude Code instances as workers in their own sessions. This is a different architecture from SDK subagents and is used for "tasks that need sustained parallelism or exceed your context window."
+
+The Super-Legal architecture uses the single-session subagent model, not agent teams.
+
+### Direct Invocation as Default
+
+From the [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) canonical guide:
+
+> "Agents operate as typically just LLMs using tools based on environmental feedback in a loop"
+
+The default is direct tool use. Delegation (via the orchestrator-workers pattern) is an advanced pattern applied when:
+- Tasks cannot be predicted upfront (the orchestrator must break them down dynamically)
+- Context isolation is needed (exploration output should not pollute synthesis context)
+- Parallelization is desired (multiple subagents working simultaneously)
+
+When none of these conditions apply, direct tool invocation by a single agent is simpler and preferred.
+
+### What "Delegation" Means in the SDK
+
+In the SDK, "delegation" means: the orchestrator calls the `Task` tool with a `subagent_type` and a `prompt`. The named subagent then executes autonomously. The orchestrator does not manage the subagent's tool calls — those happen inside the subagent's context.
+
+A subagent "recommending delegation" (i.e., returning text like "you should use the financial-analyst agent for this") is **not the expected behavior**. If the description field correctly describes when to use the subagent, the orchestrator will invoke it directly without the subagent needing to recommend it.
+
+### Practical Implication for Super-Legal
+
+If the Super-Legal orchestrator is currently seeing subagents respond with "I recommend delegating this to agent X," that is a prompt design issue — either:
+1. The subagent is being invoked for a task outside its scope (fix: improve description matching)
+2. The subagent's system prompt is instructing it to recommend delegation (fix: remove that instruction; subagents should execute directly)
+3. The subagent lacks the tools it needs to complete the task (fix: scope the correct tools to the agent)
+
+---
+
+## SDK API Reference Highlights
+
+### AgentDefinition Fields (TypeScript)
+
+```typescript
+type AgentDefinition = {
+  description: string;       // Required: when to use this agent (Claude reads this)
+  tools?: string[];          // Optional: explicit allowlist; inherits all if omitted
+  prompt: string;            // Required: the agent's system prompt
+  model?: "sonnet" | "opus" | "haiku" | "inherit"; // Optional: defaults to main model
+}
+```
+
+Note: there is no `maxThinkingTokens` on `AgentDefinition`. The `maxThinkingTokens` issue (Issue #25) is at the `query()` / `agentQuery` level, not the subagent level.
+
+### Options Fields Relevant to Tool Scoping
+
+```typescript
+type Options = {
+  allowedTools?: string[];           // Orchestrator tool allowlist
+  disallowedTools?: string[];        // Orchestrator tool denylist
+  agents?: Record<string, AgentDefinition>; // Subagent definitions
+  mcpServers?: Record<string, McpServerConfig>; // MCP servers
+  settingSources?: SettingSource[];  // [] = no filesystem settings (recommended)
+  betas?: SdkBeta[];                // e.g., ["context-1m-2025-08-07"]
+  maxThinkingTokens?: number;       // CAUTION: breaks hooks (Issue #25)
+  model?: string;                   // Override model for this query
+}
+```
+
+### SdkBeta Type
+
+As of the current SDK documentation, only one beta is exposed as a typed `SdkBeta` value:
+
+```typescript
+type SdkBeta = "context-1m-2025-08-07";
+```
+
+Other betas (like `interleaved-thinking-2025-05-14`, `effort-2025-11-24`) are passed at the Messages API level, not through the `betas` SDK option.
+
+### Hook Events
+
+Confirmed available hook events in current SDK:
+
+```typescript
+type HookEvent =
+  | "PreToolUse"
+  | "PostToolUse"
+  | "PostToolUseFailure"
+  | "Notification"
+  | "UserPromptSubmit"
+  | "SessionStart"
+  | "SessionEnd"
+  | "Stop"
+  | "SubagentStart"
+  | "SubagentStop"
+  | "PreCompact"
+  | "PermissionRequest";
+```
+
+`SubagentStart` and `SubagentStop` are the hooks used by the Super-Legal `hookSSEBridge.js`.
+
+---
+
+## Relevance to Super-Legal Architecture
+
+### What the Research Validates
+
+1. **SCOPED_MCP_SERVERS approach is correct.** Anthropic explicitly recommends scoping tools per agent. The `buildScopedTools()` pattern is aligned with `AgentDefinition.tools`.
+
+2. **`settingSources: []` is correct.** The default behavior (no filesystem settings) is the intended SDK pattern for programmatic applications.
+
+3. **One-level delegation is correct.** Super-Legal's orchestrator → subagent → direct tool calls is exactly the documented architecture. Subagents should not delegate further.
+
+4. **`output_config: { format: ... }` (SDK 0.72+) is the current structured output API.** The migration from `output_format` was correct.
+
+### What Could Be Improved
+
+1. **`input_examples` on complex tool definitions.** Super-Legal's MCP tool definitions (e.g., SEC search, CourtListener hybrid search) have complex optional parameters that would benefit from `input_examples`. The 72% → 90% accuracy improvement is significant for legal research tools where parameter correctness matters.
+
+   Priority targets: tools with `startPublishedDate`/`endPublishedDate` optional params, tools with `category` enum params, tools with complex nested filter objects.
+
+2. **Code execution bridge (`run_python_analysis`) is architecturally correct as-is.** The bridge uses `code_execution_20250825` (general-purpose) via the Messages API — this is the right tool type for straight Python analysis. `code_execution_20260120` (PTC) is a *separate* tool type for the `allowed_callers` pattern where Claude's sandbox code calls custom tools as async functions. Since MCP tools cannot be called from sandboxes, PTC does not replace the bridge's two-phase architecture (gather data via MCP → pass as JSON → execute Python). The bridge would only benefit from PTC if non-MCP data-fetching tools were added as `allowed_callers`-eligible client tools.
+
+3. **`allowed_callers` is a Messages API field, not an Agent SDK field.** The Agent SDK's `agentQuery` path does not expose `allowed_callers` in its tool definitions API. If PTC were adopted in the future, it would need to use the Messages API directly (which is what `codeExecutionBridge.js` already does).
+
+### What Remains Blocked
+
+1. **`maxThinkingTokens` in `agentQuery` (Issue #25).** This is still broken as of SDK 0.2.47. The research confirms that `maxThinkingTokens` is a valid option in the `Options` type, but the SDK bug prevents it from working without breaking hooks. No resolution has been published.
+
+2. **Programmatic Tool Calling via Agent SDK.** The `allowed_callers` field is part of the Messages API tool definition, not the `AgentDefinition` type in the Agent SDK. You cannot use Programmatic Tool Calling from within an Agent SDK `agentQuery` call today — it requires using the Messages API directly (which is what `codeExecutionBridge.js` already does).
+
+---
+
+## References
+
+All sources verified as accessible on 2026-02-22.
+
+### Primary (Official Anthropic Documentation)
+
+- [Agent SDK Overview](https://platform.claude.com/docs/en/agent-sdk/overview) — Claude Code SDK renamed to Claude Agent SDK; overview of capabilities
+- [Agent SDK: Subagents](https://platform.claude.com/docs/en/agent-sdk/subagents) — Programmatic subagent definition, tool restrictions, `Task` tool requirement, no nested subagents rule
+- [Agent SDK: MCP Integration](https://platform.claude.com/docs/en/agent-sdk/mcp) — MCP tool naming convention, `allowedTools` patterns, ENABLE_TOOL_SEARCH
+- [Agent SDK: Custom Tools](https://platform.claude.com/docs/en/agent-sdk/custom-tools) — `createSdkMcpServer`, `tool()` helper, `allowedTools` with MCP prefix
+- [Agent SDK TypeScript Reference](https://platform.claude.com/docs/en/agent-sdk/typescript) — Full `Options` type, `AgentDefinition`, `HookEvent`, `SettingSource`, `SdkBeta`
+- [Tool Use Implementation Guide](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use) — `input_examples` field specification, best practices, tool runner beta
+- [Programmatic Tool Calling](https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling) — `allowed_callers`, code execution sandbox, model compatibility table
+- [Claude Code Sub-Agents](https://code.claude.com/docs/en/sub-agents) — Filesystem-based subagents, tool restriction patterns, hooks in subagents, `memory` field
+
+### Anthropic Engineering Blog
+
+- [Building Agents with the Claude Agent SDK](https://claude.com/blog/building-agents-with-the-claude-agent-sdk) — Context management, parallelization, tool design hierarchy
+- [Introducing Advanced Tool Use](https://www.anthropic.com/engineering/advanced-tool-use) — Tool Search Tool, Programmatic Tool Calling, Tool Use Examples; beta `advanced-tool-use-2025-11-20`
+- [How We Built Our Multi-Agent Research System](https://www.anthropic.com/engineering/multi-agent-research-system) — Orchestrator-worker pattern, direct tool access by subagents, no recursive delegation
+- [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) — Initializer/coding agent pattern, purposeful tool access
+- [Writing Effective Tools for AI Agents](https://www.anthropic.com/engineering/writing-tools-for-agents) — Tool consolidation, naming conventions, response quality, token efficiency
+- [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) — Canonical agent patterns: augmented LLM, orchestrator-workers, direct tool use as default
+- [Effective Context Engineering for AI Agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — Context management for agents
+
+### Third-Party Analysis (Non-Authoritative)
+
+- [Anthropic Just Shipped the Fix for Tool Definition Bloat](https://medium.com/@DebaA/anthropic-just-shipped-the-fix-for-tool-definition-bloat-77464c8dbec9) (Medium, November 2025) — Summary of advanced tool use features
+- [Your AI Agent Wastes 95% of Its Brain on Tools](https://medium.com/genaius/your-ai-agent-wastes-95-of-its-brain-on-tools-anthropic-just-showed-the-fix-96fbe597136b) (Medium, November 2025) — Context window impact analysis
+- [Claude Agent SDK Best Practices for AI Agent Development](https://skywork.ai/blog/claude-agent-sdk-best-practices-ai-agents-2025/) (Skywork, 2025) — Community summary; treat as unofficial
+
+### GitHub
+
+- [anthropics/claude-agent-sdk-typescript](https://github.com/anthropics/claude-agent-sdk-typescript) — TypeScript SDK source and CHANGELOG
+- [anthropics/claude-agent-sdk-python](https://github.com/anthropics/claude-agent-sdk-python) — Python SDK source and CHANGELOG
+- [claude-cookbooks: programmatic_tool_calling_ptc.ipynb](https://github.com/anthropics/claude-cookbooks/blob/main/tool_use/programmatic_tool_calling_ptc.ipynb) — Official cookbook example
+
+---
+
+## §11 — Anthropic structured-output empirical constraints (Avenue A v2 findings, 2026-05-16)
+
+Avenue A v2 (PR #135) added `output_config: { format: { type: 'json_schema', schema: {...} } }` enforcement to the code-execution bridge. The implementation surfaced several **undocumented schema constraints** that the Anthropic API enforces. Cataloged here for future maintainers.
+
+### Verified-compatible feature combination
+
+The following combination works on Messages API:
+
+- `output_config: { format: { type: 'json_schema', schema } }` (SDK 0.86.1 `MessageCreateParams.output_config`, type at `node_modules/@anthropic-ai/sdk/resources/messages/messages.d.ts:1908`)
+- + `tools: [{ type: 'code_execution_20260120', name: 'code_execution' }]`
+- + `client.messages.stream(...).finalMessage()` (streaming API)
+- + `pause_turn` continuations (server-side sampling iteration limit handling)
+- + `cache_control: { type: 'ephemeral' }` on system prompt
+- + `betas: ['context-1m-2025-08-07', 'files-api-2025-04-14']`
+
+**Verified via L2 + L4 + L5 in PR #135** — no API rejection, no behavioral interaction with `pause_turn`, no cache invalidation.
+
+### Schema constraints discovered (API rejects with 400 if violated)
+
+These constraints are NOT documented in the SDK type definitions or in this codebase's local docs as of 2026-05-16. They were discovered through trial-and-error against `code_execution_20260120` + streaming. The pattern: API rejects with 400 `invalid_request_error` and a specific message naming the offending property.
+
+| Constraint | Required value | API error message if violated |
+|---|---|---|
+| `additionalProperties` on `type: 'object'` nodes | **MUST be `false`** (NOT `true`, NOT omitted) | `"For 'object' type, 'additionalProperties: true' is not supported. Please set 'additionalProperties' to false"` |
+| `minimum` on `type: 'integer'` or `type: 'number'` | **NOT supported** — strip entirely | `"For 'integer' type, property 'minimum' is not supported"` |
+| `maximum` on numeric types | Likely **NOT supported** (stripped defensively in PR #135) | (untested but symmetric with `minimum`) |
+| `minLength` on `type: 'string'` | Likely **NOT supported** (stripped defensively) | (untested but suspected based on `minimum` pattern) |
+| `maxLength` on strings | Likely **NOT supported** | (same) |
+| `enum` on `type: 'string'` | **Supported** | (verified — `status: { type: 'string', enum: ['PASS', 'FAIL', 'UNKNOWN'] }` works in L2) |
+
+**Implication**: Anthropic's structured-output validator implements a **strict subset of JSON Schema** that enforces shape but not value ranges. Schemas validate **types and required fields only**. Semantic validation (value bounds, string length, base64 magic bytes) must be done downstream in application code.
+
+### Where this matters in the bridge
+
+The bridge's `ENVELOPE_SCHEMA_XLSX` and `ENVELOPE_SCHEMA_GENERAL` constants (`codeExecutionBridge.js`, near line 70-167) are written defensively against these constraints. The schema-construction header comment documents them inline so future schema edits respect the constraints.
+
+### The b64-in-text architectural constraint (Option A pivot)
+
+The L4 v1 attempt enforced the FULL envelope (including `b64_xlsx` field) in the model's text-block output. **This broke `code_execution_20260120` + xlsx renders**:
+
+- `output_config` forces the model to emit JSON-formatted text in the response's `text` block(s)
+- For xlsx callers, that envelope includes a multi-KB base64 binary (`b64_xlsx`)
+- Output tokens in the `text` block consume the per-turn `max_tokens` budget (32K default)
+- A 25KB workbook = ~33KB base64 = ~8K-10K output tokens just for the b64 string, plus prose narration
+- Observed: phase3 (LBO sheet) hit `stop_reason='max_tokens'` with `text_len=62106` mid-base64 → truncated b64 → corrupt envelope → retry loop → render failure
+
+**The architectural fix (Option A, shipped in PR #135)**: scope-down the xlsx schema to enforce ONLY the audit metadata (`audit_results`, `sheets`, `phase_sources`, `named_ranges_count`) in the text-block enforcement. The `b64_xlsx` binary payload continues via stdout (`bash_code_execution_tool_result.content.stdout`) — the legacy path, where it has no token-budget pressure. The bridge merges audit-from-text + b64-from-stdout in `selectEnvelopeWithFallback()`.
+
+**Lesson for future structured-output work**: don't force large binary payloads into the text channel. Use the channel that's naturally sized for the data class (stdout for large/binary; text for small/structured).
+
+### `code_execution_20260120` without PTC features
+
+The bridge uses `code_execution_20260120` as a **plain code-execution tool**, NOT as a PTC tool. It does not set `allowed_callers` on any custom tool. The empirical finding: when the new tool is used without PTC features, the per-tool PTC restrictions (e.g., "`strict: true` on tool inputs not supported with PTC") **do not apply**. The bridge's `output_config` works fine because PTC isn't engaged.
+
+If a future bridge change DOES adopt PTC (`allowed_callers` on custom tools), re-verify `output_config` compatibility — Anthropic's docs may carve out PTC-specific restrictions.
+
+### Operational queries
+
+Post-Avenue-A-v2 telemetry counter `claude_xlsx_render_turn1_envelope_success_total` (added in PR #135 follow-up) provides the production observability surface:
+
+```promql
+# Turn-1 success rate by flag state — efficacy gauge
+sum by (structured_output) (rate(claude_xlsx_render_turn1_envelope_success_total{turn_outcome="first_turn"}[1h]))
+ / sum by (structured_output) (rate(claude_xlsx_render_turn1_envelope_success_total[1h]))
+
+# Envelope-source distribution post-flag-flip
+sum by (envelope_source) (rate(claude_xlsx_render_turn1_envelope_success_total{structured_output="on"}[1h]))
+```
+
+Target after `STRUCTURED_OUTPUT_ENFORCEMENT=true` deployment: `structured_output="on"` rate ≥ `structured_output="off"` baseline rate (validates Avenue A v2 doesn't regress retry behavior).
diff --git a/super-legal-mcp-refactored/docs/code-execution-enhancements/container-lifecycle-api-reference-02-2026.md b/super-legal-mcp-refactored/docs/code-execution-enhancements/container-lifecycle-api-reference-02-2026.md
new file mode 100644
index 000000000..c98f12cd6
--- /dev/null
+++ b/super-legal-mcp-refactored/docs/code-execution-enhancements/container-lifecycle-api-reference-02-2026.md
@@ -0,0 +1,446 @@
+# Anthropic Code Execution Container API Reference
+
+**Created:** 2026-02-25
+**Topic:** Code execution container lifecycle, parameter syntax, and fresh container creation
+**API Version:** `anthropic-version: 2023-06-01`
+**Beta string:** `code-execution-2025-08-25` (current), `code-execution-2025-05-22` (legacy)
+**Tool type:** `code_execution_20250825` (current), `code_execution_20250522` (legacy Python-only)
+**Sources:** Official Anthropic documentation fetched 2026-02-25
+
+---
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Tool Definition and Beta Header](#tool-definition-and-beta-header)
+3. [Container Parameter — Exact Type Definition](#container-parameter--exact-type-definition)
+4. [Creating a Fresh Container (Our Overflow Problem)](#creating-a-fresh-container-our-overflow-problem)
+5. [Container Reuse (How to Opt-In)](#container-reuse-how-to-opt-in)
+6. [Container Lifecycle](#container-lifecycle)
+7. [Response Shape — container.id](#response-shape--containerid)
+8. [Error Codes — Container-Specific](#error-codes--container-specific)
+9. [Current codeExecutionBridge.js Behavior](#current-codeexecutionbridgejs-behavior)
+10. [Fix for 200K Token Overflow](#fix-for-200k-token-overflow)
+11. [Model Compatibility](#model-compatibility)
+12. [Key Findings Summary](#key-findings-summary)
+13. [References](#references)
+
+---
+
+## Overview
+
+The Anthropic code execution tool runs Python and Bash in a secure, sandboxed container (Linux x86_64, Python 3.11.12, 5GiB RAM, 5GiB disk, no network). Each `messages.create()` call operates within a container. By default, **omitting the `container` parameter allocates a brand-new container** for that request. Passing a prior container ID via the `container` parameter **opts into reuse**, which maintains filesystem state across calls.
+
+The 200K token overflow issue in our test suite stems from sequential test calls accumulating stdout/state inside a **reused container** — because the Anthropic sandbox persists 30 days, and within that window, if you pass the same container ID, you inherit all prior execution state and output history.
+
+**The fix is straightforward: omit the `container` parameter entirely on each `runPythonAnalysis()` call to force a fresh container.**
+
+---
+
+## Tool Definition and Beta Header
+
+### Tool type
+
+```json
+{
+  "type": "code_execution_20250825",
+  "name": "code_execution"
+}
+```
+
+No additional parameters required on the tool object itself.
+
+### Beta header
+
+For basic code execution (no Skills, no Files API), **no beta header is required** as of the current docs. The tool activates via the tool type alone:
+
+```bash
+curl https://api.anthropic.com/v1/messages \
+    --header "x-api-key: $ANTHROPIC_API_KEY" \
+    --header "anthropic-version: 2023-06-01" \
+    --header "content-type: application/json" \
+    --data '{
+        "model": "claude-sonnet-4-5-20250929",
+        "max_tokens": 4096,
+        "messages": [{"role": "user", "content": "..."}],
+        "tools": [{"type": "code_execution_20250825", "name": "code_execution"}]
+    }'
+```
+
+If using the SDK's `client.beta.messages.create()` path (which our bridge does), the beta header string `"code-execution-2025-08-25"` is passed in `betas: [...]`. This is consistent with current bridge behavior.
+
+> **Note (from current codeExecutionBridge.js, line 18):**
+> ```js
+> const CODE_EXECUTION_BETA = 'code-execution-2025-08-25';
+> ```
+> This is correct. No change needed.
+
+### Legacy version
+
+| Component | Legacy | Current |
+|-----------|--------|---------|
+| Beta header | `code-execution-2025-05-22` | `code-execution-2025-08-25` |
+| Tool type | `code_execution_20250522` | `code_execution_20250825` |
+| Capabilities | Python only | Bash + file operations |
+| Response type | `code_execution_result` | `bash_code_execution_result` + `text_editor_code_execution_result` |
+
+---
+
+## Container Parameter — Exact Type Definition
+
+Source: Python SDK `src/anthropic/types/beta/message_create_params.py` and TypeScript SDK `src/resources/beta/messages/messages.ts`, fetched 2026-02-25.
+
+### Python SDK type alias
+
+```python
+Container: TypeAlias = Union[BetaContainerParams, str]
+```
+
+Used in `MessageCreateParamsBase` as:
+
+```python
+container: Optional[Container]
+# "Container identifier for reuse across requests."
+```
+
+### BetaContainerParams TypedDict (Python)
+
+```python
+class BetaContainerParams(TypedDict, total=False):
+    id: Optional[str]         # Container ID from a previous response
+    skills: Optional[Iterable[BetaSkillParams]]  # Agent Skills to load
+```
+
+### TypeScript SDK interface
+
+```typescript
+// In BetaMessageCreateParamsBase:
+container?: BetaContainerParams | string | null;
+
+// BetaContainerParams interface:
+export interface BetaContainerParams {
+  id?: string | null;
+  skills?: Array<BetaSkillParams> | null;
+}
+```
+
+### Valid shapes for the `container` parameter
+
+| Shape | Effect |
+|-------|--------|
+| Omitted / `undefined` | **Fresh container allocated for this request** |
+| `null` | No container specified (treated as fresh) |
+| `"container_id_string"` | **Reuses the named container** — inherits all filesystem state |
+| `{ id: "container_id_string" }` | Reuses the named container (object form) |
+| `{ skills: [...] }` | Creates new container with Agent Skills loaded |
+| `{ id: "...", skills: [...] }` | Reuses container AND loads Skills |
+
+> **Critical:** There is no `"type": "new"` or `"force_fresh": true` creation config object. The only way to force a fresh container is to **omit the `container` parameter** or pass `null`.
+
+---
+
+## Creating a Fresh Container (Our Overflow Problem)
+
+### The problem
+
+Our test suite runs multiple sequential `runPythonAnalysis()` calls. Because the Anthropic sandbox persists containers for 30 days, and because the `pause_turn` continuation loop at lines 128–139 of `codeExecutionBridge.js` passes `container: containerId`, if a `containerId` from a prior test is inadvertently reused, accumulated stdout overwhelms the 200K token context limit.
+
+However, looking at the current code more carefully:
+
+- The outer multi-turn loop (lines 102–203) calls `messages.create()` without a `container` parameter on turn 1 (line 106). This correctly allocates a **fresh container per top-level call**.
+- The `containerId` is only used inside the `pause_turn` inner loop (line 134) to re-bind to the **same container within a single call**. This is intentional — continuations for a paused turn should stay in the same container.
+- The `messages` array (line 92) is rebuilt fresh for each `runPythonAnalysis()` invocation. No container ID leaks between invocations.
+
+### Root cause of the 200K overflow in tests
+
+The sandbox documentation states: "Containers are scoped to the workspace of the API key." All `runPythonAnalysis()` calls from the same API key that do NOT pass a `container` parameter will receive **a new container each time** — but the container is not immediately destroyed. The 200K limit is a **context window** limit on the Messages API turn, not a container state limit. Sequential test calls that each produce large stdout outputs will overflow the 200K limit when the full conversation history (including all prior `bash_code_execution_tool_result` blocks) is passed back on subsequent turns.
+
+### Solution options
+
+**Option A (Recommended): Reset messages array between error-retry turns**
+
+The outer `for` loop adds to `messages` on error retry (lines 169–175). For very large outputs, truncate or summarize prior outputs before appending. The simplest fix for tests: cap `MAX_TURNS = 1` to prevent accumulation.
+
+**Option B: Explicit single-turn mode for testing**
+
+Set `MAX_TURNS = 1` in test runs via env var:
+
+```js
+const MAX_TURNS = parseInt(process.env.CODE_EXECUTION_MAX_TURNS || '3');
+```
+
+**Option C: The container parameter does NOT solve this**
+
+Since the 200K overflow is a context window issue (too many tokens in the `messages` array passed to `messages.create()`), not a container state issue, explicitly forcing a new container via omitting `container` won't help. Each new call already gets a fresh container.
+
+---
+
+## Container Reuse (How to Opt-In)
+
+From the official docs example:
+
+```javascript
+// First request — no container parameter = fresh container
+const response1 = await anthropic.messages.create({
+  model: "claude-sonnet-4-5-20250929",
+  max_tokens: 4096,
+  messages: [{ role: "user", content: "Write a random number to /tmp/number.txt" }],
+  tools: [{ type: "code_execution_20250825", name: "code_execution" }]
+});
+
+// Extract container ID from response
+const containerId = response1.container.id;
+
+// Second request — pass container ID to reuse filesystem state
+const response2 = await anthropic.messages.create({
+  container: containerId,  // string form
+  // OR: container: { id: containerId },  // object form
+  model: "claude-sonnet-4-5-20250929",
+  max_tokens: 4096,
+  messages: [{ role: "user", content: "Read /tmp/number.txt and square the number" }],
+  tools: [{ type: "code_execution_20250825", name: "code_execution" }]
+});
+```
+
+In curl:
+
+```bash
+# First request
+curl https://api.anthropic.com/v1/messages \
+    --header "x-api-key: $ANTHROPIC_API_KEY" \
+    --header "anthropic-version: 2023-06-01" \
+    --header "content-type: application/json" \
+    --data '{ "model": "...", "tools": [{"type": "code_execution_20250825", "name": "code_execution"}], ... }' \
+    > response1.json
+
+CONTAINER_ID=$(jq -r '.container.id' response1.json)
+
+# Second request — reuse container
+curl https://api.anthropic.com/v1/messages \
+    --header "x-api-key: $ANTHROPIC_API_KEY" \
+    --header "anthropic-version: 2023-06-01" \
+    --header "content-type: application/json" \
+    --data "{\"container\": \"$CONTAINER_ID\", ...}"
+```
+
+---
+
+## Container Lifecycle
+
+| Property | Value |
+|----------|-------|
+| Expiration | **30 days after creation** |
+| Scope | Per API key workspace |
+| Creation | Automatic on first request without container ID |
+| Deletion | No public API to explicitly delete a container |
+| Inactivity timeout | Not documented (internal Anthropic policy) |
+| Max containers | Not documented |
+
+### Container error codes
+
+When a container is unavailable or expired, tool results return error shapes:
+
+```json
+{
+  "type": "bash_code_execution_tool_result",
+  "tool_use_id": "srvtoolu_...",
+  "content": {
+    "type": "bash_code_execution_tool_result_error",
+    "error_code": "container_expired"
+  }
+}
+```
+
+Full error code table:
+
+| Error Code | Condition |
+|------------|-----------|
+| `unavailable` | Tool temporarily unavailable |
+| `execution_time_exceeded` | Exceeded maximum execution time limit |
+| `container_expired` | Container is older than 30 days or was recycled |
+| `invalid_tool_input` | Invalid parameters to the tool |
+| `too_many_requests` | Rate limit exceeded |
+| `file_not_found` | (text_editor only) File does not exist |
+| `string_not_found` | (text_editor only) `old_str` not found in str_replace |
+
+---
+
+## Response Shape — container.id
+
+The API response object includes a `container` field at the top level of the messages response:
+
+```typescript
+// Response type (BetaMessage)
+{
+  id: string;
+  type: "message";
+  role: "assistant";
+  content: ContentBlock[];
+  model: string;
+  stop_reason: "end_turn" | "max_tokens" | "stop_sequence" | "tool_use" | "pause_turn";
+  container: {
+    id: string;          // Container identifier for reuse
+    expires_at: string;  // ISO 8601 datetime, 30 days from creation
+    skills: BetaSkill[] | null;  // Skills loaded (null if none)
+  };
+  usage: { input_tokens: number; output_tokens: number; ... };
+}
+```
+
+> **Note:** `response.container` is always present when code execution runs. Access the ID via `response.container?.id` — the `?.` guard is defensive for non-code-execution requests that lack the field.
+
+---
+
+## Error Codes — Container-Specific
+
+See the error table above. The `container_expired` error means the container referenced by the passed `container` ID no longer exists. Recovery: omit `container` on the next call to get a fresh one.
+
+---
+
+## Current codeExecutionBridge.js Behavior
+
+Reviewing `/Users/ej/Super-Legal/super-legal-mcp-refactored/src/tools/codeExecutionBridge.js`:
+
+| Aspect | Current Behavior | Correct per Docs? |
+|--------|-----------------|-------------------|
+| Fresh container per invocation | Yes — `container` parameter omitted on turn 1 (line 106) | Correct |
+| pause_turn continuation binds to same container | Yes — `container: containerId` on line 134 | Correct |
+| containerId scoped to single invocation | Yes — declared inside outer `for` loop scope | Correct |
+| Beta string | `code-execution-2025-08-25` | Correct |
+| Tool type | `code_execution_20250825` | Correct |
+| Uses `client.beta.messages.create()` | Yes | Correct |
+| Response parsing for `bash_code_execution_tool_result` | Yes, primary type | Correct |
+| Container ID extraction | `response.container?.id` (line 118) | Correct |
+
+### The `pause_turn` continuation (lines 124–140)
+
+```javascript
+// Capture container ID for explicit binding in continuations.
+const containerId = response.container?.id;
+
+let pauseCount = 0;
+while (response.stop_reason === 'pause_turn' && pauseCount < MAX_PAUSE_CONTINUATIONS) {
+  pauseCount++;
+  response = await client.beta.messages.create({
+    model: DEFAULT_MODEL,
+    betas: [CODE_EXECUTION_BETA],
+    max_tokens: MAX_TOKENS,
+    system: SYSTEM_PROMPT,
+    tools: [{ type: CODE_EXECUTION_TOOL_TYPE, name: 'code_execution' }],
+    ...(containerId ? { container: containerId } : {}),  // Bind to same container
+    messages: [
+      { role: 'user', content: userPrompt },
+      { role: 'assistant', content: response.content }
+    ]
+  });
+}
+```
+
+This is correct per Anthropic docs — a `pause_turn` continuation must reuse the same container to access the paused execution state.
+
+---
+
+## Fix for 200K Token Overflow
+
+The 200K overflow in sequential tests is caused by the **messages array growing across error-retry turns** (outer loop lines 169–175), not by container state accumulation. Each new `runPythonAnalysis()` call gets a fresh container.
+
+### Minimal fix for test isolation
+
+Add a configurable max turns env override and cap stdout in the messages passed back on retry:
+
+```javascript
+// In codeExecutionBridge.js — proposed change for test stability
+const MAX_TURNS = parseInt(process.env.CODE_EXECUTION_MAX_TURNS || '3', 10);
+const MAX_RETRY_STDOUT_CHARS = 2000; // Cap error context passed back to Claude
+
+// When retrying on error, truncate stderr/stdout in the retry message:
+messages.push({
+  role: 'user',
+  content: `The code produced an error:\n\`\`\`\n${extracted.stderr?.slice(0, MAX_RETRY_STDOUT_CHARS)}\n\`\`\`\n\nPlease fix and retry.`
+});
+```
+
+For tests specifically, set `CODE_EXECUTION_MAX_TURNS=1` in the test environment to prevent multi-turn accumulation.
+
+### If the overflow is actually from container state (cross-invocation)
+
+If investigation reveals the overflow comes from container state being reused across multiple `runPythonAnalysis()` calls (which should not happen given the current code), the explicit fix is to confirm `container` is never passed on the outer call. Current code on line 106 confirms this is already correct:
+
+```javascript
+let response = await client.beta.messages.create({
+  model: DEFAULT_MODEL,
+  betas: [CODE_EXECUTION_BETA],
+  max_tokens: MAX_TOKENS,
+  system: SYSTEM_PROMPT,
+  tools: [{ type: CODE_EXECUTION_TOOL_TYPE, name: 'code_execution' }],
+  messages    // No container parameter — fresh container every time
+});
+```
+
+---
+
+## Model Compatibility
+
+| Model | Tool Version |
+|-------|-------------|
+| `claude-opus-4-6` | `code_execution_20250825` |
+| `claude-sonnet-4-6` | `code_execution_20250825` |
+| `claude-sonnet-4-5-20250929` | `code_execution_20250825` |
+| `claude-opus-4-5-20251101` | `code_execution_20250825` |
+| `claude-haiku-4-5-20251001` | `code_execution_20250825` |
+| Claude Haiku 3.5 (deprecated) | `code_execution_20250825` |
+
+> Our bridge defaults to `claude-sonnet-4-5-20250929` via `CODE_EXECUTION_MODEL` env var. This is supported.
+
+---
+
+## Key Findings Summary
+
+1. **There is no "force fresh container" creation config object.** The only way to get a fresh container is to omit the `container` parameter entirely (or pass `null`). Our bridge already does this correctly on each top-level invocation.
+
+2. **`container` accepts either a plain string (container ID) or a `BetaContainerParams` object `{ id?, skills? }`.** The object form is used for Agent Skills. For basic code execution with no Skills, the string form or omission is sufficient.
+
+3. **Container expiry is 30 days** from creation. The `container_expired` error code is returned if the container has expired.
+
+4. **There is no public API to explicitly delete containers.** They expire automatically.
+
+5. **The `pause_turn` continuation must reuse the same container** — our bridge does this correctly via `container: containerId` in the inner while loop.
+
+6. **The 200K token overflow in sequential tests is a context window problem, not a container state problem.** The messages array grows with each error-retry turn across multiple turns. Fix: cap `MAX_TURNS=1` for tests, or truncate stdout in retry messages.
+
+7. **Beta header `code-execution-2025-08-25` is correct** for the current `code_execution_20250825` tool type. No change needed.
+
+8. **`response.container.id` is the correct path** to extract the container ID from a response. The `expires_at` field is also present for TTL-aware management.
+
+---
+
+## References
+
+- [Code execution tool - Claude API Docs](https://platform.claude.com/docs/en/agents-and-tools/tool-use/code-execution-tool) — Official documentation, fetched 2026-02-25
+- [Agent Skills API guide](https://platform.claude.com/docs/en/build-with-claude/skills-guide) — Container parameter for Skills integration, fetched 2026-02-25
+- [Python SDK message_create_params.py](https://github.com/anthropics/anthropic-sdk-python/blob/main/src/anthropic/types/beta/message_create_params.py) — Container type alias definition
+- [Python SDK beta_container_params.py](https://raw.githubusercontent.com/anthropics/anthropic-sdk-python/main/src/anthropic/types/beta/beta_container_params.py) — BetaContainerParams TypedDict
+- [Python SDK beta_container.py](https://raw.githubusercontent.com/anthropics/anthropic-sdk-python/main/src/anthropic/types/beta/beta_container.py) — BetaContainer response type
+- [TypeScript SDK messages.ts (beta)](https://github.com/anthropics/anthropic-sdk-typescript/blob/main/src/resources/beta/messages/messages.ts) — TypeScript interface for container parameter
+- [Anthropic code execution with MCP (engineering blog)](https://www.anthropic.com/engineering/code-execution-with-mcp) — Architecture context
+
+---
+
+## Addendum — Avenue A v2 compatibility verified (2026-05-16, PR #135)
+
+**`output_config` (JSON-schema structured output enforcement) is compatible with `code_execution_20260120` + streaming + `pause_turn` continuations** when used WITHOUT PTC's `allowed_callers` pattern.
+
+Verified empirically via PR #135's L2 + L4 + L5 layered validation against `claude-sonnet-4-6`:
+
+- L2 smoke ($0.05 single API call): `output_config + code_execution_20260120 + streaming` accepted without error
+- L4 paired live render ($1.50): 5/5 phases delivered with `output_config` engaged on every API call (initial + pause_turn continuations); `envelope_source` distribution showed `text`, `merged:text+stdout`, and `stdout` paths all engaging as expected
+- L5 cross-caller MCP test ($0.05): non-xlsx caller (`run_python_analysis` MCP tool) succeeded with `turn_count=1, envelope_source='text'`
+
+**Key empirical constraint**: large binary payloads (`b64_xlsx` ~25KB) cannot be enforced into the model's `text` block — they exceed `max_tokens=32000` budget. The bridge's xlsx schema (`ENVELOPE_SCHEMA_XLSX` in `codeExecutionBridge.js`) was scoped down to enforce ONLY the audit metadata in text; binary continues via `bash_code_execution_tool_result.content.stdout` (the natural channel for large output). See `docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md` §11 for the full schema-constraint catalog discovered during PR #135.
+
+**No documented incompatibility** between `output_config` and:
+- Container reuse (`container: <containerId>` param on subsequent calls): verified compatible
+- `pause_turn` server-side iteration limit (5 continuations max in bridge): verified compatible — schema enforcement carries through the iteration loop
+- `cache_control: { type: 'ephemeral' }` on system prompt: verified compatible — `output_config` is a separate param, doesn't affect cache key
+
+**Container lifecycle implications**: `output_config` enforcement happens at the response-shape layer (model's text-block output), AFTER all code-execution + bash + text_editor tool cycles complete within the turn. Container lifecycle (creation, file persistence, expiry) is unaffected.
diff --git a/super-legal-mcp-refactored/src/utils/sdkMetrics.js b/super-legal-mcp-refactored/src/utils/sdkMetrics.js
index d9aa95984..5f04c8a72 100644
--- a/super-legal-mcp-refactored/src/utils/sdkMetrics.js
+++ b/super-legal-mcp-refactored/src/utils/sdkMetrics.js
@@ -427,7 +427,7 @@ export function observeRenderPhaseDuration(templateId, phase, seconds) {
 // Phase 9: per-phase failure counter. Surfaces "which phase fails most
 // often for which template, with what reason" — actionable for prompt
 // engineering OR phase-split adjustment.
-// Cardinality: 5 templates × 3 phases × 6 reasons = 90 series.
+// Cardinality: 5 templates × 3-6 phases × 6 reasons = ~90-180 series.
 export const xlsxRenderPhaseFailures = new client.Counter({
   name: 'claude_xlsx_render_phase_failures_total',
   help: 'Multi-turn render phase failures by template, phase, and bounded reason',
@@ -437,6 +437,48 @@ export function recordXlsxRenderPhaseFailure(templateId, phase, reason) {
   try { xlsxRenderPhaseFailures.inc({ template_id: templateId, phase, reason }); } catch {}
 }
 
+// Avenue A v2 efficacy gauge (PR forthcoming, 2026-05-16): per-phase turn-1
+// envelope outcome observed by the multi-turn orchestrator after each
+// runAnalysis() call. Discriminates:
+//   - structured_output: 'on'|'off' — STRUCTURED_OUTPUT_ENFORCEMENT flag state
+//     at observation time (lets operators measure efficacy across flag flips)
+//   - envelope_source: 'parsed_output'|'text'|'stdout'|'merged'|'none' — which
+//     extraction path won (set by codeExecutionBridge.js:selectEnvelopeWithFallback)
+//   - turn_outcome: 'first_turn'|'retry' — whether the envelope arrived on
+//     turn 1 (the Avenue A v2 target) or required a turn-2+ corrective retry
+//
+// Operational queries (added to PR body + ops dashboard):
+//   sum by (structured_output) (rate(...{turn_outcome="first_turn"}[1h]))
+//    / sum by (structured_output) (rate(...[1h]))
+//   → turn-1 success rate per flag state. Target: structured_output="on"
+//     rate ≥ baseline ("off") rate; ideally substantially higher.
+//
+//   sum by (envelope_source) (rate(...{structured_output="on"}[1h]))
+//   → confirms text/parsed_output/merged paths are actually engaging when
+//     the flag is on (vs. the bridge silently falling through to stdout).
+//
+// Cardinality: 5 templates × 3-6 phases × 2 flag states × 5 sources × 2 outcomes
+//            = ~300-600 series (well within Prometheus budget).
+export const xlsxRenderTurn1EnvelopeSuccess = new client.Counter({
+  name: 'claude_xlsx_render_turn1_envelope_success_total',
+  help: 'Per-phase turn-1 envelope outcome (Avenue A v2 efficacy gauge). Labels: '
+      + 'template_id, phase, structured_output (on|off), envelope_source '
+      + '(parsed_output|text|stdout|merged|none|merged:<source>+stdout), '
+      + 'turn_outcome (first_turn|retry).',
+  labelNames: ['template_id', 'phase', 'structured_output', 'envelope_source', 'turn_outcome'],
+});
+export function recordXlsxRenderTurn1Envelope(templateId, phase, structuredOutputOn, envelopeSource, turnCount) {
+  try {
+    xlsxRenderTurn1EnvelopeSuccess.inc({
+      template_id: templateId,
+      phase,
+      structured_output: structuredOutputOn ? 'on' : 'off',
+      envelope_source: envelopeSource || 'none',
+      turn_outcome: (turnCount > 1) ? 'retry' : 'first_turn',
+    });
+  } catch { /* metric is best-effort; never throws */ }
+}
+
 // Phase 8 (closes Phase 7 Issue #2 LLM-noncompliance gap): every render
 // where the sandbox produced ZERO BLUE-colored cells. The Phase 7 prompt
 // enrichment didn't change LLM behavior in live testing — the LLM still
diff --git a/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js b/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js
index 1593cc157..e9c65c35c 100644
--- a/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js
+++ b/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js
@@ -112,11 +112,28 @@ export async function renderMultiTurn({ template, inputs, sessionId, runAnalysis
 
     // Per-phase metrics — best-effort, never throws.
     try {
-      const { observeRenderPhaseDuration, recordXlsxRenderPhaseFailure } = await import('../sdkMetrics.js');
+      const {
+        observeRenderPhaseDuration,
+        recordXlsxRenderPhaseFailure,
+        recordXlsxRenderTurn1Envelope,
+      } = await import('../sdkMetrics.js');
       observeRenderPhaseDuration(template.id, phaseKey, phaseDurationSec);
       if (!phaseResult.success) {
         recordXlsxRenderPhaseFailure(template.id, phaseKey, classifyPhaseFailure(phaseResult));
       }
+      // Avenue A v2 efficacy gauge — observe turn-1 envelope outcome per phase.
+      // Inputs sourced from bridge's finalResult (envelope_source set by
+      // codeExecutionBridge.js:selectEnvelopeWithFallback; turn_count rolled
+      // through the bridge's per-turn loop). featureFlags read lazily so test
+      // env overrides take effect.
+      const { featureFlags } = await import('../../config/featureFlags.js');
+      recordXlsxRenderTurn1Envelope(
+        template.id,
+        phaseKey,
+        !!featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT,
+        phaseResult.envelope_source,
+        phaseResult.turn_count || 1,
+      );
     } catch (mErr) {
       logError('xlsx_phase_metric_emit_failed', { error: mErr.message, phaseKey });
     }
diff --git a/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js b/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js
index e32b9ead0..b84e19dea 100644
--- a/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js
+++ b/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js
@@ -802,6 +802,55 @@ async function testT17_MultiTurnHappyPath() {
   assert(xr.rows[0]?.render_status === 'completed', 'T17: DB render_status=completed');
   assert(xr.rows[0]?.s === 'PASS', 'T17: DB audit_results.status=PASS');
 
+  // Avenue A v2 telemetry — confirm xlsxRenderTurn1EnvelopeSuccess counter
+  // emitted for every phase. The orchestrator increments after each
+  // runAnalysis() return (multiTurnOrchestrator.js:113-135). Counter labels
+  // {template_id, phase, structured_output, envelope_source, turn_outcome}
+  // are observable via the Prometheus client's internal registry.
+  const { xlsxRenderTurn1EnvelopeSuccess } = await import('../../src/utils/sdkMetrics.js');
+  const counterValues = await xlsxRenderTurn1EnvelopeSuccess.get();
+  const fdwSeries = counterValues.values.filter(
+    (v) => v.labels.template_id === 'full-deal-workbook'
+  );
+  assert(
+    fdwSeries.length >= expectedPhaseCount,
+    `T17 (Avenue A v2 telemetry): xlsxRenderTurn1EnvelopeSuccess emitted for `
+      + `≥${expectedPhaseCount} phases (got ${fdwSeries.length} series)`,
+  );
+  // Each phase should have at least one observation
+  const phasesObserved = new Set(fdwSeries.map((v) => v.labels.phase));
+  assert(
+    phasesObserved.size === expectedPhaseCount,
+    `T17 (Avenue A v2 telemetry): all ${expectedPhaseCount} phases observed `
+      + `(got ${phasesObserved.size}: ${[...phasesObserved].sort().join(',')})`,
+  );
+  // turn_outcome labels are bounded enum {first_turn, retry}
+  const turnOutcomes = new Set(fdwSeries.map((v) => v.labels.turn_outcome));
+  for (const outcome of turnOutcomes) {
+    assert(
+      outcome === 'first_turn' || outcome === 'retry',
+      `T17 (Avenue A v2 telemetry): turn_outcome bounded enum (got '${outcome}')`,
+    );
+  }
+  // structured_output labels are bounded enum {on, off}
+  const flagStates = new Set(fdwSeries.map((v) => v.labels.structured_output));
+  for (const state of flagStates) {
+    assert(
+      state === 'on' || state === 'off',
+      `T17 (Avenue A v2 telemetry): structured_output bounded enum (got '${state}')`,
+    );
+  }
+  // envelope_source labels are bounded enum (incl. merged:* compound forms)
+  const sources = [...new Set(fdwSeries.map((v) => v.labels.envelope_source))];
+  const validSources = sources.filter(
+    (s) => ['parsed_output', 'text', 'stdout', 'none'].includes(s)
+        || s.startsWith('merged:'),
+  );
+  assert(
+    sources.length === validSources.length,
+    `T17 (Avenue A v2 telemetry): envelope_source bounded enum (got '${sources.join(',')}')`,
+  );
+
   await seeded.cleanup();
 }