diff --git a/super-legal-mcp-refactored/docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md b/super-legal-mcp-refactored/docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md new file mode 100644 index 000000000..ccfad0b53 --- /dev/null +++ b/super-legal-mcp-refactored/docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md @@ -0,0 +1,643 @@ +# Anthropic Claude Agent SDK: Best Practices Research + +**Created**: 2026-02-22 +**Topic**: Claude Agent SDK patterns — subagent tool access, code execution integration, tool scoping, input_examples, and agent-to-agent delegation +**Versions covered**: @anthropic-ai/claude-agent-sdk 0.2.x, @anthropic-ai/sdk 0.74.x, advanced-tool-use-2025-11-20 beta +**Primary sources**: platform.claude.com/docs/en/agent-sdk, code.claude.com/docs, anthropic.com/engineering (all verified February 2026) + +--- + +## Table of Contents + +1. [Overview and Key Findings](#overview-and-key-findings) +2. [How Subagents Access Tools: Direct vs. Delegation](#how-subagents-access-tools-direct-vs-delegation) +3. [Code Execution Integration Patterns](#code-execution-integration-patterns) +4. [Tool Scoping Best Practices](#tool-scoping-best-practices) +5. [input_examples on Tool Definitions](#input_examples-on-tool-definitions) +6. [Agent-to-Agent Delegation vs. Direct Tool Invocation](#agent-to-agent-delegation-vs-direct-tool-invocation) +7. [SDK API Reference Highlights](#sdk-api-reference-highlights) +8. [Relevance to Super-Legal Architecture](#relevance-to-super-legal-architecture) +9. [References](#references) + +--- + +## Overview and Key Findings + +Anthropic has consolidated and significantly expanded the Claude Agent SDK documentation between mid-2025 and February 2026 (the Claude Code SDK was renamed to the Claude Agent SDK). The documentation is authoritative and prescriptive about patterns. Key findings: + +| Question | Anthropic's Answer | +|:---------|:-----------------| +| Direct tool access vs. delegation? | **Direct tool access is canonical.** Subagents call tools themselves; they do not recommend delegation to another agent. | +| Code execution invocation pattern? | **Programmatic Tool Calling** (`allowed_callers: ["code_execution_20260120"]`) — Claude writes Python that calls tools in a sandbox. | +| Tool scoping best practices? | Use `tools` allowlist on each `AgentDefinition`. Omitting the field inherits all tools (not recommended for focused agents). | +| `input_examples` pattern? | Top-level field on tool definitions, array of valid input objects. Improves accuracy 72% → 90% on complex params (Anthropic internal data). | +| Agent-to-agent vs. direct invocation? | Avoid recursive delegation. Orchestrator delegates to workers; workers call tools directly. No subagent-to-subagent spawning. | + +--- + +## How Subagents Access Tools: Direct vs. Delegation + +### The Canonical Pattern: Subagents Call Tools Directly + +Anthropic's documentation is unambiguous. A subagent is a scoped agent with its own context window and a restricted tool set. When a task is delegated to it, the subagent **executes the task autonomously using its own tools** — it does not recommend that the orchestrator delegate to a different agent. + +From the [Agent SDK subagents page](https://platform.claude.com/docs/en/agent-sdk/subagents): + +> "Subagents maintain separate context from the main agent, preventing information overload and keeping interactions focused." + +> "A `doc-reviewer` subagent might only have access to Read and Grep tools, ensuring it can analyze but never accidentally modify your documentation files." + +The `tools` field on `AgentDefinition` is an explicit allowlist. If a subagent has `tools: ["Read", "Grep", "Glob"]`, it will **use those tools directly**. It will not forward work to another agent. + +### How Subagent Invocation Works (Task Tool) + +The orchestrator (main agent) invokes subagents via the `Task` tool. The orchestrator must have `Task` in its `allowedTools`. The subagent does not have `Task` in its tools — this is explicitly called out: + +> **"Subagents cannot spawn their own subagents. Don't include `Task` in a subagent's `tools` array."** +> — [platform.claude.com/docs/en/agent-sdk/subagents](https://platform.claude.com/docs/en/agent-sdk/subagents) + +This is the clearest statement of the pattern: the delegation chain is exactly **one level deep**. The orchestrator delegates once; the worker executes directly. + +### TypeScript Example (Canonical Pattern) + +```typescript +import { query } from "@anthropic-ai/claude-agent-sdk"; + +for await (const message of query({ + prompt: "Review the authentication module for security issues", + options: { + // Orchestrator has Task to delegate, plus its own tools + allowedTools: ["Read", "Grep", "Glob", "Task"], + agents: { + "code-reviewer": { + description: "Expert code review specialist. Use for quality, security, and maintainability reviews.", + prompt: "You are a code review specialist. Analyze code quality and suggest improvements.", + // Subagent calls these tools DIRECTLY — never delegates further + tools: ["Read", "Grep", "Glob"], + model: "sonnet" + } + } + } +})) +``` + +### Multi-Agent Research System (Anthropic Internal Implementation) + +Anthropic's [published account of their multi-agent research system](https://www.anthropic.com/engineering/multi-agent-research-system) confirms the pattern: + +- A **LeadResearcher orchestrator** decomposes the query and spawns 3–5 subagents in parallel +- Each subagent **independently performs web searches** and evaluates tool results +- Subagents **do not delegate to other subagents** — they execute tool calls directly +- Results flow back to the orchestrator for synthesis + +> "The lead agent spins up 3-5 subagents in parallel... subagents use 3+ tools in parallel... independently performs web searches, evaluates tool results using interleaved thinking, and returns findings to the LeadResearcher." + +### Avoid Recursive Delegation + +The [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) document establishes this as a design principle: **do not have agents recommend delegation to other agents**. The value of a subagent is that it actually performs the work. An agent that says "you should use a different agent for this" is not following the delegation model — it has simply returned an unhelpful response. + +--- + +## Code Execution Integration Patterns + +### Programmatic Tool Calling (November 2025 Beta, Now Stable) + +Released under beta header `advanced-tool-use-2025-11-20`, Programmatic Tool Calling is now documented as a first-class pattern on [platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling](https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling). + +The pattern: instead of Claude invoking tools one at a time through API round-trips, Claude **writes Python code** that calls tools as functions. The code runs in a sandboxed container. Intermediate results do not enter Claude's context window — only the final code output does. + +**IMPORTANT — Two distinct code execution tool types exist:** + +| Tool Type | Purpose | Beta Header | +|:----------|:--------|:------------| +| `code_execution_20250825` | **General-purpose** code execution (Python, bash, file ops). Used by Skills' native path (`src/server/legacyStreamHandler.js:78`). | None (GA) | +| `code_execution_20260120` | **PTC-capable** code execution (Python, bash, file ops) — supports the `allowed_callers` pattern when used WITH custom tools. Can also be used as a plain code-execution tool (without `allowed_callers`). **This is what `codeExecutionBridge.js` uses** (verified at line 30). | None | + +These are **separate tool types**, not versions of the same tool. **CORRECTION (2026-05-16, Avenue A v2 audit)**: an earlier version of this doc stated "Super-Legal bridge correctly uses `code_execution_20250825`." That was incorrect — the bridge actually uses `code_execution_20260120` (verified at `codeExecutionBridge.js:30`). The bridge does NOT use PTC's `allowed_callers` pattern (verified — zero grep matches anywhere in `src/`), so it consumes `code_execution_20260120` as a plain code-execution tool without PTC features. PTC-specific restrictions on `strict: true` on tool inputs therefore don't apply to the bridge's use of this tool. + +**PTC-compatible models (as of February 2026):** + +| Model | PTC Tool Version | +|:------|:------------| +| claude-opus-4-6 | `code_execution_20260120` | +| claude-sonnet-4-6 | `code_execution_20260120` | +| claude-sonnet-4-5-20250929 | `code_execution_20260120` | +| claude-opus-4-5-20251101 | `code_execution_20260120` | + +### The `allowed_callers` Field + +This is the mechanism that enables programmatic tool calling. It is added to a **user-defined tool's** definition, not to the code execution tool itself. + +```json +{ + "name": "query_database", + "description": "Execute a SQL query. Returns JSON rows.", + "input_schema": { ... }, + "allowed_callers": ["code_execution_20260120"] +} +``` + +**Possible values:** +- `["direct"]` — Only Claude can call this tool through the normal API round-trip (default if field omitted) +- `["code_execution_20260120"]` — Only callable from within a code execution sandbox +- `["direct", "code_execution_20260120"]` — Callable both ways + +> **Tip from official docs**: "Choose either `["direct"]` or `["code_execution_20260120"]` for each tool rather than enabling both, as this provides clearer guidance to Claude for how best to use the tool." + +### Programmatic Tool Calling Request Structure + +```typescript +import { Anthropic } from "@anthropic-ai/sdk"; +const anthropic = new Anthropic(); + +const response = await anthropic.messages.create({ + model: "claude-opus-4-6", + max_tokens: 4096, + messages: [{ + role: "user", + content: "Query sales for West, East, Central regions, then find highest revenue" + }], + tools: [ + { + // Step 1: Include the code execution server tool + type: "code_execution_20260120", + name: "code_execution" + }, + { + // Step 2: User-defined tool with allowed_callers pointing to code execution + name: "query_database", + description: "Execute a SQL query. Returns JSON array of row objects.", + input_schema: { + type: "object", + properties: { sql: { type: "string", description: "SQL query to execute" } }, + required: ["sql"] + }, + allowed_callers: ["code_execution_20260120"] + } + ] +}); +``` + +When Claude responds, it writes Python code like: + +```python +# Claude-generated code running in the sandbox +results = {} +for region in ["West", "East", "Central"]: + data = await query_database(f"SELECT SUM(revenue) FROM sales WHERE region='{region}'") + results[region] = data[0]["sum"] + +top = max(results.items(), key=lambda x: x[1]) +print(f"Top region: {top[0]} with ${top[1]:,}") +``` + +The tool calls are `await`-ed — Claude writes async Python automatically. Intermediate query results never enter the context window; only the `print()` output does. + +### When to Use Programmatic Tool Calling + +**Good use cases (from official docs):** +- Large data processing where you only need aggregates or summaries +- Multi-step workflows with 3+ dependent tool calls +- Operations requiring filtering, sorting, or transformation of results +- Tasks where intermediate data should not influence Claude's reasoning +- Parallel operations across many items (e.g., checking 50 endpoints) + +**Less ideal:** +- Single tool calls with simple responses +- Tools that need immediate user feedback +- Very fast operations where code execution overhead would outweigh the benefit + +### Constraints + +- `strict: true` on tool definitions is **not** supported with programmatic calling +- `tool_choice` forcing is **not** supported with programmatic calling +- `disable_parallel_tool_use: true` is **not** supported with programmatic calling +- MCP connector tools cannot currently be called programmatically +- Not covered by Zero Data Retention (ZDR) arrangements + +--- + +## Tool Scoping Best Practices + +### The Official Guidance: Explicit Allowlists Per Agent + +From the [subagents documentation](https://platform.claude.com/docs/en/agent-sdk/subagents): + +```typescript +// Official tool restriction pattern +agents: { + "code-analyzer": { + description: "Static code analysis and architecture review", + prompt: "Analyze code structure without making changes.", + // Read-only tools: no Edit, Write, or Bash access + tools: ["Read", "Grep", "Glob"] + } +} +``` + +The docs list canonical tool combinations by role: + +| Use Case | Tools | Description | +|:---------|:------|:------------| +| Read-only analysis | `Read`, `Grep`, `Glob` | Can examine code but not modify or execute | +| Test execution | `Bash`, `Read`, `Grep` | Can run commands and analyze output | +| Code modification | `Read`, `Edit`, `Write`, `Grep`, `Glob` | Full read/write without command execution | +| Full access | (omit `tools` field) | Inherits all tools from parent | + +### Omitting the `tools` Field + +If `tools` is **omitted** from an `AgentDefinition`, the subagent **inherits all available tools** from the parent. This is the default but is not recommended for focused agents. The documentation consistently shows `tools` being specified explicitly for every meaningful subagent. + +### MCP Tool Scoping with `allowedTools` + +For MCP tools, the naming convention is `mcp____`. Wildcard patterns are supported: + +```typescript +allowedTools: [ + "mcp__github__*", // All tools from the github server + "mcp__db__query", // Only the query tool from db server + "mcp__slack__send_message" // Only send_message from slack +] +``` + +When the MCP tool search feature is active (ENABLE_TOOL_SEARCH env var), tools marked with `defer_loading: true` are not preloaded — they are discovered on-demand. This auto-activates when MCP tool descriptions exceed 10% of context window. + +### The Tool Flooding Problem (Anthropic Research Finding) + +From the [advanced tool use blog post](https://www.anthropic.com/engineering/advanced-tool-use): + +> Loading hundreds of tool definitions upfront "wastes 85% of context on unused tool definitions" in typical agent runs. + +The recommended fix: the Tool Search Tool (part of the `advanced-tool-use-2025-11-20` beta) allows Claude to **discover tools on-demand**. In the Agent SDK, this is surfaced as the `ENABLE_TOOL_SEARCH` environment variable. + +For the Super-Legal architecture, this validates the SCOPED_MCP_SERVERS approach: giving each subagent only the tools it needs is exactly aligned with Anthropic's documented best practice. + +### `settingSources: []` Pattern + +From the [TypeScript SDK reference](https://platform.claude.com/docs/en/agent-sdk/typescript): + +> "When `settingSources` is omitted or undefined, the SDK does **not** load any filesystem settings. This provides isolation for SDK applications." + +The Super-Legal codebase already sets `settingSources: []` in `agentQuery`, which is the correct pattern for SDK-only applications that define everything programmatically. + +--- + +## input_examples on Tool Definitions + +### Feature Origin and Documentation + +`input_examples` is part of the `advanced-tool-use-2025-11-20` beta, but it is also documented as a **standard field** in the [main tool use implementation guide](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use) with no beta requirement for basic use. + +From the official docs, tool definitions now support four fields: + +| Parameter | Description | +|:----------|:------------| +| `name` | Tool name (regex: `^[a-zA-Z0-9_-]{1,64}$`) | +| `description` | Detailed description of what the tool does and when to use it | +| `input_schema` | JSON Schema object defining expected parameters | +| `input_examples` | **(Optional)** Array of example input objects | + +### Why input_examples Matters + +From the [advanced tool use documentation](https://www.anthropic.com/engineering/advanced-tool-use): + +> "JSON schemas define what's structurally valid, but can't express usage patterns: when to include optional parameters, which combinations make sense, or what conventions your API expects." + +> "Tool use examples improved accuracy from 72% to 90% on complex parameter handling." + +The improvement comes from demonstrating: +- When to include optional parameters +- What realistic values look like (not schema placeholders like `"string"`) +- Parameter correlations (e.g., if field A is set, field B has a specific format) +- Domain-specific conventions not capturable in JSON Schema + +### Official Format + +```python +# Official example from platform.claude.com +tools=[ + { + "name": "get_weather", + "description": "Get the current weather in a given location", + "input_schema": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city and state, e.g. San Francisco, CA", + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "The unit of temperature", + }, + }, + "required": ["location"], + }, + "input_examples": [ + {"location": "San Francisco, CA", "unit": "fahrenheit"}, + {"location": "Tokyo, Japan", "unit": "celsius"}, + { + "location": "New York, NY" # Demonstrates that 'unit' is optional + }, + ], + } +] +``` + +### Constraints on input_examples + +- Each example **must be valid** according to the tool's `input_schema` — invalid examples return a 400 error +- **Not supported for server-side tools** — only user-defined tools can have input examples +- **Token cost**: ~20–50 tokens for simple examples, ~100–200 tokens for complex nested objects +- 1–5 examples per tool is the recommended range + +### When to Add input_examples (Official Guidance) + +The docs are explicit that descriptions take priority: + +> "Prioritize descriptions, but consider using `input_examples` for complex tools. Clear descriptions are most important, but for tools with complex inputs, nested objects, or format-sensitive parameters, you can use the `input_examples` field." + +Add `input_examples` when: +- Tools have complex nested structures where valid JSON does not mean correct usage +- Many optional parameters with non-obvious inclusion patterns +- Domain-specific conventions not captured in schemas (e.g., date formats, code conventions) +- The tool has caused consistent parameter formatting errors in testing + +Do **not** add `input_examples` just to add them — the token cost is real. + +### Best Practices for input_examples Content + +From the [advanced tool use blog post](https://www.anthropic.com/engineering/advanced-tool-use): + +1. **Use realistic data** — real city names, plausible prices, not `"string"` or `"value"` +2. **Show variety** — minimal, partial, and full specification patterns +3. **Keep it concise** — 1–5 examples per tool +4. **Focus on ambiguity** — correct usage that isn't obvious from the schema alone + +--- + +## Agent-to-Agent Delegation vs. Direct Tool Invocation + +### Anthropic's Position: One Level of Delegation + +The architecture Anthropic recommends across all documentation is: + +``` +Orchestrator (main agent) + ├── Has: Task tool + its own tools + ├── Delegates to: Subagent A (via Task tool) + │ └── Has: scoped tool subset, calls tools directly + ├── Delegates to: Subagent B (via Task tool) + │ └── Has: different scoped tool subset, calls tools directly + └── Synthesizes results +``` + +There is **no documented pattern** for: +- A subagent recommending that the orchestrator delegate to a different agent +- A subagent spawning another subagent +- An agent using natural language to suggest "you should ask agent X" + +The "do not include `Task` in a subagent's tools" constraint is structural enforcement of this one-level design. + +### Agent Teams (Separate Concept) + +For workflows that exceed the single-level delegation pattern, Anthropic documents [agent teams](https://code.claude.com/docs/en/agent-teams) as a separate concept: + +> "Subagents work within a single session; agent teams coordinate across separate sessions." + +Agent teams use the `--agent` flag to run Claude Code instances as workers in their own sessions. This is a different architecture from SDK subagents and is used for "tasks that need sustained parallelism or exceed your context window." + +The Super-Legal architecture uses the single-session subagent model, not agent teams. + +### Direct Invocation as Default + +From the [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) canonical guide: + +> "Agents operate as typically just LLMs using tools based on environmental feedback in a loop" + +The default is direct tool use. Delegation (via the orchestrator-workers pattern) is an advanced pattern applied when: +- Tasks cannot be predicted upfront (the orchestrator must break them down dynamically) +- Context isolation is needed (exploration output should not pollute synthesis context) +- Parallelization is desired (multiple subagents working simultaneously) + +When none of these conditions apply, direct tool invocation by a single agent is simpler and preferred. + +### What "Delegation" Means in the SDK + +In the SDK, "delegation" means: the orchestrator calls the `Task` tool with a `subagent_type` and a `prompt`. The named subagent then executes autonomously. The orchestrator does not manage the subagent's tool calls — those happen inside the subagent's context. + +A subagent "recommending delegation" (i.e., returning text like "you should use the financial-analyst agent for this") is **not the expected behavior**. If the description field correctly describes when to use the subagent, the orchestrator will invoke it directly without the subagent needing to recommend it. + +### Practical Implication for Super-Legal + +If the Super-Legal orchestrator is currently seeing subagents respond with "I recommend delegating this to agent X," that is a prompt design issue — either: +1. The subagent is being invoked for a task outside its scope (fix: improve description matching) +2. The subagent's system prompt is instructing it to recommend delegation (fix: remove that instruction; subagents should execute directly) +3. The subagent lacks the tools it needs to complete the task (fix: scope the correct tools to the agent) + +--- + +## SDK API Reference Highlights + +### AgentDefinition Fields (TypeScript) + +```typescript +type AgentDefinition = { + description: string; // Required: when to use this agent (Claude reads this) + tools?: string[]; // Optional: explicit allowlist; inherits all if omitted + prompt: string; // Required: the agent's system prompt + model?: "sonnet" | "opus" | "haiku" | "inherit"; // Optional: defaults to main model +} +``` + +Note: there is no `maxThinkingTokens` on `AgentDefinition`. The `maxThinkingTokens` issue (Issue #25) is at the `query()` / `agentQuery` level, not the subagent level. + +### Options Fields Relevant to Tool Scoping + +```typescript +type Options = { + allowedTools?: string[]; // Orchestrator tool allowlist + disallowedTools?: string[]; // Orchestrator tool denylist + agents?: Record; // Subagent definitions + mcpServers?: Record; // MCP servers + settingSources?: SettingSource[]; // [] = no filesystem settings (recommended) + betas?: SdkBeta[]; // e.g., ["context-1m-2025-08-07"] + maxThinkingTokens?: number; // CAUTION: breaks hooks (Issue #25) + model?: string; // Override model for this query +} +``` + +### SdkBeta Type + +As of the current SDK documentation, only one beta is exposed as a typed `SdkBeta` value: + +```typescript +type SdkBeta = "context-1m-2025-08-07"; +``` + +Other betas (like `interleaved-thinking-2025-05-14`, `effort-2025-11-24`) are passed at the Messages API level, not through the `betas` SDK option. + +### Hook Events + +Confirmed available hook events in current SDK: + +```typescript +type HookEvent = + | "PreToolUse" + | "PostToolUse" + | "PostToolUseFailure" + | "Notification" + | "UserPromptSubmit" + | "SessionStart" + | "SessionEnd" + | "Stop" + | "SubagentStart" + | "SubagentStop" + | "PreCompact" + | "PermissionRequest"; +``` + +`SubagentStart` and `SubagentStop` are the hooks used by the Super-Legal `hookSSEBridge.js`. + +--- + +## Relevance to Super-Legal Architecture + +### What the Research Validates + +1. **SCOPED_MCP_SERVERS approach is correct.** Anthropic explicitly recommends scoping tools per agent. The `buildScopedTools()` pattern is aligned with `AgentDefinition.tools`. + +2. **`settingSources: []` is correct.** The default behavior (no filesystem settings) is the intended SDK pattern for programmatic applications. + +3. **One-level delegation is correct.** Super-Legal's orchestrator → subagent → direct tool calls is exactly the documented architecture. Subagents should not delegate further. + +4. **`output_config: { format: ... }` (SDK 0.72+) is the current structured output API.** The migration from `output_format` was correct. + +### What Could Be Improved + +1. **`input_examples` on complex tool definitions.** Super-Legal's MCP tool definitions (e.g., SEC search, CourtListener hybrid search) have complex optional parameters that would benefit from `input_examples`. The 72% → 90% accuracy improvement is significant for legal research tools where parameter correctness matters. + + Priority targets: tools with `startPublishedDate`/`endPublishedDate` optional params, tools with `category` enum params, tools with complex nested filter objects. + +2. **Code execution bridge (`run_python_analysis`) is architecturally correct as-is.** The bridge uses `code_execution_20250825` (general-purpose) via the Messages API — this is the right tool type for straight Python analysis. `code_execution_20260120` (PTC) is a *separate* tool type for the `allowed_callers` pattern where Claude's sandbox code calls custom tools as async functions. Since MCP tools cannot be called from sandboxes, PTC does not replace the bridge's two-phase architecture (gather data via MCP → pass as JSON → execute Python). The bridge would only benefit from PTC if non-MCP data-fetching tools were added as `allowed_callers`-eligible client tools. + +3. **`allowed_callers` is a Messages API field, not an Agent SDK field.** The Agent SDK's `agentQuery` path does not expose `allowed_callers` in its tool definitions API. If PTC were adopted in the future, it would need to use the Messages API directly (which is what `codeExecutionBridge.js` already does). + +### What Remains Blocked + +1. **`maxThinkingTokens` in `agentQuery` (Issue #25).** This is still broken as of SDK 0.2.47. The research confirms that `maxThinkingTokens` is a valid option in the `Options` type, but the SDK bug prevents it from working without breaking hooks. No resolution has been published. + +2. **Programmatic Tool Calling via Agent SDK.** The `allowed_callers` field is part of the Messages API tool definition, not the `AgentDefinition` type in the Agent SDK. You cannot use Programmatic Tool Calling from within an Agent SDK `agentQuery` call today — it requires using the Messages API directly (which is what `codeExecutionBridge.js` already does). + +--- + +## References + +All sources verified as accessible on 2026-02-22. + +### Primary (Official Anthropic Documentation) + +- [Agent SDK Overview](https://platform.claude.com/docs/en/agent-sdk/overview) — Claude Code SDK renamed to Claude Agent SDK; overview of capabilities +- [Agent SDK: Subagents](https://platform.claude.com/docs/en/agent-sdk/subagents) — Programmatic subagent definition, tool restrictions, `Task` tool requirement, no nested subagents rule +- [Agent SDK: MCP Integration](https://platform.claude.com/docs/en/agent-sdk/mcp) — MCP tool naming convention, `allowedTools` patterns, ENABLE_TOOL_SEARCH +- [Agent SDK: Custom Tools](https://platform.claude.com/docs/en/agent-sdk/custom-tools) — `createSdkMcpServer`, `tool()` helper, `allowedTools` with MCP prefix +- [Agent SDK TypeScript Reference](https://platform.claude.com/docs/en/agent-sdk/typescript) — Full `Options` type, `AgentDefinition`, `HookEvent`, `SettingSource`, `SdkBeta` +- [Tool Use Implementation Guide](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use) — `input_examples` field specification, best practices, tool runner beta +- [Programmatic Tool Calling](https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling) — `allowed_callers`, code execution sandbox, model compatibility table +- [Claude Code Sub-Agents](https://code.claude.com/docs/en/sub-agents) — Filesystem-based subagents, tool restriction patterns, hooks in subagents, `memory` field + +### Anthropic Engineering Blog + +- [Building Agents with the Claude Agent SDK](https://claude.com/blog/building-agents-with-the-claude-agent-sdk) — Context management, parallelization, tool design hierarchy +- [Introducing Advanced Tool Use](https://www.anthropic.com/engineering/advanced-tool-use) — Tool Search Tool, Programmatic Tool Calling, Tool Use Examples; beta `advanced-tool-use-2025-11-20` +- [How We Built Our Multi-Agent Research System](https://www.anthropic.com/engineering/multi-agent-research-system) — Orchestrator-worker pattern, direct tool access by subagents, no recursive delegation +- [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) — Initializer/coding agent pattern, purposeful tool access +- [Writing Effective Tools for AI Agents](https://www.anthropic.com/engineering/writing-tools-for-agents) — Tool consolidation, naming conventions, response quality, token efficiency +- [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) — Canonical agent patterns: augmented LLM, orchestrator-workers, direct tool use as default +- [Effective Context Engineering for AI Agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — Context management for agents + +### Third-Party Analysis (Non-Authoritative) + +- [Anthropic Just Shipped the Fix for Tool Definition Bloat](https://medium.com/@DebaA/anthropic-just-shipped-the-fix-for-tool-definition-bloat-77464c8dbec9) (Medium, November 2025) — Summary of advanced tool use features +- [Your AI Agent Wastes 95% of Its Brain on Tools](https://medium.com/genaius/your-ai-agent-wastes-95-of-its-brain-on-tools-anthropic-just-showed-the-fix-96fbe597136b) (Medium, November 2025) — Context window impact analysis +- [Claude Agent SDK Best Practices for AI Agent Development](https://skywork.ai/blog/claude-agent-sdk-best-practices-ai-agents-2025/) (Skywork, 2025) — Community summary; treat as unofficial + +### GitHub + +- [anthropics/claude-agent-sdk-typescript](https://github.com/anthropics/claude-agent-sdk-typescript) — TypeScript SDK source and CHANGELOG +- [anthropics/claude-agent-sdk-python](https://github.com/anthropics/claude-agent-sdk-python) — Python SDK source and CHANGELOG +- [claude-cookbooks: programmatic_tool_calling_ptc.ipynb](https://github.com/anthropics/claude-cookbooks/blob/main/tool_use/programmatic_tool_calling_ptc.ipynb) — Official cookbook example + +--- + +## §11 — Anthropic structured-output empirical constraints (Avenue A v2 findings, 2026-05-16) + +Avenue A v2 (PR #135) added `output_config: { format: { type: 'json_schema', schema: {...} } }` enforcement to the code-execution bridge. The implementation surfaced several **undocumented schema constraints** that the Anthropic API enforces. Cataloged here for future maintainers. + +### Verified-compatible feature combination + +The following combination works on Messages API: + +- `output_config: { format: { type: 'json_schema', schema } }` (SDK 0.86.1 `MessageCreateParams.output_config`, type at `node_modules/@anthropic-ai/sdk/resources/messages/messages.d.ts:1908`) +- + `tools: [{ type: 'code_execution_20260120', name: 'code_execution' }]` +- + `client.messages.stream(...).finalMessage()` (streaming API) +- + `pause_turn` continuations (server-side sampling iteration limit handling) +- + `cache_control: { type: 'ephemeral' }` on system prompt +- + `betas: ['context-1m-2025-08-07', 'files-api-2025-04-14']` + +**Verified via L2 + L4 + L5 in PR #135** — no API rejection, no behavioral interaction with `pause_turn`, no cache invalidation. + +### Schema constraints discovered (API rejects with 400 if violated) + +These constraints are NOT documented in the SDK type definitions or in this codebase's local docs as of 2026-05-16. They were discovered through trial-and-error against `code_execution_20260120` + streaming. The pattern: API rejects with 400 `invalid_request_error` and a specific message naming the offending property. + +| Constraint | Required value | API error message if violated | +|---|---|---| +| `additionalProperties` on `type: 'object'` nodes | **MUST be `false`** (NOT `true`, NOT omitted) | `"For 'object' type, 'additionalProperties: true' is not supported. Please set 'additionalProperties' to false"` | +| `minimum` on `type: 'integer'` or `type: 'number'` | **NOT supported** — strip entirely | `"For 'integer' type, property 'minimum' is not supported"` | +| `maximum` on numeric types | Likely **NOT supported** (stripped defensively in PR #135) | (untested but symmetric with `minimum`) | +| `minLength` on `type: 'string'` | Likely **NOT supported** (stripped defensively) | (untested but suspected based on `minimum` pattern) | +| `maxLength` on strings | Likely **NOT supported** | (same) | +| `enum` on `type: 'string'` | **Supported** | (verified — `status: { type: 'string', enum: ['PASS', 'FAIL', 'UNKNOWN'] }` works in L2) | + +**Implication**: Anthropic's structured-output validator implements a **strict subset of JSON Schema** that enforces shape but not value ranges. Schemas validate **types and required fields only**. Semantic validation (value bounds, string length, base64 magic bytes) must be done downstream in application code. + +### Where this matters in the bridge + +The bridge's `ENVELOPE_SCHEMA_XLSX` and `ENVELOPE_SCHEMA_GENERAL` constants (`codeExecutionBridge.js`, near line 70-167) are written defensively against these constraints. The schema-construction header comment documents them inline so future schema edits respect the constraints. + +### The b64-in-text architectural constraint (Option A pivot) + +The L4 v1 attempt enforced the FULL envelope (including `b64_xlsx` field) in the model's text-block output. **This broke `code_execution_20260120` + xlsx renders**: + +- `output_config` forces the model to emit JSON-formatted text in the response's `text` block(s) +- For xlsx callers, that envelope includes a multi-KB base64 binary (`b64_xlsx`) +- Output tokens in the `text` block consume the per-turn `max_tokens` budget (32K default) +- A 25KB workbook = ~33KB base64 = ~8K-10K output tokens just for the b64 string, plus prose narration +- Observed: phase3 (LBO sheet) hit `stop_reason='max_tokens'` with `text_len=62106` mid-base64 → truncated b64 → corrupt envelope → retry loop → render failure + +**The architectural fix (Option A, shipped in PR #135)**: scope-down the xlsx schema to enforce ONLY the audit metadata (`audit_results`, `sheets`, `phase_sources`, `named_ranges_count`) in the text-block enforcement. The `b64_xlsx` binary payload continues via stdout (`bash_code_execution_tool_result.content.stdout`) — the legacy path, where it has no token-budget pressure. The bridge merges audit-from-text + b64-from-stdout in `selectEnvelopeWithFallback()`. + +**Lesson for future structured-output work**: don't force large binary payloads into the text channel. Use the channel that's naturally sized for the data class (stdout for large/binary; text for small/structured). + +### `code_execution_20260120` without PTC features + +The bridge uses `code_execution_20260120` as a **plain code-execution tool**, NOT as a PTC tool. It does not set `allowed_callers` on any custom tool. The empirical finding: when the new tool is used without PTC features, the per-tool PTC restrictions (e.g., "`strict: true` on tool inputs not supported with PTC") **do not apply**. The bridge's `output_config` works fine because PTC isn't engaged. + +If a future bridge change DOES adopt PTC (`allowed_callers` on custom tools), re-verify `output_config` compatibility — Anthropic's docs may carve out PTC-specific restrictions. + +### Operational queries + +Post-Avenue-A-v2 telemetry counter `claude_xlsx_render_turn1_envelope_success_total` (added in PR #135 follow-up) provides the production observability surface: + +```promql +# Turn-1 success rate by flag state — efficacy gauge +sum by (structured_output) (rate(claude_xlsx_render_turn1_envelope_success_total{turn_outcome="first_turn"}[1h])) + / sum by (structured_output) (rate(claude_xlsx_render_turn1_envelope_success_total[1h])) + +# Envelope-source distribution post-flag-flip +sum by (envelope_source) (rate(claude_xlsx_render_turn1_envelope_success_total{structured_output="on"}[1h])) +``` + +Target after `STRUCTURED_OUTPUT_ENFORCEMENT=true` deployment: `structured_output="on"` rate ≥ `structured_output="off"` baseline rate (validates Avenue A v2 doesn't regress retry behavior). diff --git a/super-legal-mcp-refactored/docs/code-execution-enhancements/container-lifecycle-api-reference-02-2026.md b/super-legal-mcp-refactored/docs/code-execution-enhancements/container-lifecycle-api-reference-02-2026.md new file mode 100644 index 000000000..c98f12cd6 --- /dev/null +++ b/super-legal-mcp-refactored/docs/code-execution-enhancements/container-lifecycle-api-reference-02-2026.md @@ -0,0 +1,446 @@ +# Anthropic Code Execution Container API Reference + +**Created:** 2026-02-25 +**Topic:** Code execution container lifecycle, parameter syntax, and fresh container creation +**API Version:** `anthropic-version: 2023-06-01` +**Beta string:** `code-execution-2025-08-25` (current), `code-execution-2025-05-22` (legacy) +**Tool type:** `code_execution_20250825` (current), `code_execution_20250522` (legacy Python-only) +**Sources:** Official Anthropic documentation fetched 2026-02-25 + +--- + +## Table of Contents + +1. [Overview](#overview) +2. [Tool Definition and Beta Header](#tool-definition-and-beta-header) +3. [Container Parameter — Exact Type Definition](#container-parameter--exact-type-definition) +4. [Creating a Fresh Container (Our Overflow Problem)](#creating-a-fresh-container-our-overflow-problem) +5. [Container Reuse (How to Opt-In)](#container-reuse-how-to-opt-in) +6. [Container Lifecycle](#container-lifecycle) +7. [Response Shape — container.id](#response-shape--containerid) +8. [Error Codes — Container-Specific](#error-codes--container-specific) +9. [Current codeExecutionBridge.js Behavior](#current-codeexecutionbridgejs-behavior) +10. [Fix for 200K Token Overflow](#fix-for-200k-token-overflow) +11. [Model Compatibility](#model-compatibility) +12. [Key Findings Summary](#key-findings-summary) +13. [References](#references) + +--- + +## Overview + +The Anthropic code execution tool runs Python and Bash in a secure, sandboxed container (Linux x86_64, Python 3.11.12, 5GiB RAM, 5GiB disk, no network). Each `messages.create()` call operates within a container. By default, **omitting the `container` parameter allocates a brand-new container** for that request. Passing a prior container ID via the `container` parameter **opts into reuse**, which maintains filesystem state across calls. + +The 200K token overflow issue in our test suite stems from sequential test calls accumulating stdout/state inside a **reused container** — because the Anthropic sandbox persists 30 days, and within that window, if you pass the same container ID, you inherit all prior execution state and output history. + +**The fix is straightforward: omit the `container` parameter entirely on each `runPythonAnalysis()` call to force a fresh container.** + +--- + +## Tool Definition and Beta Header + +### Tool type + +```json +{ + "type": "code_execution_20250825", + "name": "code_execution" +} +``` + +No additional parameters required on the tool object itself. + +### Beta header + +For basic code execution (no Skills, no Files API), **no beta header is required** as of the current docs. The tool activates via the tool type alone: + +```bash +curl https://api.anthropic.com/v1/messages \ + --header "x-api-key: $ANTHROPIC_API_KEY" \ + --header "anthropic-version: 2023-06-01" \ + --header "content-type: application/json" \ + --data '{ + "model": "claude-sonnet-4-5-20250929", + "max_tokens": 4096, + "messages": [{"role": "user", "content": "..."}], + "tools": [{"type": "code_execution_20250825", "name": "code_execution"}] + }' +``` + +If using the SDK's `client.beta.messages.create()` path (which our bridge does), the beta header string `"code-execution-2025-08-25"` is passed in `betas: [...]`. This is consistent with current bridge behavior. + +> **Note (from current codeExecutionBridge.js, line 18):** +> ```js +> const CODE_EXECUTION_BETA = 'code-execution-2025-08-25'; +> ``` +> This is correct. No change needed. + +### Legacy version + +| Component | Legacy | Current | +|-----------|--------|---------| +| Beta header | `code-execution-2025-05-22` | `code-execution-2025-08-25` | +| Tool type | `code_execution_20250522` | `code_execution_20250825` | +| Capabilities | Python only | Bash + file operations | +| Response type | `code_execution_result` | `bash_code_execution_result` + `text_editor_code_execution_result` | + +--- + +## Container Parameter — Exact Type Definition + +Source: Python SDK `src/anthropic/types/beta/message_create_params.py` and TypeScript SDK `src/resources/beta/messages/messages.ts`, fetched 2026-02-25. + +### Python SDK type alias + +```python +Container: TypeAlias = Union[BetaContainerParams, str] +``` + +Used in `MessageCreateParamsBase` as: + +```python +container: Optional[Container] +# "Container identifier for reuse across requests." +``` + +### BetaContainerParams TypedDict (Python) + +```python +class BetaContainerParams(TypedDict, total=False): + id: Optional[str] # Container ID from a previous response + skills: Optional[Iterable[BetaSkillParams]] # Agent Skills to load +``` + +### TypeScript SDK interface + +```typescript +// In BetaMessageCreateParamsBase: +container?: BetaContainerParams | string | null; + +// BetaContainerParams interface: +export interface BetaContainerParams { + id?: string | null; + skills?: Array | null; +} +``` + +### Valid shapes for the `container` parameter + +| Shape | Effect | +|-------|--------| +| Omitted / `undefined` | **Fresh container allocated for this request** | +| `null` | No container specified (treated as fresh) | +| `"container_id_string"` | **Reuses the named container** — inherits all filesystem state | +| `{ id: "container_id_string" }` | Reuses the named container (object form) | +| `{ skills: [...] }` | Creates new container with Agent Skills loaded | +| `{ id: "...", skills: [...] }` | Reuses container AND loads Skills | + +> **Critical:** There is no `"type": "new"` or `"force_fresh": true` creation config object. The only way to force a fresh container is to **omit the `container` parameter** or pass `null`. + +--- + +## Creating a Fresh Container (Our Overflow Problem) + +### The problem + +Our test suite runs multiple sequential `runPythonAnalysis()` calls. Because the Anthropic sandbox persists containers for 30 days, and because the `pause_turn` continuation loop at lines 128–139 of `codeExecutionBridge.js` passes `container: containerId`, if a `containerId` from a prior test is inadvertently reused, accumulated stdout overwhelms the 200K token context limit. + +However, looking at the current code more carefully: + +- The outer multi-turn loop (lines 102–203) calls `messages.create()` without a `container` parameter on turn 1 (line 106). This correctly allocates a **fresh container per top-level call**. +- The `containerId` is only used inside the `pause_turn` inner loop (line 134) to re-bind to the **same container within a single call**. This is intentional — continuations for a paused turn should stay in the same container. +- The `messages` array (line 92) is rebuilt fresh for each `runPythonAnalysis()` invocation. No container ID leaks between invocations. + +### Root cause of the 200K overflow in tests + +The sandbox documentation states: "Containers are scoped to the workspace of the API key." All `runPythonAnalysis()` calls from the same API key that do NOT pass a `container` parameter will receive **a new container each time** — but the container is not immediately destroyed. The 200K limit is a **context window** limit on the Messages API turn, not a container state limit. Sequential test calls that each produce large stdout outputs will overflow the 200K limit when the full conversation history (including all prior `bash_code_execution_tool_result` blocks) is passed back on subsequent turns. + +### Solution options + +**Option A (Recommended): Reset messages array between error-retry turns** + +The outer `for` loop adds to `messages` on error retry (lines 169–175). For very large outputs, truncate or summarize prior outputs before appending. The simplest fix for tests: cap `MAX_TURNS = 1` to prevent accumulation. + +**Option B: Explicit single-turn mode for testing** + +Set `MAX_TURNS = 1` in test runs via env var: + +```js +const MAX_TURNS = parseInt(process.env.CODE_EXECUTION_MAX_TURNS || '3'); +``` + +**Option C: The container parameter does NOT solve this** + +Since the 200K overflow is a context window issue (too many tokens in the `messages` array passed to `messages.create()`), not a container state issue, explicitly forcing a new container via omitting `container` won't help. Each new call already gets a fresh container. + +--- + +## Container Reuse (How to Opt-In) + +From the official docs example: + +```javascript +// First request — no container parameter = fresh container +const response1 = await anthropic.messages.create({ + model: "claude-sonnet-4-5-20250929", + max_tokens: 4096, + messages: [{ role: "user", content: "Write a random number to /tmp/number.txt" }], + tools: [{ type: "code_execution_20250825", name: "code_execution" }] +}); + +// Extract container ID from response +const containerId = response1.container.id; + +// Second request — pass container ID to reuse filesystem state +const response2 = await anthropic.messages.create({ + container: containerId, // string form + // OR: container: { id: containerId }, // object form + model: "claude-sonnet-4-5-20250929", + max_tokens: 4096, + messages: [{ role: "user", content: "Read /tmp/number.txt and square the number" }], + tools: [{ type: "code_execution_20250825", name: "code_execution" }] +}); +``` + +In curl: + +```bash +# First request +curl https://api.anthropic.com/v1/messages \ + --header "x-api-key: $ANTHROPIC_API_KEY" \ + --header "anthropic-version: 2023-06-01" \ + --header "content-type: application/json" \ + --data '{ "model": "...", "tools": [{"type": "code_execution_20250825", "name": "code_execution"}], ... }' \ + > response1.json + +CONTAINER_ID=$(jq -r '.container.id' response1.json) + +# Second request — reuse container +curl https://api.anthropic.com/v1/messages \ + --header "x-api-key: $ANTHROPIC_API_KEY" \ + --header "anthropic-version: 2023-06-01" \ + --header "content-type: application/json" \ + --data "{\"container\": \"$CONTAINER_ID\", ...}" +``` + +--- + +## Container Lifecycle + +| Property | Value | +|----------|-------| +| Expiration | **30 days after creation** | +| Scope | Per API key workspace | +| Creation | Automatic on first request without container ID | +| Deletion | No public API to explicitly delete a container | +| Inactivity timeout | Not documented (internal Anthropic policy) | +| Max containers | Not documented | + +### Container error codes + +When a container is unavailable or expired, tool results return error shapes: + +```json +{ + "type": "bash_code_execution_tool_result", + "tool_use_id": "srvtoolu_...", + "content": { + "type": "bash_code_execution_tool_result_error", + "error_code": "container_expired" + } +} +``` + +Full error code table: + +| Error Code | Condition | +|------------|-----------| +| `unavailable` | Tool temporarily unavailable | +| `execution_time_exceeded` | Exceeded maximum execution time limit | +| `container_expired` | Container is older than 30 days or was recycled | +| `invalid_tool_input` | Invalid parameters to the tool | +| `too_many_requests` | Rate limit exceeded | +| `file_not_found` | (text_editor only) File does not exist | +| `string_not_found` | (text_editor only) `old_str` not found in str_replace | + +--- + +## Response Shape — container.id + +The API response object includes a `container` field at the top level of the messages response: + +```typescript +// Response type (BetaMessage) +{ + id: string; + type: "message"; + role: "assistant"; + content: ContentBlock[]; + model: string; + stop_reason: "end_turn" | "max_tokens" | "stop_sequence" | "tool_use" | "pause_turn"; + container: { + id: string; // Container identifier for reuse + expires_at: string; // ISO 8601 datetime, 30 days from creation + skills: BetaSkill[] | null; // Skills loaded (null if none) + }; + usage: { input_tokens: number; output_tokens: number; ... }; +} +``` + +> **Note:** `response.container` is always present when code execution runs. Access the ID via `response.container?.id` — the `?.` guard is defensive for non-code-execution requests that lack the field. + +--- + +## Error Codes — Container-Specific + +See the error table above. The `container_expired` error means the container referenced by the passed `container` ID no longer exists. Recovery: omit `container` on the next call to get a fresh one. + +--- + +## Current codeExecutionBridge.js Behavior + +Reviewing `/Users/ej/Super-Legal/super-legal-mcp-refactored/src/tools/codeExecutionBridge.js`: + +| Aspect | Current Behavior | Correct per Docs? | +|--------|-----------------|-------------------| +| Fresh container per invocation | Yes — `container` parameter omitted on turn 1 (line 106) | Correct | +| pause_turn continuation binds to same container | Yes — `container: containerId` on line 134 | Correct | +| containerId scoped to single invocation | Yes — declared inside outer `for` loop scope | Correct | +| Beta string | `code-execution-2025-08-25` | Correct | +| Tool type | `code_execution_20250825` | Correct | +| Uses `client.beta.messages.create()` | Yes | Correct | +| Response parsing for `bash_code_execution_tool_result` | Yes, primary type | Correct | +| Container ID extraction | `response.container?.id` (line 118) | Correct | + +### The `pause_turn` continuation (lines 124–140) + +```javascript +// Capture container ID for explicit binding in continuations. +const containerId = response.container?.id; + +let pauseCount = 0; +while (response.stop_reason === 'pause_turn' && pauseCount < MAX_PAUSE_CONTINUATIONS) { + pauseCount++; + response = await client.beta.messages.create({ + model: DEFAULT_MODEL, + betas: [CODE_EXECUTION_BETA], + max_tokens: MAX_TOKENS, + system: SYSTEM_PROMPT, + tools: [{ type: CODE_EXECUTION_TOOL_TYPE, name: 'code_execution' }], + ...(containerId ? { container: containerId } : {}), // Bind to same container + messages: [ + { role: 'user', content: userPrompt }, + { role: 'assistant', content: response.content } + ] + }); +} +``` + +This is correct per Anthropic docs — a `pause_turn` continuation must reuse the same container to access the paused execution state. + +--- + +## Fix for 200K Token Overflow + +The 200K overflow in sequential tests is caused by the **messages array growing across error-retry turns** (outer loop lines 169–175), not by container state accumulation. Each new `runPythonAnalysis()` call gets a fresh container. + +### Minimal fix for test isolation + +Add a configurable max turns env override and cap stdout in the messages passed back on retry: + +```javascript +// In codeExecutionBridge.js — proposed change for test stability +const MAX_TURNS = parseInt(process.env.CODE_EXECUTION_MAX_TURNS || '3', 10); +const MAX_RETRY_STDOUT_CHARS = 2000; // Cap error context passed back to Claude + +// When retrying on error, truncate stderr/stdout in the retry message: +messages.push({ + role: 'user', + content: `The code produced an error:\n\`\`\`\n${extracted.stderr?.slice(0, MAX_RETRY_STDOUT_CHARS)}\n\`\`\`\n\nPlease fix and retry.` +}); +``` + +For tests specifically, set `CODE_EXECUTION_MAX_TURNS=1` in the test environment to prevent multi-turn accumulation. + +### If the overflow is actually from container state (cross-invocation) + +If investigation reveals the overflow comes from container state being reused across multiple `runPythonAnalysis()` calls (which should not happen given the current code), the explicit fix is to confirm `container` is never passed on the outer call. Current code on line 106 confirms this is already correct: + +```javascript +let response = await client.beta.messages.create({ + model: DEFAULT_MODEL, + betas: [CODE_EXECUTION_BETA], + max_tokens: MAX_TOKENS, + system: SYSTEM_PROMPT, + tools: [{ type: CODE_EXECUTION_TOOL_TYPE, name: 'code_execution' }], + messages // No container parameter — fresh container every time +}); +``` + +--- + +## Model Compatibility + +| Model | Tool Version | +|-------|-------------| +| `claude-opus-4-6` | `code_execution_20250825` | +| `claude-sonnet-4-6` | `code_execution_20250825` | +| `claude-sonnet-4-5-20250929` | `code_execution_20250825` | +| `claude-opus-4-5-20251101` | `code_execution_20250825` | +| `claude-haiku-4-5-20251001` | `code_execution_20250825` | +| Claude Haiku 3.5 (deprecated) | `code_execution_20250825` | + +> Our bridge defaults to `claude-sonnet-4-5-20250929` via `CODE_EXECUTION_MODEL` env var. This is supported. + +--- + +## Key Findings Summary + +1. **There is no "force fresh container" creation config object.** The only way to get a fresh container is to omit the `container` parameter entirely (or pass `null`). Our bridge already does this correctly on each top-level invocation. + +2. **`container` accepts either a plain string (container ID) or a `BetaContainerParams` object `{ id?, skills? }`.** The object form is used for Agent Skills. For basic code execution with no Skills, the string form or omission is sufficient. + +3. **Container expiry is 30 days** from creation. The `container_expired` error code is returned if the container has expired. + +4. **There is no public API to explicitly delete containers.** They expire automatically. + +5. **The `pause_turn` continuation must reuse the same container** — our bridge does this correctly via `container: containerId` in the inner while loop. + +6. **The 200K token overflow in sequential tests is a context window problem, not a container state problem.** The messages array grows with each error-retry turn across multiple turns. Fix: cap `MAX_TURNS=1` for tests, or truncate stdout in retry messages. + +7. **Beta header `code-execution-2025-08-25` is correct** for the current `code_execution_20250825` tool type. No change needed. + +8. **`response.container.id` is the correct path** to extract the container ID from a response. The `expires_at` field is also present for TTL-aware management. + +--- + +## References + +- [Code execution tool - Claude API Docs](https://platform.claude.com/docs/en/agents-and-tools/tool-use/code-execution-tool) — Official documentation, fetched 2026-02-25 +- [Agent Skills API guide](https://platform.claude.com/docs/en/build-with-claude/skills-guide) — Container parameter for Skills integration, fetched 2026-02-25 +- [Python SDK message_create_params.py](https://github.com/anthropics/anthropic-sdk-python/blob/main/src/anthropic/types/beta/message_create_params.py) — Container type alias definition +- [Python SDK beta_container_params.py](https://raw.githubusercontent.com/anthropics/anthropic-sdk-python/main/src/anthropic/types/beta/beta_container_params.py) — BetaContainerParams TypedDict +- [Python SDK beta_container.py](https://raw.githubusercontent.com/anthropics/anthropic-sdk-python/main/src/anthropic/types/beta/beta_container.py) — BetaContainer response type +- [TypeScript SDK messages.ts (beta)](https://github.com/anthropics/anthropic-sdk-typescript/blob/main/src/resources/beta/messages/messages.ts) — TypeScript interface for container parameter +- [Anthropic code execution with MCP (engineering blog)](https://www.anthropic.com/engineering/code-execution-with-mcp) — Architecture context + +--- + +## Addendum — Avenue A v2 compatibility verified (2026-05-16, PR #135) + +**`output_config` (JSON-schema structured output enforcement) is compatible with `code_execution_20260120` + streaming + `pause_turn` continuations** when used WITHOUT PTC's `allowed_callers` pattern. + +Verified empirically via PR #135's L2 + L4 + L5 layered validation against `claude-sonnet-4-6`: + +- L2 smoke ($0.05 single API call): `output_config + code_execution_20260120 + streaming` accepted without error +- L4 paired live render ($1.50): 5/5 phases delivered with `output_config` engaged on every API call (initial + pause_turn continuations); `envelope_source` distribution showed `text`, `merged:text+stdout`, and `stdout` paths all engaging as expected +- L5 cross-caller MCP test ($0.05): non-xlsx caller (`run_python_analysis` MCP tool) succeeded with `turn_count=1, envelope_source='text'` + +**Key empirical constraint**: large binary payloads (`b64_xlsx` ~25KB) cannot be enforced into the model's `text` block — they exceed `max_tokens=32000` budget. The bridge's xlsx schema (`ENVELOPE_SCHEMA_XLSX` in `codeExecutionBridge.js`) was scoped down to enforce ONLY the audit metadata in text; binary continues via `bash_code_execution_tool_result.content.stdout` (the natural channel for large output). See `docs/code-execution-enhancements/anthropic-sdk-best-practices-research.md` §11 for the full schema-constraint catalog discovered during PR #135. + +**No documented incompatibility** between `output_config` and: +- Container reuse (`container: ` param on subsequent calls): verified compatible +- `pause_turn` server-side iteration limit (5 continuations max in bridge): verified compatible — schema enforcement carries through the iteration loop +- `cache_control: { type: 'ephemeral' }` on system prompt: verified compatible — `output_config` is a separate param, doesn't affect cache key + +**Container lifecycle implications**: `output_config` enforcement happens at the response-shape layer (model's text-block output), AFTER all code-execution + bash + text_editor tool cycles complete within the turn. Container lifecycle (creation, file persistence, expiry) is unaffected. diff --git a/super-legal-mcp-refactored/src/utils/sdkMetrics.js b/super-legal-mcp-refactored/src/utils/sdkMetrics.js index d9aa95984..5f04c8a72 100644 --- a/super-legal-mcp-refactored/src/utils/sdkMetrics.js +++ b/super-legal-mcp-refactored/src/utils/sdkMetrics.js @@ -427,7 +427,7 @@ export function observeRenderPhaseDuration(templateId, phase, seconds) { // Phase 9: per-phase failure counter. Surfaces "which phase fails most // often for which template, with what reason" — actionable for prompt // engineering OR phase-split adjustment. -// Cardinality: 5 templates × 3 phases × 6 reasons = 90 series. +// Cardinality: 5 templates × 3-6 phases × 6 reasons = ~90-180 series. export const xlsxRenderPhaseFailures = new client.Counter({ name: 'claude_xlsx_render_phase_failures_total', help: 'Multi-turn render phase failures by template, phase, and bounded reason', @@ -437,6 +437,48 @@ export function recordXlsxRenderPhaseFailure(templateId, phase, reason) { try { xlsxRenderPhaseFailures.inc({ template_id: templateId, phase, reason }); } catch {} } +// Avenue A v2 efficacy gauge (PR forthcoming, 2026-05-16): per-phase turn-1 +// envelope outcome observed by the multi-turn orchestrator after each +// runAnalysis() call. Discriminates: +// - structured_output: 'on'|'off' — STRUCTURED_OUTPUT_ENFORCEMENT flag state +// at observation time (lets operators measure efficacy across flag flips) +// - envelope_source: 'parsed_output'|'text'|'stdout'|'merged'|'none' — which +// extraction path won (set by codeExecutionBridge.js:selectEnvelopeWithFallback) +// - turn_outcome: 'first_turn'|'retry' — whether the envelope arrived on +// turn 1 (the Avenue A v2 target) or required a turn-2+ corrective retry +// +// Operational queries (added to PR body + ops dashboard): +// sum by (structured_output) (rate(...{turn_outcome="first_turn"}[1h])) +// / sum by (structured_output) (rate(...[1h])) +// → turn-1 success rate per flag state. Target: structured_output="on" +// rate ≥ baseline ("off") rate; ideally substantially higher. +// +// sum by (envelope_source) (rate(...{structured_output="on"}[1h])) +// → confirms text/parsed_output/merged paths are actually engaging when +// the flag is on (vs. the bridge silently falling through to stdout). +// +// Cardinality: 5 templates × 3-6 phases × 2 flag states × 5 sources × 2 outcomes +// = ~300-600 series (well within Prometheus budget). +export const xlsxRenderTurn1EnvelopeSuccess = new client.Counter({ + name: 'claude_xlsx_render_turn1_envelope_success_total', + help: 'Per-phase turn-1 envelope outcome (Avenue A v2 efficacy gauge). Labels: ' + + 'template_id, phase, structured_output (on|off), envelope_source ' + + '(parsed_output|text|stdout|merged|none|merged:+stdout), ' + + 'turn_outcome (first_turn|retry).', + labelNames: ['template_id', 'phase', 'structured_output', 'envelope_source', 'turn_outcome'], +}); +export function recordXlsxRenderTurn1Envelope(templateId, phase, structuredOutputOn, envelopeSource, turnCount) { + try { + xlsxRenderTurn1EnvelopeSuccess.inc({ + template_id: templateId, + phase, + structured_output: structuredOutputOn ? 'on' : 'off', + envelope_source: envelopeSource || 'none', + turn_outcome: (turnCount > 1) ? 'retry' : 'first_turn', + }); + } catch { /* metric is best-effort; never throws */ } +} + // Phase 8 (closes Phase 7 Issue #2 LLM-noncompliance gap): every render // where the sandbox produced ZERO BLUE-colored cells. The Phase 7 prompt // enrichment didn't change LLM behavior in live testing — the LLM still diff --git a/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js b/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js index 1593cc157..e9c65c35c 100644 --- a/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js +++ b/super-legal-mcp-refactored/src/utils/xlsxRenderer/multiTurnOrchestrator.js @@ -112,11 +112,28 @@ export async function renderMultiTurn({ template, inputs, sessionId, runAnalysis // Per-phase metrics — best-effort, never throws. try { - const { observeRenderPhaseDuration, recordXlsxRenderPhaseFailure } = await import('../sdkMetrics.js'); + const { + observeRenderPhaseDuration, + recordXlsxRenderPhaseFailure, + recordXlsxRenderTurn1Envelope, + } = await import('../sdkMetrics.js'); observeRenderPhaseDuration(template.id, phaseKey, phaseDurationSec); if (!phaseResult.success) { recordXlsxRenderPhaseFailure(template.id, phaseKey, classifyPhaseFailure(phaseResult)); } + // Avenue A v2 efficacy gauge — observe turn-1 envelope outcome per phase. + // Inputs sourced from bridge's finalResult (envelope_source set by + // codeExecutionBridge.js:selectEnvelopeWithFallback; turn_count rolled + // through the bridge's per-turn loop). featureFlags read lazily so test + // env overrides take effect. + const { featureFlags } = await import('../../config/featureFlags.js'); + recordXlsxRenderTurn1Envelope( + template.id, + phaseKey, + !!featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT, + phaseResult.envelope_source, + phaseResult.turn_count || 1, + ); } catch (mErr) { logError('xlsx_phase_metric_emit_failed', { error: mErr.message, phaseKey }); } diff --git a/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js b/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js index e32b9ead0..b84e19dea 100644 --- a/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js +++ b/super-legal-mcp-refactored/test/sdk/xlsx-renderer-integration.test.js @@ -802,6 +802,55 @@ async function testT17_MultiTurnHappyPath() { assert(xr.rows[0]?.render_status === 'completed', 'T17: DB render_status=completed'); assert(xr.rows[0]?.s === 'PASS', 'T17: DB audit_results.status=PASS'); + // Avenue A v2 telemetry — confirm xlsxRenderTurn1EnvelopeSuccess counter + // emitted for every phase. The orchestrator increments after each + // runAnalysis() return (multiTurnOrchestrator.js:113-135). Counter labels + // {template_id, phase, structured_output, envelope_source, turn_outcome} + // are observable via the Prometheus client's internal registry. + const { xlsxRenderTurn1EnvelopeSuccess } = await import('../../src/utils/sdkMetrics.js'); + const counterValues = await xlsxRenderTurn1EnvelopeSuccess.get(); + const fdwSeries = counterValues.values.filter( + (v) => v.labels.template_id === 'full-deal-workbook' + ); + assert( + fdwSeries.length >= expectedPhaseCount, + `T17 (Avenue A v2 telemetry): xlsxRenderTurn1EnvelopeSuccess emitted for ` + + `≥${expectedPhaseCount} phases (got ${fdwSeries.length} series)`, + ); + // Each phase should have at least one observation + const phasesObserved = new Set(fdwSeries.map((v) => v.labels.phase)); + assert( + phasesObserved.size === expectedPhaseCount, + `T17 (Avenue A v2 telemetry): all ${expectedPhaseCount} phases observed ` + + `(got ${phasesObserved.size}: ${[...phasesObserved].sort().join(',')})`, + ); + // turn_outcome labels are bounded enum {first_turn, retry} + const turnOutcomes = new Set(fdwSeries.map((v) => v.labels.turn_outcome)); + for (const outcome of turnOutcomes) { + assert( + outcome === 'first_turn' || outcome === 'retry', + `T17 (Avenue A v2 telemetry): turn_outcome bounded enum (got '${outcome}')`, + ); + } + // structured_output labels are bounded enum {on, off} + const flagStates = new Set(fdwSeries.map((v) => v.labels.structured_output)); + for (const state of flagStates) { + assert( + state === 'on' || state === 'off', + `T17 (Avenue A v2 telemetry): structured_output bounded enum (got '${state}')`, + ); + } + // envelope_source labels are bounded enum (incl. merged:* compound forms) + const sources = [...new Set(fdwSeries.map((v) => v.labels.envelope_source))]; + const validSources = sources.filter( + (s) => ['parsed_output', 'text', 'stdout', 'none'].includes(s) + || s.startsWith('merged:'), + ); + assert( + sources.length === validSources.length, + `T17 (Avenue A v2 telemetry): envelope_source bounded enum (got '${sources.join(',')}')`, + ); + await seeded.cleanup(); }