Summary
Sonnet 4.6 (claude-sonnet-4-6) was tested as orchestrator for the Super Legal research pipeline. Three comparative tests using identical queries (Netflix/WBD $82.7B M&A due diligence) revealed a significant quality regression in research plan generation.
Test Conditions
All tests used the Agent SDK path (agentQuery) with maxThinkingTokens: 16000, deprecated type: 'enabled' mode. Adaptive thinking (type: 'adaptive') could not be tested because the Agent SDK does not support it — blocked on anthropics/claude-agent-sdk-typescript#25.
Results
| Metric |
Sonnet 4.5 |
Sonnet 4.6 |
Sonnet 4.6 (effort: high) |
| Research plan lines |
428 |
145 |
145 |
| Specialists assigned |
13 |
7 |
8 |
| Specialist prompt detail |
8-12 focus items with case citations |
1-2 sentences |
1-2 sentences |
| Cross-reference patterns |
10 mapped |
None |
None |
| Thinking block (first) |
447 words, structured markdown |
133 words, flat |
180 words, flat |
Thinking Block Analysis
Sonnet 4.5 produces structured thinking with markdown headers (## Transaction Summary, ## My Approach, ### Complexity Assessment, ### Domains Identified), explicitly references system instructions, and maps domains to specialist types before writing the research plan.
Sonnet 4.6 (both configurations) produces flat paragraphs with no structure, no complexity assessment, and no reference to system instructions. The original 4.6 test partially compensated by using the mcp__super-legal-tools__think tool for extended reasoning (258 words). The 4.6 High test did not use this tool — instead using TodoWrite for procedural task tracking.
Impact
The research plan directly drives specialist prompts, which determine research quality. 4.5's detailed prompts include specific case law citations (e.g., *United States v. AT&T Inc.*, D.D.C. 2018), enumerated focus areas (8-12 per specialist), cross-reference instructions, and key authorities. 4.6's 1-2 sentence prompts produce less focused specialist research.
Root Cause
Model-level behavior difference. Both models used identical maxThinkingTokens: 16000 through the same Agent SDK path. The quality gap is intrinsic to Sonnet 4.6's thinking behavior, not a configuration issue.
Blocked On
Action Taken
- Reverted orchestrator to Sonnet 4.5 (v3.4.0)
- Increased
budget_tokens from 4096 → 8192 on Messages API paths
- Sonnet 4.6 available via
SDK_MODEL=claude-sonnet-4-6 for future testing
Raw Logs
Full SSE thinking logs for all three tests are in docs/Review-Dead-Code/WTF-IS-THIS-THINKING.md.
Summary
Sonnet 4.6 (
claude-sonnet-4-6) was tested as orchestrator for the Super Legal research pipeline. Three comparative tests using identical queries (Netflix/WBD $82.7B M&A due diligence) revealed a significant quality regression in research plan generation.Test Conditions
All tests used the Agent SDK path (
agentQuery) withmaxThinkingTokens: 16000, deprecatedtype: 'enabled'mode. Adaptive thinking (type: 'adaptive') could not be tested because the Agent SDK does not support it — blocked on anthropics/claude-agent-sdk-typescript#25.Results
Thinking Block Analysis
Sonnet 4.5 produces structured thinking with markdown headers (
## Transaction Summary,## My Approach,### Complexity Assessment,### Domains Identified), explicitly references system instructions, and maps domains to specialist types before writing the research plan.Sonnet 4.6 (both configurations) produces flat paragraphs with no structure, no complexity assessment, and no reference to system instructions. The original 4.6 test partially compensated by using the
mcp__super-legal-tools__thinktool for extended reasoning (258 words). The 4.6 High test did not use this tool — instead usingTodoWritefor procedural task tracking.Impact
The research plan directly drives specialist prompts, which determine research quality. 4.5's detailed prompts include specific case law citations (e.g.,
*United States v. AT&T Inc.*, D.D.C. 2018), enumerated focus areas (8-12 per specialist), cross-reference instructions, and key authorities. 4.6's 1-2 sentence prompts produce less focused specialist research.Root Cause
Model-level behavior difference. Both models used identical
maxThinkingTokens: 16000through the same Agent SDK path. The quality gap is intrinsic to Sonnet 4.6's thinking behavior, not a configuration issue.Blocked On
type: 'adaptive'+effort: 'high'which may produce deeper thinking output.Action Taken
budget_tokensfrom 4096 → 8192 on Messages API pathsSDK_MODEL=claude-sonnet-4-6for future testingRaw Logs
Full SSE thinking logs for all three tests are in
docs/Review-Dead-Code/WTF-IS-THIS-THINKING.md.