Sonnet 4.6 orchestrator produces 3x less detailed research plans than Sonnet 4.5

## Summary

Sonnet 4.6 (`claude-sonnet-4-6`) was tested as orchestrator for the Super Legal research pipeline. Three comparative tests using identical queries (Netflix/WBD $82.7B M&A due diligence) revealed a significant quality regression in research plan generation.

## Test Conditions

All tests used the Agent SDK path (`agentQuery`) with `maxThinkingTokens: 16000`, deprecated `type: 'enabled'` mode. Adaptive thinking (`type: 'adaptive'`) could not be tested because the Agent SDK does not support it — blocked on [anthropics/claude-agent-sdk-typescript#25](https://github.com/anthropics/claude-agent-sdk-typescript/issues/25).

## Results

| Metric | Sonnet 4.5 | Sonnet 4.6 | Sonnet 4.6 (effort: high) |
|--------|-----------|-----------|--------------------------|
| Research plan lines | 428 | 145 | 145 |
| Specialists assigned | 13 | 7 | 8 |
| Specialist prompt detail | 8-12 focus items with case citations | 1-2 sentences | 1-2 sentences |
| Cross-reference patterns | 10 mapped | None | None |
| Thinking block (first) | 447 words, structured markdown | 133 words, flat | 180 words, flat |

## Thinking Block Analysis

**Sonnet 4.5** produces structured thinking with markdown headers (`## Transaction Summary`, `## My Approach`, `### Complexity Assessment`, `### Domains Identified`), explicitly references system instructions, and maps domains to specialist types before writing the research plan.

**Sonnet 4.6** (both configurations) produces flat paragraphs with no structure, no complexity assessment, and no reference to system instructions. The original 4.6 test partially compensated by using the `mcp__super-legal-tools__think` tool for extended reasoning (258 words). The 4.6 High test did not use this tool — instead using `TodoWrite` for procedural task tracking.

## Impact

The research plan directly drives specialist prompts, which determine research quality. 4.5's detailed prompts include specific case law citations (e.g., `*United States v. AT&T Inc.*, D.D.C. 2018`), enumerated focus areas (8-12 per specialist), cross-reference instructions, and key authorities. 4.6's 1-2 sentence prompts produce less focused specialist research.

## Root Cause

Model-level behavior difference. Both models used identical `maxThinkingTokens: 16000` through the same Agent SDK path. The quality gap is intrinsic to Sonnet 4.6's thinking behavior, not a configuration issue.

## Blocked On

- [anthropics/claude-agent-sdk-typescript#25](https://github.com/anthropics/claude-agent-sdk-typescript/issues/25) — Adaptive thinking support in Agent SDK. Once resolved, Sonnet 4.6 can be tested with `type: 'adaptive'` + `effort: 'high'` which may produce deeper thinking output.

## Action Taken

- Reverted orchestrator to Sonnet 4.5 (v3.4.0)
- Increased `budget_tokens` from 4096 → 8192 on Messages API paths
- Sonnet 4.6 available via `SDK_MODEL=claude-sonnet-4-6` for future testing

## Raw Logs

Full SSE thinking logs for all three tests are in `docs/Review-Dead-Code/WTF-IS-THIS-THINKING.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sonnet 4.6 orchestrator produces 3x less detailed research plans than Sonnet 4.5 #3

Summary

Test Conditions

Results

Thinking Block Analysis

Impact

Root Cause

Blocked On

Action Taken

Raw Logs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Sonnet 4.5	Sonnet 4.6	Sonnet 4.6 (effort: high)
Research plan lines	428	145	145
Specialists assigned	13	7	8
Specialist prompt detail	8-12 focus items with case citations	1-2 sentences	1-2 sentences
Cross-reference patterns	10 mapped	None	None
Thinking block (first)	447 words, structured markdown	133 words, flat	180 words, flat

Sonnet 4.6 orchestrator produces 3x less detailed research plans than Sonnet 4.5 #3

Description

Summary

Test Conditions

Results

Thinking Block Analysis

Impact

Root Cause

Blocked On

Action Taken

Raw Logs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions