Skip to content

Sonnet 4.6 orchestrator produces 3x less detailed research plans than Sonnet 4.5 #3

Description

@Number531

Summary

Sonnet 4.6 (claude-sonnet-4-6) was tested as orchestrator for the Super Legal research pipeline. Three comparative tests using identical queries (Netflix/WBD $82.7B M&A due diligence) revealed a significant quality regression in research plan generation.

Test Conditions

All tests used the Agent SDK path (agentQuery) with maxThinkingTokens: 16000, deprecated type: 'enabled' mode. Adaptive thinking (type: 'adaptive') could not be tested because the Agent SDK does not support it — blocked on anthropics/claude-agent-sdk-typescript#25.

Results

Metric Sonnet 4.5 Sonnet 4.6 Sonnet 4.6 (effort: high)
Research plan lines 428 145 145
Specialists assigned 13 7 8
Specialist prompt detail 8-12 focus items with case citations 1-2 sentences 1-2 sentences
Cross-reference patterns 10 mapped None None
Thinking block (first) 447 words, structured markdown 133 words, flat 180 words, flat

Thinking Block Analysis

Sonnet 4.5 produces structured thinking with markdown headers (## Transaction Summary, ## My Approach, ### Complexity Assessment, ### Domains Identified), explicitly references system instructions, and maps domains to specialist types before writing the research plan.

Sonnet 4.6 (both configurations) produces flat paragraphs with no structure, no complexity assessment, and no reference to system instructions. The original 4.6 test partially compensated by using the mcp__super-legal-tools__think tool for extended reasoning (258 words). The 4.6 High test did not use this tool — instead using TodoWrite for procedural task tracking.

Impact

The research plan directly drives specialist prompts, which determine research quality. 4.5's detailed prompts include specific case law citations (e.g., *United States v. AT&T Inc.*, D.D.C. 2018), enumerated focus areas (8-12 per specialist), cross-reference instructions, and key authorities. 4.6's 1-2 sentence prompts produce less focused specialist research.

Root Cause

Model-level behavior difference. Both models used identical maxThinkingTokens: 16000 through the same Agent SDK path. The quality gap is intrinsic to Sonnet 4.6's thinking behavior, not a configuration issue.

Blocked On

  • anthropics/claude-agent-sdk-typescript#25 — Adaptive thinking support in Agent SDK. Once resolved, Sonnet 4.6 can be tested with type: 'adaptive' + effort: 'high' which may produce deeper thinking output.

Action Taken

  • Reverted orchestrator to Sonnet 4.5 (v3.4.0)
  • Increased budget_tokens from 4096 → 8192 on Messages API paths
  • Sonnet 4.6 available via SDK_MODEL=claude-sonnet-4-6 for future testing

Raw Logs

Full SSE thinking logs for all three tests are in docs/Review-Dead-Code/WTF-IS-THIS-THINKING.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodel-evaluationModel quality/performance evaluation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions