Skip to content

Per-turn reasoning cleanup causes server-side prompt cache bust on every Claude thinking model turn #125

@expiren

Description

@expiren

Summary

Magic Context's reasoning cleanup and stripClearedReasoning stages change the conversation body on every turn for Claude thinking models, causing a complete server-side prompt cache miss (0% hit rate → read=0) even when the account, endpoint, fingerprint, and system prompt are all identical.

Environment

  • Magic Context: @cortexkit/opencode-magic-context@0.21.8
  • Provider: Antigravity proxy (Google Cloud Code Assist) with implicit prefix caching
  • Model: claude-opus-4-6-thinking (any Claude thinking model)
  • Plugin: @expiren/opencode-antigravity-auth@1.6.49

Root Cause

Claude thinking models generate thinking blocks in every assistant response. On the next turn, MC's Phase 1 transform pipeline runs:

  1. reasoning replay → clears N thinking blocks (N grows each turn)
  2. stripClearedReasoning → strips the cleared parts
  3. sentinel replay → neutralizes stripped messages

Because N changes every turn (new thinking blocks from the latest response), the conversation body is different from the previous turn's body. Google's implicit prefix cache is keyed on exact prefix hash — any change in the conversation messages results in a complete cache miss.

Evidence from MC Logs

Three consecutive transforms on the same account, no account switch, <2 minutes apart:

Field Transform 1 Transform 2 Transform 3
reasoning replay: cleared= 8 8 10
reasoning cleanup: (none) cleared=2 watermark=2437→2467 (none)
stripClearedReasoning: strippedParts= 8 8 10
sentinel replay: neutralized= 3 3 7
Output messages 44 40 41

Transform 2 has reasoning cleanup: cleared=2 watermark=2437→2467 — two new thinking blocks from the previous assistant response were cleared, advancing the watermark. Transform 3 then strips 10 parts (was 8) because the watermark advanced.

Corresponding Plugin Cache Stats

These are from the Antigravity plugin's debug output for the same session, same account (idx=18, no account switch):

Request 1: Cache HIT  read=148055 total=148996 hitRate=99%   ← previous turn's prefix matched
Request 2: Cache MISS read=0      total=149615 hitRate=0%    ← complete miss after MC changed the body

Total tokens only grew by 619 (148996 → 149615) — a single user message. Yet read dropped from 148055 to 0. The entire prefix was invalidated because MC's reasoning cleanup changed the conversation content.

Impact

  • Every Claude thinking model turn suffers a complete cache miss (~150K uncached tokens re-processed)
  • This wastes significant compute quota on the Antigravity proxy
  • Cache warmup probes become ineffective (probe seeds the cache, MC immediately invalidates it on the next turn)
  • Hit rate cannot exceed ~50% on average because every other turn is guaranteed to miss

Expected Behavior

The reasoning stripping result should be idempotent across turns — if thinking blocks are stripped to the same sentinel structure regardless of how many new blocks were added, the prefix hash would remain stable and the server-side cache would be reused.

Possible Fixes

  1. Stable sentinel replacement: Replace all thinking blocks with a fixed-content sentinel (e.g., { text: "." }) so the stripped result is identical regardless of which/how many blocks were cleared. The count of sentinels and their content must be deterministic from turn to turn.

  2. Watermark-stable stripping: Apply reasoning cleanup at a fixed watermark position rather than advancing it each turn, so parts already stripped remain in the same sentinel form.

  3. One-time strip at generation time: Strip thinking blocks immediately when the assistant response is received (before it enters OpenCode's history), rather than re-stripping on every subsequent turn. This way the history content is already clean and doesn't change.

Reproduction

  1. Use any Claude thinking model (e.g., claude-opus-4-6-thinking) with Magic Context enabled
  2. Send 3+ messages in a conversation
  3. Observe MC logs: reasoning replay: cleared=N where N increases each turn
  4. Observe provider cache stats: alternating HIT/MISS pattern or consistent MISS on every other turn

Related

The message.updated events with hasUsageTokens=false being counted as cache BUSTs in MC diagnostics is a separate but related issue — it inflates the BUST count in MC's own metrics for Antigravity Claude models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions