From 85cb369abeb510ec0cd1e38c5abbea10ecccf6fd Mon Sep 17 00:00:00 2001
From: Number531 <120485065+Number531@users.noreply.github.com>
Date: Fri, 24 Apr 2026 14:34:18 -0400
Subject: [PATCH] docs(sdk): align stale 0.2.97 references with v6.3.0 bump
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Follow-up to #79 (SDK bump PR). The v6.3.0 CHANGELOG entry listed these
two docs under "Updated — Documentation" but they weren't included in
the main PR due to bundling concerns with unrelated pending changes.
Landing them here so the CHANGELOG claim matches repo state.

- docs/pending-updates/execution-gke-migration.md — single-line surgical
  update to the package.json version reference (line 1474). Unrelated
  pending content in the working tree was preserved and not bundled.
- docs/citation-chat-router-sdk-alignment.md — previously untracked
  research doc (created 2026-03-17, last updated 2026-03-19). Committing
  as a new file with the version reference already bumped to 0.2.119.

No functional changes. Docs-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../citation-chat-router-sdk-alignment.md     | 541 ++++++++++++++++++
 .../execution-gke-migration.md                |   2 +-
 2 files changed, 542 insertions(+), 1 deletion(-)
 create mode 100644 super-legal-mcp-refactored/docs/citation-chat-router-sdk-alignment.md

diff --git a/super-legal-mcp-refactored/docs/citation-chat-router-sdk-alignment.md b/super-legal-mcp-refactored/docs/citation-chat-router-sdk-alignment.md
new file mode 100644
index 000000000..02a017733
--- /dev/null
+++ b/super-legal-mcp-refactored/docs/citation-chat-router-sdk-alignment.md
@@ -0,0 +1,541 @@
+# Citation Chat Router — Claude Agent SDK Alignment Research
+
+**Created**: 2026-03-17
+**Updated**: 2026-03-19 (version alignment, endpoint path correction)
+**Topic**: Agent SDK vs Messages API for interactive conversational chat
+**Versions**: `@anthropic-ai/claude-agent-sdk` 0.2.119, `@anthropic-ai/sdk` ^0.86.1
+**Sources verified as of**: March 2026
+
+---
+
+## Table of Contents
+
+1. [Executive Summary](#1-executive-summary)
+2. [Claude Agent SDK — Conversational Patterns](#2-claude-agent-sdk--conversational-patterns)
+3. [Messages API Streaming — Current Recommended Path](#3-messages-api-streaming--current-recommended-path)
+4. [Anthropic Cookbook Coverage](#4-anthropic-cookbook-coverage)
+5. [Agent SDK vs Messages API — Official Distinction](#5-agent-sdk-vs-messages-api--official-distinction)
+6. [Prompt Caching for Multi-Turn Citation Context](#6-prompt-caching-for-multi-turn-citation-context)
+7. [Design Alignment Assessment for citationChatRouter.js](#7-design-alignment-assessment-for-citationchatrouterjs)
+8. [Recommended Implementation Pattern](#8-recommended-implementation-pattern)
+9. [References](#9-references)
+
+---
+
+## 1. Executive Summary
+
+**The proposed citationChatRouter.js design — an Express router using the Messages API directly with streaming — is fully aligned with Anthropic's recommended patterns for interactive conversational use cases.** The Claude Agent SDK is explicitly not the right tool for this feature. The Messages API with `client.messages.stream()` (not `client.beta.messages.stream()`) is the current stable recommended path for multi-turn chat. Prompt caching with a 10K-token citation context block will yield approximately 90% cost reduction on cache hits and is straightforward to implement.
+
+Key findings:
+
+| Question | Answer |
+|----------|--------|
+| Should a chat router use Agent SDK? | No — Agent SDK is for autonomous tool-executing agents |
+| Is `.stream()` still on beta path? | No — `client.messages.stream()` is the stable path for chat. Note: the existing `/api/stream` endpoint uses `anthropic.beta.messages.stream()` for `context_management` beta features, but the chat router does not need those. |
+| Is there an official Agent SDK "chat mode"? | V2 session API (`unstable_v2_createSession`) exists but is marked unstable and has higher overhead than Messages API |
+| Should citation context be prompt-cached? | Yes — inject as system prompt block with `cache_control: { type: "ephemeral" }` |
+| Is Express SSE streaming compatible? | Yes — pipe `content_block_delta` text deltas directly to `res.write()` |
+
+---
+
+## 2. Claude Agent SDK — Conversational Patterns
+
+**Source**: [TypeScript V2 interface preview](https://platform.claude.com/docs/en/agent-sdk/typescript-v2-preview), [Agent SDK overview](https://platform.claude.com/docs/en/agent-sdk/overview), [TypeScript SDK reference (V1)](https://platform.claude.com/docs/en/agent-sdk/typescript)
+**Verified**: March 2026
+
+### 2.1 What the Agent SDK Is Designed For
+
+The Agent SDK gives you Claude with **built-in autonomous tool execution**. The overview page opens with:
+
+> "Build AI agents that autonomously read files, run commands, search the web, edit code, and more."
+
+Every example in the overview is an autonomous task: "Find and fix the bug in auth.py", "Find all TODO comments", "Use the code-reviewer agent to review this codebase". The tool list (`Read`, `Write`, `Edit`, `Bash`, `Glob`, `Grep`, `WebSearch`, `WebFetch`, `AskUserQuestion`) makes this purpose explicit.
+
+### 2.2 The V1 `query()` Mode
+
+The primary V1 API is `query()`, an async generator over `SDKMessage` events. For multi-turn conversations in V1 you must either:
+
+- Use `options.resume` with a session ID to continue a previous session
+- Pass `prompt` as an `AsyncIterable<SDKUserMessage>` and coordinate yields yourself
+
+The V1 multi-turn approach (feeding messages via `AsyncIterable`) is described as requiring "yield coordination" and the docs note it requires "restructuring" even for basic multi-turn use. This is not suited to a request/response HTTP router pattern where each turn is a separate HTTP request.
+
+### 2.3 The V2 Session API (Unstable Preview)
+
+The V2 API (`unstable_v2_createSession`, `unstable_v2_resumeSession`, `session.send()`, `session.stream()`) provides a cleaner multi-turn interface:
+
+```typescript
+// V2 multi-turn (unstable preview)
+import { unstable_v2_createSession } from "@anthropic-ai/claude-agent-sdk";
+
+await using session = unstable_v2_createSession({ model: "claude-opus-4-6" });
+
+await session.send("What is 5 + 3?");
+for await (const msg of session.stream()) {
+  if (msg.type === "assistant") {
+    const text = msg.message.content
+      .filter((block) => block.type === "text")
+      .map((block) => block.text)
+      .join("");
+    console.log(text);
+  }
+}
+```
+
+**Critical caveat**: The V2 page carries a prominent warning:
+
+> "The V2 interface is an **unstable preview**. APIs may change based on feedback before becoming stable."
+
+**Additional concern**: Session forking, some advanced streaming input patterns, and other V1 features are unavailable in V2.
+
+**Why V2 sessions are not the right choice for a citation chat router**:
+
+1. Sessions are **process-backed** — they spawn a Claude Code subprocess. This has substantially higher per-session overhead than an HTTP call to the Messages API.
+2. The Agent SDK is designed for single-session long-running autonomous work. Using it for short request/response chat turns per HTTP request is architecturally mismatched.
+3. The context injection model (citation chunks injected per-turn) is cleanly handled by the Messages API `system` parameter but requires awkward pre-turn `send()` calls in the Session API.
+4. V2 is marked unstable with no stability commitment date.
+
+### 2.4 Official Guidance: Agent SDK vs Client SDK
+
+The Agent SDK overview page includes an explicit comparison tab titled "Agent SDK vs Client SDK":
+
+> **Client SDK**: You send prompts and implement tool execution yourself. Gives you direct API access.
+> **Agent SDK**: Gives you Claude with built-in tool execution. Claude handles tools autonomously.
+
+The docs show this contrast:
+
+```typescript
+// Client SDK (Messages API): You control the loop
+let response = await client.messages.create({ ...params });
+while (response.stop_reason === "tool_use") {
+  const result = yourToolExecutor(response.tool_use);
+  response = await client.messages.create({ tool_result: result, ...params });
+}
+
+// Agent SDK: Claude handles tools autonomously
+for await (const message of query({ prompt: "Fix the bug in auth.py" })) {
+  console.log(message);
+}
+```
+
+A citation chat router has **no tools to execute**. It takes user questions, injects document context, and returns text. The Messages API is the correct abstraction.
+
+---
+
+## 3. Messages API Streaming — Current Recommended Path
+
+**Source**: [Streaming Messages docs](https://platform.claude.com/docs/en/docs/build-with-claude/streaming)
+**Verified**: March 2026
+
+### 3.1 `.stream()` Has Graduated from Beta
+
+`client.messages.stream()` is documented on the main (non-beta) streaming page with examples using `client.messages.stream()` (not `client.beta.messages.stream()`). The canonical TypeScript example:
+
+```typescript
+import Anthropic from "@anthropic-ai/sdk";
+
+const client = new Anthropic();
+
+await client.messages
+  .stream({
+    messages: [{ role: "user", content: "Hello" }],
+    model: "claude-opus-4-6",
+    max_tokens: 1024
+  })
+  .on("text", (text) => {
+    console.log(text);
+  });
+```
+
+This is `@anthropic-ai/sdk` 0.86.1's stable API surface. The `client.beta.messages.stream()` path is not mentioned in current docs for standard text streaming. The beta path is now used only for beta-gated features (e.g., code execution with `code_execution_20250825` tool — as noted in the project memory for `codeExecutionBridge.js`).
+
+**Confirmed in project memory (2026-02-25)**: `codeExecutionBridge.js` was migrated away from `client.beta.messages.create` to `client.messages.create` (standard path) precisely because "the beta path silently reused containers" and had undocumented behavior. The principle generalizes: prefer the standard path.
+
+### 3.2 SSE Event Flow
+
+Each stream emits these events in order:
+
+1. `message_start` — contains a `Message` object with empty `content`
+2. For each content block: `content_block_start` → one or more `content_block_delta` → `content_block_stop`
+3. One or more `message_delta` — top-level changes, cumulative token counts
+4. `message_stop`
+
+Plus optional `ping` events (keep-alive) and `error` events.
+
+**Text delta wire format**:
+```
+event: content_block_delta
+data: {"type": "content_block_delta","index": 0,"delta": {"type": "text_delta", "text": "ello frien"}}
+```
+
+**Tool use delta wire format** (when tools are present):
+```
+event: content_block_delta
+data: {"type": "content_block_delta","index": 1,"delta": {"type": "input_json_delta","partial_json": "{\"location\": \"San Fra"}}
+```
+
+### 3.3 TypeScript Stream Helpers
+
+The SDK's `MessageStream` object (returned by `.stream()`) exposes these methods:
+
+| Method | Description |
+|--------|-------------|
+| `.on("text", cb)` | Fires on each text delta with the text string |
+| `.on("message", cb)` | Fires on each raw `MessageStreamEvent` |
+| `.on("finalMessage", cb)` | Fires once with complete accumulated `Message` |
+| `.on("error", cb)` | Fires on stream errors |
+| `.finalMessage()` | Promise resolving to complete `Message` |
+| `.finalText()` | Promise resolving to complete text string |
+| `.abort()` | Cancel the stream |
+
+For Express SSE delivery, use `.on("text", ...)` to pipe deltas and `.on("finalMessage", ...)` to send the `done` event:
+
+```typescript
+// Express route handler pattern
+res.setHeader("Content-Type", "text/event-stream");
+res.setHeader("Cache-Control", "no-cache");
+res.setHeader("Connection", "keep-alive");
+
+const stream = client.messages.stream({
+  model: "claude-sonnet-4-6",
+  max_tokens: 2048,
+  system: buildSystemPromptWithCitations(citations),
+  messages: conversationHistory
+});
+
+stream.on("text", (text) => {
+  res.write(`data: ${JSON.stringify({ type: "text_delta", text })}\n\n`);
+});
+
+stream.on("finalMessage", () => {
+  res.write(`data: ${JSON.stringify({ type: "done" })}\n\n`);
+  res.end();
+});
+
+stream.on("error", (err) => {
+  res.write(`data: ${JSON.stringify({ type: "error", message: err.message })}\n\n`);
+  res.end();
+});
+```
+
+---
+
+## 4. Anthropic Cookbook Coverage
+
+**Source**: [anthropics/anthropic-cookbook](https://github.com/anthropics/anthropic-cookbook)
+**Verified**: March 2026 (35.2k stars, MIT licensed)
+
+### 4.1 Available RAG Examples
+
+| Example | Path | Relevance |
+|---------|------|-----------|
+| RAG with Pinecone | `third_party/Pinecone/rag_using_pinecone.ipynb` | Vector DB retrieval + Claude Q&A |
+| Wikipedia search | `third_party/Wikipedia/wikipedia-search-cookbook.ipynb` | Real-time knowledge retrieval |
+| Embeddings (Voyage AI) | `third_party/VoyageAI/how_to_create_embeddings.md` | Generating embeddings for RAG |
+| PDF upload + summarization | `misc/pdf_upload_summarization.ipynb` | Document Q&A |
+| Customer service agent | `tool_use/customer_service_agent.ipynb` | Conversational tool-use agent |
+
+### 4.2 Gap: No Node.js / Express Examples
+
+> "While the code examples are primarily written in Python, the concepts can be adapted to any language that supports interaction with the Claude API."
+
+All cookbook examples are Jupyter notebooks (95.9% notebooks, 4.1% Python scripts). There are **no** Express/Node.js server integration examples in the official cookbook. This is a documentation gap but does not indicate incompatibility — it simply means the adapter pattern must be implemented from the streaming SSE documentation directly.
+
+### 4.3 RAG Pattern from Cookbook
+
+The Pinecone cookbook establishes the canonical pattern:
+
+1. Embed user query → vector similarity search → retrieve top-k chunks
+2. Concatenate chunks into context block
+3. Inject context into Claude system prompt or user message
+4. Stream Claude's response
+
+For a citation-based Q&A router, the retrieval step is replaced by explicit citation injection (citations already known from the report), making this simpler than general RAG.
+
+---
+
+## 5. Agent SDK vs Messages API — Official Distinction
+
+**Source**: [Agent SDK overview — "Compare the Agent SDK to other Claude tools"](https://platform.claude.com/docs/en/agent-sdk/overview)
+**Verified**: March 2026
+
+### 5.1 The Official Decision Matrix
+
+From the "Agent SDK vs Client SDK" comparison tab:
+
+| Use when | Tool choice |
+|----------|-------------|
+| Your use case is text generation: chatbots, summarization, classification, content generation, Q&A | **Messages API (Client SDK)** |
+| You want autonomous tool execution (file read/write, shell commands, web search) without implementing a tool loop | **Agent SDK** |
+| CI/CD pipelines, production automation, custom applications | Agent SDK |
+| Interactive development, one-off tasks | CLI |
+
+The phrase "chatbots" and "Q&A" are explicitly listed as Messages API use cases. A citation Q&A router maps directly to this category.
+
+### 5.2 The Tool Loop Distinction
+
+The defining architectural difference: with the Messages API you **implement the tool loop yourself** if you need tools. With the Agent SDK, Claude handles tools autonomously.
+
+For a citation chat router:
+- **No tools are needed** — citations are pre-fetched and injected as context
+- The "tool loop" concern is irrelevant
+- The Messages API is the minimal correct abstraction
+
+### 5.3 Sessions in Agent SDK vs Stateless HTTP
+
+The Agent SDK session model is designed for **long-running stateful sessions** where Claude accumulates file-system context across many turns of autonomous work. It stores session transcripts to disk (`listSessions()`, `getSessionMessages()` return filesystem-backed data).
+
+A chat router serving many concurrent users needs **stateless per-request handling** where conversation history is managed by the caller (client sends full history on each turn, or server stores it in Redis/memory). The Messages API is inherently stateless per-request, which is the correct model for an HTTP router.
+
+---
+
+## 6. Prompt Caching for Multi-Turn Citation Context
+
+**Source**: [Prompt caching docs](https://platform.claude.com/docs/en/docs/build-with-claude/prompt-caching), [Automatic prompt caching announcement (Feb 2026)](https://medium.com/ai-software-engineer/anthropic-just-fixed-the-biggest-hidden-cost-in-ai-agents-using-automatic-prompt-caching-9d47c95903c5)
+**Verified**: March 2026
+
+### 6.1 Why This Matters for Citation Chat
+
+A citation context block of ~10K tokens, sent on every turn of a multi-turn conversation, accumulates significant cost. With Sonnet 4.6 at $3/MTok input:
+
+- Without caching, 10 turns × 10K context tokens = 100K tokens = $0.30 per conversation
+- With caching (90% savings on cache hits): cache write at 1.25x ($0.0375 per 10K) + 9 cache reads at 0.1x ($0.003 each) = $0.064 total — a **79% reduction**
+
+For Sonnet 4.6 specifically, the minimum cacheable prompt length is **2,048 tokens**, meaning a 10K-token citation block comfortably qualifies.
+
+### 6.2 Workspace Isolation Update (February 5, 2026)
+
+**Important operational note**: As of February 5, 2026, prompt caching uses workspace-level isolation (not organization-level). Caches are isolated per workspace. This means cache hits only occur within the same API workspace — verify the citation chat router uses the same `ANTHROPIC_API_KEY` workspace as other services if cross-service cache sharing is desired.
+
+### 6.3 Two Caching Strategies for Citation Chat
+
+**Strategy A — Static citation block in system prompt (recommended)**
+
+When the same set of citations is used throughout a conversation, inject them once as a cached system prompt block:
+
+```typescript
+// Build system prompt with cached citation context
+function buildSystemWithCitations(
+  baseInstructions: string,
+  citationText: string // ~10K tokens of citation content
+): Array<{ type: "text"; text: string; cache_control?: object }> {
+  return [
+    {
+      type: "text",
+      text: baseInstructions
+      // No cache_control: instructions may change per request
+    },
+    {
+      type: "text",
+      text: citationText,
+      cache_control: { type: "ephemeral" } // Cache the large citation block
+    }
+  ];
+}
+
+// Usage in Messages API call
+const response = await client.messages.create({
+  model: "claude-sonnet-4-6",
+  max_tokens: 2048,
+  system: buildSystemWithCitations(CHAT_SYSTEM_INSTRUCTIONS, citationBlock),
+  messages: conversationHistory
+});
+```
+
+Cache hit behavior:
+- First call: `cache_creation_input_tokens` = ~10K, `cache_read_input_tokens` = 0
+- Subsequent calls (within 5 min): `cache_read_input_tokens` = ~10K, cost = 10% of base
+
+**Strategy B — Automatic caching for growing conversation history**
+
+Use top-level `cache_control` to progressively cache the conversation history as it grows:
+
+```typescript
+const response = await client.messages.create({
+  model: "claude-sonnet-4-6",
+  max_tokens: 2048,
+  cache_control: { type: "ephemeral" }, // Automatic: caches up to last cacheable block
+  system: citationSystemPrompt,
+  messages: conversationHistory
+});
+```
+
+Cache behavior advances automatically:
+- Turn 1: `system + user[1]` cached
+- Turn 2: `system + user[1] + asst[1] + user[2]` read from cache; `asst[2] + user[3]` written
+- Turn 3: all prior turns read from cache; only new turn written
+
+**Strategy C — Combined (best for long conversations with static citations)**
+
+Cache the citation block explicitly, let automatic caching handle conversation history:
+
+```typescript
+const response = await client.messages.create({
+  model: "claude-sonnet-4-6",
+  max_tokens: 2048,
+  cache_control: { type: "ephemeral" }, // Auto-cache conversation turns
+  system: [
+    { type: "text", text: CHAT_SYSTEM_INSTRUCTIONS },
+    {
+      type: "text",
+      text: citationBlock,
+      cache_control: { type: "ephemeral" } // Explicit cache for citations
+    }
+  ],
+  messages: conversationHistory
+});
+```
+
+> Note: Maximum 4 cache breakpoints per request. Combined strategy uses 2, leaving headroom.
+
+### 6.4 Extended TTL for Citation Cache
+
+If a conversation spans more than 5 minutes (common for document review sessions), use the 1-hour TTL:
+
+```typescript
+cache_control: { type: "ephemeral", ttl: "1h" }
+// Cost: 2x base input price for cache write, still 0.1x for cache reads
+// Break-even: if citation block is reused more than twice within an hour
+```
+
+### 6.5 Monitoring Cache Performance
+
+```typescript
+// After each API call, log cache stats
+const { usage } = response;
+console.log({
+  cache_created: usage.cache_creation_input_tokens,
+  cache_read: usage.cache_read_input_tokens,
+  uncached_input: usage.input_tokens,
+  total_effective_input:
+    (usage.cache_read_input_tokens ?? 0) +
+    (usage.cache_creation_input_tokens ?? 0) +
+    usage.input_tokens
+});
+```
+
+---
+
+## 7. Design Alignment Assessment for citationChatRouter.js
+
+The proposed design (Express router, Messages API direct, streaming) maps cleanly to all official patterns:
+
+### 7.1 Aligned Design Decisions
+
+| Decision | Alignment |
+|----------|-----------|
+| Express router module (not Agent SDK) | Correct — chatbots and Q&A are explicitly Messages API use cases |
+| `client.messages.stream()` (not beta path) | Correct — stable API, matches current documentation |
+| SSE streaming to frontend | Correct — standard pattern, matches Anthropic's SSE event format |
+| Stateless per-request (caller manages history) | Correct — matches Messages API's stateless design |
+| Citation context injected as system prompt | Correct — matches the large-context caching cookbook pattern |
+| No tool execution needed | Correct — citations are pre-injected, no tool loop required |
+
+### 7.2 Potential Gaps to Address
+
+| Gap | Recommendation |
+|-----|----------------|
+| No prompt caching on citation block | Add `cache_control: { type: "ephemeral" }` to citation system prompt block — immediate ~79% cost reduction on multi-turn conversations |
+| Conversation history management | Define whether client sends full history each turn or server maintains session store. Full-history-from-client is simpler and stateless. |
+| SSE keepalive for long responses | Anthropic sends `ping` events; ensure the Express SSE transport does not time out on long model turns |
+| `message_stop` vs `done` event naming | Decide whether to forward Anthropic's `message_stop` event directly or wrap in a custom `done` event for frontend compatibility |
+| Error handling in SSE stream | Anthropic may send `error` events mid-stream (e.g., `overloaded_error`). The `.on("error", ...)` handler must write an error SSE event before closing the connection. |
+
+### 7.3 What Does Not Apply
+
+| Agent SDK Feature | Applicability |
+|-------------------|---------------|
+| `query()` / V1 agent loop | Not needed — no autonomous tool execution |
+| `unstable_v2_createSession()` | Not needed — adds subprocess overhead for no benefit in a chat router |
+| Hooks (`PreToolUse`, `PostToolUse`, `SubagentStart`) | Not needed — citation chat has no tools or subagents |
+| MCP servers | Not needed — citations are already retrieved before the chat turn |
+| `settingSources`, `permissionMode` | Not applicable — Messages API, not Agent SDK |
+| Session resume/fork | Not applicable — use stateless conversation history pattern |
+
+---
+
+## 8. Recommended Implementation Pattern
+
+Based on the research, the minimal correct implementation for a citation chat router in this codebase:
+
+```typescript
+// citationChatRouter.js — recommended structure
+import Anthropic from "@anthropic-ai/sdk";
+import express from "express";
+
+const router = express.Router();
+const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
+
+const CITATION_CHAT_SYSTEM = `You are a legal research assistant...
+Answer questions using only the provided citation context.`;
+
+router.post("/chat", async (req, res) => {
+  const { messages, citationContext } = req.body;
+  // citationContext: string of ~10K tokens from retrieved citations
+
+  // Set SSE headers
+  res.setHeader("Content-Type", "text/event-stream");
+  res.setHeader("Cache-Control", "no-cache");
+  res.setHeader("Connection", "keep-alive");
+
+  try {
+    const stream = client.messages.stream({
+      model: process.env.CHAT_MODEL ?? "claude-sonnet-4-6",
+      max_tokens: 2048,
+      system: [
+        { type: "text", text: CITATION_CHAT_SYSTEM },
+        {
+          type: "text",
+          text: citationContext,
+          cache_control: { type: "ephemeral" } // Cache the citation block
+        }
+      ],
+      messages // Full conversation history from client
+    });
+
+    stream.on("text", (text) => {
+      res.write(`data: ${JSON.stringify({ type: "text_delta", text })}\n\n`);
+    });
+
+    stream.on("finalMessage", (message) => {
+      res.write(`data: ${JSON.stringify({ type: "done", stop_reason: message.stop_reason })}\n\n`);
+      res.end();
+    });
+
+    stream.on("error", (err) => {
+      res.write(`data: ${JSON.stringify({ type: "error", message: err.message })}\n\n`);
+      res.end();
+    });
+
+    req.on("close", () => stream.abort());
+
+  } catch (err) {
+    if (!res.headersSent) {
+      res.status(500).json({ error: err.message });
+    }
+  }
+});
+
+export default router;
+```
+
+Key implementation notes:
+
+1. `req.on("close", () => stream.abort())` — essential for aborting the Anthropic stream when the client disconnects (prevents runaway token spend)
+2. The `cache_control` on `citationContext` block requires the context to exceed 2,048 tokens for Sonnet 4.6 to qualify for caching
+3. `messages` array is passed in full from the client each turn — no server-side session storage required
+4. Use `model: "claude-sonnet-4-6"` (the existing orchestrator model) for consistency; add a feature flag if the chat model should be independently configurable
+
+---
+
+## 9. References
+
+- [Agent SDK overview](https://platform.claude.com/docs/en/agent-sdk/overview) — official "Compare the Agent SDK to other Claude tools" section, March 2026
+- [TypeScript V2 interface preview](https://platform.claude.com/docs/en/agent-sdk/typescript-v2-preview) — unstable session API documentation, March 2026
+- [TypeScript SDK reference (V1)](https://platform.claude.com/docs/en/agent-sdk/typescript) — `query()` function, `Options` type, `SDKMessage` types, March 2026
+- [Streaming Messages](https://platform.claude.com/docs/en/docs/build-with-claude/streaming) — SSE event format, `content_block_delta` wire format, TypeScript stream helpers, March 2026
+- [Prompt caching](https://platform.claude.com/docs/en/docs/build-with-claude/prompt-caching) — multi-turn caching strategy, workspace isolation update (Feb 5 2026), TTL options, pricing table, March 2026
+- [anthropic-cookbook on GitHub](https://github.com/anthropics/anthropic-cookbook) — RAG examples, Pinecone integration, embedding patterns (Python only), March 2026
+- [Anthropic Just Fixed the Biggest Hidden Cost in AI Agents (Medium, Feb 2026)](https://medium.com/ai-software-engineer/anthropic-just-fixed-the-biggest-hidden-cost-in-ai-agents-using-automatic-prompt-caching-9d47c95903c5) — automatic prompt caching announcement and workspace isolation change
+- [Prompt caching — Anthropic blog](https://www.anthropic.com/news/prompt-caching) — original announcement, up to 90% cost reduction, 85% latency reduction
+- [Streaming Messages API reference (docs.anthropic.com)](https://docs.anthropic.com/en/api/messages-streaming) — raw SSE protocol specification
diff --git a/super-legal-mcp-refactored/docs/pending-updates/execution-gke-migration.md b/super-legal-mcp-refactored/docs/pending-updates/execution-gke-migration.md
index ffae5ed56..e6b8d6e9e 100644
--- a/super-legal-mcp-refactored/docs/pending-updates/execution-gke-migration.md
+++ b/super-legal-mcp-refactored/docs/pending-updates/execution-gke-migration.md
@@ -1471,7 +1471,7 @@ ARMO's March 31, 2026 analysis ([armosec.io/blog/sandboxing-ai-agents-gke-worklo
 - `src/utils/circuitBreaker.js` — reusable for the new router client
 - `src/api-clients/BaseHybridClient.js` L48–57 (metrics), L377–425 (`parallelStrategy` pattern to copy for shadow mode)
 - `Dockerfile` — `node:22-slim`, has `python3` in APT deps, non-root `app` user
-- `package.json` — `@anthropic-ai/sdk` ^0.86.1, `@google/genai` ^1.45.0, `@google/generative-ai` ^0.21.0, `@google-cloud/secret-manager` ^6.1.1, `@anthropic-ai/claude-agent-sdk` 0.2.97
+- `package.json` — `@anthropic-ai/sdk` ^0.86.1, `@google/genai` ^1.45.0, `@google/generative-ai` ^0.21.0, `@google-cloud/secret-manager` ^6.1.1, `@anthropic-ai/claude-agent-sdk` 0.2.119
 
 ### External docs
 - [GKE Agent Sandbox](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/agent-sandbox) — bootstrap & operation