diff --git a/super-legal-mcp-refactored/CHANGELOG.md b/super-legal-mcp-refactored/CHANGELOG.md index ac36a988a..77b7e3c11 100644 --- a/super-legal-mcp-refactored/CHANGELOG.md +++ b/super-legal-mcp-refactored/CHANGELOG.md @@ -4,6 +4,66 @@ All notable changes to the Super Legal MCP Server are documented in this file. ## [Unreleased] +### Operations — `STRUCTURED_OUTPUT_ENFORCEMENT` pre-flip readiness: alerts + playbook + cohort plan + feature-flag truth-doc registration + dashboard panels + skill doc updates (PR #140) + +Closes the operational surface for the bridge observability v2 work (PRs #135-#139). Zero source-code changes — documentation, YAML config, and Grafana JSON only. Unblocks the `STRUCTURED_OUTPUT_ENFORCEMENT=true` production flip by closing 6 distinct operator-readiness gaps under one coherent PR. + +**Six coordinated changes**: + +1. **Feature-flag truth doc registration** (`docs/feature-flags.md`): Documents that `STRUCTURED_OUTPUT_ENFORCEMENT` (PR #135) and `XLSX_RENDERER` (PR #100 era) were both added to production code but NEVER registered in the truth doc — accumulated documentation debt closed here. Bumps doc to v4.2 (41 flags total). Flag #42 entry includes an explicit 8-item pre-flip prerequisite checklist that MUST be ✅ before flipping to `true` in production. + +2. **`flags.env` prerequisite comment block** (`flags.env:18-30`): Warns operators reading the env file directly that flipping `STRUCTURED_OUTPUT_ENFORCEMENT=true` without the checklist is forbidden; cross-links to the truth doc + runbooks. + +3. **Six Prometheus alerts** (`prometheus/alerts.yml`): + - 4 alerts on `claude_code_bridge_envelope_outcome_total` (PR #138 counter): `CodeBridgeEnvelopeStdoutFallbackHigh`, `CodeBridgeEnvelopeNoneCritical`, `CodeBridgeTurn1SuccessLow`, `CodeBridgeUnknownCallerCategoryEmitting` + - 2 alerts on `claude_xlsx_render_turn1_envelope_success_total` (PR #136 counter): `XlsxRenderTurn1RegressionByTemplate`, `XlsxRenderTurn1RegressionByPhase` + - Each alert includes `runbook_url` annotation linking to the playbook's per-alert anchor + - Labels: `severity`, `team`, `component`, `flag` for routing + filtering + +4. **Envelope-decision on-call playbook** (`docs/runbooks/envelope-decision-debug-playbook.md`, NEW): 6-section drill-down: triage decision tree, per-alert response procedures with kubectl + PromQL + SQL + Cloud Logging queries, cross-surface quick reference, rollback procedure (~5min target), escalation contacts (placeholder for team), seed-empty known-false-positives table. + +5. **Cohort rollout plan** (`docs/runbooks/structured-output-enforcement-rollout.md`, NEW): Pre-flip 8-item checklist + rollback drill (rehearse in staging) + 3 stages (single-client canary 24h → 25% cohort 48h → 100% fleet 7-day) with per-stage pass/fail criteria + rollback triggers + per-stage rollback procedures + entry/exit log template + aborted-rollout recovery procedure + roles & responsibilities. + +6. **Grafana dashboard panels** (`grafana/claude-sdk-dashboard.json`): 5 new panels visualizing both counters. 3 panels for PR #138 counter (caller-category mix, envelope-source distribution, turn-1 success gauge with threshold band at 0.70). 2 panels for PR #136 counter (per-template turn-1 success, per-phase turn-1 success heatmap). Dashboard goes 6 → 11 panels. + +**Plus 4 minor skill doc enhancements** identified by post-PR-138 audit (bundled here since they're operator-facing documentation): +- `.claude/skills/post-deploy-verify/scripts/verify-tier2.sh` — new V7 check probing `/metrics` for the new counter registration +- `.claude/skills/infrastructure-health/references/postgresql.md` — note `code_executions.envelope_source` + `access_log.event_data` JSONB columns + indexes +- `.claude/skills/client-audit-export/SKILL.md` — note regulator handoff bundles now include both new columns via `SELECT *` +- `.claude/skills/client-offboarding/SKILL.md` — note Phase 2 archive captures `event_data` JSONB inline as quoted JSON in CSV + +**Verification**: +- L1 (doc consistency): `STRUCTURED_OUTPUT_ENFORCEMENT` + `XLSX_RENDERER` now appear in `docs/feature-flags.md` Quick Reference table + detailed entry sections +- L2 (PromQL syntax): all 6 alerts pass `promtool check rules prometheus/alerts.yml` syntax validation +- L3 (Grafana JSON): dashboard JSON parses cleanly via `jq` — verified 11 panels enumerable, all 5 new panels include datasource + targets + fieldConfig +- L4 (staging soak ≥ 3 days at flag=true) — deferred to operations team execution; cannot be performed in this PR + +**Files modified (11 total across PR #140 + companion PR #141)**: + +In PR #140 commit (7 files in `super-legal-mcp-refactored/` worktree): +- `docs/feature-flags.md` — flag #41 (XLSX_RENDERER) + flag #42 (STRUCTURED_OUTPUT_ENFORCEMENT) entries; header bump v4.1 → v4.2; Dependency Tree section updated to include both new flags +- `flags.env` — prerequisite comment block above line 22 +- `prometheus/alerts.yml` — 6 new alert rules +- `docs/runbooks/envelope-decision-debug-playbook.md` (NEW) +- `docs/runbooks/structured-output-enforcement-rollout.md` (NEW) +- `grafana/claude-sdk-dashboard.json` — 5 new panels +- `CHANGELOG.md` (this entry) + +In companion PR #141 commit (4 files in `.claude/skills/`, outside this worktree boundary): +- `.claude/skills/post-deploy-verify/scripts/verify-tier2.sh` — V7 check +- `.claude/skills/infrastructure-health/references/postgresql.md` — 2 table rows enriched +- `.claude/skills/client-audit-export/SKILL.md` — 2 table rows enriched +- `.claude/skills/client-offboarding/SKILL.md` — Step 6.5 paragraph enriched + +PR #141 exists as a separate commit because the 4 skill docs live at the project root (`/Users/ej/Super-Legal/.claude/skills/`) which is outside the xlsx-renderer worktree's boundary; same logical scope, separated only by git-worktree mechanics. + +**Honest limits acknowledged**: +- L4 staging soak (≥ 3 days at flag=true) cannot be performed in this PR — requires operations execution per the runbook +- §5 of the debug playbook leaves escalation contacts as placeholder TBD (team-specific info) +- §6 of the debug playbook seeds the known-false-positives table empty (populated by on-call as real incidents occur) + +**Plan**: `/Users/ej/.claude/plans/twinkling-glittering-comet.md` + ### Added — Bridge observability v2: envelope_source DB persistence + generic Prometheus counter + access_log JSONB enrichment + structured envelope-decision logging (PR #138) Closes 4 verified observability/auditability gaps identified by background Explore agents on 2026-05-16 after PR #137 merged. Pre-PR-138, `envelope_source` (set on every bridge return at `selectEnvelopeWithFallback`) was visible only via the multi-turn xlsx orchestrator's Prometheus counter — single-turn xlsx, MCP gateway, and Agent SDK subagent callers produced envelope outcomes that never reached any dashboard or queryable schema. This PR ships a single coherent change (~220 LOC + 4 migration files) that closes all four under one operational umbrella so `STRUCTURED_OUTPUT_ENFORCEMENT=true` can be flipped for full-fleet production rollout with confidence. diff --git a/super-legal-mcp-refactored/docs/feature-flags.md b/super-legal-mcp-refactored/docs/feature-flags.md index d85f1c194..7089aa05a 100644 --- a/super-legal-mcp-refactored/docs/feature-flags.md +++ b/super-legal-mcp-refactored/docs/feature-flags.md @@ -2,10 +2,10 @@ ## Super-Legal MCP Server — Single Source of Truth -**Version:** 4.1 -**Date:** 2026-05-10 +**Version:** 4.2 +**Date:** 2026-05-16 **Source:** `src/config/featureFlags.js` -**Total flags:** 39 (33 boolean + 4 numeric/string + 2 dead code; +4 since v4.0 — `EXA_ADDITIONAL_QUERIES`, `EXA_ADDITIONAL_QUERIES_AB_SAMPLE`, `FMP_ENABLED`, `ALLOW_FULL_TRANSCRIPT`) +**Total flags:** 41 (35 boolean + 4 numeric/string + 2 dead code; +2 since v4.1 — `XLSX_RENDERER` [PR #100 era, never registered], `STRUCTURED_OUTPUT_ENFORCEMENT` [PR #135 Avenue A v2, never registered]) All feature flags are environment-variable-controlled via the `envBool()` helper. Set `FLAG_NAME=true` or `FLAG_NAME=false` in your environment or `.env` file. No code changes required for any toggle. @@ -57,6 +57,8 @@ All feature flags are environment-variable-controlled via the `envBool()` helper | 38 | [`ALLOW_FULL_TRANSCRIPT`](#38-allow_full_transcript) | `false` | Active | Capabilities | | 39 | [`EXA_ADDITIONAL_QUERIES`](#39-exa_additional_queries) | `false` | Active (v7.1.0) | Search | | 40 | [`EXA_ADDITIONAL_QUERIES_AB_SAMPLE`](#40-exa_additional_queries_ab_sample) | `0.0` (numeric) | Active (v7.6.0) | Search — measurement | +| 41 | [`XLSX_RENDERER`](#41-xlsx_renderer) | `false` | Active — staging | Capabilities (workbook deliverable) | +| 42 | [`STRUCTURED_OUTPUT_ENFORCEMENT`](#42-structured_output_enforcement) | `false` | Active — pre-flip readiness | API / Bridge envelope enforcement | --- @@ -77,7 +79,9 @@ USE_AGENT_SDK ────────> gates Agent SDK multi-turn path │ │ └── CODE_EXECUTION_BRIDGE ──> code-execution domain │ ├── CODE_EXECUTION_BRIDGE ──> agent prompt content (5 agents) │ │ ├── FILES_API_CHART_EXTRACTION ──> file_id download from containers - │ │ └── CHART_PERSISTENCE ──> charts to disk + markdown embedding + │ │ ├── CHART_PERSISTENCE ──> charts to disk + markdown embedding + │ │ └── STRUCTURED_OUTPUT_ENFORCEMENT ──> API-level envelope JSON-schema enforcement (PR #135/Avenue A v2; flag #42) + │ │ └── (no effect when CODE_EXECUTION_BRIDGE=false — bridge never called) │ ├── DOCUMENT_PROCESSING ──> P0 pre-wave phase │ └── CITATION_WEBSEARCH_VERIFICATION ──> G5 phase │ └── CITATION_DEEP_VERIFICATION ──> G5 model + depth @@ -105,6 +109,12 @@ ACCESS_AUDIT ──────> (independent, but useful only with AUTH_ENABLED SESSION_RECONCILIATION ──> (requires HOOK_DB_PERSISTENCE) hourly auto-rebuild loop for partial sessions (KG + artifacts) +XLSX_RENDERER ──────> (flag #41) session-grain xlsx workbook deliverable post-manifest.finalize + ├── Requires: CODE_EXECUTION_BRIDGE=true (bridge invocations per phase) + ├── Recommends: HOOK_DB_PERSISTENCE=true (xlsx_renders table for state machine + audit) + └── Interacts with: STRUCTURED_OUTPUT_ENFORCEMENT (when both ON, xlsx renders use + API-level envelope enforcement; independent flips supported) + STRUCTURED_OUTPUTS ──> (independent) JSON schema on API requests SKILLS_ENABLED ──────> (independent) custom skills + beta headers EXTENDED_CONTEXT ────> (independent) 1M context for Messages API path @@ -1473,6 +1483,133 @@ Captures every SSE event flowing through `ctx.send()` in `streamContext.js` into --- +### 41. XLSX_RENDERER + +| Attribute | Value | +|-----------|-------| +| **Default** | `false` | +| **Type** | boolean | +| **Category** | Capabilities (workbook deliverable) | +| **Purpose** | Enables session-grain XLSX workbook generation as a post-processor after `manifest.finalize` | +| **GitHub** | [#100](https://github.com/Number531/Legal-API/issues/100) — multi-turn orchestrator + phase isolation | +| **Documentation debt** | Flag has existed in production code since v4.5 era but was never registered in this doc until v4.2 (2026-05-16) | + +**Behavior ON:** +1. After `manifest.finalize()`, the xlsx renderer post-processor runs against the session +2. Routes to single-turn (`session-models` template) or multi-turn orchestrator (`full-deal-workbook`, `lbo-focused`, `valuation-only`, `tax-memo-workbook`) based on template registry +3. Each render goes through `gather → composePhaseSpec → runPythonAnalysis → selectEnvelopeWithFallback → reconcile → persist` +4. `xlsx_renders` table populated with audit_status, sheet_count, warnings_count, node_audit_ran (generated columns from migration 018) +5. Per-phase OTel spans + per-render Prometheus metrics +6. 202-style async polling endpoint exposed via `/api/db/sessions/:sessionKey/xlsx-render` + +**Behavior OFF:** +1. `manifest.finalize()` returns without invoking the renderer +2. No xlsx_renders rows written +3. No code-execution bridge calls from xlsx path (MCP-direct calls via `run_python_analysis` still work) + +**Templates registered** (in `src/config/xlsxTemplates/index.js`): +| Template | Type | Phases | Estimated wall time | +|---|---|---|---| +| `session-models` | single-turn | 1 | ~180s | +| `full-deal-workbook` | multi-turn | 5 (phase1-5) | ~530s | +| `lbo-focused` | multi-turn | 4 | ~400s | +| `valuation-only` | multi-turn | 4 | ~350s | +| `tax-memo-workbook` | multi-turn | 4 | ~400s | + +**Files:** +- `src/utils/xlsxRenderer/index.js` — single-turn entry point (line 208 invokes bridge) +- `src/utils/xlsxRenderer/multiTurnOrchestrator.js` — multi-turn phase dispatcher (line 105 invokes bridge per phase) +- `src/config/xlsxTemplates/` — 5 template definitions +- `src/db/postgres.js` — `xlsx_renders` table + 4 generated columns +- `migrations/017_xlsx-renders.{up,down}.sql`, `migrations/018_xlsx-renders-generated-columns.{up,down}.sql` + +**Cost:** Each render = N × bridge calls (N = phase count). Multi-turn full-deal-workbook ≈ 5 bridge calls = 5 × Anthropic API + Files API charges. Persistence cost negligible. + +**Rollback:** Flip `XLSX_RENDERER=false` in prod env. In-flight renders complete (already in async pipeline); new renders skip the post-processor. No data loss. + +**Interaction with flag #42**: when `STRUCTURED_OUTPUT_ENFORCEMENT=true` is also flipped, all xlsx renders use API-level envelope enforcement on turn 1. Independent flags but commonly flipped together (PR #135 + this flag share the production-readiness boundary). + +--- + +### 42. STRUCTURED_OUTPUT_ENFORCEMENT + +| Attribute | Value | +|-----------|-------| +| **Default** | `false` | +| **Type** | boolean | +| **Category** | API / Bridge envelope enforcement | +| **Purpose** | Enables Anthropic `output_config: { format: { type: 'json_schema', schema: {...} } }` enforcement on the code-execution bridge's text-block output | +| **GitHub** | PR #135 (Avenue A v2 — structured output), PR #138 (observability v2), PR #139 (real-bug follow-up), PR #140 (operator-runbook readiness — this doc registration) | +| **Documentation debt** | Flag has existed in production code since PR #135 (2026-05-12) but was never registered in this doc until v4.2 (2026-05-16) | +| **Prerequisites for production flip** | **ALL ITEMS MUST BE ✅ BEFORE FLIPPING.** See **Pre-flip checklist** below. | + +**Behavior ON:** +1. Bridge passes `output_config: { format: { type: 'json_schema', schema: ENVELOPE_SCHEMA_XLSX or ENVELOPE_SCHEMA_GENERAL } }` to the Anthropic Messages API on both initial and `pause_turn` continuation API calls +2. `extractResults()` reads `response.parsed_output` (SDK auto-parsed envelope) → falls back to final text block JSON parse → falls back to existing stdout extraction +3. `selectEnvelopeWithFallback()` picks the winning extraction path; merges audit-from-text with b64-from-stdout for xlsx callers (Option A schema is audit-only to avoid b64-in-text max_tokens cliff) +4. Per-call `envelope_source` set on `finalResult` ∈ {`parsed_output`, `text`, `stdout`, `none`, `merged:parsed_output+stdout`, `merged:text+stdout`} +5. Emits to Prometheus counter `claude_code_bridge_envelope_outcome_total` + Cloud Logging event `envelope_decision` + OTel span attribute + persisted to `code_executions.envelope_source` DB column + +**Behavior OFF:** +1. No `output_config` parameter passed to API +2. Bridge uses prompt-level envelope instructions + corrective-retry path (existing pre-PR-135 behavior) +3. `extractResults()` reads stdout only +4. `envelope_source` set to `'stdout'` or `'none'` depending on whether b64 envelope is parseable from stdout + +**Target efficacy** (per PR #135 plan — these are PLAN TARGETS, not yet validated by measured fleet data): +- Pre-flag-flip baseline (PR #134 L4 logs, MEASURED): ~80% of phases need turn-2 corrective retry +- Post-flag-flip TARGETS (PR #135 plan): + - ≥95% turn-1 envelope success + - ~50% wall-time reduction on multi-turn renders + - ~50% LLM token cost reduction per successful render +- Validation status: PR #135 L4 paired comparison (N=2) confirmed Option A correctness (5/5 phases delivered, audit_status=PASS) but DID NOT statistically validate the ~50% figures — see pre-flip prerequisite #6 ("L4 paired comparison data N≥10 collected showing turn-1 success rate ≥ baseline"). Statistical confidence for the ~50% claims requires the staging soak data per prerequisite #5. +- **Actual measured efficacy will be filled in here after Stage 3 7-day observation completes.** + +### Pre-flip checklist (BLOCKING) + +| # | Prerequisite | Status | Owner | Verified by | +|---|---|---|---|---| +| 1 | Prometheus alerts wired on `claude_code_bridge_envelope_outcome_total` + `claude_xlsx_render_turn1_envelope_success_total` (6 alerts total) | ⬜ | Operations | Merged via PR #140 (this PR) — `prometheus/alerts.yml` | +| 2 | Envelope-decision debug playbook published | ⬜ | Operations | `docs/runbooks/envelope-decision-debug-playbook.md` (PR #140) | +| 3 | Cohort rollout plan + rollback drill published | ⬜ | Operations | `docs/runbooks/structured-output-enforcement-rollout.md` (PR #140) | +| 4 | Grafana dashboard panels live | ⬜ | Operations | `grafana/claude-sdk-dashboard.json` (PR #140) | +| 5 | Staging soak ≥ 3 days at `STRUCTURED_OUTPUT_ENFORCEMENT=true` with all 6 alerts silent | ⬜ | Operations | Staging Prometheus query log | +| 6 | L4 paired comparison data N≥10 collected showing turn-1 success rate ≥ baseline | ⬜ | Operations | Staging Prometheus query log | +| 7 | PR #138 + PR #139 deployed to production for ≥ 7 days at flag=false (baselines established) | ⬜ | Deployment | Production deploy logs + counter baseline values | +| 8 | On-call rotation aware (announcement posted, runbook URLs in channel topic) | ⬜ | Operations | Team channel announcement timestamp | + +Until all 8 items are ✅, **DO NOT flip `STRUCTURED_OUTPUT_ENFORCEMENT=true` in production.** Staging is fair game for soak testing (item #5). + +### Rollback (one-line) + +```bash +# In production CONTAINER_ENV: +STRUCTURED_OUTPUT_ENFORCEMENT=false +# Deploy → 30-min observation. envelope_source="text" rate should drop to 0 within 5 min. +``` + +See `docs/runbooks/envelope-decision-debug-playbook.md` §4 "Rollback procedure" for full procedure including verification queries. + +**Files:** +- `src/tools/codeExecutionBridge.js` — schemas, output_config injection, selectEnvelopeWithFallback, sdkLogger emission, recordCodeBridgeEnvelope counter, OTel span attrs +- `src/utils/sdkMetrics.js` — `codeBridgeEnvelopeOutcome` Counter (PR #138) + `xlsxRenderTurn1EnvelopeSuccess` Counter (PR #136) +- `src/utils/hookDBBridge.js` — INSERT extended to persist envelope_source as code_executions.$25 column +- `src/middleware/accessAudit.js` — getEventData callback for opportunistic event_data enrichment +- `src/server/adminRouter.js` — code_execution_json endpoint enriches access_log.event_data with `{execution_id, envelope_source, success}` +- `src/utils/xlsxRenderer/multiTurnOrchestrator.js` + `xlsxRenderer/index.js` + `tools/toolImplementations.js` — caller_category propagation +- `migrations/019_code-executions-envelope-source.{up,down}.sql`, `migrations/020_access-log-event-data.{up,down}.sql` +- `prometheus/alerts.yml` — 6 alerts (PR #140) +- `grafana/claude-sdk-dashboard.json` — 5 panels (PR #140) +- `docs/runbooks/envelope-decision-debug-playbook.md` (PR #140), `docs/runbooks/structured-output-enforcement-rollout.md` (PR #140) + +**Cost impact:** +- Flag OFF (MEASURED, current production state): zero direct cost; ~80% of bridge calls incur turn-2 corrective retry cost (per PR #134 L4 logs) +- Flag ON (TARGET, not yet validated): zero direct API cost increase (`output_config` is an existing Anthropic API parameter, no surcharge); plan target is ~50% reduction in turn-2 retry cost ≈ ~50% reduction in LLM token spend per successful render. Actual measured impact will be quantified during cohort rollout per `docs/runbooks/structured-output-enforcement-rollout.md` Stage 3 observation window. + +**Interaction with flag #41**: independent flag, but commonly flipped together post-staging. `XLSX_RENDERER=true` alone enables xlsx renders with the existing prompt-level + corrective-retry path; adding `STRUCTURED_OUTPUT_ENFORCEMENT=true` adds API-level enforcement on top. + +--- + ## Dead Code Flags These are exported from `featureFlags.js` but never consumed at runtime: diff --git a/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md b/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md new file mode 100644 index 000000000..248aeeefc --- /dev/null +++ b/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md @@ -0,0 +1,429 @@ +# Envelope-Decision Debug Playbook — On-Call Reference + +**Version**: 1.0 (PR #140, 2026-05-16) +**Audience**: On-call engineers + ops + platform team +**Triggered by**: any of 6 alerts in `prometheus/alerts.yml` (anchors below) +**Related**: [feature-flags.md §42](../feature-flags.md#42-structured_output_enforcement), [structured-output-enforcement-rollout.md](structured-output-enforcement-rollout.md), [anthropic-sdk-best-practices-research.md §13](../code-execution-enhancements/anthropic-sdk-best-practices-research.md#§13) + +--- + +## §1 — Triage decision tree + +``` +Alert fired? +│ +├── Critical severity ("CodeBridgeEnvelopeNoneCritical") +│ └── §2 alert-2 procedure → consider ROLLBACK per §4 +│ +├── Warning + flag=STRUCTURED_OUTPUT_ENFORCEMENT +│ ├── 3+ flag-tagged alerts firing concurrently → ROLLBACK per §4 +│ └── Single flag-tagged alert → §2 per-alert procedure (no immediate rollback) +│ +├── Warning, no flag tag (e.g., CodeBridgeUnknownCallerCategoryEmitting) +│ └── §2 per-alert procedure → file engineering ticket for next sprint +│ +└── No alert but envelope_source distribution looks suspicious + └── §3 cross-surface query patterns for self-diagnosis +``` + +--- + +## §2 — Per-alert response procedures + +### Alert 1 — `CodeBridgeEnvelopeStdoutFallbackHigh` + +**Anchor**: `alert-1-stdout-fallback-high` +**Severity**: warning +**TTL**: 30m +**Trigger**: `envelope_source=stdout` rate > 80% over 1h **when `STRUCTURED_OUTPUT_ENFORCEMENT=on`** + +**What it means**: Text-channel enforcement is silently failing. The bridge is falling through to the legacy stdout path despite the flag being on. This defeats the whole point of Avenue A v2. + +**Likely causes** (in order of probability): +1. Anthropic API rejecting the `output_config` schema and silently degrading to plain Messages API (no JSON enforcement) +2. Prompt regression causing model to ignore the schema and emit envelope to stdout instead of text +3. Feature flag mis-read at request time (e.g., `featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT` evaluates `false` despite env var being `true`) +4. Schema validation rejecting valid envelopes (e.g., new field added to envelope without schema update) + +**Investigation steps**: + +```bash +# 1. Confirm flag is actually on in the running container +kubectl exec -n super-legal -- env | grep STRUCTURED_OUTPUT_ENFORCEMENT +# Expect: STRUCTURED_OUTPUT_ENFORCEMENT=true + +# 2. Cloud Logging — find the smoking gun (text=missing, stdout=present, flag=on) +gcloud logging read 'jsonPayload.event="envelope_decision" + AND jsonPayload.structured_output_enabled=true + AND jsonPayload.envelope_source="stdout"' --limit=20 --format=json +# Each result shows xlsx_mode + text_envelope_present (should be FALSE) + +# stdout_envelope_present (should be TRUE) + +# 3. SQL — distribution by hour for trend +psql -c " +SELECT date_trunc('hour', created_at) AS hr, + envelope_source, + COUNT(*) AS n +FROM code_executions +WHERE created_at > NOW() - INTERVAL '24h' +GROUP BY 1, 2 +ORDER BY 1 DESC, 2; +" +# Expected post-flag-flip: envelope_source='text' OR 'parsed_output' should dominate +# Anomaly: 'stdout' suddenly dominant + +# 4. Check Anthropic API error logs for 400 responses +gcloud logging read 'severity>=WARNING jsonPayload.message=~"400.*output_config|schema"' --limit=10 +``` + +**Actions**: +- If cause #1 (API rejection): file Anthropic support ticket with request_id; consider rolling back flag temporarily +- If cause #2 (prompt regression): revert recent bridge prompt PRs; document in §6 +- If cause #3 (flag mis-read): check container env injection in deploy config +- If cause #4 (schema rejection): inspect `ENVELOPE_SCHEMA_XLSX` / `ENVELOPE_SCHEMA_GENERAL` constants for recent changes + +--- + +### Alert 2 — `CodeBridgeEnvelopeNoneCritical` + +**Anchor**: `alert-2-envelope-none-critical` +**Severity**: **critical** +**TTL**: 15m +**Trigger**: `envelope_source=none` rate > 5% over 1h + +**What it means**: Bridge is producing no extractable envelope from EITHER text OR stdout. Downstream consumers (xlsx renderer, MCP gateway, agents) receive null data. **Silent data loss in progress.** + +**Likely causes** (in order of probability): +1. Anthropic sandbox failure (containers crashing, API returning empty responses) +2. Model refusing every request (`stop_reason='refusal'` dominates) +3. Container reuse pattern hitting state limit (PR #125 era issue, may reoccur with PTC) +4. Network/timeout issue cutting responses before envelope emission + +**Investigation steps**: + +```bash +# 1. Per-stop-reason distribution +psql -c " +SELECT stop_reason, COUNT(*) AS n, + ROUND(100.0*COUNT(*)/SUM(COUNT(*)) OVER (), 2) AS pct +FROM code_executions +WHERE created_at > NOW() - INTERVAL '1h' +GROUP BY 1 ORDER BY 2 DESC; +" +# Expected: end_turn dominates (>90%) +# Anomaly: refusal > 5%, or stop_reason=NULL > 5%, or max_tokens > 5% + +# 2. Per-container failure clustering +psql -c " +SELECT container_id, COUNT(*) AS n, + COUNT(*) FILTER (WHERE envelope_source = 'none') AS failures +FROM code_executions +WHERE created_at > NOW() - INTERVAL '1h' +GROUP BY 1 HAVING COUNT(*) > 5 ORDER BY failures DESC LIMIT 10; +" +# Anomaly: a few container_ids account for most failures (state leak) + +# 3. Cloud Logging — bridge error traces +gcloud logging read 'jsonPayload.event=~"code_execution_failure|envelope_decision" + jsonPayload.envelope_source="none"' --limit=20 --format=json +``` + +**Actions**: +- **If 3+ envelope alerts firing concurrently OR this alert sustained > 15m → ROLLBACK per §4** (cohort plan rollback trigger) +- If sandbox failure: file Anthropic support; consider temporary `STRUCTURED_OUTPUT_ENFORCEMENT=false` while investigating +- If refusal spike: inspect recent prompt changes for refusal-triggering content +- If container state leak: confirm PR #125 fix is deployed (`container_id` not reused across requests) + +--- + +### Alert 3 — `CodeBridgeTurn1SuccessLow` + +**Anchor**: `alert-3-turn1-success-low` +**Severity**: warning +**TTL**: 24h +**Trigger**: turn-1 success rate < 70% for any caller_category over 24h + +**What it means**: Specific caller path is failing on turn 1 too often. Forces turn-2 corrective retry, doubling wall time + token cost. + +**Likely causes**: +1. Schema rejection for that caller's data shape (e.g., new envelope field not in schema) +2. Prompt-level instructions not being followed by model for that caller's task style +3. Per-caller-specific data complexity (large prompts truncated mid-envelope) + +**Investigation steps**: + +```bash +# 1. Per-caller-category baseline comparison +# (Compare against pre-flag-flip baseline established per §42 prerequisite #7) +# PromQL: +sum by (caller_category, turn_outcome) ( + rate(claude_code_bridge_envelope_outcome_total{structured_output="on"}[24h]) +) + +# 2. SQL — find specific failed phases +psql -c " +SELECT id, agent_type, envelope_source, turn_count, stop_reason, created_at +FROM code_executions +WHERE created_at > NOW() - INTERVAL '24h' + AND turn_count > 1 +ORDER BY created_at DESC LIMIT 50; +" + +# 3. Per-affected-caller examine recent prompt/data +gcloud logging read 'jsonPayload.event="envelope_decision" + AND jsonPayload.text_envelope_present=false + AND timestamp>"-1h"' --limit=20 +``` + +**Actions**: +- If a specific caller_category dominates: investigate that caller's spec composition +- If across-the-board: investigate model regression (Sonnet 4.6 update? prompt cache invalidation?) +- File ticket — not immediate rollback unless combined with other alerts (see §1) + +--- + +### Alert 4 — `CodeBridgeUnknownCallerCategoryEmitting` + +**Anchor**: `alert-4-unknown-caller` +**Severity**: warning (data hygiene) +**TTL**: 15m +**Trigger**: `caller_category=unknown` series emitting at all + +**What it means**: A new bridge invocation site was added without passing `callerCategory`. Defeats per-caller observability. + +**Investigation steps**: + +```bash +# Grep for runPythonAnalysis call sites that DON'T pass callerCategory +grep -rn "runPythonAnalysis(" src/ | grep -v "callerCategory" +# Expected: only the 3 known sites (multiTurnOrchestrator:97, xlsxRenderer/index:209, toolImplementations:962) — and those should be matched OUT by the grep -v +# Anomaly: a new file appears in results +``` + +**Actions**: +- Identify the new call site +- Patch to add `callerCategory: ''` (must match feature-flags.md §42 enum) +- If a new caller category is needed (not one of the existing 5), update both the counter help text in `sdkMetrics.js` AND the feature-flags.md enum docs + +--- + +### Alert 5 — `XlsxRenderTurn1RegressionByTemplate` + +**Anchor**: `alert-5-template-regression` +**Severity**: warning +**TTL**: 24h +**Trigger**: per-template turn-1 success rate < 70% over 24h + +**What it means**: A specific xlsx template is regressing. Compare to pre-flag-flip baseline. + +**Investigation steps**: + +```promql +# 1. PromQL — compare on vs off for the affected template +sum by (structured_output) ( + rate(claude_xlsx_render_turn1_envelope_success_total{ + template_id="", turn_outcome="first_turn" + }[24h]) +) / sum by (structured_output) ( + rate(claude_xlsx_render_turn1_envelope_success_total{ + template_id="" + }[24h]) +) +# If structured_output="on" rate < structured_output="off" rate: structured output enforcement is the regression cause for this template +``` + +```sql +-- 2. SQL — recent failed renders for the template +SELECT id, render_status, audit_status, sheet_count, warnings_count, created_at +FROM xlsx_renders +WHERE template_id = '' + AND created_at > NOW() - INTERVAL '24h' + AND (render_status = 'failed' OR audit_status = 'FAIL') +ORDER BY created_at DESC LIMIT 20; +``` + +**Actions**: +- Inspect recent template changes (`src/config/xlsxTemplates/