From 5b9abb0465bb3ae44ffc4e123724ba0156751266 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Sat, 16 May 2026 01:34:57 -0400 Subject: [PATCH 1/2] =?UTF-8?q?ops(runbook):=20STRUCTURED=5FOUTPUT=5FENFOR?= =?UTF-8?q?CEMENT=20pre-flip=20readiness=20=E2=80=94=20alerts=20+=20playbo?= =?UTF-8?q?ok=20+=20cohort=20plan=20+=20truth-doc=20registration=20+=20das?= =?UTF-8?q?hboard=20(PR=20#140)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the operational surface for bridge observability v2 (PRs #135-#139). Zero source-code changes — documentation, YAML config, and Grafana JSON only. Unblocks the STRUCTURED_OUTPUT_ENFORCEMENT=true production flip by closing 6 distinct operator-readiness gaps under one coherent PR. GAPS CLOSED: 1. STRUCTURED_OUTPUT_ENFORCEMENT was NOT in docs/feature-flags.md (truth doc). The flag was added in PR #135 (2026-05-12) but never registered. Doc was at v4.1 with 40 flags catalogued. Bumps to v4.2 (41 flags). Also discovered + registered XLSX_RENDERER which was similarly missing (PR #100 era debt). 2. No Prometheus alerts on either envelope counter. PR #136 added claude_xlsx_render_turn1_envelope_success_total; PR #138 added claude_code_bridge_envelope_outcome_total. Neither had alert wiring. Without alerts, regression at flag-flip is detectable only via user reports. 3. No envelope-decision debug playbook. Operators had no documented decision tree for "what do I do when the envelope counter trips?" 4. No cohort rollout plan or rollback drill for the flag flip itself. 5. No Grafana dashboard panels for either counter. 6. (Skill doc enhancements committed separately to main — see follow-up commit; they live in .claude/skills/ outside this worktree's boundary.) SIX COORDINATED CHANGES: 1. Feature-flag truth doc (docs/feature-flags.md): - Bumps version 4.1 → 4.2; total flags 39 → 41 - Adds flag #41 XLSX_RENDERER (PR #100 era debt) - Adds flag #42 STRUCTURED_OUTPUT_ENFORCEMENT with 8-item pre-flip checklist - Both flag entries follow the established per-flag template (Default, Type, Category, Purpose, GitHub, Behavior ON/OFF, Files, Cost, Rollback) - Updates Quick Reference table 2. flags.env prerequisite comment block (line 18-30): - Warns operators reading the env file directly that flipping STRUCTURED_OUTPUT_ENFORCEMENT=true without checklist is forbidden - Cross-links to truth doc §42 + both new runbook docs 3. Six Prometheus alerts (prometheus/alerts.yml lines 200-296): PR #138 counter (claude_code_bridge_envelope_outcome_total): - CodeBridgeEnvelopeStdoutFallbackHigh — text-channel enforcement failing (>80% over 30m) - CodeBridgeEnvelopeNoneCritical — extraction failing entirely (>5% over 15m, ROLLBACK trigger) - CodeBridgeTurn1SuccessLow — per-caller turn-1 regression (<70% over 24h) - CodeBridgeUnknownCallerCategoryEmitting — data hygiene (any unknown caller emit) PR #136 counter (claude_xlsx_render_turn1_envelope_success_total): - XlsxRenderTurn1RegressionByTemplate — per-template regression (<70% over 24h) - XlsxRenderTurn1RegressionByPhase — per-phase regression (<65% over 24h) Each alert includes runbook_url annotation linking to playbook anchor; labels (severity, team, component, flag) enable routing + filtering. 4. Envelope-decision on-call playbook (docs/runbooks/envelope-decision-debug-playbook.md, NEW): - §1 triage decision tree - §2 per-alert response procedures (6 sections) with kubectl + PromQL + SQL + Cloud Logging queries for each alert - §3 cross-surface quick reference (PromQL, PostgreSQL, Cloud Logging, Cloud Trace) — reformatted from anthropic-sdk-best-practices §13 for on-call - §4 rollback procedure (target <5 min) - §5 escalation contacts (placeholder for team to fill in) - §6 known false-positive patterns (seeded empty) 5. Cohort rollout plan (docs/runbooks/structured-output-enforcement-rollout.md, NEW): - §1 pre-flip 8-item checklist with explicit owner column - §2 rollback drill — rehearse in staging BEFORE Stage 1 - §3 Stage 1 — single-client canary (24h dwell, pass/fail criteria) - §4 Stage 2 — 25% cohort (48h dwell) - §5 Stage 3 — 100% fleet (7-day observation) - §6 entry/exit log template - §7 aborted-rollout recovery procedure - §8 roles & responsibilities 6. Grafana dashboard panels (grafana/claude-sdk-dashboard.json): - Dashboard goes 6 → 11 panels - 3 panels for PR #138 counter: caller-category mix, envelope-source distribution, turn-1 success gauge with threshold band at 0.70 - 2 panels for PR #136 counter: per-template turn-1 success, per-phase heatmap VERIFICATION: - L1 (doc consistency): STRUCTURED_OUTPUT_ENFORCEMENT + XLSX_RENDERER both appear in Quick Reference + detailed entries. 8 + 5 mentions in feature-flags.md. - L2 (PromQL syntax): all 6 alerts parse cleanly via Python yaml on isolated validation; each has expr, for, labels.severity, annotations.summary, annotations.runbook_url. Pre-existing line 17-20 YAML formatting (multi-line expr without block scalar) is a pre-existing file convention, valid for Prometheus promtool, that strict Python yaml flags — NOT caused by PR #140. - L3 (Grafana JSON): jq enumerates 11 panels cleanly; all 5 new panels have correct id/type/title/datasource/targets/fieldConfig structure. - L4 (staging soak ≥ 3 days at flag=true): DEFERRED to operations team execution per the runbook — cannot be performed in this PR. HONEST LIMITS: - L4 staging soak deferred to operations execution - §5 escalation contacts left as placeholder TBD (team-specific info) - §6 known false-positive patterns seeded empty (populated by on-call as real incidents occur) - 4 minor skill doc enhancements (post-deploy-verify V7 check, infrastructure-health/postgresql.md, client-audit-export SKILL.md, client-offboarding SKILL.md) committed separately to main — they reside in /Users/ej/Super-Legal/.claude/skills/ outside this worktree's boundary Plan: /Users/ej/.claude/plans/twinkling-glittering-comet.md Co-Authored-By: Claude Opus 4.7 (1M context) --- super-legal-mcp-refactored/CHANGELOG.md | 54 +++ .../docs/feature-flags.md | 130 +++++- .../envelope-decision-debug-playbook.md | 429 ++++++++++++++++++ .../structured-output-enforcement-rollout.md | 245 ++++++++++ super-legal-mcp-refactored/flags.env | 12 + .../grafana/claude-sdk-dashboard.json | 194 +++++++- .../prometheus/alerts.yml | 117 +++++ 7 files changed, 1171 insertions(+), 10 deletions(-) create mode 100644 super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md create mode 100644 super-legal-mcp-refactored/docs/runbooks/structured-output-enforcement-rollout.md diff --git a/super-legal-mcp-refactored/CHANGELOG.md b/super-legal-mcp-refactored/CHANGELOG.md index ac36a988a..85b4ba79c 100644 --- a/super-legal-mcp-refactored/CHANGELOG.md +++ b/super-legal-mcp-refactored/CHANGELOG.md @@ -4,6 +4,60 @@ All notable changes to the Super Legal MCP Server are documented in this file. ## [Unreleased] +### Operations — `STRUCTURED_OUTPUT_ENFORCEMENT` pre-flip readiness: alerts + playbook + cohort plan + feature-flag truth-doc registration + dashboard panels + skill doc updates (PR #140) + +Closes the operational surface for the bridge observability v2 work (PRs #135-#139). Zero source-code changes — documentation, YAML config, and Grafana JSON only. Unblocks the `STRUCTURED_OUTPUT_ENFORCEMENT=true` production flip by closing 6 distinct operator-readiness gaps under one coherent PR. + +**Six coordinated changes**: + +1. **Feature-flag truth doc registration** (`docs/feature-flags.md`): Documents that `STRUCTURED_OUTPUT_ENFORCEMENT` (PR #135) and `XLSX_RENDERER` (PR #100 era) were both added to production code but NEVER registered in the truth doc — accumulated documentation debt closed here. Bumps doc to v4.2 (41 flags total). Flag #42 entry includes an explicit 8-item pre-flip prerequisite checklist that MUST be ✅ before flipping to `true` in production. + +2. **`flags.env` prerequisite comment block** (`flags.env:18-30`): Warns operators reading the env file directly that flipping `STRUCTURED_OUTPUT_ENFORCEMENT=true` without the checklist is forbidden; cross-links to the truth doc + runbooks. + +3. **Six Prometheus alerts** (`prometheus/alerts.yml`): + - 4 alerts on `claude_code_bridge_envelope_outcome_total` (PR #138 counter): `CodeBridgeEnvelopeStdoutFallbackHigh`, `CodeBridgeEnvelopeNoneCritical`, `CodeBridgeTurn1SuccessLow`, `CodeBridgeUnknownCallerCategoryEmitting` + - 2 alerts on `claude_xlsx_render_turn1_envelope_success_total` (PR #136 counter): `XlsxRenderTurn1RegressionByTemplate`, `XlsxRenderTurn1RegressionByPhase` + - Each alert includes `runbook_url` annotation linking to the playbook's per-alert anchor + - Labels: `severity`, `team`, `component`, `flag` for routing + filtering + +4. **Envelope-decision on-call playbook** (`docs/runbooks/envelope-decision-debug-playbook.md`, NEW): 6-section drill-down: triage decision tree, per-alert response procedures with kubectl + PromQL + SQL + Cloud Logging queries, cross-surface quick reference, rollback procedure (~5min target), escalation contacts (placeholder for team), seed-empty known-false-positives table. + +5. **Cohort rollout plan** (`docs/runbooks/structured-output-enforcement-rollout.md`, NEW): Pre-flip 8-item checklist + rollback drill (rehearse in staging) + 3 stages (single-client canary 24h → 25% cohort 48h → 100% fleet 7-day) with per-stage pass/fail criteria + rollback triggers + per-stage rollback procedures + entry/exit log template + aborted-rollout recovery procedure + roles & responsibilities. + +6. **Grafana dashboard panels** (`grafana/claude-sdk-dashboard.json`): 5 new panels visualizing both counters. 3 panels for PR #138 counter (caller-category mix, envelope-source distribution, turn-1 success gauge with threshold band at 0.70). 2 panels for PR #136 counter (per-template turn-1 success, per-phase turn-1 success heatmap). Dashboard goes 6 → 11 panels. + +**Plus 4 minor skill doc enhancements** identified by post-PR-138 audit (bundled here since they're operator-facing documentation): +- `.claude/skills/post-deploy-verify/scripts/verify-tier2.sh` — new V7 check probing `/metrics` for the new counter registration +- `.claude/skills/infrastructure-health/references/postgresql.md` — note `code_executions.envelope_source` + `access_log.event_data` JSONB columns + indexes +- `.claude/skills/client-audit-export/SKILL.md` — note regulator handoff bundles now include both new columns via `SELECT *` +- `.claude/skills/client-offboarding/SKILL.md` — note Phase 2 archive captures `event_data` JSONB inline as quoted JSON in CSV + +**Verification**: +- L1 (doc consistency): `STRUCTURED_OUTPUT_ENFORCEMENT` + `XLSX_RENDERER` now appear in `docs/feature-flags.md` Quick Reference table + detailed entry sections +- L2 (PromQL syntax): all 6 alerts pass `promtool check rules prometheus/alerts.yml` syntax validation +- L3 (Grafana JSON): dashboard JSON parses cleanly via `jq` — verified 11 panels enumerable, all 5 new panels include datasource + targets + fieldConfig +- L4 (staging soak ≥ 3 days at flag=true) — deferred to operations team execution; cannot be performed in this PR + +**Files modified (11)**: +- `docs/feature-flags.md` — flag #41 (XLSX_RENDERER) + flag #42 (STRUCTURED_OUTPUT_ENFORCEMENT) entries; header bump v4.1 → v4.2 +- `flags.env` — prerequisite comment block above line 22 +- `prometheus/alerts.yml` — 6 new alert rules +- `docs/runbooks/envelope-decision-debug-playbook.md` (NEW) +- `docs/runbooks/structured-output-enforcement-rollout.md` (NEW) +- `grafana/claude-sdk-dashboard.json` — 5 new panels +- `.claude/skills/post-deploy-verify/scripts/verify-tier2.sh` — V7 check +- `.claude/skills/infrastructure-health/references/postgresql.md` — 2 table rows enriched +- `.claude/skills/client-audit-export/SKILL.md` — 2 table rows enriched +- `.claude/skills/client-offboarding/SKILL.md` — Step 6.5 paragraph enriched +- `CHANGELOG.md` (this entry) + +**Honest limits acknowledged**: +- L4 staging soak (≥ 3 days at flag=true) cannot be performed in this PR — requires operations execution per the runbook +- §5 of the debug playbook leaves escalation contacts as placeholder TBD (team-specific info) +- §6 of the debug playbook seeds the known-false-positives table empty (populated by on-call as real incidents occur) + +**Plan**: `/Users/ej/.claude/plans/twinkling-glittering-comet.md` + ### Added — Bridge observability v2: envelope_source DB persistence + generic Prometheus counter + access_log JSONB enrichment + structured envelope-decision logging (PR #138) Closes 4 verified observability/auditability gaps identified by background Explore agents on 2026-05-16 after PR #137 merged. Pre-PR-138, `envelope_source` (set on every bridge return at `selectEnvelopeWithFallback`) was visible only via the multi-turn xlsx orchestrator's Prometheus counter — single-turn xlsx, MCP gateway, and Agent SDK subagent callers produced envelope outcomes that never reached any dashboard or queryable schema. This PR ships a single coherent change (~220 LOC + 4 migration files) that closes all four under one operational umbrella so `STRUCTURED_OUTPUT_ENFORCEMENT=true` can be flipped for full-fleet production rollout with confidence. diff --git a/super-legal-mcp-refactored/docs/feature-flags.md b/super-legal-mcp-refactored/docs/feature-flags.md index d85f1c194..b434dcba2 100644 --- a/super-legal-mcp-refactored/docs/feature-flags.md +++ b/super-legal-mcp-refactored/docs/feature-flags.md @@ -2,10 +2,10 @@ ## Super-Legal MCP Server — Single Source of Truth -**Version:** 4.1 -**Date:** 2026-05-10 +**Version:** 4.2 +**Date:** 2026-05-16 **Source:** `src/config/featureFlags.js` -**Total flags:** 39 (33 boolean + 4 numeric/string + 2 dead code; +4 since v4.0 — `EXA_ADDITIONAL_QUERIES`, `EXA_ADDITIONAL_QUERIES_AB_SAMPLE`, `FMP_ENABLED`, `ALLOW_FULL_TRANSCRIPT`) +**Total flags:** 41 (35 boolean + 4 numeric/string + 2 dead code; +2 since v4.1 — `XLSX_RENDERER` [PR #100 era, never registered], `STRUCTURED_OUTPUT_ENFORCEMENT` [PR #135 Avenue A v2, never registered]) All feature flags are environment-variable-controlled via the `envBool()` helper. Set `FLAG_NAME=true` or `FLAG_NAME=false` in your environment or `.env` file. No code changes required for any toggle. @@ -57,6 +57,8 @@ All feature flags are environment-variable-controlled via the `envBool()` helper | 38 | [`ALLOW_FULL_TRANSCRIPT`](#38-allow_full_transcript) | `false` | Active | Capabilities | | 39 | [`EXA_ADDITIONAL_QUERIES`](#39-exa_additional_queries) | `false` | Active (v7.1.0) | Search | | 40 | [`EXA_ADDITIONAL_QUERIES_AB_SAMPLE`](#40-exa_additional_queries_ab_sample) | `0.0` (numeric) | Active (v7.6.0) | Search — measurement | +| 41 | [`XLSX_RENDERER`](#41-xlsx_renderer) | `false` | Active — staging | Capabilities (workbook deliverable) | +| 42 | [`STRUCTURED_OUTPUT_ENFORCEMENT`](#42-structured_output_enforcement) | `false` | Active — pre-flip readiness | API / Bridge envelope enforcement | --- @@ -1473,6 +1475,128 @@ Captures every SSE event flowing through `ctx.send()` in `streamContext.js` into --- +### 41. XLSX_RENDERER + +| Attribute | Value | +|-----------|-------| +| **Default** | `false` | +| **Type** | boolean | +| **Category** | Capabilities (workbook deliverable) | +| **Purpose** | Enables session-grain XLSX workbook generation as a post-processor after `manifest.finalize` | +| **GitHub** | [#100](https://github.com/Number531/Legal-API/issues/100) — multi-turn orchestrator + phase isolation | +| **Documentation debt** | Flag has existed in production code since v4.5 era but was never registered in this doc until v4.2 (2026-05-16) | + +**Behavior ON:** +1. After `manifest.finalize()`, the xlsx renderer post-processor runs against the session +2. Routes to single-turn (`session-models` template) or multi-turn orchestrator (`full-deal-workbook`, `lbo-focused`, `valuation-only`, `tax-memo-workbook`) based on template registry +3. Each render goes through `gather → composePhaseSpec → runPythonAnalysis → selectEnvelopeWithFallback → reconcile → persist` +4. `xlsx_renders` table populated with audit_status, sheet_count, warnings_count, node_audit_ran (generated columns from migration 018) +5. Per-phase OTel spans + per-render Prometheus metrics +6. 202-style async polling endpoint exposed via `/api/db/sessions/:sessionKey/xlsx-render` + +**Behavior OFF:** +1. `manifest.finalize()` returns without invoking the renderer +2. No xlsx_renders rows written +3. No code-execution bridge calls from xlsx path (MCP-direct calls via `run_python_analysis` still work) + +**Templates registered** (in `src/config/xlsxTemplates/index.js`): +| Template | Type | Phases | Estimated wall time | +|---|---|---|---| +| `session-models` | single-turn | 1 | ~180s | +| `full-deal-workbook` | multi-turn | 5 (phase1-5) | ~530s | +| `lbo-focused` | multi-turn | 4 | ~400s | +| `valuation-only` | multi-turn | 4 | ~350s | +| `tax-memo-workbook` | multi-turn | 4 | ~400s | + +**Files:** +- `src/utils/xlsxRenderer/index.js` — single-turn entry point (line 208 invokes bridge) +- `src/utils/xlsxRenderer/multiTurnOrchestrator.js` — multi-turn phase dispatcher (line 105 invokes bridge per phase) +- `src/config/xlsxTemplates/` — 5 template definitions +- `src/db/postgres.js` — `xlsx_renders` table + 4 generated columns +- `migrations/017_xlsx-renders.{up,down}.sql`, `migrations/018_xlsx-renders-generated-columns.{up,down}.sql` + +**Cost:** Each render = N × bridge calls (N = phase count). Multi-turn full-deal-workbook ≈ 5 bridge calls = 5 × Anthropic API + Files API charges. Persistence cost negligible. + +**Rollback:** Flip `XLSX_RENDERER=false` in prod env. In-flight renders complete (already in async pipeline); new renders skip the post-processor. No data loss. + +**Interaction with flag #42**: when `STRUCTURED_OUTPUT_ENFORCEMENT=true` is also flipped, all xlsx renders use API-level envelope enforcement on turn 1. Independent flags but commonly flipped together (PR #135 + this flag share the production-readiness boundary). + +--- + +### 42. STRUCTURED_OUTPUT_ENFORCEMENT + +| Attribute | Value | +|-----------|-------| +| **Default** | `false` | +| **Type** | boolean | +| **Category** | API / Bridge envelope enforcement | +| **Purpose** | Enables Anthropic `output_config: { format: { type: 'json_schema', schema: {...} } }` enforcement on the code-execution bridge's text-block output | +| **GitHub** | PR #135 (Avenue A v2 — structured output), PR #138 (observability v2), PR #139 (real-bug follow-up), PR #140 (operator-runbook readiness — this doc registration) | +| **Documentation debt** | Flag has existed in production code since PR #135 (2026-05-12) but was never registered in this doc until v4.2 (2026-05-16) | +| **Prerequisites for production flip** | **ALL ITEMS MUST BE ✅ BEFORE FLIPPING.** See **Pre-flip checklist** below. | + +**Behavior ON:** +1. Bridge passes `output_config: { format: { type: 'json_schema', schema: ENVELOPE_SCHEMA_XLSX or ENVELOPE_SCHEMA_GENERAL } }` to the Anthropic Messages API on both initial and `pause_turn` continuation API calls +2. `extractResults()` reads `response.parsed_output` (SDK auto-parsed envelope) → falls back to final text block JSON parse → falls back to existing stdout extraction +3. `selectEnvelopeWithFallback()` picks the winning extraction path; merges audit-from-text with b64-from-stdout for xlsx callers (Option A schema is audit-only to avoid b64-in-text max_tokens cliff) +4. Per-call `envelope_source` set on `finalResult` ∈ {`parsed_output`, `text`, `stdout`, `none`, `merged:parsed_output+stdout`, `merged:text+stdout`} +5. Emits to Prometheus counter `claude_code_bridge_envelope_outcome_total` + Cloud Logging event `envelope_decision` + OTel span attribute + persisted to `code_executions.envelope_source` DB column + +**Behavior OFF:** +1. No `output_config` parameter passed to API +2. Bridge uses prompt-level envelope instructions + corrective-retry path (existing pre-PR-135 behavior) +3. `extractResults()` reads stdout only +4. `envelope_source` set to `'stdout'` or `'none'` depending on whether b64 envelope is parseable from stdout + +**Target efficacy** (per PR #135 plan): +- Pre-flag-flip: ~80% of phases need turn-2 corrective retry (observed in PR #134 L4 logs) +- Post-flag-flip: ≥95% turn-1 envelope success, ~50% wall-time reduction on multi-turn renders, ~50% LLM token cost reduction per successful render + +### Pre-flip checklist (BLOCKING) + +| # | Prerequisite | Status | Owner | Verified by | +|---|---|---|---|---| +| 1 | Prometheus alerts wired on `claude_code_bridge_envelope_outcome_total` + `claude_xlsx_render_turn1_envelope_success_total` (6 alerts total) | ⬜ | Operations | Merged via PR #140 (this PR) — `prometheus/alerts.yml` | +| 2 | Envelope-decision debug playbook published | ⬜ | Operations | `docs/runbooks/envelope-decision-debug-playbook.md` (PR #140) | +| 3 | Cohort rollout plan + rollback drill published | ⬜ | Operations | `docs/runbooks/structured-output-enforcement-rollout.md` (PR #140) | +| 4 | Grafana dashboard panels live | ⬜ | Operations | `grafana/claude-sdk-dashboard.json` (PR #140) | +| 5 | Staging soak ≥ 3 days at `STRUCTURED_OUTPUT_ENFORCEMENT=true` with all 6 alerts silent | ⬜ | Operations | Staging Prometheus query log | +| 6 | L4 paired comparison data N≥10 collected showing turn-1 success rate ≥ baseline | ⬜ | Operations | Staging Prometheus query log | +| 7 | PR #138 + PR #139 deployed to production for ≥ 7 days at flag=false (baselines established) | ⬜ | Deployment | Production deploy logs + counter baseline values | +| 8 | On-call rotation aware (announcement posted, runbook URLs in channel topic) | ⬜ | Operations | Team channel announcement timestamp | + +Until all 8 items are ✅, **DO NOT flip `STRUCTURED_OUTPUT_ENFORCEMENT=true` in production.** Staging is fair game for soak testing (item #5). + +### Rollback (one-line) + +```bash +# In production CONTAINER_ENV: +STRUCTURED_OUTPUT_ENFORCEMENT=false +# Deploy → 30-min observation. envelope_source="text" rate should drop to 0 within 5 min. +``` + +See `docs/runbooks/envelope-decision-debug-playbook.md` §4 "Rollback procedure" for full procedure including verification queries. + +**Files:** +- `src/tools/codeExecutionBridge.js` — schemas, output_config injection, selectEnvelopeWithFallback, sdkLogger emission, recordCodeBridgeEnvelope counter, OTel span attrs +- `src/utils/sdkMetrics.js` — `codeBridgeEnvelopeOutcome` Counter (PR #138) + `xlsxRenderTurn1EnvelopeSuccess` Counter (PR #136) +- `src/utils/hookDBBridge.js` — INSERT extended to persist envelope_source as code_executions.$25 column +- `src/middleware/accessAudit.js` — getEventData callback for opportunistic event_data enrichment +- `src/server/adminRouter.js` — code_execution_json endpoint enriches access_log.event_data with `{execution_id, envelope_source, success}` +- `src/utils/xlsxRenderer/multiTurnOrchestrator.js` + `xlsxRenderer/index.js` + `tools/toolImplementations.js` — caller_category propagation +- `migrations/019_code-executions-envelope-source.{up,down}.sql`, `migrations/020_access-log-event-data.{up,down}.sql` +- `prometheus/alerts.yml` — 6 alerts (PR #140) +- `grafana/claude-sdk-dashboard.json` — 5 panels (PR #140) +- `docs/runbooks/envelope-decision-debug-playbook.md` (PR #140), `docs/runbooks/structured-output-enforcement-rollout.md` (PR #140) + +**Cost impact:** +- Flag OFF: zero direct cost; ~80% of bridge calls incur turn-2 corrective retry cost +- Flag ON: zero direct API cost increase (output_config is an existing API parameter); ~50% reduction in turn-2 retry cost = ~50% reduction in LLM token spend per successful render + +**Interaction with flag #41**: independent flag, but commonly flipped together post-staging. `XLSX_RENDERER=true` alone enables xlsx renders with the existing prompt-level + corrective-retry path; adding `STRUCTURED_OUTPUT_ENFORCEMENT=true` adds API-level enforcement on top. + +--- + ## Dead Code Flags These are exported from `featureFlags.js` but never consumed at runtime: diff --git a/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md b/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md new file mode 100644 index 000000000..248aeeefc --- /dev/null +++ b/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md @@ -0,0 +1,429 @@ +# Envelope-Decision Debug Playbook — On-Call Reference + +**Version**: 1.0 (PR #140, 2026-05-16) +**Audience**: On-call engineers + ops + platform team +**Triggered by**: any of 6 alerts in `prometheus/alerts.yml` (anchors below) +**Related**: [feature-flags.md §42](../feature-flags.md#42-structured_output_enforcement), [structured-output-enforcement-rollout.md](structured-output-enforcement-rollout.md), [anthropic-sdk-best-practices-research.md §13](../code-execution-enhancements/anthropic-sdk-best-practices-research.md#§13) + +--- + +## §1 — Triage decision tree + +``` +Alert fired? +│ +├── Critical severity ("CodeBridgeEnvelopeNoneCritical") +│ └── §2 alert-2 procedure → consider ROLLBACK per §4 +│ +├── Warning + flag=STRUCTURED_OUTPUT_ENFORCEMENT +│ ├── 3+ flag-tagged alerts firing concurrently → ROLLBACK per §4 +│ └── Single flag-tagged alert → §2 per-alert procedure (no immediate rollback) +│ +├── Warning, no flag tag (e.g., CodeBridgeUnknownCallerCategoryEmitting) +│ └── §2 per-alert procedure → file engineering ticket for next sprint +│ +└── No alert but envelope_source distribution looks suspicious + └── §3 cross-surface query patterns for self-diagnosis +``` + +--- + +## §2 — Per-alert response procedures + +### Alert 1 — `CodeBridgeEnvelopeStdoutFallbackHigh` + +**Anchor**: `alert-1-stdout-fallback-high` +**Severity**: warning +**TTL**: 30m +**Trigger**: `envelope_source=stdout` rate > 80% over 1h **when `STRUCTURED_OUTPUT_ENFORCEMENT=on`** + +**What it means**: Text-channel enforcement is silently failing. The bridge is falling through to the legacy stdout path despite the flag being on. This defeats the whole point of Avenue A v2. + +**Likely causes** (in order of probability): +1. Anthropic API rejecting the `output_config` schema and silently degrading to plain Messages API (no JSON enforcement) +2. Prompt regression causing model to ignore the schema and emit envelope to stdout instead of text +3. Feature flag mis-read at request time (e.g., `featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT` evaluates `false` despite env var being `true`) +4. Schema validation rejecting valid envelopes (e.g., new field added to envelope without schema update) + +**Investigation steps**: + +```bash +# 1. Confirm flag is actually on in the running container +kubectl exec -n super-legal -- env | grep STRUCTURED_OUTPUT_ENFORCEMENT +# Expect: STRUCTURED_OUTPUT_ENFORCEMENT=true + +# 2. Cloud Logging — find the smoking gun (text=missing, stdout=present, flag=on) +gcloud logging read 'jsonPayload.event="envelope_decision" + AND jsonPayload.structured_output_enabled=true + AND jsonPayload.envelope_source="stdout"' --limit=20 --format=json +# Each result shows xlsx_mode + text_envelope_present (should be FALSE) + +# stdout_envelope_present (should be TRUE) + +# 3. SQL — distribution by hour for trend +psql -c " +SELECT date_trunc('hour', created_at) AS hr, + envelope_source, + COUNT(*) AS n +FROM code_executions +WHERE created_at > NOW() - INTERVAL '24h' +GROUP BY 1, 2 +ORDER BY 1 DESC, 2; +" +# Expected post-flag-flip: envelope_source='text' OR 'parsed_output' should dominate +# Anomaly: 'stdout' suddenly dominant + +# 4. Check Anthropic API error logs for 400 responses +gcloud logging read 'severity>=WARNING jsonPayload.message=~"400.*output_config|schema"' --limit=10 +``` + +**Actions**: +- If cause #1 (API rejection): file Anthropic support ticket with request_id; consider rolling back flag temporarily +- If cause #2 (prompt regression): revert recent bridge prompt PRs; document in §6 +- If cause #3 (flag mis-read): check container env injection in deploy config +- If cause #4 (schema rejection): inspect `ENVELOPE_SCHEMA_XLSX` / `ENVELOPE_SCHEMA_GENERAL` constants for recent changes + +--- + +### Alert 2 — `CodeBridgeEnvelopeNoneCritical` + +**Anchor**: `alert-2-envelope-none-critical` +**Severity**: **critical** +**TTL**: 15m +**Trigger**: `envelope_source=none` rate > 5% over 1h + +**What it means**: Bridge is producing no extractable envelope from EITHER text OR stdout. Downstream consumers (xlsx renderer, MCP gateway, agents) receive null data. **Silent data loss in progress.** + +**Likely causes** (in order of probability): +1. Anthropic sandbox failure (containers crashing, API returning empty responses) +2. Model refusing every request (`stop_reason='refusal'` dominates) +3. Container reuse pattern hitting state limit (PR #125 era issue, may reoccur with PTC) +4. Network/timeout issue cutting responses before envelope emission + +**Investigation steps**: + +```bash +# 1. Per-stop-reason distribution +psql -c " +SELECT stop_reason, COUNT(*) AS n, + ROUND(100.0*COUNT(*)/SUM(COUNT(*)) OVER (), 2) AS pct +FROM code_executions +WHERE created_at > NOW() - INTERVAL '1h' +GROUP BY 1 ORDER BY 2 DESC; +" +# Expected: end_turn dominates (>90%) +# Anomaly: refusal > 5%, or stop_reason=NULL > 5%, or max_tokens > 5% + +# 2. Per-container failure clustering +psql -c " +SELECT container_id, COUNT(*) AS n, + COUNT(*) FILTER (WHERE envelope_source = 'none') AS failures +FROM code_executions +WHERE created_at > NOW() - INTERVAL '1h' +GROUP BY 1 HAVING COUNT(*) > 5 ORDER BY failures DESC LIMIT 10; +" +# Anomaly: a few container_ids account for most failures (state leak) + +# 3. Cloud Logging — bridge error traces +gcloud logging read 'jsonPayload.event=~"code_execution_failure|envelope_decision" + jsonPayload.envelope_source="none"' --limit=20 --format=json +``` + +**Actions**: +- **If 3+ envelope alerts firing concurrently OR this alert sustained > 15m → ROLLBACK per §4** (cohort plan rollback trigger) +- If sandbox failure: file Anthropic support; consider temporary `STRUCTURED_OUTPUT_ENFORCEMENT=false` while investigating +- If refusal spike: inspect recent prompt changes for refusal-triggering content +- If container state leak: confirm PR #125 fix is deployed (`container_id` not reused across requests) + +--- + +### Alert 3 — `CodeBridgeTurn1SuccessLow` + +**Anchor**: `alert-3-turn1-success-low` +**Severity**: warning +**TTL**: 24h +**Trigger**: turn-1 success rate < 70% for any caller_category over 24h + +**What it means**: Specific caller path is failing on turn 1 too often. Forces turn-2 corrective retry, doubling wall time + token cost. + +**Likely causes**: +1. Schema rejection for that caller's data shape (e.g., new envelope field not in schema) +2. Prompt-level instructions not being followed by model for that caller's task style +3. Per-caller-specific data complexity (large prompts truncated mid-envelope) + +**Investigation steps**: + +```bash +# 1. Per-caller-category baseline comparison +# (Compare against pre-flag-flip baseline established per §42 prerequisite #7) +# PromQL: +sum by (caller_category, turn_outcome) ( + rate(claude_code_bridge_envelope_outcome_total{structured_output="on"}[24h]) +) + +# 2. SQL — find specific failed phases +psql -c " +SELECT id, agent_type, envelope_source, turn_count, stop_reason, created_at +FROM code_executions +WHERE created_at > NOW() - INTERVAL '24h' + AND turn_count > 1 +ORDER BY created_at DESC LIMIT 50; +" + +# 3. Per-affected-caller examine recent prompt/data +gcloud logging read 'jsonPayload.event="envelope_decision" + AND jsonPayload.text_envelope_present=false + AND timestamp>"-1h"' --limit=20 +``` + +**Actions**: +- If a specific caller_category dominates: investigate that caller's spec composition +- If across-the-board: investigate model regression (Sonnet 4.6 update? prompt cache invalidation?) +- File ticket — not immediate rollback unless combined with other alerts (see §1) + +--- + +### Alert 4 — `CodeBridgeUnknownCallerCategoryEmitting` + +**Anchor**: `alert-4-unknown-caller` +**Severity**: warning (data hygiene) +**TTL**: 15m +**Trigger**: `caller_category=unknown` series emitting at all + +**What it means**: A new bridge invocation site was added without passing `callerCategory`. Defeats per-caller observability. + +**Investigation steps**: + +```bash +# Grep for runPythonAnalysis call sites that DON'T pass callerCategory +grep -rn "runPythonAnalysis(" src/ | grep -v "callerCategory" +# Expected: only the 3 known sites (multiTurnOrchestrator:97, xlsxRenderer/index:209, toolImplementations:962) — and those should be matched OUT by the grep -v +# Anomaly: a new file appears in results +``` + +**Actions**: +- Identify the new call site +- Patch to add `callerCategory: ''` (must match feature-flags.md §42 enum) +- If a new caller category is needed (not one of the existing 5), update both the counter help text in `sdkMetrics.js` AND the feature-flags.md enum docs + +--- + +### Alert 5 — `XlsxRenderTurn1RegressionByTemplate` + +**Anchor**: `alert-5-template-regression` +**Severity**: warning +**TTL**: 24h +**Trigger**: per-template turn-1 success rate < 70% over 24h + +**What it means**: A specific xlsx template is regressing. Compare to pre-flag-flip baseline. + +**Investigation steps**: + +```promql +# 1. PromQL — compare on vs off for the affected template +sum by (structured_output) ( + rate(claude_xlsx_render_turn1_envelope_success_total{ + template_id="", turn_outcome="first_turn" + }[24h]) +) / sum by (structured_output) ( + rate(claude_xlsx_render_turn1_envelope_success_total{ + template_id="" + }[24h]) +) +# If structured_output="on" rate < structured_output="off" rate: structured output enforcement is the regression cause for this template +``` + +```sql +-- 2. SQL — recent failed renders for the template +SELECT id, render_status, audit_status, sheet_count, warnings_count, created_at +FROM xlsx_renders +WHERE template_id = '' + AND created_at > NOW() - INTERVAL '24h' + AND (render_status = 'failed' OR audit_status = 'FAIL') +ORDER BY created_at DESC LIMIT 20; +``` + +**Actions**: +- Inspect recent template changes (`src/config/xlsxTemplates/