diff --git a/super-legal-mcp-refactored/CHANGELOG.md b/super-legal-mcp-refactored/CHANGELOG.md
index ac36a988a..77b7e3c11 100644
--- a/super-legal-mcp-refactored/CHANGELOG.md
+++ b/super-legal-mcp-refactored/CHANGELOG.md
@@ -4,6 +4,66 @@ All notable changes to the Super Legal MCP Server are documented in this file.
 
 ## [Unreleased]
 
+### Operations — `STRUCTURED_OUTPUT_ENFORCEMENT` pre-flip readiness: alerts + playbook + cohort plan + feature-flag truth-doc registration + dashboard panels + skill doc updates (PR #140)
+
+Closes the operational surface for the bridge observability v2 work (PRs #135-#139). Zero source-code changes — documentation, YAML config, and Grafana JSON only. Unblocks the `STRUCTURED_OUTPUT_ENFORCEMENT=true` production flip by closing 6 distinct operator-readiness gaps under one coherent PR.
+
+**Six coordinated changes**:
+
+1. **Feature-flag truth doc registration** (`docs/feature-flags.md`): Documents that `STRUCTURED_OUTPUT_ENFORCEMENT` (PR #135) and `XLSX_RENDERER` (PR #100 era) were both added to production code but NEVER registered in the truth doc — accumulated documentation debt closed here. Bumps doc to v4.2 (41 flags total). Flag #42 entry includes an explicit 8-item pre-flip prerequisite checklist that MUST be ✅ before flipping to `true` in production.
+
+2. **`flags.env` prerequisite comment block** (`flags.env:18-30`): Warns operators reading the env file directly that flipping `STRUCTURED_OUTPUT_ENFORCEMENT=true` without the checklist is forbidden; cross-links to the truth doc + runbooks.
+
+3. **Six Prometheus alerts** (`prometheus/alerts.yml`):
+   - 4 alerts on `claude_code_bridge_envelope_outcome_total` (PR #138 counter): `CodeBridgeEnvelopeStdoutFallbackHigh`, `CodeBridgeEnvelopeNoneCritical`, `CodeBridgeTurn1SuccessLow`, `CodeBridgeUnknownCallerCategoryEmitting`
+   - 2 alerts on `claude_xlsx_render_turn1_envelope_success_total` (PR #136 counter): `XlsxRenderTurn1RegressionByTemplate`, `XlsxRenderTurn1RegressionByPhase`
+   - Each alert includes `runbook_url` annotation linking to the playbook's per-alert anchor
+   - Labels: `severity`, `team`, `component`, `flag` for routing + filtering
+
+4. **Envelope-decision on-call playbook** (`docs/runbooks/envelope-decision-debug-playbook.md`, NEW): 6-section drill-down: triage decision tree, per-alert response procedures with kubectl + PromQL + SQL + Cloud Logging queries, cross-surface quick reference, rollback procedure (~5min target), escalation contacts (placeholder for team), seed-empty known-false-positives table.
+
+5. **Cohort rollout plan** (`docs/runbooks/structured-output-enforcement-rollout.md`, NEW): Pre-flip 8-item checklist + rollback drill (rehearse in staging) + 3 stages (single-client canary 24h → 25% cohort 48h → 100% fleet 7-day) with per-stage pass/fail criteria + rollback triggers + per-stage rollback procedures + entry/exit log template + aborted-rollout recovery procedure + roles & responsibilities.
+
+6. **Grafana dashboard panels** (`grafana/claude-sdk-dashboard.json`): 5 new panels visualizing both counters. 3 panels for PR #138 counter (caller-category mix, envelope-source distribution, turn-1 success gauge with threshold band at 0.70). 2 panels for PR #136 counter (per-template turn-1 success, per-phase turn-1 success heatmap). Dashboard goes 6 → 11 panels.
+
+**Plus 4 minor skill doc enhancements** identified by post-PR-138 audit (bundled here since they're operator-facing documentation):
+- `.claude/skills/post-deploy-verify/scripts/verify-tier2.sh` — new V7 check probing `/metrics` for the new counter registration
+- `.claude/skills/infrastructure-health/references/postgresql.md` — note `code_executions.envelope_source` + `access_log.event_data` JSONB columns + indexes
+- `.claude/skills/client-audit-export/SKILL.md` — note regulator handoff bundles now include both new columns via `SELECT *`
+- `.claude/skills/client-offboarding/SKILL.md` — note Phase 2 archive captures `event_data` JSONB inline as quoted JSON in CSV
+
+**Verification**:
+- L1 (doc consistency): `STRUCTURED_OUTPUT_ENFORCEMENT` + `XLSX_RENDERER` now appear in `docs/feature-flags.md` Quick Reference table + detailed entry sections
+- L2 (PromQL syntax): all 6 alerts pass `promtool check rules prometheus/alerts.yml` syntax validation
+- L3 (Grafana JSON): dashboard JSON parses cleanly via `jq` — verified 11 panels enumerable, all 5 new panels include datasource + targets + fieldConfig
+- L4 (staging soak ≥ 3 days at flag=true) — deferred to operations team execution; cannot be performed in this PR
+
+**Files modified (11 total across PR #140 + companion PR #141)**:
+
+In PR #140 commit (7 files in `super-legal-mcp-refactored/` worktree):
+- `docs/feature-flags.md` — flag #41 (XLSX_RENDERER) + flag #42 (STRUCTURED_OUTPUT_ENFORCEMENT) entries; header bump v4.1 → v4.2; Dependency Tree section updated to include both new flags
+- `flags.env` — prerequisite comment block above line 22
+- `prometheus/alerts.yml` — 6 new alert rules
+- `docs/runbooks/envelope-decision-debug-playbook.md` (NEW)
+- `docs/runbooks/structured-output-enforcement-rollout.md` (NEW)
+- `grafana/claude-sdk-dashboard.json` — 5 new panels
+- `CHANGELOG.md` (this entry)
+
+In companion PR #141 commit (4 files in `.claude/skills/`, outside this worktree boundary):
+- `.claude/skills/post-deploy-verify/scripts/verify-tier2.sh` — V7 check
+- `.claude/skills/infrastructure-health/references/postgresql.md` — 2 table rows enriched
+- `.claude/skills/client-audit-export/SKILL.md` — 2 table rows enriched
+- `.claude/skills/client-offboarding/SKILL.md` — Step 6.5 paragraph enriched
+
+PR #141 exists as a separate commit because the 4 skill docs live at the project root (`/Users/ej/Super-Legal/.claude/skills/`) which is outside the xlsx-renderer worktree's boundary; same logical scope, separated only by git-worktree mechanics.
+
+**Honest limits acknowledged**:
+- L4 staging soak (≥ 3 days at flag=true) cannot be performed in this PR — requires operations execution per the runbook
+- §5 of the debug playbook leaves escalation contacts as placeholder TBD (team-specific info)
+- §6 of the debug playbook seeds the known-false-positives table empty (populated by on-call as real incidents occur)
+
+**Plan**: `/Users/ej/.claude/plans/twinkling-glittering-comet.md`
+
 ### Added — Bridge observability v2: envelope_source DB persistence + generic Prometheus counter + access_log JSONB enrichment + structured envelope-decision logging (PR #138)
 
 Closes 4 verified observability/auditability gaps identified by background Explore agents on 2026-05-16 after PR #137 merged. Pre-PR-138, `envelope_source` (set on every bridge return at `selectEnvelopeWithFallback`) was visible only via the multi-turn xlsx orchestrator's Prometheus counter — single-turn xlsx, MCP gateway, and Agent SDK subagent callers produced envelope outcomes that never reached any dashboard or queryable schema. This PR ships a single coherent change (~220 LOC + 4 migration files) that closes all four under one operational umbrella so `STRUCTURED_OUTPUT_ENFORCEMENT=true` can be flipped for full-fleet production rollout with confidence.
diff --git a/super-legal-mcp-refactored/docs/feature-flags.md b/super-legal-mcp-refactored/docs/feature-flags.md
index d85f1c194..7089aa05a 100644
--- a/super-legal-mcp-refactored/docs/feature-flags.md
+++ b/super-legal-mcp-refactored/docs/feature-flags.md
@@ -2,10 +2,10 @@
 
 ## Super-Legal MCP Server — Single Source of Truth
 
-**Version:** 4.1
-**Date:** 2026-05-10
+**Version:** 4.2
+**Date:** 2026-05-16
 **Source:** `src/config/featureFlags.js`
-**Total flags:** 39 (33 boolean + 4 numeric/string + 2 dead code; +4 since v4.0 — `EXA_ADDITIONAL_QUERIES`, `EXA_ADDITIONAL_QUERIES_AB_SAMPLE`, `FMP_ENABLED`, `ALLOW_FULL_TRANSCRIPT`)
+**Total flags:** 41 (35 boolean + 4 numeric/string + 2 dead code; +2 since v4.1 — `XLSX_RENDERER` [PR #100 era, never registered], `STRUCTURED_OUTPUT_ENFORCEMENT` [PR #135 Avenue A v2, never registered])
 
 All feature flags are environment-variable-controlled via the `envBool()` helper. Set `FLAG_NAME=true` or `FLAG_NAME=false` in your environment or `.env` file. No code changes required for any toggle.
 
@@ -57,6 +57,8 @@ All feature flags are environment-variable-controlled via the `envBool()` helper
 | 38 | [`ALLOW_FULL_TRANSCRIPT`](#38-allow_full_transcript) | `false` | Active | Capabilities |
 | 39 | [`EXA_ADDITIONAL_QUERIES`](#39-exa_additional_queries) | `false` | Active (v7.1.0) | Search |
 | 40 | [`EXA_ADDITIONAL_QUERIES_AB_SAMPLE`](#40-exa_additional_queries_ab_sample) | `0.0` (numeric) | Active (v7.6.0) | Search — measurement |
+| 41 | [`XLSX_RENDERER`](#41-xlsx_renderer) | `false` | Active — staging | Capabilities (workbook deliverable) |
+| 42 | [`STRUCTURED_OUTPUT_ENFORCEMENT`](#42-structured_output_enforcement) | `false` | Active — pre-flip readiness | API / Bridge envelope enforcement |
 
 ---
 
@@ -77,7 +79,9 @@ USE_AGENT_SDK ────────> gates Agent SDK multi-turn path
   │     │     └── CODE_EXECUTION_BRIDGE ──> code-execution domain
   │     ├── CODE_EXECUTION_BRIDGE ──> agent prompt content (5 agents)
   │     │     ├── FILES_API_CHART_EXTRACTION ──> file_id download from containers
-  │     │     └── CHART_PERSISTENCE ──> charts to disk + markdown embedding
+  │     │     ├── CHART_PERSISTENCE ──> charts to disk + markdown embedding
+  │     │     └── STRUCTURED_OUTPUT_ENFORCEMENT ──> API-level envelope JSON-schema enforcement (PR #135/Avenue A v2; flag #42)
+  │     │           └── (no effect when CODE_EXECUTION_BRIDGE=false — bridge never called)
   │     ├── DOCUMENT_PROCESSING ──> P0 pre-wave phase
   │     └── CITATION_WEBSEARCH_VERIFICATION ──> G5 phase
   │           └── CITATION_DEEP_VERIFICATION ──> G5 model + depth
@@ -105,6 +109,12 @@ ACCESS_AUDIT ──────> (independent, but useful only with AUTH_ENABLED
 SESSION_RECONCILIATION ──> (requires HOOK_DB_PERSISTENCE) hourly auto-rebuild loop
                            for partial sessions (KG + artifacts)
 
+XLSX_RENDERER ──────> (flag #41) session-grain xlsx workbook deliverable post-manifest.finalize
+                       ├── Requires: CODE_EXECUTION_BRIDGE=true (bridge invocations per phase)
+                       ├── Recommends: HOOK_DB_PERSISTENCE=true (xlsx_renders table for state machine + audit)
+                       └── Interacts with: STRUCTURED_OUTPUT_ENFORCEMENT (when both ON, xlsx renders use
+                           API-level envelope enforcement; independent flips supported)
+
 STRUCTURED_OUTPUTS ──> (independent) JSON schema on API requests
 SKILLS_ENABLED ──────> (independent) custom skills + beta headers
 EXTENDED_CONTEXT ────> (independent) 1M context for Messages API path
@@ -1473,6 +1483,133 @@ Captures every SSE event flowing through `ctx.send()` in `streamContext.js` into
 
 ---
 
+### 41. XLSX_RENDERER
+
+| Attribute | Value |
+|-----------|-------|
+| **Default** | `false` |
+| **Type** | boolean |
+| **Category** | Capabilities (workbook deliverable) |
+| **Purpose** | Enables session-grain XLSX workbook generation as a post-processor after `manifest.finalize` |
+| **GitHub** | [#100](https://github.com/Number531/Legal-API/issues/100) — multi-turn orchestrator + phase isolation |
+| **Documentation debt** | Flag has existed in production code since v4.5 era but was never registered in this doc until v4.2 (2026-05-16) |
+
+**Behavior ON:**
+1. After `manifest.finalize()`, the xlsx renderer post-processor runs against the session
+2. Routes to single-turn (`session-models` template) or multi-turn orchestrator (`full-deal-workbook`, `lbo-focused`, `valuation-only`, `tax-memo-workbook`) based on template registry
+3. Each render goes through `gather → composePhaseSpec → runPythonAnalysis → selectEnvelopeWithFallback → reconcile → persist`
+4. `xlsx_renders` table populated with audit_status, sheet_count, warnings_count, node_audit_ran (generated columns from migration 018)
+5. Per-phase OTel spans + per-render Prometheus metrics
+6. 202-style async polling endpoint exposed via `/api/db/sessions/:sessionKey/xlsx-render`
+
+**Behavior OFF:**
+1. `manifest.finalize()` returns without invoking the renderer
+2. No xlsx_renders rows written
+3. No code-execution bridge calls from xlsx path (MCP-direct calls via `run_python_analysis` still work)
+
+**Templates registered** (in `src/config/xlsxTemplates/index.js`):
+| Template | Type | Phases | Estimated wall time |
+|---|---|---|---|
+| `session-models` | single-turn | 1 | ~180s |
+| `full-deal-workbook` | multi-turn | 5 (phase1-5) | ~530s |
+| `lbo-focused` | multi-turn | 4 | ~400s |
+| `valuation-only` | multi-turn | 4 | ~350s |
+| `tax-memo-workbook` | multi-turn | 4 | ~400s |
+
+**Files:**
+- `src/utils/xlsxRenderer/index.js` — single-turn entry point (line 208 invokes bridge)
+- `src/utils/xlsxRenderer/multiTurnOrchestrator.js` — multi-turn phase dispatcher (line 105 invokes bridge per phase)
+- `src/config/xlsxTemplates/` — 5 template definitions
+- `src/db/postgres.js` — `xlsx_renders` table + 4 generated columns
+- `migrations/017_xlsx-renders.{up,down}.sql`, `migrations/018_xlsx-renders-generated-columns.{up,down}.sql`
+
+**Cost:** Each render = N × bridge calls (N = phase count). Multi-turn full-deal-workbook ≈ 5 bridge calls = 5 × Anthropic API + Files API charges. Persistence cost negligible.
+
+**Rollback:** Flip `XLSX_RENDERER=false` in prod env. In-flight renders complete (already in async pipeline); new renders skip the post-processor. No data loss.
+
+**Interaction with flag #42**: when `STRUCTURED_OUTPUT_ENFORCEMENT=true` is also flipped, all xlsx renders use API-level envelope enforcement on turn 1. Independent flags but commonly flipped together (PR #135 + this flag share the production-readiness boundary).
+
+---
+
+### 42. STRUCTURED_OUTPUT_ENFORCEMENT
+
+| Attribute | Value |
+|-----------|-------|
+| **Default** | `false` |
+| **Type** | boolean |
+| **Category** | API / Bridge envelope enforcement |
+| **Purpose** | Enables Anthropic `output_config: { format: { type: 'json_schema', schema: {...} } }` enforcement on the code-execution bridge's text-block output |
+| **GitHub** | PR #135 (Avenue A v2 — structured output), PR #138 (observability v2), PR #139 (real-bug follow-up), PR #140 (operator-runbook readiness — this doc registration) |
+| **Documentation debt** | Flag has existed in production code since PR #135 (2026-05-12) but was never registered in this doc until v4.2 (2026-05-16) |
+| **Prerequisites for production flip** | **ALL ITEMS MUST BE ✅ BEFORE FLIPPING.** See **Pre-flip checklist** below. |
+
+**Behavior ON:**
+1. Bridge passes `output_config: { format: { type: 'json_schema', schema: ENVELOPE_SCHEMA_XLSX or ENVELOPE_SCHEMA_GENERAL } }` to the Anthropic Messages API on both initial and `pause_turn` continuation API calls
+2. `extractResults()` reads `response.parsed_output` (SDK auto-parsed envelope) → falls back to final text block JSON parse → falls back to existing stdout extraction
+3. `selectEnvelopeWithFallback()` picks the winning extraction path; merges audit-from-text with b64-from-stdout for xlsx callers (Option A schema is audit-only to avoid b64-in-text max_tokens cliff)
+4. Per-call `envelope_source` set on `finalResult` ∈ {`parsed_output`, `text`, `stdout`, `none`, `merged:parsed_output+stdout`, `merged:text+stdout`}
+5. Emits to Prometheus counter `claude_code_bridge_envelope_outcome_total` + Cloud Logging event `envelope_decision` + OTel span attribute + persisted to `code_executions.envelope_source` DB column
+
+**Behavior OFF:**
+1. No `output_config` parameter passed to API
+2. Bridge uses prompt-level envelope instructions + corrective-retry path (existing pre-PR-135 behavior)
+3. `extractResults()` reads stdout only
+4. `envelope_source` set to `'stdout'` or `'none'` depending on whether b64 envelope is parseable from stdout
+
+**Target efficacy** (per PR #135 plan — these are PLAN TARGETS, not yet validated by measured fleet data):
+- Pre-flag-flip baseline (PR #134 L4 logs, MEASURED): ~80% of phases need turn-2 corrective retry
+- Post-flag-flip TARGETS (PR #135 plan):
+  - ≥95% turn-1 envelope success
+  - ~50% wall-time reduction on multi-turn renders
+  - ~50% LLM token cost reduction per successful render
+- Validation status: PR #135 L4 paired comparison (N=2) confirmed Option A correctness (5/5 phases delivered, audit_status=PASS) but DID NOT statistically validate the ~50% figures — see pre-flip prerequisite #6 ("L4 paired comparison data N≥10 collected showing turn-1 success rate ≥ baseline"). Statistical confidence for the ~50% claims requires the staging soak data per prerequisite #5.
+- **Actual measured efficacy will be filled in here after Stage 3 7-day observation completes.**
+
+### Pre-flip checklist (BLOCKING)
+
+| # | Prerequisite | Status | Owner | Verified by |
+|---|---|---|---|---|
+| 1 | Prometheus alerts wired on `claude_code_bridge_envelope_outcome_total` + `claude_xlsx_render_turn1_envelope_success_total` (6 alerts total) | ⬜ | Operations | Merged via PR #140 (this PR) — `prometheus/alerts.yml` |
+| 2 | Envelope-decision debug playbook published | ⬜ | Operations | `docs/runbooks/envelope-decision-debug-playbook.md` (PR #140) |
+| 3 | Cohort rollout plan + rollback drill published | ⬜ | Operations | `docs/runbooks/structured-output-enforcement-rollout.md` (PR #140) |
+| 4 | Grafana dashboard panels live | ⬜ | Operations | `grafana/claude-sdk-dashboard.json` (PR #140) |
+| 5 | Staging soak ≥ 3 days at `STRUCTURED_OUTPUT_ENFORCEMENT=true` with all 6 alerts silent | ⬜ | Operations | Staging Prometheus query log |
+| 6 | L4 paired comparison data N≥10 collected showing turn-1 success rate ≥ baseline | ⬜ | Operations | Staging Prometheus query log |
+| 7 | PR #138 + PR #139 deployed to production for ≥ 7 days at flag=false (baselines established) | ⬜ | Deployment | Production deploy logs + counter baseline values |
+| 8 | On-call rotation aware (announcement posted, runbook URLs in channel topic) | ⬜ | Operations | Team channel announcement timestamp |
+
+Until all 8 items are ✅, **DO NOT flip `STRUCTURED_OUTPUT_ENFORCEMENT=true` in production.** Staging is fair game for soak testing (item #5).
+
+### Rollback (one-line)
+
+```bash
+# In production CONTAINER_ENV:
+STRUCTURED_OUTPUT_ENFORCEMENT=false
+# Deploy → 30-min observation. envelope_source="text" rate should drop to 0 within 5 min.
+```
+
+See `docs/runbooks/envelope-decision-debug-playbook.md` §4 "Rollback procedure" for full procedure including verification queries.
+
+**Files:**
+- `src/tools/codeExecutionBridge.js` — schemas, output_config injection, selectEnvelopeWithFallback, sdkLogger emission, recordCodeBridgeEnvelope counter, OTel span attrs
+- `src/utils/sdkMetrics.js` — `codeBridgeEnvelopeOutcome` Counter (PR #138) + `xlsxRenderTurn1EnvelopeSuccess` Counter (PR #136)
+- `src/utils/hookDBBridge.js` — INSERT extended to persist envelope_source as code_executions.$25 column
+- `src/middleware/accessAudit.js` — getEventData callback for opportunistic event_data enrichment
+- `src/server/adminRouter.js` — code_execution_json endpoint enriches access_log.event_data with `{execution_id, envelope_source, success}`
+- `src/utils/xlsxRenderer/multiTurnOrchestrator.js` + `xlsxRenderer/index.js` + `tools/toolImplementations.js` — caller_category propagation
+- `migrations/019_code-executions-envelope-source.{up,down}.sql`, `migrations/020_access-log-event-data.{up,down}.sql`
+- `prometheus/alerts.yml` — 6 alerts (PR #140)
+- `grafana/claude-sdk-dashboard.json` — 5 panels (PR #140)
+- `docs/runbooks/envelope-decision-debug-playbook.md` (PR #140), `docs/runbooks/structured-output-enforcement-rollout.md` (PR #140)
+
+**Cost impact:**
+- Flag OFF (MEASURED, current production state): zero direct cost; ~80% of bridge calls incur turn-2 corrective retry cost (per PR #134 L4 logs)
+- Flag ON (TARGET, not yet validated): zero direct API cost increase (`output_config` is an existing Anthropic API parameter, no surcharge); plan target is ~50% reduction in turn-2 retry cost ≈ ~50% reduction in LLM token spend per successful render. Actual measured impact will be quantified during cohort rollout per `docs/runbooks/structured-output-enforcement-rollout.md` Stage 3 observation window.
+
+**Interaction with flag #41**: independent flag, but commonly flipped together post-staging. `XLSX_RENDERER=true` alone enables xlsx renders with the existing prompt-level + corrective-retry path; adding `STRUCTURED_OUTPUT_ENFORCEMENT=true` adds API-level enforcement on top.
+
+---
+
 ## Dead Code Flags
 
 These are exported from `featureFlags.js` but never consumed at runtime:
diff --git a/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md b/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md
new file mode 100644
index 000000000..248aeeefc
--- /dev/null
+++ b/super-legal-mcp-refactored/docs/runbooks/envelope-decision-debug-playbook.md
@@ -0,0 +1,429 @@
+# Envelope-Decision Debug Playbook — On-Call Reference
+
+**Version**: 1.0 (PR #140, 2026-05-16)
+**Audience**: On-call engineers + ops + platform team
+**Triggered by**: any of 6 alerts in `prometheus/alerts.yml` (anchors below)
+**Related**: [feature-flags.md §42](../feature-flags.md#42-structured_output_enforcement), [structured-output-enforcement-rollout.md](structured-output-enforcement-rollout.md), [anthropic-sdk-best-practices-research.md §13](../code-execution-enhancements/anthropic-sdk-best-practices-research.md#§13)
+
+---
+
+## §1 — Triage decision tree
+
+```
+Alert fired?
+│
+├── Critical severity ("CodeBridgeEnvelopeNoneCritical")
+│   └── §2 alert-2 procedure → consider ROLLBACK per §4
+│
+├── Warning + flag=STRUCTURED_OUTPUT_ENFORCEMENT
+│   ├── 3+ flag-tagged alerts firing concurrently → ROLLBACK per §4
+│   └── Single flag-tagged alert → §2 per-alert procedure (no immediate rollback)
+│
+├── Warning, no flag tag (e.g., CodeBridgeUnknownCallerCategoryEmitting)
+│   └── §2 per-alert procedure → file engineering ticket for next sprint
+│
+└── No alert but envelope_source distribution looks suspicious
+    └── §3 cross-surface query patterns for self-diagnosis
+```
+
+---
+
+## §2 — Per-alert response procedures
+
+### Alert 1 — `CodeBridgeEnvelopeStdoutFallbackHigh`
+
+**Anchor**: `alert-1-stdout-fallback-high`
+**Severity**: warning
+**TTL**: 30m
+**Trigger**: `envelope_source=stdout` rate > 80% over 1h **when `STRUCTURED_OUTPUT_ENFORCEMENT=on`**
+
+**What it means**: Text-channel enforcement is silently failing. The bridge is falling through to the legacy stdout path despite the flag being on. This defeats the whole point of Avenue A v2.
+
+**Likely causes** (in order of probability):
+1. Anthropic API rejecting the `output_config` schema and silently degrading to plain Messages API (no JSON enforcement)
+2. Prompt regression causing model to ignore the schema and emit envelope to stdout instead of text
+3. Feature flag mis-read at request time (e.g., `featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT` evaluates `false` despite env var being `true`)
+4. Schema validation rejecting valid envelopes (e.g., new field added to envelope without schema update)
+
+**Investigation steps**:
+
+```bash
+# 1. Confirm flag is actually on in the running container
+kubectl exec -n super-legal <pod> -- env | grep STRUCTURED_OUTPUT_ENFORCEMENT
+# Expect: STRUCTURED_OUTPUT_ENFORCEMENT=true
+
+# 2. Cloud Logging — find the smoking gun (text=missing, stdout=present, flag=on)
+gcloud logging read 'jsonPayload.event="envelope_decision"
+  AND jsonPayload.structured_output_enabled=true
+  AND jsonPayload.envelope_source="stdout"' --limit=20 --format=json
+# Each result shows xlsx_mode + text_envelope_present (should be FALSE) +
+# stdout_envelope_present (should be TRUE)
+
+# 3. SQL — distribution by hour for trend
+psql -c "
+SELECT date_trunc('hour', created_at) AS hr,
+       envelope_source,
+       COUNT(*) AS n
+FROM code_executions
+WHERE created_at > NOW() - INTERVAL '24h'
+GROUP BY 1, 2
+ORDER BY 1 DESC, 2;
+"
+# Expected post-flag-flip: envelope_source='text' OR 'parsed_output' should dominate
+# Anomaly: 'stdout' suddenly dominant
+
+# 4. Check Anthropic API error logs for 400 responses
+gcloud logging read 'severity>=WARNING jsonPayload.message=~"400.*output_config|schema"' --limit=10
+```
+
+**Actions**:
+- If cause #1 (API rejection): file Anthropic support ticket with request_id; consider rolling back flag temporarily
+- If cause #2 (prompt regression): revert recent bridge prompt PRs; document in §6
+- If cause #3 (flag mis-read): check container env injection in deploy config
+- If cause #4 (schema rejection): inspect `ENVELOPE_SCHEMA_XLSX` / `ENVELOPE_SCHEMA_GENERAL` constants for recent changes
+
+---
+
+### Alert 2 — `CodeBridgeEnvelopeNoneCritical`
+
+**Anchor**: `alert-2-envelope-none-critical`
+**Severity**: **critical**
+**TTL**: 15m
+**Trigger**: `envelope_source=none` rate > 5% over 1h
+
+**What it means**: Bridge is producing no extractable envelope from EITHER text OR stdout. Downstream consumers (xlsx renderer, MCP gateway, agents) receive null data. **Silent data loss in progress.**
+
+**Likely causes** (in order of probability):
+1. Anthropic sandbox failure (containers crashing, API returning empty responses)
+2. Model refusing every request (`stop_reason='refusal'` dominates)
+3. Container reuse pattern hitting state limit (PR #125 era issue, may reoccur with PTC)
+4. Network/timeout issue cutting responses before envelope emission
+
+**Investigation steps**:
+
+```bash
+# 1. Per-stop-reason distribution
+psql -c "
+SELECT stop_reason, COUNT(*) AS n,
+       ROUND(100.0*COUNT(*)/SUM(COUNT(*)) OVER (), 2) AS pct
+FROM code_executions
+WHERE created_at > NOW() - INTERVAL '1h'
+GROUP BY 1 ORDER BY 2 DESC;
+"
+# Expected: end_turn dominates (>90%)
+# Anomaly: refusal > 5%, or stop_reason=NULL > 5%, or max_tokens > 5%
+
+# 2. Per-container failure clustering
+psql -c "
+SELECT container_id, COUNT(*) AS n,
+       COUNT(*) FILTER (WHERE envelope_source = 'none') AS failures
+FROM code_executions
+WHERE created_at > NOW() - INTERVAL '1h'
+GROUP BY 1 HAVING COUNT(*) > 5 ORDER BY failures DESC LIMIT 10;
+"
+# Anomaly: a few container_ids account for most failures (state leak)
+
+# 3. Cloud Logging — bridge error traces
+gcloud logging read 'jsonPayload.event=~"code_execution_failure|envelope_decision"
+  jsonPayload.envelope_source="none"' --limit=20 --format=json
+```
+
+**Actions**:
+- **If 3+ envelope alerts firing concurrently OR this alert sustained > 15m → ROLLBACK per §4** (cohort plan rollback trigger)
+- If sandbox failure: file Anthropic support; consider temporary `STRUCTURED_OUTPUT_ENFORCEMENT=false` while investigating
+- If refusal spike: inspect recent prompt changes for refusal-triggering content
+- If container state leak: confirm PR #125 fix is deployed (`container_id` not reused across requests)
+
+---
+
+### Alert 3 — `CodeBridgeTurn1SuccessLow`
+
+**Anchor**: `alert-3-turn1-success-low`
+**Severity**: warning
+**TTL**: 24h
+**Trigger**: turn-1 success rate < 70% for any caller_category over 24h
+
+**What it means**: Specific caller path is failing on turn 1 too often. Forces turn-2 corrective retry, doubling wall time + token cost.
+
+**Likely causes**:
+1. Schema rejection for that caller's data shape (e.g., new envelope field not in schema)
+2. Prompt-level instructions not being followed by model for that caller's task style
+3. Per-caller-specific data complexity (large prompts truncated mid-envelope)
+
+**Investigation steps**:
+
+```bash
+# 1. Per-caller-category baseline comparison
+# (Compare against pre-flag-flip baseline established per §42 prerequisite #7)
+# PromQL:
+sum by (caller_category, turn_outcome) (
+  rate(claude_code_bridge_envelope_outcome_total{structured_output="on"}[24h])
+)
+
+# 2. SQL — find specific failed phases
+psql -c "
+SELECT id, agent_type, envelope_source, turn_count, stop_reason, created_at
+FROM code_executions
+WHERE created_at > NOW() - INTERVAL '24h'
+  AND turn_count > 1
+ORDER BY created_at DESC LIMIT 50;
+"
+
+# 3. Per-affected-caller examine recent prompt/data
+gcloud logging read 'jsonPayload.event="envelope_decision"
+  AND jsonPayload.text_envelope_present=false
+  AND timestamp>"-1h"' --limit=20
+```
+
+**Actions**:
+- If a specific caller_category dominates: investigate that caller's spec composition
+- If across-the-board: investigate model regression (Sonnet 4.6 update? prompt cache invalidation?)
+- File ticket — not immediate rollback unless combined with other alerts (see §1)
+
+---
+
+### Alert 4 — `CodeBridgeUnknownCallerCategoryEmitting`
+
+**Anchor**: `alert-4-unknown-caller`
+**Severity**: warning (data hygiene)
+**TTL**: 15m
+**Trigger**: `caller_category=unknown` series emitting at all
+
+**What it means**: A new bridge invocation site was added without passing `callerCategory`. Defeats per-caller observability.
+
+**Investigation steps**:
+
+```bash
+# Grep for runPythonAnalysis call sites that DON'T pass callerCategory
+grep -rn "runPythonAnalysis(" src/ | grep -v "callerCategory"
+# Expected: only the 3 known sites (multiTurnOrchestrator:97, xlsxRenderer/index:209, toolImplementations:962) — and those should be matched OUT by the grep -v
+# Anomaly: a new file appears in results
+```
+
+**Actions**:
+- Identify the new call site
+- Patch to add `callerCategory: '<new_category>'` (must match feature-flags.md §42 enum)
+- If a new caller category is needed (not one of the existing 5), update both the counter help text in `sdkMetrics.js` AND the feature-flags.md enum docs
+
+---
+
+### Alert 5 — `XlsxRenderTurn1RegressionByTemplate`
+
+**Anchor**: `alert-5-template-regression`
+**Severity**: warning
+**TTL**: 24h
+**Trigger**: per-template turn-1 success rate < 70% over 24h
+
+**What it means**: A specific xlsx template is regressing. Compare to pre-flag-flip baseline.
+
+**Investigation steps**:
+
+```promql
+# 1. PromQL — compare on vs off for the affected template
+sum by (structured_output) (
+  rate(claude_xlsx_render_turn1_envelope_success_total{
+    template_id="<affected_template>", turn_outcome="first_turn"
+  }[24h])
+) / sum by (structured_output) (
+  rate(claude_xlsx_render_turn1_envelope_success_total{
+    template_id="<affected_template>"
+  }[24h])
+)
+# If structured_output="on" rate < structured_output="off" rate: structured output enforcement is the regression cause for this template
+```
+
+```sql
+-- 2. SQL — recent failed renders for the template
+SELECT id, render_status, audit_status, sheet_count, warnings_count, created_at
+FROM xlsx_renders
+WHERE template_id = '<affected>'
+  AND created_at > NOW() - INTERVAL '24h'
+  AND (render_status = 'failed' OR audit_status = 'FAIL')
+ORDER BY created_at DESC LIMIT 20;
+```
+
+**Actions**:
+- Inspect recent template changes (`src/config/xlsxTemplates/<template>.js`)
+- Inspect schema variants in `ENVELOPE_SCHEMA_XLSX` for mismatch with template's audit_results shape
+- Consider per-template rollback (flip `STRUCTURED_OUTPUT_ENFORCEMENT=false` for affected client only if possible) before fleet-wide rollback
+
+---
+
+### Alert 6 — `XlsxRenderTurn1RegressionByPhase`
+
+**Anchor**: `alert-6-phase-regression`
+**Severity**: warning
+**TTL**: 24h
+**Trigger**: per-template + per-phase turn-1 success rate < 65% over 24h
+
+**What it means**: A specific PHASE within a template is regressing. Often more diagnostic than template-level alert because it pinpoints the data shape issue.
+
+**Investigation steps**:
+
+```sql
+-- 1. envelope_source distribution for the affected phase
+SELECT envelope_source, COUNT(*) AS n
+FROM code_executions
+WHERE created_at > NOW() - INTERVAL '24h'
+  AND agent_type LIKE '%<phase>%'  -- e.g., '%phase3%'
+GROUP BY 1 ORDER BY 2 DESC;
+```
+
+```bash
+# 2. Cloud Logging — phase-specific decision context
+gcloud logging read 'jsonPayload.event="envelope_decision"
+  AND jsonPayload.xlsx_mode=true
+  AND timestamp>"-24h"' --limit=50 --format=json \
+  | jq '.[] | select(.jsonPayload.text_envelope_present == false)'
+```
+
+**Actions**:
+- Phase-3 LBO is the highest-risk phase per L4 logs; expected to be first to trip
+- If phase3: inspect LBO sheet composition logic in `xlsxTemplates/full-deal-workbook.js` for envelope shape changes
+- Otherwise: same template-level investigation as Alert 5
+
+---
+
+## §3 — Cross-surface query patterns (quick reference)
+
+Reformatted from [`anthropic-sdk-best-practices-research.md §13`](../code-execution-enhancements/anthropic-sdk-best-practices-research.md#§13) for on-call use.
+
+### Prometheus (fleet aggregates, time-series)
+
+```promql
+# Turn-1 success rate fleet-wide
+sum(rate(claude_code_bridge_envelope_outcome_total{turn_outcome="first_turn"}[1h]))
+ / sum(rate(claude_code_bridge_envelope_outcome_total[1h]))
+
+# Per-caller-category extraction-path mix
+sum by (caller_category, envelope_source) (
+  rate(claude_code_bridge_envelope_outcome_total{structured_output="on"}[1h])
+)
+
+# Per-template per-phase turn-1 success (PR #136 counter)
+sum by (template_id, phase) (
+  rate(claude_xlsx_render_turn1_envelope_success_total{turn_outcome="first_turn"}[1h])
+) / sum by (template_id, phase) (
+  rate(claude_xlsx_render_turn1_envelope_success_total[1h])
+)
+```
+
+### PostgreSQL (per-execution forensics, indexed)
+
+```sql
+-- All stdout-fallback executions in last hour (indexed via idx_code_exec_envelope_source)
+SELECT id, session_id, agent_type, envelope_source, stop_reason, created_at
+FROM code_executions
+WHERE envelope_source = 'stdout' AND created_at > NOW() - INTERVAL '1h';
+
+-- Distribution of envelope_source values for a session
+SELECT envelope_source, COUNT(*) FROM code_executions
+WHERE session_id = $1 GROUP BY 1;
+
+-- Compliance JOIN: which extraction path produced the artifact user X accessed?
+SELECT a.requester, a.created_at, a.event_data->>'envelope_source' AS path,
+       ce.id, ce.agent_type
+FROM access_log a
+JOIN code_executions ce ON ce.id = (a.event_data->>'execution_id')::uuid
+WHERE a.requester = $1 AND a.resource_type = 'code_execution_json'
+ORDER BY a.created_at DESC LIMIT 50;
+
+-- GIN-indexed containment query (any access of fallback-extracted artifacts)
+SELECT * FROM access_log
+WHERE event_data @> '{"envelope_source": "stdout"}'
+  AND created_at > NOW() - INTERVAL '7d';
+```
+
+### Cloud Logging (structured event search)
+
+```
+# All envelope decisions in a specific session
+jsonPayload.event="envelope_decision"
+jsonPayload.session_id="2026-05-16-XXXX"
+
+# Smoking gun for Alert 1 (text path failing despite flag=on)
+jsonPayload.event="envelope_decision"
+jsonPayload.structured_output_enabled=true
+jsonPayload.envelope_source="stdout"
+jsonPayload.text_envelope_present=false
+
+# All none-source decisions (Alert 2 context)
+jsonPayload.event="envelope_decision"
+jsonPayload.envelope_source="none"
+```
+
+### Cloud Trace (span attribute filter)
+
+```
+# Bridge spans filtered by extraction path
+span.name="code_execution_bridge.runPythonAnalysis"
+attributes.envelope_source="stdout"
+
+# Bridge spans for a specific caller
+span.name="code_execution_bridge.runPythonAnalysis"
+attributes.caller_category="xlsx_multi_turn"
+```
+
+---
+
+## §4 — Rollback procedure
+
+### Trigger conditions (any one fires → execute):
+- `CodeBridgeEnvelopeNoneCritical` sustained > 15m
+- 3 of 6 envelope alerts firing concurrently
+- User-reported envelope corruption or "render failed" complaints exceed pre-flip baseline by 2×
+- `audit_status=FAIL` rate on `xlsx_renders` exceeds pre-flip baseline by 50%
+
+### Execution (target: < 5 minutes total)
+
+```bash
+# 1. Flip flag off in production CONTAINER_ENV (per-client or fleet-wide)
+# Edit deployment config, set:
+STRUCTURED_OUTPUT_ENFORCEMENT=false
+# Redeploy (image is unchanged; only env injection differs)
+
+# 2. Verify deploy succeeded
+kubectl get pods -n super-legal -o wide
+kubectl exec -n super-legal <pod> -- env | grep STRUCTURED_OUTPUT_ENFORCEMENT
+# Expect: STRUCTURED_OUTPUT_ENFORCEMENT=false
+
+# 3. Wait 5 min, then verify envelope_source=text rate drops to 0
+# PromQL:
+sum(rate(claude_code_bridge_envelope_outcome_total{envelope_source=~"text|parsed_output|merged:.*"}[5m]))
+# Expected post-rollback: → 0 within 5 min (no text-path emissions when flag=off)
+
+# 4. Confirm alerts clear within their TTLs
+# CodeBridgeEnvelopeNoneCritical: 15m
+# CodeBridgeEnvelopeStdoutFallbackHigh: 30m (will fire briefly during transition then resolve)
+# CodeBridgeTurn1SuccessLow: 24h dwell — won't immediately clear, mark as known
+```
+
+### Post-rollback follow-up
+
+- Document the incident in §6 (known false-positive patterns)
+- File engineering ticket with full timeline + queries used
+- Do NOT re-attempt flag flip until root cause is fixed AND staging soak completes (3 days minimum)
+
+---
+
+## §5 — Escalation contacts
+
+> **TODO** (team to fill in): Slack channel handles, on-call rotation tool, escalation matrix.
+
+Placeholder structure:
+- **Primary on-call**: <team_channel> + PagerDuty rotation `super-legal-platform`
+- **Anthropic API outage**: support ticket via `console.anthropic.com` (mention `request_id`)
+- **Customer-impact escalation**: <client-success channel>
+- **CTO escalation**: critical incidents only, > 1h sustained customer impact
+
+---
+
+## §6 — Known false-positive patterns
+
+> Populated as on-call team encounters and documents real false-positives. Seeded empty by PR #140.
+
+| Date | Alert | Pattern | Resolution |
+|---|---|---|---|
+| _TBD_ | _TBD_ | _TBD_ | _TBD_ |
+
+When adding entries: include alert name, what triggered it, why it was a false positive, what threshold/expr change (if any) was made.
diff --git a/super-legal-mcp-refactored/docs/runbooks/structured-output-enforcement-rollout.md b/super-legal-mcp-refactored/docs/runbooks/structured-output-enforcement-rollout.md
new file mode 100644
index 000000000..93377bd2f
--- /dev/null
+++ b/super-legal-mcp-refactored/docs/runbooks/structured-output-enforcement-rollout.md
@@ -0,0 +1,245 @@
+# `STRUCTURED_OUTPUT_ENFORCEMENT=true` Cohort Rollout Plan
+
+**Version**: 1.0 (PR #140, 2026-05-16)
+**Audience**: Deployment + ops + on-call team
+**Related**: [feature-flags.md §42](../feature-flags.md#42-structured_output_enforcement), [envelope-decision-debug-playbook.md](envelope-decision-debug-playbook.md)
+**Status**: Pre-flip — waiting on §1 prerequisites
+
+---
+
+## Why this plan exists
+
+`STRUCTURED_OUTPUT_ENFORCEMENT=true` flips on Anthropic-API-level JSON-schema enforcement for code-execution-bridge envelopes. This is a behavior-changing flag affecting every bridge call across xlsx renders (4 templates × N phases each), MCP gateway invocations, and Agent SDK subagent code-execution paths.
+
+Engineering surface is complete (PRs #135 → #139); observability surface is complete (PR #140). What remains is the **operational discipline** to flip the flag safely:
+
+- A 3-stage cohort plan with explicit dwell times and pass/fail criteria
+- A rehearsed rollback drill (staging) before any production flip
+- Documented rollback triggers and ownership
+
+---
+
+## §1 — Pre-flip checklist (BLOCKING)
+
+All 8 items MUST be ✅ before entering Stage 1.
+
+| # | Prerequisite | How to verify | Status |
+|---|---|---|---|
+| 1 | Prometheus alerts wired (6 alerts in `prometheus/alerts.yml`) | `promtool check rules prometheus/alerts.yml` returns 0 errors; alerts visible in Alertmanager UI | ⬜ |
+| 2 | Envelope-decision debug playbook published | `docs/runbooks/envelope-decision-debug-playbook.md` exists on main + on-call channel topic links to it | ⬜ |
+| 3 | This rollout plan published | `docs/runbooks/structured-output-enforcement-rollout.md` exists on main (you are reading it) | ⬜ |
+| 4 | Grafana dashboard panels live | `grafana/claude-sdk-dashboard.json` 5 new panels imported; panels render with data when flag=false in staging | ⬜ |
+| 5 | Staging soak ≥ 3 days at `STRUCTURED_OUTPUT_ENFORCEMENT=true` with all 6 alerts silent | Staging Prometheus query log shows zero alert fires over 72h | ⬜ |
+| 6 | L4 paired comparison data N≥10 collected | Staging Prometheus shows turn-1 success rate (structured_output=on) ≥ baseline (structured_output=off) across N≥10 paired sessions | ⬜ |
+| 7 | PR #138 + PR #139 deployed to production for ≥ 7 days at flag=false (baselines established) | Production Prometheus has 7+ days of `claude_code_bridge_envelope_outcome_total{structured_output="off"}` data | ⬜ |
+| 8 | On-call rotation aware (announcement posted, runbook URLs in channel topic) | Slack channel `#super-legal-ops` topic includes runbook URL; announcement message timestamped within 7 days of Stage 1 entry | ⬜ |
+
+**Sign-off**: who marks each ✅? Owner column maps to:
+- #1, #3, #4, #5, #6: Operations
+- #2: Engineering + Operations (engineering writes, ops verifies on-call accessibility)
+- #7: Deployment
+- #8: Ops manager
+
+---
+
+## §2 — Rollback drill (REHEARSE IN STAGING BEFORE STAGE 1)
+
+This drill must execute end-to-end successfully in staging before any production flip. Verify before Stage 1.
+
+### Drill procedure
+
+```bash
+# Step 1: enable flag in staging
+# Edit staging CONTAINER_ENV:
+STRUCTURED_OUTPUT_ENFORCEMENT=true
+# Deploy to staging
+# Wait 5 min for first bridge calls to land with new flag
+
+# Step 2: confirm flag is active
+kubectl exec -n super-legal-staging <pod> -- env | grep STRUCTURED_OUTPUT_ENFORCEMENT
+# Expect: STRUCTURED_OUTPUT_ENFORCEMENT=true
+
+# Step 3: confirm text-path is being chosen (envelope_source=text|parsed_output)
+# Wait 15 min for traffic, then:
+# PromQL:
+sum by (envelope_source) (
+  rate(claude_code_bridge_envelope_outcome_total{structured_output="on"}[15m])
+)
+# Expect: text or parsed_output > 0; stdout may also be > 0 for xlsx (merge path)
+
+# Step 4: execute rollback (per envelope-decision-debug-playbook.md §4)
+# Edit staging CONTAINER_ENV:
+STRUCTURED_OUTPUT_ENFORCEMENT=false
+# Deploy to staging
+# Wait 5 min
+
+# Step 5: confirm rollback effective
+sum by (envelope_source) (
+  rate(claude_code_bridge_envelope_outcome_total[5m])
+)
+# Expect: envelope_source=text|parsed_output|merged:* rates → 0 within 5 min
+#         envelope_source=stdout|none dominate (legacy paths)
+
+# Step 6: confirm alerts cleared
+# Alertmanager UI: no firing alerts in staging within 30 min
+```
+
+### Drill pass criteria
+
+- Steps 1-6 all complete without manual intervention beyond `kubectl` / deploy
+- Step 5 verification PromQL returns expected values within 5 min of flag flip
+- No customer-impact incidents during the drill
+
+### Drill fail → halt
+
+If the drill fails to roll back cleanly:
+- File engineering ticket
+- Do NOT enter Stage 1 until the rollback mechanism is debugged
+- Most common cause: flag is read at module load time instead of per-request; verify `featureFlags.STRUCTURED_OUTPUT_ENFORCEMENT` is read inside `runPythonAnalysis` (it is per PR #138 design, but verify in source)
+
+---
+
+## §3 — Stage 1: Single-client canary (24h dwell)
+
+### Cohort selection
+
+Pick the client meeting ALL criteria:
+- Avg renders/day < 5 over last 7 days
+- No active deal-close in next 48h (check with client-success)
+- Customer aware of canary status (advance notice email; opt-in preferred)
+
+### Procedure
+
+```bash
+# 1. Identify target client deployment
+gcloud run services list --region=us-east1 | grep super-legal-<client>
+
+# 2. Flip flag in that client's CONTAINER_ENV ONLY
+gcloud run services update super-legal-<client> \
+  --region=us-east1 \
+  --update-env-vars=STRUCTURED_OUTPUT_ENFORCEMENT=true
+
+# 3. Verify deploy
+gcloud run services describe super-legal-<client> \
+  --region=us-east1 --format="value(spec.template.spec.containers[0].env)" | grep STRUCTURED
+
+# 4. Begin 24h observation window
+```
+
+### Pass criteria (all must hold for 24h)
+
+- Zero critical alerts (`CodeBridgeEnvelopeNoneCritical` never fires)
+- ≥ 50 bridge calls observed in the 24h window (sufficient sample)
+- Turn-1 success rate ≥ baseline ± 5% (compare to pre-flip 7-day baseline per §1 #7)
+- Zero customer-reported issues
+- `audit_status=FAIL` rate on `xlsx_renders` for this client ≤ pre-flip baseline
+
+### Fail criteria (any one → ROLLBACK Stage 1)
+
+- `CodeBridgeEnvelopeNoneCritical` fires
+- 3+ envelope alerts fire concurrently
+- Customer reports envelope corruption or render failure
+- Turn-1 success rate degrades > 5% vs baseline
+
+### Stage 1 rollback (per-client)
+
+```bash
+gcloud run services update super-legal-<client> \
+  --region=us-east1 \
+  --update-env-vars=STRUCTURED_OUTPUT_ENFORCEMENT=false
+# Then follow envelope-decision-debug-playbook.md §4 verification steps
+```
+
+---
+
+## §4 — Stage 2: 25% cohort (48h dwell)
+
+### Cohort selection
+
+After Stage 1 passes 24h:
+- Stratified random sample of clients: 25% of fleet, stratified by avg renders/day (quartiles)
+- Exclude any client with active deal-close in next 72h
+- Notify all selected clients (status page update)
+
+### Procedure
+
+Same per-client `gcloud run services update` pattern as Stage 1, applied to each selected client.
+
+### Pass criteria (all must hold for 48h)
+
+- All Stage 1 criteria scaled to the cohort
+- Aggregate per-template turn-1 success rate ≥ baseline ± 3%
+- No envelope alerts fire across the cohort
+- No new patterns appear in `envelope-decision-debug-playbook.md` §6 false-positives
+
+### Fail criteria
+
+- Any Stage 1 rollback trigger fires for any cohort member
+- 2+ clients show concurrent degradation
+
+### Stage 2 rollback (cohort-wide)
+
+Loop through cohort clients, set `STRUCTURED_OUTPUT_ENFORCEMENT=false` for each. Status page update.
+
+---
+
+## §5 — Stage 3: 100% fleet (7-day observation)
+
+### Procedure
+
+After Stage 2 passes 48h:
+- Flip remaining clients (75% of fleet)
+- Begin 7-day observation window
+- Update `docs/feature-flags.md` §42 status to "Active in production (flipped <DATE>, cohort: 100%)"
+
+### Pass criteria (7-day observation)
+
+- Fleet-wide turn-1 success rate ≥ baseline (target ≥0.85 absolute per §42 efficacy expectation)
+- N≥100 paired-comparison data points showing rate improvement
+- Zero unresolved incidents in `envelope-decision-debug-playbook.md` §6
+- `claude_code_bridge_envelope_outcome_total{envelope_source="text"}` rate steady-state > 0 for all caller categories (proves text-path enforcement is engaging fleet-wide)
+
+### Long-term monitoring (post-Stage 3)
+
+- Quarterly review of `envelope-decision-debug-playbook.md` §6 false-positives
+- Recalibrate per-template `estimated_seconds` based on observed wall-time improvements (PR #138 plan deferred item)
+- Re-evaluate Avenue B Phase 2 (LBO sheet decomposition) — likely unnecessary per Avenue A v2 efficacy (PR #138 plan hypothesis)
+- After 60 days: consider removing the corrective-retry path in `codeExecutionBridge.js:613-652` (PR #138 plan deferred item)
+
+---
+
+## §6 — Stage entry/exit log (operations to fill in)
+
+Track actual rollout progression here. Initial state: pre-stage (no client flipped).
+
+| Stage | Entered | Exit (pass/fail) | Cohort size | Notes |
+|---|---|---|---|---|
+| Pre-flip checklist (§1) | — | — | — | All 8 items pending verification |
+| Rollback drill (§2) | — | — | staging | — |
+| Stage 1 — canary | — | — | 1 client | — |
+| Stage 2 — 25% | — | — | TBD | — |
+| Stage 3 — 100% | — | — | full fleet | — |
+
+---
+
+## §7 — Aborted-rollout recovery procedure
+
+If any stage fails AND rollback is executed successfully:
+
+1. **Do NOT re-attempt the flag flip immediately.** Root-cause the failure first.
+2. Update `envelope-decision-debug-playbook.md` §6 with the false-positive or real-positive pattern
+3. If real-positive (genuine bug): file engineering ticket; require fix + new PR + ≥ 3-day staging re-soak before re-entering Stage 1
+4. If false-positive (alert too sensitive): tune alert threshold in `prometheus/alerts.yml`; require 1-day staging re-soak before re-entering Stage 1
+5. Repeat from §2 (rollback drill) for each re-entry attempt
+
+---
+
+## §8 — Roles & responsibilities
+
+| Role | Responsibilities |
+|---|---|
+| **Operations lead** | §1 checklist verification, Stage 1/2/3 entry/exit decisions, §6 log maintenance |
+| **On-call engineer** | Alert response per debug playbook, rollback execution if triggered |
+| **Engineering lead** | Root-cause analysis for any failures, fix PRs, code review of rollout-plan revisions |
+| **Client success** | Pre-flip notification to canary client, post-flip status page updates, customer escalation routing |
+| **Deployment** | `gcloud run services update` execution, deploy verification queries |
diff --git a/super-legal-mcp-refactored/flags.env b/super-legal-mcp-refactored/flags.env
index 3484eee6b..6479ccd1e 100644
--- a/super-legal-mcp-refactored/flags.env
+++ b/super-legal-mcp-refactored/flags.env
@@ -19,6 +19,18 @@ XLSX_RENDERER=false
 # Default false (existing prompt-level + corrective-retry path). When true, bridge
 # enforces envelope shape at the Anthropic API level via output_config + parsed_output
 # extraction. Target: eliminate the ~80% turn-1-envelope-miss rate.
+#
+# ⚠️  PRODUCTION FLIP PREREQUISITES (BLOCKING): see docs/feature-flags.md §42
+#     1. Prometheus alerts wired (prometheus/alerts.yml — 6 alerts)
+#     2. Envelope-decision debug playbook (docs/runbooks/envelope-decision-debug-playbook.md)
+#     3. Cohort rollout plan (docs/runbooks/structured-output-enforcement-rollout.md)
+#     4. Grafana dashboard panels (grafana/claude-sdk-dashboard.json)
+#     5. Staging soak ≥ 3 days at flag=true with all 6 alerts silent
+#     6. L4 paired comparison data N≥10
+#     7. PR #138 + #139 deployed at flag=false for ≥ 7 days (baselines)
+#     8. On-call rotation announcement posted
+# DO NOT flip to true in production until ALL 8 are ✅.
+# Staging is fair game for soak testing (item #5).
 STRUCTURED_OUTPUT_ENFORCEMENT=false
 # Phase 7 — operational caps (per-process; multi-pod multiplies)
 XLSX_RENDER_CONCURRENCY=10
diff --git a/super-legal-mcp-refactored/grafana/claude-sdk-dashboard.json b/super-legal-mcp-refactored/grafana/claude-sdk-dashboard.json
index 0639093ad..f84d31def 100644
--- a/super-legal-mcp-refactored/grafana/claude-sdk-dashboard.json
+++ b/super-legal-mcp-refactored/grafana/claude-sdk-dashboard.json
@@ -16,7 +16,12 @@
         "type": "timeseries",
         "title": "Request Latency (P50/P95/P99)",
         "datasource": "${DS_PROMETHEUS}",
-        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 0,
+          "y": 0
+        },
         "targets": [
           {
             "refId": "A",
@@ -46,7 +51,12 @@
         "type": "timeseries",
         "title": "Tool Error Rate by Tool",
         "datasource": "${DS_PROMETHEUS}",
-        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 12,
+          "y": 0
+        },
         "targets": [
           {
             "refId": "A",
@@ -66,7 +76,12 @@
         "type": "timeseries",
         "title": "Structured Output Success Rate",
         "datasource": "${DS_PROMETHEUS}",
-        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 0,
+          "y": 8
+        },
         "targets": [
           {
             "refId": "A",
@@ -86,7 +101,12 @@
         "type": "timeseries",
         "title": "Token Usage (Input/Output/Cached)",
         "datasource": "${DS_PROMETHEUS}",
-        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 12,
+          "y": 8
+        },
         "targets": [
           {
             "refId": "A",
@@ -116,7 +136,12 @@
         "type": "timeseries",
         "title": "Circuit Breaker Trips (15m)",
         "datasource": "${DS_PROMETHEUS}",
-        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 0,
+          "y": 16
+        },
         "targets": [
           {
             "refId": "A",
@@ -136,7 +161,12 @@
         "type": "timeseries",
         "title": "Thinking Blocks (rate)",
         "datasource": "${DS_PROMETHEUS}",
-        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 12,
+          "y": 16
+        },
         "targets": [
           {
             "refId": "A",
@@ -150,8 +180,158 @@
           },
           "overrides": []
         }
+      },
+      {
+        "id": 7,
+        "type": "timeseries",
+        "title": "PR #138 — Code-Bridge Caller-Category Mix (rate)",
+        "datasource": "${DS_PROMETHEUS}",
+        "description": "Rate of bridge calls per caller_category. Confirms all 3 production callers (xlsx_multi_turn, xlsx_single_turn, mcp_gateway) are emitting and that 'unknown' stays at 0.",
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 0,
+          "y": 24
+        },
+        "targets": [
+          {
+            "refId": "A",
+            "expr": "sum by (caller_category) (rate(claude_code_bridge_envelope_outcome_total[5m]))",
+            "legendFormat": "{{caller_category}}"
+          }
+        ],
+        "fieldConfig": {
+          "defaults": {
+            "unit": "ops"
+          },
+          "overrides": []
+        }
+      },
+      {
+        "id": 8,
+        "type": "timeseries",
+        "title": "PR #138 — Envelope-Source Distribution (flag=on)",
+        "datasource": "${DS_PROMETHEUS}",
+        "description": "When STRUCTURED_OUTPUT_ENFORCEMENT=on, what extraction path won? Healthy: text or parsed_output dominates. Anomaly: stdout dominant (see envelope-decision-debug-playbook.md alert 1).",
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 12,
+          "y": 24
+        },
+        "targets": [
+          {
+            "refId": "A",
+            "expr": "sum by (envelope_source) (rate(claude_code_bridge_envelope_outcome_total{structured_output=\"on\"}[5m]))",
+            "legendFormat": "{{envelope_source}}"
+          }
+        ],
+        "fieldConfig": {
+          "defaults": {
+            "unit": "ops"
+          },
+          "overrides": []
+        }
+      },
+      {
+        "id": 9,
+        "type": "gauge",
+        "title": "PR #138 — Turn-1 Success Rate by Caller (1h)",
+        "datasource": "${DS_PROMETHEUS}",
+        "description": "Per-caller turn-1 envelope success. Alert 3 threshold: 0.70 sustained 24h. Pre-flag-flip baseline target ≥0.70 absolute; post-flip target ≥0.85.",
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 0,
+          "y": 32
+        },
+        "targets": [
+          {
+            "refId": "A",
+            "expr": "sum by (caller_category) (rate(claude_code_bridge_envelope_outcome_total{turn_outcome=\"first_turn\"}[1h])) / (sum by (caller_category) (rate(claude_code_bridge_envelope_outcome_total[1h])) + 0.0001)",
+            "legendFormat": "{{caller_category}}"
+          }
+        ],
+        "fieldConfig": {
+          "defaults": {
+            "unit": "percentunit",
+            "min": 0,
+            "max": 1,
+            "thresholds": {
+              "mode": "absolute",
+              "steps": [
+                {
+                  "color": "red",
+                  "value": null
+                },
+                {
+                  "color": "orange",
+                  "value": 0.70
+                },
+                {
+                  "color": "green",
+                  "value": 0.85
+                }
+              ]
+            }
+          },
+          "overrides": []
+        }
+      },
+      {
+        "id": 10,
+        "type": "timeseries",
+        "title": "PR #136 — Per-Template Turn-1 Success (1h)",
+        "datasource": "${DS_PROMETHEUS}",
+        "description": "Per-xlsx-template turn-1 envelope success. Alert 5 threshold: 0.70 sustained 24h. Compare across flag states to validate STRUCTURED_OUTPUT_ENFORCEMENT efficacy per template.",
+        "gridPos": {
+          "h": 8,
+          "w": 12,
+          "x": 12,
+          "y": 32
+        },
+        "targets": [
+          {
+            "refId": "A",
+            "expr": "sum by (template_id) (rate(claude_xlsx_render_turn1_envelope_success_total{turn_outcome=\"first_turn\"}[1h])) / (sum by (template_id) (rate(claude_xlsx_render_turn1_envelope_success_total[1h])) + 0.0001)",
+            "legendFormat": "{{template_id}}"
+          }
+        ],
+        "fieldConfig": {
+          "defaults": {
+            "unit": "percentunit",
+            "min": 0,
+            "max": 1
+          },
+          "overrides": []
+        }
+      },
+      {
+        "id": 11,
+        "type": "heatmap",
+        "title": "PR #136 — Per-Phase Turn-1 Success Heatmap (1h)",
+        "datasource": "${DS_PROMETHEUS}",
+        "description": "Per-template + per-phase turn-1 envelope success. Alert 6 threshold: 0.65 sustained 24h. Phase-3 LBO is the highest-risk phase per L4 logs — expected first to trip if structured output destabilizes.",
+        "gridPos": {
+          "h": 10,
+          "w": 24,
+          "x": 0,
+          "y": 40
+        },
+        "targets": [
+          {
+            "refId": "A",
+            "expr": "sum by (template_id, phase) (rate(claude_xlsx_render_turn1_envelope_success_total{turn_outcome=\"first_turn\"}[1h])) / (sum by (template_id, phase) (rate(claude_xlsx_render_turn1_envelope_success_total[1h])) + 0.0001)",
+            "legendFormat": "{{template_id}} / {{phase}}"
+          }
+        ],
+        "fieldConfig": {
+          "defaults": {
+            "unit": "percentunit"
+          },
+          "overrides": []
+        }
       }
     ]
   }
 }
-
diff --git a/super-legal-mcp-refactored/prometheus/alerts.yml b/super-legal-mcp-refactored/prometheus/alerts.yml
index 4a377538d..2481ba3d8 100644
--- a/super-legal-mcp-refactored/prometheus/alerts.yml
+++ b/super-legal-mcp-refactored/prometheus/alerts.yml
@@ -197,6 +197,123 @@ groups:
           summary: "XLSX phase 3 (comprehensive audit) failing >10% of renders"
           description: "Aggregated workbook fails final audit. Likely cause: earlier phases emit BLUE-discipline or formula-whitelist violations. Check claude_xlsx_no_blue_cells_total + phase_audits in xlsx_renders.audit_results."
 
+      # ─── PR #140 — STRUCTURED_OUTPUT_ENFORCEMENT pre-flip readiness alerts ───
+      # All 6 alerts on the two envelope counters (PR #136 + PR #138). Wired
+      # BEFORE flipping STRUCTURED_OUTPUT_ENFORCEMENT=true so operators get
+      # immediate regression signal at the flag-flip moment. See:
+      #   docs/runbooks/envelope-decision-debug-playbook.md (per-alert response)
+      #   docs/runbooks/structured-output-enforcement-rollout.md (cohort plan)
+      #   docs/feature-flags.md §42 (prerequisite checklist)
+
+      # Alert 1/6 (PR #138 counter) — operators see if text-channel enforcement
+      # silently fails (counter falls back to stdout). Fires at flag=ON when
+      # output_config doesn't actually enforce. At flag=OFF this is the steady-
+      # state (stdout is the default), so this alert is meaningful ONLY when
+      # STRUCTURED_OUTPUT_ENFORCEMENT=true.
+      - alert: CodeBridgeEnvelopeStdoutFallbackHigh
+        expr: |
+          (sum(rate(claude_code_bridge_envelope_outcome_total{envelope_source="stdout",structured_output="on"}[1h])) /
+           (sum(rate(claude_code_bridge_envelope_outcome_total{structured_output="on"}[1h])) + 0.0001)) > 0.80
+        for: 30m
+        labels:
+          severity: warning
+          team: platform
+          component: code-execution-bridge
+          flag: STRUCTURED_OUTPUT_ENFORCEMENT
+        annotations:
+          summary: "Code-bridge envelope falling back to stdout > 80% (30m, flag=on)"
+          description: "{{ $value | printf \"%.0f\" }}% of bridge calls with STRUCTURED_OUTPUT_ENFORCEMENT=on are using the stdout path instead of text/parsed_output. Indicates output_config enforcement is not engaging. Check Cloud Logging for `event=envelope_decision structured_output_enabled=true envelope_source=stdout` for the smoking gun. Likely causes: (1) Anthropic API rejecting schema silently, (2) prompt regression suppressing text envelope, (3) bridge feature flag mis-read at request time."
+          runbook_url: "docs/runbooks/envelope-decision-debug-playbook.md#alert-1-stdout-fallback-high"
+
+      # Alert 2/6 (PR #138 counter) — extraction failing entirely. Critical
+      # because envelope=none means downstream consumers get no data. Tight
+      # 15m TTL because silent data loss is unacceptable.
+      - alert: CodeBridgeEnvelopeNoneCritical
+        expr: |
+          (sum(rate(claude_code_bridge_envelope_outcome_total{envelope_source="none"}[1h])) /
+           (sum(rate(claude_code_bridge_envelope_outcome_total[1h])) + 0.0001)) > 0.05
+        for: 15m
+        labels:
+          severity: critical
+          team: platform
+          component: code-execution-bridge
+          flag: STRUCTURED_OUTPUT_ENFORCEMENT
+        annotations:
+          summary: "Code-bridge envelope extraction failing > 5% (15m)"
+          description: "{{ $value | printf \"%.0f\" }}% of bridge calls produced envelope_source=none (no extractable envelope from text OR stdout). Downstream consumers receive null data. Investigate: (1) Anthropic sandbox failure rate, (2) stop_reason distribution in code_executions, (3) recent prompt changes. ROLLBACK TRIGGER per cohort plan if sustained > 15m."
+          runbook_url: "docs/runbooks/envelope-decision-debug-playbook.md#alert-2-envelope-none-critical"
+
+      # Alert 3/6 (PR #138 counter) — per-caller-category regression. 24h
+      # window matches the cohort-rollout dwell time (Stage 1 = 24h). If
+      # turn-1 success drops below 70% for any caller during Stage 1, abort.
+      - alert: CodeBridgeTurn1SuccessLow
+        expr: |
+          sum by (caller_category) (rate(claude_code_bridge_envelope_outcome_total{turn_outcome="first_turn"}[1h])) /
+          (sum by (caller_category) (rate(claude_code_bridge_envelope_outcome_total[1h])) + 0.0001) < 0.70
+        for: 24h
+        labels:
+          severity: warning
+          team: platform
+          component: code-execution-bridge
+          flag: STRUCTURED_OUTPUT_ENFORCEMENT
+        annotations:
+          summary: "Caller {{ $labels.caller_category }} turn-1 envelope success < 70% (24h sustained)"
+          description: "Turn-1 rate {{ $value | printf \"%.2f\" }} for caller_category={{ $labels.caller_category }}. Pre-flag-flip baseline target ≥0.70 absolute. Indicates schema rejection, prompt regression, or per-caller-specific shape mismatch. ROLLBACK TRIGGER per cohort plan if 3 of 6 envelope alerts fire concurrently."
+          runbook_url: "docs/runbooks/envelope-decision-debug-playbook.md#alert-3-turn1-success-low"
+
+      # Alert 4/6 (PR #138 counter) — data hygiene. If `caller_category=unknown`
+      # ever fires in production, a caller path was added without setting the
+      # parameter. Catches accidental regression where a new bridge invocation
+      # site bypasses the established 3-caller pattern.
+      - alert: CodeBridgeUnknownCallerCategoryEmitting
+        expr: sum(rate(claude_code_bridge_envelope_outcome_total{caller_category="unknown"}[15m])) > 0
+        for: 15m
+        labels:
+          severity: warning
+          team: platform
+          component: code-execution-bridge
+        annotations:
+          summary: "Code-bridge caller_category=unknown emitting (data hygiene)"
+          description: "A bridge invocation is being made without passing callerCategory. Defeats per-caller observability. Grep for `runPythonAnalysis(` call sites that don't pass callerCategory; the 3 known callers (multiTurnOrchestrator.js, xlsxRenderer/index.js, toolImplementations.js) all set it explicitly. New caller paths must add their own caller_category label per the §42 enum."
+          runbook_url: "docs/runbooks/envelope-decision-debug-playbook.md#alert-4-unknown-caller"
+
+      # Alert 5/6 (PR #136 counter) — per-template regression detector. Same
+      # 70% floor as Alert 3 but template-scoped instead of caller-scoped.
+      # Phase-3 LBO is the highest-risk template per L4 logs; expected to be
+      # the first template to trip if structured output destabilizes.
+      - alert: XlsxRenderTurn1RegressionByTemplate
+        expr: |
+          sum by (template_id) (rate(claude_xlsx_render_turn1_envelope_success_total{turn_outcome="first_turn"}[1h])) /
+          (sum by (template_id) (rate(claude_xlsx_render_turn1_envelope_success_total[1h])) + 0.0001) < 0.70
+        for: 24h
+        labels:
+          severity: warning
+          team: platform
+          component: xlsx-renderer
+          flag: STRUCTURED_OUTPUT_ENFORCEMENT
+        annotations:
+          summary: "Template {{ $labels.template_id }} turn-1 envelope success < 70% (24h)"
+          description: "Turn-1 rate {{ $value | printf \"%.2f\" }} for template_id={{ $labels.template_id }}. Compare to pre-flag-flip baseline (from 7-day pre-deploy data per §42 prerequisite #7). If this template's baseline was ≥0.70 OFF and is now <0.70 ON, structured output enforcement is the regression vector for this template."
+          runbook_url: "docs/runbooks/envelope-decision-debug-playbook.md#alert-5-template-regression"
+
+      # Alert 6/6 (PR #136 counter) — per-phase regression. Tighter 65%
+      # threshold acknowledging phase variability (e.g., phase3 LBO commonly
+      # at 70-75% baseline; floor at 65% gives margin for normal variance).
+      - alert: XlsxRenderTurn1RegressionByPhase
+        expr: |
+          sum by (template_id, phase) (rate(claude_xlsx_render_turn1_envelope_success_total{turn_outcome="first_turn"}[1h])) /
+          (sum by (template_id, phase) (rate(claude_xlsx_render_turn1_envelope_success_total[1h])) + 0.0001) < 0.65
+        for: 24h
+        labels:
+          severity: warning
+          team: platform
+          component: xlsx-renderer
+          flag: STRUCTURED_OUTPUT_ENFORCEMENT
+        annotations:
+          summary: "Phase {{ $labels.phase }} of {{ $labels.template_id }} turn-1 < 65% (24h)"
+          description: "Turn-1 rate {{ $value | printf \"%.2f\" }} on {{ $labels.template_id }}/{{ $labels.phase }}. Phase-level regression often indicates structured-output enforcement is rejecting envelopes specific to one phase's data shape. Query: SELECT envelope_source, COUNT(*) FROM code_executions WHERE created_at > NOW() - INTERVAL '24h' AND agent_type LIKE '%{{ $labels.template_id }}%' GROUP BY 1; check envelope_source distribution for the affected phase."
+          runbook_url: "docs/runbooks/envelope-decision-debug-playbook.md#alert-6-phase-regression"
+
       # v6.8.7 T2: G5 citation-verifier observability alerts.
       # Baseline established 2026-05-12 (PRs #118+#119): Exa 96.8% / Anthropic 96.1%.
       # 90% WARN floor gives ~7pp margin; 80% CRIT triggers only on genuine degradation.