Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions .claude/skills/post-deploy-verify/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
name: post-deploy-verify
description: >
Run a tiered post-deployment verification checklist after every deploy completes.
Tier 1 (Critical, ~2 min): /health 200, DB connectivity, COMMIT_SHA wired, flags
propagated. Tier 2 (Integration, ~5 min): § 8.4.X V1-V4 protocol — FMP tools,
M46-M58 code-execution models, Cloud Trace SubagentStart, citation verifier,
container env audit, bridge_metadata.git_sha. Tier 3 (Deep, ~10 min): metrics
baseline, reconciliation status, circuit breakers, Prometheus alerts, OTel
sampler. Read-only — never auto-rolls back. Triggers: "verify deploy",
"post-deploy check", "verify staging", "§ 8.4.X verification", "deploy validation",
"/post-deploy-verify". Supports: --tier 1|2|3|all, --url <base_url>.
---

# Post-Deploy Verify

## Workflow

Execute `scripts/verify.sh` from the skill directory (or invoke `/post-deploy-verify`):

```bash
bash /Users/ej/Super-Legal/.claude/skills/post-deploy-verify/scripts/verify.sh
bash /Users/ej/Super-Legal/.claude/skills/post-deploy-verify/scripts/verify.sh --tier 1
bash /Users/ej/Super-Legal/.claude/skills/post-deploy-verify/scripts/verify.sh --tier 2
bash /Users/ej/Super-Legal/.claude/skills/post-deploy-verify/scripts/verify.sh --tier 3 --url http://34.26.70.60:3001
```

The script:
1. Resolves base URL (--url arg → `super-legal-mcp-refactored/scripts/.staging-ip` → `http://localhost:3001`)
2. Pre-flight: validates `which curl`, `which python3`, `which gcloud`, `which jq`
3. Dispatches to selected tier(s); each tier outputs JSON
4. `format-report.py` aggregates → markdown report
5. Exit code: 0 PASSED, 1 WARNING, 2 FAILED

## Tier 1 — Critical (~2 min)

| Check | Source | Pass criteria |
|---|---|---|
| `GET /health` returns 200 | curl | HTTP 200 in ≤10s |
| `dependencies.database.status` = `ok` | `/health` JSON | DB connectivity |
| `dependencies.database.latency_ms` < 50 | `/health` JSON | Pool healthy |
| `dependencies.circuit_breaker.state` = `CLOSED` | `/health` JSON | Anthropic API healthy |
| Container env has `COMMIT_SHA` set | `gcloud compute ssh + docker exec env` | Reproducibility wired |
| `COMMIT_SHA` matches `git rev-parse HEAD` | shell | No drift |
| `feature_flags` block matches `flags.env` | `/health.feature_flags` vs `flags.env` parse | Flags propagated |

Fails immediately on any CRITICAL — operator should investigate before proceeding.

## Tier 2 — § 8.4.X V1-V4 + Container Env Audit (~5 min)

Embeds the verification protocol from `super-legal-mcp-refactored/docs/pending-updates/equity-analyst-update.md` § 8.4.X.

| Check | Pass criteria |
|---|---|
| **V1**: FMP tool invocations in last 5 min | When `FMP_ENABLED=true` + memo run: ≥1 row in `hook_audit_log` with `tool_name LIKE 'mcp__equities__%'`. Otherwise: WARNING "no recent invocations" |
| **V2**: Code-execution models M46–M58 | When FMP active + memo run: ≥1 row in `code_executions` with `model_id IN ('M46',...,'M58')`. Otherwise: WARNING |
| **V3**: Cloud Trace SubagentStart | When FMP active + memo run + sampler ≥0.1: ≥1 span in last hour. (Manual check via Cloud Console; skill prints filter URL) |
| **V4**: Citation verifier accepts FMP | Latest memo's `qa-outputs/citation-verification-certificate.md` shows FMP URLs as `CONFIRMED`. (Manual file inspection; skill prints query path) |
| **`bridge_metadata.git_sha` not 'unknown'** | All recent code-execution audit rows have real git SHA |
| **Container env audit** | `OTEL_TRACES_SAMPLER`, `OTEL_TRACES_SAMPLER_ARG`, `FMP_ENABLED`, `COMMIT_SHA`, `BCRYPT_ROUNDS` all present in container env |

## Tier 3 — Metrics + Reconciliation + Trace (~10 min)

| Check | Source | Pass criteria |
|---|---|---|
| Reconciliation enabled, 0 backlog | `/health.reconciliation` | `enabled=true`, `pending_kg=0`, `pending_artifacts=0`, `stuck_kg=0` |
| No new `claude_hook_persistence_failures_total` since deploy | `/metrics` | Counter delta = 0 since deploy |
| All circuit breakers CLOSED | `/metrics` `claude_hook_circuit_breaker_state` | All values = 0 |
| Memory baseline | `/health.performance` or `.memory` | RSS within ±20% of `references/baselines.json` |
| Prometheus alerts not firing | `/metrics` evaluation | All thresholds within bounds |

## Output Format

```
## Post-Deploy Verification Report
Timestamp: <ISO> | Target: <url> | Tier: <level>
Container: <name> | COMMIT_SHA: <short sha>

### Overall: PASSED ✓ | WARNING ⚠ | FAILED ✗

### Tier 1: Critical (<elapsed>)
✓ /health 200 OK in <ms>ms
✓ Database OK, latency <ms>ms
✓ Anthropic circuit_breaker: CLOSED (<n>/<threshold> failures)
✓ COMMIT_SHA in container = <sha> (matches deployed)
✓ flags.env propagated (<m>/<n> flags match)

### Tier 2: § 8.4.X V1-V4 + Container Env (<elapsed>)
✓ V1 FMP tools: <n> invocations in last 5min
✓ V2 Code models M46-M58: M46(<n>), M48(<n>), M50(<n>)
⚠ V3 Cloud Trace: skill prints filter URL — manual check required
⚠ V4 Citation verifier: skill prints qa-outputs path — manual check required
✓ Container env: all required vars present
✓ bridge_metadata.git_sha: all recent rows = <sha>

### Tier 3: Metrics + Reconciliation (<elapsed>)
✓ Reconciliation: enabled, 0 backlog
✓ Hook persistence: 0 failures since <timestamp>
✓ All circuit breakers CLOSED
⚠ Memory: RSS <mb>MB (baseline <mb>MB; ±X.X% — watch trend)
✓ Prometheus alerts: 0 firing

### Issues detected
NONE | <list>

### Raw Signals
[Full /health JSON + query results in collapsible blocks]
```

## Pre-flight Checks

```bash
which curl # required
which python3
which gcloud # for Tier 1 container env probe + Tier 2 SSH
which jq # for /health JSON parsing
```

## Read-Only Guarantee

Never auto-rolls back, never invokes admin endpoints, never executes DML.
All remediation suggestions are printed as commands the operator can run manually.

## Troubleshooting

| Failure | Fix |
|---|---|
| Tier 1 `/health` returns 502/503 | Container starting or unhealthy. Wait 60s and retry. If persistent, check `/deploy` Step 8.5 (post-IP container restart) |
| Tier 1 `COMMIT_SHA = 'unknown'` | `--build-arg COMMIT_SHA` missed during last build. See `deploy/SKILL.md`. Re-deploy or `docker exec` to set env post-hoc (not persisted) |
| Tier 2 V1 returns 0 rows but `FMP_ENABLED=true` | Either no memo run since deploy (run a test memo) OR `FMP_API_KEY` invalid/rate-limited (check `claude_api_client_results_total{fetch_source}` distribution — `exa_fallback` dominating = key issue) |
| Tier 2 container env missing vars | Re-run `/deploy` and ensure deploy.sh's CONTAINER_ENV plumbing is current (lines 80-95) |
| Tier 3 reconciliation stuck | See `session-diagnostics` Pattern 14 for forensic SQL |

See `references/failure-patterns.md` for 15+ documented failure modes.

## Known Constraints

- **V3 Cloud Trace check is manual** — gcloud trace logs read API requires interactive auth and project scoping; skill prints the filter URL operator can paste into Cloud Console
- **V4 Citation verifier check is manual** — requires reading the latest memo's `qa-outputs/` directory; skill prints the path
- **Baselines drift over time** — `baselines.json` is regenerated post-deploy. If you bumped flags or memory limits, regenerate baselines before declaring memory drift
32 changes: 32 additions & 0 deletions .claude/skills/post-deploy-verify/references/baselines.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"_meta": {
"version": "v7.0.1",
"captured_at": "2026-05-07",
"captured_from": "super-legal-staging-bzx4 (34.26.70.60:3001)",
"notes": "Baseline captured at deploy time. Regenerate after major releases or when memory limits change. ±20% drift is acceptable; >20% warrants investigation."
},
"memory": {
"rss_mb": 145,
"heap_used_mb": 60,
"heap_total_mb": 63
},
"performance": {
"uptime_seconds_at_capture": 33,
"active_streams": 0,
"background_tasks": 0
},
"database": {
"expected_latency_ms_max": 50,
"pool_max_connections": 15
},
"circuit_breaker": {
"expected_state": "CLOSED",
"expected_failures": 0,
"threshold": 3
},
"reconciliation": {
"expected_enabled": true,
"expected_pending_kg": 0,
"expected_pending_artifacts": 0
}
}
127 changes: 127 additions & 0 deletions .claude/skills/post-deploy-verify/references/failure-patterns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Post-Deploy Failure Patterns

Known failure modes seen in v6.x and v7.0.x deploys, with detection signal and remediation. Operator-only — read-only skill never auto-applies fixes.

## Tier 1 — Critical

### P1: `/health` returns 502/503

**Signal**: Tier 1 fast-fail with `fatal:true`.
**Cause**: Container starting, OOMed, or instance terminated.
**Remediation**:
```bash
gcloud compute instances list --filter="name~super-legal-staging" --format="table(name,zone,status,networkInterfaces[0].accessConfigs[0].natIP)"
gcloud compute ssh <instance> --zone=us-east1-c --command="docker ps -a && docker logs --tail 200 \$(docker ps -q | head -1)"
```

### P2: `COMMIT_SHA = 'unknown'` in container

**Signal**: Tier 1 FAILED on COMMIT_SHA.
**Cause**: `deploy.sh` missed `--build-arg COMMIT_SHA=$(git rev-parse HEAD)` plumbing.
**Remediation**: Re-run deploy after confirming `deploy.sh:54-62` carries the build arg.

### P3: DB latency > 100ms

**Signal**: Tier 1 FAILED on DB latency.
**Cause**: Pool saturation, Cloud SQL hot spot, or network partition.
**Remediation**: Query the Postgres system catalog `pg_stat_activity` grouping by `state`. If `idle in transaction` > 5, investigate stuck client. If pool exhausted, restart container.

### P4: Circuit breaker OPEN

**Signal**: Tier 1 FAILED on circuit breaker.
**Cause**: Anthropic API failures exceeded threshold (3 consecutive).
**Remediation**: Wait 60s for half-open probe; check `https://status.anthropic.com`. If sustained, container restart resets state.

### P5: Required env var missing

**Signal**: Tier 1 FAILED on missing env (e.g., `OTEL_TRACES_SAMPLER`).
**Cause**: `deploy.sh` `CONTAINER_ENV` array doesn't include the var.
**Remediation**: Update deploy.sh, redeploy. Do NOT manually `docker exec ... export` — overwritten on restart.

## Tier 2 — Integration / § 8.4.X

### P6: V1 returns 0 FMP invocations

**Signal**: Tier 2 WARNING on V1.
**Cause**: Either FMP_API_KEY broken, or no equity-research memo run in last 5 min.
**Remediation**: Run a test memo via the frontend. If still 0 after memo completes, check `claude_api_client_results_total{client="fmp"}` distribution.

### P7: V2 missing M46-M58 invocations

**Signal**: Tier 2 WARNING on V2.
**Cause**: `CODE_EXECUTION_BRIDGE=false` or model catalog deploy missed.
**Remediation**: Verify `code_executions` table has rows for any model post-deploy. If empty, code-execution bridge is dead — check `claude-sdk-server` logs for `[CODE-EXEC]` errors.

### P8: V3 Cloud Trace empty

**Signal**: Tier 2 WARNING on V3 (no SubagentStart spans).
**Cause**: `OTEL_TRACES_SAMPLER_ARG` too low (e.g., 0.01) or `OTEL_ENABLED=false`.
**Remediation**: Bump sampler to 1.0 for verification window, redeploy. After window, return to 0.1.

### P9: V4 citation verifier rejected FMP

**Signal**: Latest memo's `citation-verification-certificate.md` shows FMP URLs as REJECTED.
**Cause**: Citation verifier regex doesn't match `financialmodelingprep.com` URL pattern.
**Remediation**: Check `src/utils/citationWebsearchVerifier.js` regex; FMP entries should match `^https?://(www\.)?financialmodelingprep\.com/`.

### P10: `bridge_metadata.git_sha = 'unknown'`

**Signal**: Tier 2 FAILED on bridge_metadata git_sha probe.
**Cause**: Same as P2 — `COMMIT_SHA` build arg never propagated to runtime.
**Remediation**: Re-run deploy with build arg. Replay envelope is required for EU AI Act Art. 15 audit trail.

## Tier 3 — Deep / Reconciliation

### P11: Reconciliation `pending_kg > 0`

**Signal**: Tier 3 FAILED on reconciliation backlog.
**Cause**: KG worker stuck or DB write contention.
**Remediation**:
```sql
SELECT session_key, kg_build_attempts, kg_build_last_error, last_kg_build_attempt_at
FROM sessions WHERE kg_status = 'pending' AND kg_build_attempts >= 3
ORDER BY created_at DESC LIMIT 20;
```
If errors are real, fix root cause. If stuck due to deploy interruption, sessions self-recover after `SESSION_RECONCILIATION` worker tick.

### P12: `claude_hook_persistence_failures_total` non-zero

**Signal**: Tier 3 FAILED on hook persistence.
**Cause**: DB unreachable during hook fire OR row-too-large.
**Remediation**:
```sql
SELECT event_type, count(*) FROM hook_audit_log
WHERE persisted = false AND created_at > NOW() - INTERVAL '1 hour'
GROUP BY event_type;
```

### P13: Memory RSS drift > 50%

**Signal**: Tier 3 FAILED on memory baseline.
**Cause**: Leak, pool re-sizing, or legitimate workload increase.
**Remediation**: Capture heap dump via `kill -USR2 <pid>`, regenerate `baselines.json` if drift is intentional.

### P14: Transcript event count < 100 for completed session (Pattern 10)

**Signal**: Tier 3 FAILED on `t3-transcript-event-rate.sql`.
**Cause**: Transcript flush broken — `TRANSCRIPT_DB_PERSISTENCE=false` OR queue saturated.
**Remediation**: Confirm flag is true via `/health.feature_flags.TRANSCRIPT_DB_PERSISTENCE`. Check `claude_hook_persistence_failures_total{event_type="transcript"}`. Sessions completed during disabled flag are NOT recoverable.

### P15: OTel sampler not propagated

**Signal**: Tier 3 FAILED on `OTEL_ENABLED` or sampler arg mismatch.
**Cause**: Container env missing or `flags.env` drifted from `--container-env` array.
**Remediation**: Check `gcloud compute ssh ... -- docker exec <container> env | grep OTEL`. Redeploy if missing.

## Cross-tier patterns

### Static IP race (deploy.sh step 7)

Not a verification failure but precedes Tier 1 P1. If verify-tier1 sees DB unreachable from container despite `/health` 200 from frontend — container is on ephemeral IP not whitelisted by Cloud SQL.

**Detection**:
```bash
gcloud compute instances describe <instance> --zone=us-east1-c \
--format="value(networkInterfaces[0].accessConfigs[0].natIP)"
```
If not `34.26.70.60`, see `deploy/SKILL.md` § "Static IP assignment race".
42 changes: 42 additions & 0 deletions .claude/skills/post-deploy-verify/references/tier1-checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Tier 1 — Critical Checks (~2 min)

Tier 1 fails fast on any CRITICAL — operator should investigate before Tier 2/3.

| Check | Pass criteria | Severity on fail |
|---|---|---|
| `GET /health` HTTP code | 200 | FAILED (script exits early with `fatal:true`) |
| `dependencies.database.status` | `ok` | FAILED |
| `dependencies.database.latency_ms` | `< 50` | WARNING if 50-100, FAILED if >100 |
| `dependencies.circuit_breaker.state` | `CLOSED` | FAILED if OPEN; WARNING if missing |
| Container env `COMMIT_SHA` | set, not 'unknown' | FAILED if 'unknown' or missing |
| `COMMIT_SHA` matches `git rev-parse HEAD` | exact match | WARNING if drift |
| Required env vars present | all 7 v7.0.x flags (OTEL_ENABLED, OTEL_TRACES_SAMPLER, OTEL_TRACES_SAMPLER_ARG, TRANSCRIPT_DB_PERSISTENCE, SESSION_RECONCILIATION, HOOK_DB_PERSISTENCE, FMP_ENABLED) | FAILED if any missing |

## Fast-fail behavior

If `/health` returns non-200, script exits immediately with `fatal:true` in the JSON output. No further Tier 1 checks run, and Tier 2/3 are not invoked.

This avoids cascading false-negatives when the container is unreachable.

## Assumptions

- `gcloud` CLI installed + authenticated (Tier 1 container env audit). If not, those checks emit WARNING and skip.
- `jq` installed for JSON parsing (pre-flight).
- `curl` installed (pre-flight).

## Why these checks

- **`/health` 200**: container is responsive
- **DB connectivity + latency**: Cloud SQL whitelist intact, pool not saturated
- **Circuit breaker state**: Anthropic API healthy enough to make tool calls
- **`COMMIT_SHA != 'unknown'`**: deploy.sh's `--build-arg` propagation worked (compliance/audit requirement)
- **Required env vars**: deploy.sh's `--container-env` plumbing carried v7.0.x flags through

## Manual recovery hints

| Failure | Likely cause | Fix |
|---|---|---|
| `/health` 502/503 | Container unhealthy or starting | Wait 60s; check `gcloud compute ssh` + `docker logs` |
| `COMMIT_SHA = 'unknown'` | deploy.sh missed `--build-arg COMMIT_SHA=$(git rev-parse HEAD)` | Re-deploy with the fix in deploy.sh:54-62 |
| Missing env var | `--container-env` plumbing didn't include it | Update deploy.sh CONTAINER_ENV array; redeploy |
| DB latency >100ms | Pool exhaustion or Cloud SQL hot spot | Check `pg_pool` metrics; consider `pg_stat_activity` |
28 changes: 28 additions & 0 deletions .claude/skills/post-deploy-verify/references/tier2-checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Tier 2 — § 8.4.X V1-V4 + Container Env Audit (~5 min)

Embeds the verification protocol from `super-legal-mcp-refactored/docs/pending-updates/equity-analyst-update.md` § 8.4.X.

## V1: FMP tool invocations
SQL: `scripts/queries/v1-fmp-tool-invocations.sql` — counts `mcp__equities__%` tool calls in last 5 min.

## V2: Code-execution models M46-M58
SQL: `scripts/queries/v2-code-execution-models.sql` — per-model invocation/success/duration distribution.

## V3: Cloud Trace SubagentStart
Manual: filter Cloud Trace for `attribute.agent_type='equity-analyst' AND attributes.stage='research_support'`. Skill prints filter URL.

## V4: Citation verifier accepts FMP
Manual: grep latest memo's `qa-outputs/citation-verification-certificate.md` for `financialmodelingprep.com` entries. Skill prints path.

## Container env audit
Required: `OTEL_TRACES_SAMPLER`, `OTEL_TRACES_SAMPLER_ARG`, `FMP_ENABLED`, `COMMIT_SHA`. Optional: `FMP_API_KEY`, `BCRYPT_ROUNDS`.

## bridge_metadata.git_sha probe
SQL: `scripts/queries/t3-bridge-metadata-git-sha.sql`. Pass: single row with real SHA. Fail: 'unknown' = COMMIT_SHA build arg missed.

## Why these checks

- **V1/V2**: confirms FMP routing is live (when FMP_ENABLED=true). Distinguishes "no recent memo" from "FMP_API_KEY broken" via `claude_api_client_results_total{fetch_source}` distribution.
- **V3**: confirms OTel sampler is sampling enough that equity-analyst spans land in Cloud Trace
- **V4**: confirms citation_websearch_verifier accepts FMP URL patterns (regex match)
- **bridge_metadata.git_sha**: confirms regulator-replay envelope is intact (EU AI Act Art. 15)
Loading