diff --git a/.claude/skills/api-integration/SKILL.md b/.claude/skills/api-integration/SKILL.md index 006d40843..f08ea2d6c 100644 --- a/.claude/skills/api-integration/SKILL.md +++ b/.claude/skills/api-integration/SKILL.md @@ -7,12 +7,13 @@ description: Build and integrate a new API client into the Super Legal MCP platf ## Overview -This skill integrates a new API data source into the Super Legal MCP platform following the canonical pattern used by all 36 existing clients. The process produces a fully operational hybrid client with native-first routing, Exa two-phase fallback (search + /contents enrichment), circuit breaker protection, caching, observability, and frontend catalog display. +This skill integrates a new API data source into the Super Legal MCP platform following the canonical pattern used by all 38 existing clients. The process produces a fully operational hybrid client with native-first routing, Exa two-phase fallback (search + /contents enrichment), circuit breaker protection, caching, observability, and frontend catalog display. -**Current platform state** (update these counts after each integration): -- API clients: 36 -- Base MCP domains: 33 (+conditional: code-execution, direct-fetch, exa-search) -- Tool schemas: 149+ +**Current platform state** (v7.0.1; update these counts after each integration): +- API clients: 38 (+FMP equity-research gated by FMP_ENABLED, +DirectFetch) +- Base MCP domains: 34 (+conditional: code-execution, direct-fetch, exa-search, equities) +- Tool schemas: 197 (161 base + 36 FMP equity tools when FMP_ENABLED=true) +- For high-precision integrations (live financial data, regulatory APIs), follow the FMP-derived empirical-first methodology — see Phase 1.5 (Empirical Capture & Probing) and Phase 1.6 (Endpoint Classification) below. - Production entry point: `Dockerfile:59` → `bootstrap.js` → `claude-sdk-server.js` → `clientRegistry.js` - `EnhancedLegalMcpServer.js` is legacy/local-dev only — do NOT wire new clients there diff --git a/.claude/skills/client-backup-restore/SKILL.md b/.claude/skills/client-backup-restore/SKILL.md index 71b0dc052..3f4d1c12f 100644 --- a/.claude/skills/client-backup-restore/SKILL.md +++ b/.claude/skills/client-backup-restore/SKILL.md @@ -187,10 +187,23 @@ gcloud sql backups restore {backup_id} \ | Type | What's included | Size estimate | Duration | |---|---|---|---| -| `full` | Database + reports directory | ~150-300 MB | 3-5 min | -| `database-only` | Cloud SQL export (all tables + data) | ~30-80 MB | 1-2 min | +| `full` | Database + reports directory | ~200-400 MB (v7.0.0+: includes transcript_events ~700KB-1MB/session) | 3-6 min | +| `database-only` | Cloud SQL export (all tables + data) | ~30-100 MB (larger on v7.0.0+ deployments due to transcript_events) | 1-3 min | | `reports-only` | Reports directory (sessions, raw sources) | ~100-250 MB | 2-4 min | +**v7.0.0+ database-only scope** — Cloud SQL export captures all tables, including the new compliance/observability tables introduced in v7.0.0: +- `transcript_events` — full SSE event history per session (~700KB-1MB per session × N sessions). **Largest growth vector** at 10K+ sessions. +- `code_executions` (now with 13+ reproducibility columns: model_id, llm_name, anthropic_request_id, anthropic_message_id, system_prompt_hash, python_code, container_id, tool_use_id, stop_reason, turn_count, pause_count, refusal_detected, etc.) — required for byte-replay envelope per EU AI Act Art. 15 +- `code_execution_inputs` — data lineage junction (small table, 1-5 rows per execution) +- `citation_source_links` — citation→source bridge with confidence scores (1 row per matched citation) +- `hook_audit_log` — now includes `bridge_metadata` JSONB column with `git_sha + sdk_version + container_id + system_prompt_hash` (regulator-replay envelope) + +Restore verification (Phase 4) should confirm these row counts post-restore for v7.0.0+ deployments: +- `SELECT COUNT(*) FROM transcript_events` matches pre-backup count +- `SELECT COUNT(*) FROM code_executions WHERE model_id IS NOT NULL` matches pre-backup count (NULL model_id = pre-v6.8.4 row, allowed) +- `SELECT COUNT(*) FROM citation_source_links` matches pre-backup count +- `SELECT event_data->'bridge_metadata' IS NOT NULL FROM hook_audit_log WHERE tool_name='run_python_analysis'` — bridge_metadata preserved on restore + ## Storage Locations All backups stored in the client's WORM bucket: diff --git a/.claude/skills/client-offboarding/SKILL.md b/.claude/skills/client-offboarding/SKILL.md index 49c76378d..c7abe3594 100644 --- a/.claude/skills/client-offboarding/SKILL.md +++ b/.claude/skills/client-offboarding/SKILL.md @@ -55,7 +55,12 @@ bash /Users/ej/Super-Legal/.claude/skills/client-offboarding/scripts/offboard-cl **Step 7**: Verify archives — checks that archive files exist in GCS and have non-zero size. Reports checksums. -**Step 6.5**: Archive Wave 3 audit tables as dedicated CSV artifacts — `access_log` (EU AI Act Article 12 read-side evidence) and `human_interventions` (EU AI Act Article 14 operator governance evidence) exported via `psql COPY TO STDOUT` + gzip to `gs://super-legal-worm-{client_id}/archive/{table}-{date}.csv.gz`. Cloud SQL's native `gcloud sql export csv` doesn't support table-level `--query` filtering, so the script uses psql directly against the connection string resolved from Secret Manager. Runs AFTER archive verification (Step 7) and BEFORE any destructive deletion (Phase 3) — these tables must survive the DB drop as standalone legal records. Legacy clients predating Wave 3 (tables don't exist) gracefully skip via `2>/dev/null || warn`. v6.5.1+ instances also have `hook_audit_log WHERE event_type = 'KGBuild'` entries — include in archive for KG build audit trail. Requires v6.6.0+ for complete telemetry (background task tracking, pool survival). +**Step 6.5**: Archive compliance audit tables as dedicated CSV artifacts. **Wave 3 tables**: `access_log` (EU AI Act Article 12 read-side evidence) and `human_interventions` (EU AI Act Article 14 operator governance evidence). **v7.0.0 tables** (added in this scope): +- `transcript_events` — full SSE event history per session (~700KB-1MB per session × N sessions; may be largest archive file). Required for byte-faithful session-reload audit if regulator queries any session history. +- `citation_source_links` — citation→raw-source bridge with confidence scores. Required for hallucination audit (any citation with `confidence < 0.85` flagged for QA review at session-time should be reproducible from this archive). +- `code_execution_inputs` — data lineage junction linking code executions to upstream subagent reports/embeddings/KG nodes. Required for EU AI Act Art. 15 reproducibility chain ("which subagent's output drove this DCF result?"). + +All exported via `psql COPY TO STDOUT` + gzip to `gs://super-legal-worm-{client_id}/archive/{table}-{date}.csv.gz`. Cloud SQL's native `gcloud sql export csv` doesn't support table-level `--query` filtering, so the script uses psql directly against the connection string resolved from Secret Manager. Runs AFTER archive verification (Step 7) and BEFORE any destructive deletion (Phase 3) — these tables must survive the DB drop as standalone legal records. Legacy clients predating each table gracefully skip via `2>/dev/null || warn`. v6.5.1+ instances also have `hook_audit_log WHERE event_type = 'KGBuild'` entries — include in archive for KG build audit trail. **v7.0.0+ instances** have `hook_audit_log` rows with `bridge_metadata` JSONB (`git_sha + sdk_version + container_id + system_prompt_hash`) — these are the regulator-replay envelope and MUST be preserved in the audit log archive. Requires v6.6.0+ for complete telemetry (background task tracking, pool survival). ### Phase 3: Resource Deletion (DESTRUCTIVE — requires --confirm) diff --git a/.claude/skills/code-execution-models/SKILL.md b/.claude/skills/code-execution-models/SKILL.md index bd0365d6c..292913ccc 100644 --- a/.claude/skills/code-execution-models/SKILL.md +++ b/.claude/skills/code-execution-models/SKILL.md @@ -1,6 +1,6 @@ --- name: code-execution-models -description: Add new financial models to the code execution sandbox catalog. Use when the user asks to "add a model", "create a financial model", "add [model name] to code execution", "new analysis model", or wants to expand the PE/IB/M&A quantitative analysis toolkit. The sandbox runs Claude-generated Python (pandas, numpy, scipy, sklearn, matplotlib, seaborn) via the Anthropic code_execution_20260120 tool to produce structured JSON results, charts (PNG), and formatted tables. Currently 45 models across 13 categories. Also use when the user says "/code-execution-models". +description: Add new financial models to the code execution sandbox catalog. Use when the user asks to "add a model", "create a financial model", "add [model name] to code execution", "new analysis model", or wants to expand the PE/IB/M&A quantitative analysis toolkit. The sandbox runs Claude-generated Python (pandas, numpy, scipy, sklearn, matplotlib, seaborn) via the Anthropic code_execution_20260120 tool to produce structured JSON results, charts (PNG), and formatted tables. Currently 56 models across 13 categories (M46–M55, M58 added in v7.0.0 for FMP equity research, gated by FMP_ENABLED). Also use when the user says "/code-execution-models". --- # Code Execution Models — Add Financial Analysis Model @@ -159,7 +159,7 @@ Append to the `CODE_EXECUTION_MODELS` array after the last entry: } ``` -**Format guidelines** (match existing 45 models): +**Format guidelines** (match existing 56 models): - `description`: 100-300 words, business-context-rich, mentions charts/tables produced - `methodology`: cites specific standards (ASC, IRC, academic papers) with thresholds - `outputFormat`: explicitly states chart types and table formats to generate diff --git a/.claude/skills/deploy/SKILL.md b/.claude/skills/deploy/SKILL.md index c8786096e..e787e4864 100644 --- a/.claude/skills/deploy/SKILL.md +++ b/.claude/skills/deploy/SKILL.md @@ -137,6 +137,42 @@ gcloud compute instances add-access-config $INSTANCE --zone=us-east1-c --access- gcloud compute ssh $INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)' ``` +### Variant: MIG instance replacement mid-retries + +**Observed on**: 2026-05-06 v7.0.1 deploy. + +**Symptom**: Step 7 retries 5x with `Could not fetch resource: super-legal-staging-XXXX` even though the script logs `IP is RESERVED` and proceeds to `Attempt 1/5: Assigning ...`. The log line keeps showing the SAME instance name across all 5 attempts. Meanwhile, `gcloud compute instances list` reveals a DIFFERENT instance name is actually running. + +**Root cause**: The MIG terminated the instance the script was targeting (e.g., `super-legal-staging-0239`) and rolled forward to a new one (e.g., `super-legal-staging-bzx4`) DURING step 7's retry budget. The script captured the original instance name in step 6 and did not re-resolve it on each retry. Every `add-access-config` call hits a deleted resource. + +**Detection between retries**: +```bash +gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' +``` +If this returns a different instance name than what the script's log shows, the variant has triggered. + +**Manual recovery on the new instance**: +```bash +NEW_INSTANCE=$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1) +gcloud compute instances delete-access-config $NEW_INSTANCE --zone=us-east1-c --access-config-name=external-nat --quiet +sleep 10 +gcloud compute instances add-access-config $NEW_INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet +sed -i '' '/compute\./d' ~/.ssh/google_compute_known_hosts +gcloud compute ssh $NEW_INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)' +``` + +Wait 60s, then verify via `curl http://34.26.70.60:3001/health`. + +**Future deploy.sh hardening** (not yet implemented): Step 7's retry loop should re-resolve the instance name on each attempt: +```bash +for attempt in 1 2 3 4 5; do + INSTANCE=$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1) + gcloud compute instances add-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet 2>err && break + sleep 30 +done +``` +Filed as v7.0.x follow-up code change. + ### Docker push transient broken pipe **Observed on**: 2026-04-27 v6.7.0 deploy. diff --git a/.claude/skills/infrastructure-health/SKILL.md b/.claude/skills/infrastructure-health/SKILL.md index 176ffea4c..eb45a07e7 100644 --- a/.claude/skills/infrastructure-health/SKILL.md +++ b/.claude/skills/infrastructure-health/SKILL.md @@ -2,7 +2,7 @@ name: infrastructure-health description: > Tiered infrastructure health monitoring for Super Legal MCP platform. Monitors GCE instances, - PostgreSQL/pgvector, Anthropic API circuit breakers, 36 API clients, Gemini embedding + PostgreSQL/pgvector, Anthropic API circuit breakers, 38 API clients (incl. FMP equity-research, gated), Gemini embedding service, memory trends, EPO OAuth tokens, Prometheus alerts, session hygiene, API key expiration, Docker image drift, and dependency vulnerabilities. Triggers on: "infrastructure health", "health check", "infra status", "system health", "check infrastructure", "run health checks", @@ -135,7 +135,7 @@ Read these subskill references: - [references/dependency-vulnerabilities.md](references/dependency-vulnerabilities.md) — npm audit ### Execution -1. Fetch `/metrics` and check for circuit breaker trips, high error rates. Wave 4 metrics to verify: `claude_subagent_duration_ms`, `claude_api_client_results_total` (check for `outcome="zero_results"`), `claude_document_conversion_duration_ms`, `claude_document_conversion_errors_total`, `claude_embedding_duration_ms`, `claude_gate_check_results_total`, `claude_kg_build_total` (check for `status="error"` or `status="skipped_breaker"`), `claude_kg_build_duration_ms` +1. Fetch `/metrics` and check for circuit breaker trips, high error rates. Wave 4 metrics to verify: `claude_subagent_duration_ms`, `claude_api_client_results_total` (check for `outcome="zero_results"` and `fetch_source` distribution — `exa_fallback` dominating for FMP tools indicates `FMP_API_KEY` issues), `claude_document_conversion_duration_ms`, `claude_document_conversion_errors_total`, `claude_embedding_duration_ms`, `claude_gate_check_results_total`, `claude_kg_build_total` (check for `status="error"` or `status="skipped_breaker"`), `claude_kg_build_duration_ms`. **v7.0.0 metrics to verify**: `claude_hook_persistence_failures_total` (any non-`unknown` reason = data loss vector), `claude_hook_circuit_breaker_state` (any value ≥2 = persistence skipping), `claude_code_execution_failures_total` by reason, `claude_hook_invocations_total` (success path counter — should grow during active sessions), `claude_tool_invocations_v2_total` (replaces deprecated v1; verify both still emitting during dual-emission window). **OTel sampler check**: container env `OTEL_TRACES_SAMPLER_ARG` — `1.0` indicates verification window, `0.1` is steady-state. See `references/prometheus-alerts.md` for full alert rule + remediation table. 2. Run `scripts/pg-health.sh` for session hygiene and table sizes 3. Calculate days until SAM_GOV_API_KEY expiry (set 2026-02-11, 90-day lifetime → ~2026-05-12) 4. Run `scripts/docker-drift.sh` (requires gcloud auth — skip gracefully if unavailable) diff --git a/.claude/skills/infrastructure-health/references/postgresql.md b/.claude/skills/infrastructure-health/references/postgresql.md index 3d8623615..b069f853c 100644 --- a/.claude/skills/infrastructure-health/references/postgresql.md +++ b/.claude/skills/infrastructure-health/references/postgresql.md @@ -1,7 +1,10 @@ # PostgreSQL Health — Subskill Reference +**Version**: v7.0.1 (2026-05-06) + ## Connection -- Pool max: `PG_POOL_MAX` env (default: 10) +- Pool max: `PG_POOL_MAX` env (default: **15** — bumped from 10 in v7.0.0 for 33% burst margin during simultaneous live stream + 3-rebuild reconciliation + transcript flush) +- `statement_timeout`: 120,000 ms (preserved — extending was found unnecessary and risky during v6.8.0 audit) - Connection string: `PG_CONNECTION_STRING` or `DATABASE_URL` - Extension: pgvector (required when `EMBEDDING_PERSISTENCE=true`) @@ -10,11 +13,15 @@ |-------|---------|----------------| | sessions | Session tracking | 1 row per pipeline run | | reports | Report versions | ~10-20 rows per session | -| hook_audit_log | Agent activity audit | 100-500 rows per session | +| hook_audit_log | Agent activity audit | 100-500 rows per session; v7.0.0 adds `bridge_metadata` JSONB + `tool_use_id` columns | +| code_executions | Per-`run_python_analysis` execution audit | 1 row per code execution; **v7.0.0 adds reproducibility columns** (model_id, llm_name, anthropic_request_id, anthropic_message_id, input/output/cache tokens, system_prompt_hash, python_code, container_id, tool_use_id, stop_reason, turn_count, pause_count, refusal_detected) | +| code_execution_inputs | **v7.0.0** — data lineage junction linking each code execution to upstream subagent reports/embeddings/KG nodes | 1-5 rows per code execution | +| transcript_events | **v7.0.0** — full-fidelity SSE event capture (`migrations/012_transcript-events.up.sql`); buffered batch insert | ~4,000-6,000 rows per 30-50 min session; **~700KB-1MB storage per session** | +| citation_source_links | **v7.0.0** — citation→source bridge with fuzzy matching (URL exact / URL fuzzy / title fuzzy / embedding cosine) + confidence score | 1 row per memo footnote matched | | report_embeddings | pgvector embeddings | ~50-100 chunks per report | | agent_states | Agent lifecycle | ~40 rows per session | | source_writes | Wave 3 WAL — raw source persistence reconciliation | 1 row per raw source capture; hourly reconciler | -| access_log | Wave 3 — EU AI Act Art. 12 read-side audit | 1 row per `/api/sessions/:id/*` read (fire-and-forget) | +| access_log | Wave 3 — EU AI Act Art. 12 read-side audit | 1 row per `/api/sessions/:id/*` read (fire-and-forget); v7.0.0 audit-export reads also logged | | human_interventions | Wave 3 — EU AI Act Art. 14 operator governance audit | 0-5 rows per session (admin actions only) | | pii_mappings | Wave 3 — GDPR Art. 17 pseudonymization backing store | 0-N per session when PII detected | diff --git a/.claude/skills/infrastructure-health/references/prometheus-alerts.md b/.claude/skills/infrastructure-health/references/prometheus-alerts.md index a49a4c590..0e734d157 100644 --- a/.claude/skills/infrastructure-health/references/prometheus-alerts.md +++ b/.claude/skills/infrastructure-health/references/prometheus-alerts.md @@ -1,33 +1,160 @@ # Prometheus Alert Review — Subskill Reference +**Version**: v7.0.1 (2026-05-06) | **Source**: `super-legal-mcp-refactored/prometheus/alerts.yml`, `src/utils/sdkMetrics.js`, `src/config/alertingRules.js` + ## Metrics Endpoint -`GET /metrics` on the server (same port 3001, or METRICS_PORT if configured) -## Key Alerts (from prometheus/alerts.yml) -| Alert | Condition | Duration | Severity | -|-------|-----------|----------|----------| -| ClaudeToolErrorRateHigh | Tool error rate >5% | 5m | warning | -| ClaudeLatencyRegression | P95 latency >10s | 10m | warning | -| StructuredOutputValidationFailure | Output failures >2% | 5m | critical | -| CircuitBreakerTripping | >3 trips in 15m | 1m | critical | -| RateLimitExhaustion | Rate limit errors >10/min | 5m | warning | +`GET /metrics` on the server (port 3001 in production, or `METRICS_PORT` if configured). Prometheus exposition format (`text/plain; version=0.0.4`). Authentication: none (network-layer ACL — Prometheus scrapes from same VPC). + +For full metric inventory see `super-legal-mcp-refactored/docs/metrics-catalog.md` (33 metrics across 12 categories). This reference focuses on what to scrape during Tier 3 health checks. + +## Alert Rules (13 total) + +### Tool & latency alerts (5, pre-v7.0.0) + +| Alert | Condition | Duration | Severity | Remediation | +|---|---|---|---|---| +| `ClaudeToolErrorRateHigh` | `rate(claude_tool_invocations_v2_total{status="error"}[5m]) / rate(claude_tool_invocations_v2_total[5m]) > 0.05` | 5m | warning | Identify failing tool via `{tool_name}` label. Check native API health via api-client-sweep | +| `ClaudeLatencyRegression` | `histogram_quantile(0.95, claude_request_duration_ms_bucket) > 10000` | 10m | warning | Check Anthropic API circuit breaker, network latency, recent SDK upgrade | +| `StructuredOutputValidationFailure` | `rate(claude_structured_output_failures_total[5m]) / rate(claude_structured_output_attempts_total[5m]) > 0.02` | 5m | critical | Schema validation rejecting LLM output. Usually transient. If persistent, check tool schema drift | +| `CircuitBreakerTripping` | `increase(claude_circuit_breaker_trips_total[15m]) > 3` | 1m | critical | Correlate `{domain}` label with API client sweep. Check upstream API status | +| `RateLimitExhaustion` | `sum(rate(claude_errors_total{code="RATE_LIMIT_ERROR"}[5m])) > 10` | 5m | warning | Anthropic-side rate limit. Check session concurrency; consider rpm/tpm bump | + +**Note (v7.0.0/v7.0.1)**: `ClaudeToolErrorRateHigh` was migrated to `claude_tool_invocations_v2_total` per W5.6. Legacy `claude_tool_invocations_total` is in 7-day dual-emission window; will be removed in v7.0.x. + +### Hook persistence alerts (3, v7.0.0 — CRITICAL data loss vectors) + +| Alert | Condition | Duration | Severity | Remediation | +|---|---|---|---|---| +| `HookPersistenceFailures` | `sum by (hook, reason) (rate(claude_hook_persistence_failures_total{reason!="unknown"}[5m])) > 0` | 5m | warning | Check DB pool health, CircuitBreaker state, recent deploys. Per-hook + per-reason labels exposed | +| `HookCircuitBreakerOpen` | `max by (hook) (claude_hook_circuit_breaker_state) >= 2` | 2m | critical | Persistence is being skipped — rows are being lost. Likely DB connectivity. 2m threshold absorbs cold-start churn during rolling deploys | +| `HookEnvelopeShapeDrift` | `sum(rate(claude_hook_persistence_failures_total{reason="envelope_shape_drift"}[5m])) > 0` | 1m TTL | critical | SDK upgrade or upstream API field rename. Update the schema (not the test mock). Short TTL because silent data loss starts immediately on drift | + +### Reconciliation alerts (5, v6.7.0) + +| Alert | Condition | Duration | Severity | Remediation | +|---|---|---|---|---| +| `ReconciliationKgBacklog` | `claude_reconciliation_pending_sessions{type="kg"} > 50` | 10m | warning | Check `/health.reconciliation`, `kg_build_last_error` distribution, kgBreaker state | +| `ReconciliationKgCritical` | `claude_reconciliation_pending_sessions{type="kg"} > 100` | 5m | critical | Loop draining slower than ingest rate. Investigate immediately — possible KG extractor regression or pool exhaustion | +| `ReconciliationArtifactsBacklog` | `claude_reconciliation_pending_sessions{type="artifacts"} > 50` | 10m | warning | Check `artifacts_build_last_error`; investigate document conversion pipeline | +| `ReconciliationScanSlow` | `histogram_quantile(0.95, sum(rate(claude_reconciliation_scan_duration_ms_bucket[1h])) by (le)) > 900000` | 15m | warning | P95 >15min — likely 15-min Promise.race timeouts firing. Check `kg_build_last_error` for `'kg_build_timeout_15min'` | +| `ReconciliationScanErrors` | `rate(claude_reconciliation_scans_total{status="error"}[1h]) > 0.0003` | 30m | warning | Loop throwing — check Cloud Logging for `'[SessionReconciliation] Scan failed'` | ## Key Metrics to Scrape + +### Pre-v7.0.0 (still active) + +``` +claude_circuit_breaker_trips_total{domain} +claude_tool_invocations_total{tool, status} # DEPRECATED — removed in v7.0.x +claude_tokens_input_total{model} +claude_tokens_output_total{model} +claude_errors_total{code, path} +claude_request_duration_ms_bucket +``` + +### v7.0.0 additions (5 new) + +``` +claude_tool_invocations_v2_total{tool_name, status} # bounded enum +claude_hook_persistence_failures_total{hook, reason} # 10-value reason enum +claude_hook_circuit_breaker_state{hook} # 0=closed, 1=half-open, 2=open +claude_code_execution_failures_total{reason} # refusal_detected | timeout | api_error | container_error | envelope_parse_error +claude_hook_invocations_total{hook} # success path counter +``` + +### Reconciliation (v6.7.0) + ``` -claude_circuit_breaker_trips_total # Counter by domain -claude_tool_invocations_total{status="error"} # Tool failure counts -claude_tokens_input_total # Token consumption -claude_tokens_output_total -claude_errors_total{code="..."} # Error breakdown +claude_reconciliation_scans_total{status} +claude_reconciliation_rebuilds_total{type, status} +claude_reconciliation_scan_duration_ms_bucket +claude_reconciliation_pending_sessions{type} ``` ## Check Method -Fetch `/metrics` and parse Prometheus text format. Look for: -1. Any `circuit_breaker_trips_total` value > 0 since last check -2. Error rate: `tool_invocations{status=error}` / `tool_invocations{status=success}` -3. Token consumption trends (cost monitoring) - -## Remediation -- **Tool error rate high**: Identify which tool via `{tool=...}` label. Check if the corresponding native API is down. -- **Structured output failures**: Schema validation rejecting LLM output. Usually transient. If persistent, check if tool schema changed. -- **Circuit breaker trips**: Correlate `{domain=...}` label with API client sweep results. + +Fetch `/metrics` and parse Prometheus text format. For each Tier 3 sweep: + +1. **Circuit breaker trips** — any `claude_circuit_breaker_trips_total > 0` since last check +2. **Tool error rate (v2)** — `claude_tool_invocations_v2_total{status="error"}` / total > 0.05 +3. **Hook persistence failures** — any `claude_hook_persistence_failures_total{reason!="unknown"}` increasing +4. **Hook circuit breaker** — any `claude_hook_circuit_breaker_state >= 2` (open) per hook +5. **Envelope shape drift** — any `claude_hook_persistence_failures_total{reason="envelope_shape_drift"}` > 0 (CRITICAL — fix immediately) +6. **Reconciliation backlog** — `claude_reconciliation_pending_sessions{type="kg"}` > 50 sustained +7. **Code execution failures** — `claude_code_execution_failures_total` by reason — `refusal_detected` is informational, others are operational +8. **Token consumption trends** — input/output/cache token counters for cost monitoring + +## v7.0.0 Table Health Probes + +For Tier 3 deeper checks, probe the new tables via PostgreSQL: + +```sql +-- transcript_events: row count per recent session, FK integrity +SELECT s.session_key, COUNT(t.id) AS event_count +FROM sessions s LEFT JOIN transcript_events t ON s.id = t.session_id +WHERE s.created_at > NOW() - INTERVAL '24 hours' +GROUP BY s.session_key +HAVING COUNT(t.id) = 0 -- sessions with no events = persistence broken +ORDER BY s.created_at DESC LIMIT 10; + +-- citation_source_links: confidence distribution +SELECT matched_via, COUNT(*) AS total, + SUM(CASE WHEN confidence < 0.85 THEN 1 ELSE 0 END) AS low_confidence +FROM citation_source_links +WHERE created_at > NOW() - INTERVAL '24 hours' +GROUP BY matched_via; + +-- code_execution_inputs: lineage row count per execution +SELECT ce.id, ce.agent_type, COUNT(cei.id) AS lineage_rows +FROM code_executions ce LEFT JOIN code_execution_inputs cei ON ce.id = cei.execution_id +WHERE ce.created_at > NOW() - INTERVAL '24 hours' +GROUP BY ce.id, ce.agent_type +HAVING COUNT(cei.id) = 0; -- executions with no lineage = CAPABILITY constants not wired + +-- bridge_metadata.git_sha = 'unknown' = COMMIT_SHA build arg missing +SELECT event_data->'bridge_metadata'->>'git_sha' AS git_sha, COUNT(*) +FROM hook_audit_log +WHERE tool_name='run_python_analysis' AND created_at > NOW() - INTERVAL '24 hours' +GROUP BY git_sha; +``` + +## OTel Sampler Tuning + +Container env: `OTEL_TRACES_SAMPLER=parentbased_traceidratio` + `OTEL_TRACES_SAMPLER_ARG=`. + +| Rate | Use case | +|---|---| +| `1.0` (100%) | First-light verification deploys, post-incident triage. Cloud Trace cost scales linearly. **Current value during v7.0.1 verification window.** | +| `0.1` (10%) | Production steady-state default. Bounds Cloud Trace cost; statistical visibility into 1-in-10 sessions. | +| `0.01` (1%) | Very high traffic deployments where 10% overwhelms Cloud Trace quota. | + +**Action**: when v7.0.1 verification completes (FMP first-light + § 8.4.X V1–V4 pass), reduce `OTEL_TRACES_SAMPLER_ARG` from `1.0` back to `0.1` for cost discipline. This is a `flags.env` flip + redeploy. + +## FMP_ENABLED Health Probe (v7.0.0) + +If `FMP_ENABLED=true` in container env: + +1. Verify `FMP_API_KEY` is set in container env (`docker exec env | grep FMP_API_KEY`) +2. Probe rate limiter remaining: `claude_api_client_results_total{tool_name=~"mcp__equities__.*"}` should show `fetch_source="fmp_native"` outcomes +3. If `fetch_source="exa_fallback"` dominates, FMP_API_KEY is invalid or rate-limited +4. § 8.4.X V1–V4 verification protocol — see `super-legal-mcp-refactored/docs/pending-updates/equity-analyst-update.md` + +## Remediation Quick Reference + +- **Tool error rate high (v2)**: Identify tool via `{tool_name}` label. Check native API in api-client-sweep. +- **Hook persistence failures**: Check DB pool, CircuitBreaker state, latest deploys. Reasons map to specific causes (e.g., `connection_timeout` → DB unreachable, `envelope_shape_drift` → SDK upgrade) +- **HookCircuitBreakerOpen**: Persistence skipped. Check DB connectivity. If DB is healthy, check for poison-pill events causing repeated failures +- **HookEnvelopeShapeDrift**: SDK or upstream API changed. Update zod schema in `src/schemas/toolEnvelopes.js`, NOT the test mock +- **Structured output failures**: Schema validation rejecting LLM output. Usually transient. If persistent, check tool schema drift +- **Circuit breaker trips**: Correlate `{domain}` label with api-client-sweep results +- **Reconciliation backlog**: Loop detecting partial sessions but not draining. Check `kg_build_last_error`, `artifacts_build_last_error`, kgBreaker state +- **Reconciliation scan slow**: 15-min Promise.race timeouts firing on rebuilds. Check for KG extractor regression +- **`bridge_metadata.git_sha = 'unknown'`**: COMMIT_SHA build arg missed during last `docker build`. Verify `deploy.sh:54-62` passes `--build-arg COMMIT_SHA=$(git rev-parse HEAD)` and `Dockerfile` has matching `ARG COMMIT_SHA=unknown` + `ENV COMMIT_SHA=${COMMIT_SHA}` + +## Reference docs + +- Full metric inventory: `super-legal-mcp-refactored/docs/metrics-catalog.md` +- Audit-export endpoint runbook: `super-legal-mcp-refactored/docs/runbooks/v6.8.5-audit-export.md` +- Reconciliation runbook: `super-legal-mcp-refactored/docs/runbooks/v6.7.0-session-reconciliation.md` +- Feature flag registry: `super-legal-mcp-refactored/docs/feature-flags.md` §31a/b (OTel sampler, COMMIT_SHA) diff --git a/.claude/skills/session-diagnostics/SKILL.md b/.claude/skills/session-diagnostics/SKILL.md index aacce6356..a49357e00 100644 --- a/.claude/skills/session-diagnostics/SKILL.md +++ b/.claude/skills/session-diagnostics/SKILL.md @@ -103,6 +103,70 @@ See `references/failure-patterns.md` for the full catalog. Summary: | 7 | Subagent crash (SubagentStart with no matching SubagentStop) | CRITICAL | | 8 | Empty session (0 reports rows) | INFO | | 9 | Hook audit gaps (audit_log row count < 0.5x reports count) | WARNING | +| 10 | **Transcript replay gap** (v7.0.0): `transcript_events` row count = 0 for completed session OR sessions exists but flush failed (missing late events) | CRITICAL | +| 11 | **Citation low-confidence** (v7.0.0): `citation_source_links.confidence < 0.85` for >20% of citations — fuzzy matches flagged for QA, possible hallucinated citations | WARNING | +| 12 | **Code-execution traceability NULL** (v7.0.0): `code_executions` rows with NULL `model_id`, `anthropic_request_id`, `python_code`, or `system_prompt_hash` — regulator audit gap, EU AI Act Art. 15 byte-replay envelope broken | CRITICAL | +| 13 | **bridge_metadata corruption / missing git_sha** (v7.0.0): `hook_audit_log` rows where `event_data->'bridge_metadata'` is malformed OR `bridge_metadata.git_sha = 'unknown'` indicating COMMIT_SHA build arg missed | WARNING | +| 14 | **Reconciliation stall** (v6.7.0): sessions with `kg_status='building'` >10 minutes OR `kg_build_attempts >= 5` (kgBreaker retry budget exhausted, parked for manual intervention) | CRITICAL | +| 15 | **FMP equity-analyst routing failure** (v7.0.0): orchestrator dispatched equity-analyst (visible in `hook_audit_log` SubagentStart with `agent_type='equity-analyst'`) but `claude_api_client_results_total{tool_name LIKE 'mcp__equities__%', fetch_source}` shows `exa_fallback` instead of `fmp_native` (FMP_API_KEY invalid or rate-limited); OR M46–M58 model rows show `success=false` | WARNING | + +### v7.0.0 Diagnostic Queries (operator copy-paste) + +```sql +-- Pattern 10: Transcript replay gap +SELECT s.session_key, s.status, COUNT(t.id) AS event_count +FROM sessions s +LEFT JOIN transcript_events t ON s.id = t.session_id +WHERE s.session_key = '' AND s.status='complete' +GROUP BY s.session_key, s.status; +-- Expected: event_count > 0 (typical 4,000-6,000 events per memo session) + +-- Pattern 11: Citation low-confidence distribution +SELECT matched_via, COUNT(*) AS total, + SUM(CASE WHEN confidence < 0.85 THEN 1 ELSE 0 END) AS low_conf +FROM citation_source_links csl +JOIN reports r ON csl.report_id = r.id +JOIN sessions s ON r.session_id = s.id +WHERE s.session_key = '' +GROUP BY matched_via; + +-- Pattern 12: Code-execution traceability NULL check +SELECT COUNT(*) AS total, + SUM(CASE WHEN model_id IS NULL THEN 1 ELSE 0 END) AS missing_model_id, + SUM(CASE WHEN python_code IS NULL THEN 1 ELSE 0 END) AS missing_code, + SUM(CASE WHEN system_prompt_hash IS NULL THEN 1 ELSE 0 END) AS missing_prompt_hash, + SUM(CASE WHEN anthropic_request_id IS NULL THEN 1 ELSE 0 END) AS missing_request_id +FROM code_executions ce +JOIN sessions s ON ce.session_id = s.id +WHERE s.session_key = ''; + +-- Pattern 13: bridge_metadata.git_sha = 'unknown' indicates COMMIT_SHA build arg missed +SELECT event_data->'bridge_metadata'->>'git_sha' AS git_sha, COUNT(*) +FROM hook_audit_log hal +JOIN sessions s ON hal.session_id = s.id +WHERE s.session_key = '' AND tool_name='run_python_analysis' +GROUP BY git_sha; + +-- Pattern 14: Reconciliation stall (kg + artifacts pipelines) +SELECT session_key, status, + kg_status, kg_build_attempts, kg_breaker_skipped_count, + last_kg_build_attempt_at, kg_build_last_error, + artifacts_status, artifacts_build_attempts, + last_artifacts_build_attempt_at, artifacts_build_last_error, + updated_at +FROM sessions +WHERE session_key = ''; + +-- Pattern 15: FMP routing — V1 + V2 from § 8.4.X +SELECT tool_name, COUNT(*) FROM hook_audit_log hal +JOIN sessions s ON hal.session_id = s.id +WHERE s.session_key = '' AND tool_name LIKE 'mcp__equities__%' +GROUP BY tool_name; +SELECT model_id, COUNT(*), SUM(CASE WHEN success THEN 0 ELSE 1 END) AS failed +FROM code_executions ce JOIN sessions s ON ce.session_id = s.id +WHERE s.session_key = '' AND model_id IN ('M46','M47','M48','M49','M50','M51','M52','M53','M54','M55','M58') +GROUP BY model_id; +``` ## Pre-flight Checks