From e64712e455bce27f1ad7a74c179afed23474a2b8 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 03:14:29 -0400 Subject: [PATCH 01/11] docs(kg): plan + audit for risk-layer fix (#231) Root-cause + remediation plan for the KG dropping the entire risk node layer on current-format sessions: risk-summary.json never persisted + Phase 7 parser keyed on the legacy risk_categories schema. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../pending-updates/kg-risk-layer-fix-231.md | 36 +++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 super-legal-mcp-refactored/docs/pending-updates/kg-risk-layer-fix-231.md diff --git a/super-legal-mcp-refactored/docs/pending-updates/kg-risk-layer-fix-231.md b/super-legal-mcp-refactored/docs/pending-updates/kg-risk-layer-fix-231.md new file mode 100644 index 000000000..5ead87c3a --- /dev/null +++ b/super-legal-mcp-refactored/docs/pending-updates/kg-risk-layer-fix-231.md @@ -0,0 +1,36 @@ +# KG Risk-Layer Fix — Plan + Audit (issue #231) + +## Problem +The KG drops the entire `risk` node layer for current-format sessions. Two independent breaks: +1. **Persistence** — `review-outputs/risk-summary.json` is never written to the `reports` table (live hook `hookDBBridge.js:1445` and backfill `walkMarkdown` persist `.md` only). KG `CRITICAL_REPORTS` gate (`hookDBBridge.js:1359`) then times out on `risk-summary`. +2. **Parser schema drift** — Phase 7 (`kgPhases6to8.js:380`) keys on `parsed.risk_categories || parsed.categories`; the current producer emits `exposure_by_category` with **string** exposures (`"$433.75M"`) and **string** probability (`"8% fail"`). Even when present (May-27 session), 0 risk nodes result. + +Cardinal (May) worked because it emitted `risk-summary-narrative.md` (Markdown → `.md` persist path → Phase 7 Path B regex). + +## Plan (additive, non-destructive) + +### Fix #1 — Persistence (scoped allowlist, NOT all `.json`) +- `src/utils/hookDBBridge.js:1445` — persist `risk-summary.json` via a `JSON_REPORT_FILENAMES` allowlist (in `hookDBBridgeConfig.js`). **Do not broaden to all `.json`** (would pull in `*-state.json`, `banker-*.json`, `entities.json`). +- `scripts/backfill-local-to-db.mjs` — `walkMarkdown` additionally ingests files whose basename ∈ the same allowlist; also scan `review-outputs/` for `*-state.json` (pre-existing gap). +- `persistReport` is already content-agnostic; `extractReportKey` already strips `.json` → `review/risk-summary`. No change. + +### Fix #2 — Parser (modularized, unit-tested) +- Extract the JSON risk-block parsing into a **pure exported function** `buildRiskBlocksFromJson(content)` in `kgPhases6to8.js` (refactor commit: byte-equivalent for the legacy schema). +- Extend it to accept `exposure_by_category` + string exposure fields (`weighted_exposure`, `exposure_low/high`) and string `probability` (passed through; already contains `%`). +- **Ordering constraint (from node-creation loop @ line 442 `amounts.slice(0,5)`):** the synth block must list `Exposure:` BEFORE `Mitigation:` so real exposure `$` amounts lead the extracted array (mitigation prose contains `$1,237,262,000`). +- Phase 13 (`kgPhase13ProbabilisticValue.js`) intentionally untouched — it requires numeric `p10/p50/p90` and correctly skips string-only findings. + +### Tests (`node --test`, matching kg-phase13 style) +- `test/sdk/kg-phase7-risk-parser.test.js` — unit tests on `buildRiskBlocksFromJson` for: legacy `risk_categories` numeric, `categories` alias, current `exposure_by_category` + string exposures, exposure-before-mitigation ordering, malformed JSON → `[]`. + +## Audit of the plan +| Dimension | Finding | +|---|---| +| **Blast radius** | Confined to observability/storage: `reports` table + KG `risk`/`closing_condition` nodes. Generative pipeline reads `risk-summary.json` from **disk** (`_promptConstants.js:3066`), unaffected. `documentConverter` already excludes `risk-summary.json` (no garbage DOCX). | +| **Best practices** | Allowlist (not blanket `.json`) avoids polluting `reports`. Parser extracted to a pure, testable function. Both persist paths fail-soft (`hookDBBridge.js:6`). | +| **Modularity** | One pure function (`buildRiskBlocksFromJson`) shared intent; one config constant (`JSON_REPORT_FILENAMES`) shared by live hook + backfill. No new tables, no schema migration. | +| **Seamless integration** | Legacy numeric schema preserved (refactor is byte-equivalent); new schema added as fallthrough. Existing 23 kg-phase13 tests must stay green. | +| **Anti-recurrence** | Producer↔consumer drift is the root cause → add the parser unit test built from the real `risk-summary.json` shape as the contract anchor. | + +## Remediation for affected sessions +After fix: ingest `risk-summary.json` + re-run KG (upsert) + reapply embeddings for 2026-06-16 (and later 2026-06-08, 2026-05-27). From c3105bad0539c42de354e3af07d98afc02ad8fc0 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 03:14:29 -0400 Subject: [PATCH 02/11] refactor(kg): extract buildRiskBlocksFromJson pure helper Move the Phase 7 risk-summary JSON parsing into an exported, side-effect-free function so the schema handling is unit-testable in isolation. Byte-equivalent behavior for the legacy risk_categories/categories schema (existing 33 kg-phase tests stay green); no functional change. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../src/utils/knowledgeGraph/kgPhases6to8.js | 97 +++++++++++-------- 1 file changed, 54 insertions(+), 43 deletions(-) diff --git a/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js b/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js index ea6d3120e..1384a1cf4 100644 --- a/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js +++ b/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js @@ -348,6 +348,57 @@ async function phase6_dealStructure(pool, sessionId, evolutionLog, resolver) { console.log(`[KG] Phase 6: ${condCount} conditions, ${entityCount} entities, ${milestoneCount} milestones`); } +/** + * Parse a risk-summary JSON document into uniform { title, block } risk blocks + * that the Phase 7 node-creation loop consumes identically to Markdown-extracted + * blocks. Returns [] for non-JSON or unparseable content (caller then falls back + * to the Markdown regex path). + * + * Extracted from phase7_riskAndFacts and exported so the risk-summary schema + * handling is unit-testable in isolation (test/sdk/kg-phase7-risk-parser.test.js). + */ +export function buildRiskBlocksFromJson(content) { + const blocks = []; + const trimmed = (content || '').trim(); + if (!trimmed.startsWith('{') && !trimmed.startsWith('[')) return blocks; + let parsed; + try { + parsed = JSON.parse(trimmed); + } catch (err) { + console.warn('[KG Phase 6 risk] JSON parse failed, falling back to markdown:', err.message); + return blocks; + } + const categories = parsed.risk_categories || parsed.categories || []; + for (const cat of categories) { + const catName = cat.category || cat.name || 'Uncategorized'; + for (const finding of (cat.findings || [])) { + const fid = finding.id || ''; + const title = (finding.finding || finding.title || finding.name || '').toString(); + if (!title || title.length < 5) continue; + const exposureBits = []; + if (finding.p50 != null) exposureBits.push(`$${(finding.p50 / 1e9).toFixed(2)}B (p50)`); + if (finding.p10 != null && finding.p10 !== finding.p50) exposureBits.push(`$${(finding.p10 / 1e9).toFixed(2)}B (p10)`); + if (finding.p90 != null && finding.p90 !== finding.p50) exposureBits.push(`$${(finding.p90 / 1e9).toFixed(2)}B (p90)`); + if (finding.probability_weighted != null) exposureBits.push(`$${(finding.probability_weighted / 1e9).toFixed(2)}B (probability-weighted)`); + if (finding.npv_at_8pct != null) exposureBits.push(`NPV $${(finding.npv_at_8pct / 1e9).toFixed(2)}B`); + if (finding.dcf_present_value != null) exposureBits.push(`DCF PV $${(finding.dcf_present_value / 1e9).toFixed(2)}B`); + const probPct = finding.probability != null ? `${Math.round(finding.probability * 100)}%` : ''; + const synthBlock = [ + `**${fid ? fid + ': ' : ''}${title}**`, + `Category: ${catName}`, + `Severity: ${finding.severity || cat.severity || 'UNCLASSIFIED'}`, + `Exposure: ${exposureBits.join(', ') || 'unquantified'}`, + probPct ? `Probability: ${probPct}` : '', + finding.source ? `Source: ${finding.source}` : '', + finding.notes ? `Notes: ${finding.notes}` : '', + finding.correlation_note ? `Correlation: ${finding.correlation_note}` : '', + ].filter(Boolean).join('\n'); + blocks.push({ title: `${fid ? fid + ': ' : ''}${title}`, block: synthBlock }); + } + } + return blocks; +} + async function phase7_riskAndFacts(pool, sessionId, evolutionLog, resolver, tNumberMap) { let riskCount = 0, factCount = 0; @@ -370,49 +421,9 @@ async function phase7_riskAndFacts(pool, sessionId, evolutionLog, resolver, tNum // Two source formats supported: // - JSON (e.g., risk-summary.json with risk_categories[].findings[]) — code-execution output // - Markdown (e.g., risk-summary-narrative.md with **Title** + $exposure prose blocks) — LLM output - const riskBlocks = []; - - // Path A: detect JSON content (Cardinal-style risk-summary.json) - const trimmed = content.trim(); - if (trimmed.startsWith('{') || trimmed.startsWith('[')) { - try { - const parsed = JSON.parse(trimmed); - const categories = parsed.risk_categories || parsed.categories || []; - for (const cat of categories) { - const catName = cat.category || cat.name || 'Uncategorized'; - for (const finding of (cat.findings || [])) { - // Synthesize a markdown-equivalent block from the JSON finding so the - // downstream regex-based property extractors still work identically. - // Format: **: ** \n exposure $... probability ...% notes... - const fid = finding.id || ''; - const title = (finding.finding || finding.title || finding.name || '').toString(); - if (!title || title.length < 5) continue; - const exposureBits = []; - if (finding.p50 != null) exposureBits.push(`$${(finding.p50 / 1e9).toFixed(2)}B (p50)`); - if (finding.p10 != null && finding.p10 !== finding.p50) exposureBits.push(`$${(finding.p10 / 1e9).toFixed(2)}B (p10)`); - if (finding.p90 != null && finding.p90 !== finding.p50) exposureBits.push(`$${(finding.p90 / 1e9).toFixed(2)}B (p90)`); - if (finding.probability_weighted != null) exposureBits.push(`$${(finding.probability_weighted / 1e9).toFixed(2)}B (probability-weighted)`); - if (finding.npv_at_8pct != null) exposureBits.push(`NPV $${(finding.npv_at_8pct / 1e9).toFixed(2)}B`); - if (finding.dcf_present_value != null) exposureBits.push(`DCF PV $${(finding.dcf_present_value / 1e9).toFixed(2)}B`); - const probPct = finding.probability != null ? `${Math.round(finding.probability * 100)}%` : ''; - const synthBlock = [ - `**${fid ? fid + ': ' : ''}${title}**`, - `Category: ${catName}`, - `Severity: ${finding.severity || cat.severity || 'UNCLASSIFIED'}`, - `Exposure: ${exposureBits.join(', ') || 'unquantified'}`, - probPct ? `Probability: ${probPct}` : '', - finding.source ? `Source: ${finding.source}` : '', - finding.notes ? `Notes: ${finding.notes}` : '', - finding.correlation_note ? `Correlation: ${finding.correlation_note}` : '', - ].filter(Boolean).join('\n'); - riskBlocks.push({ title: `${fid ? fid + ': ' : ''}${title}`, block: synthBlock }); - } - } - } catch (err) { - // JSON parse failed; fall through to markdown path - console.warn('[KG Phase 6 risk] JSON parse failed, falling back to markdown:', err.message); - } - } + // Path A: JSON content (risk-summary.json) — extracted to buildRiskBlocksFromJson() + // so the schema handling is unit-testable in isolation. + const riskBlocks = buildRiskBlocksFromJson(content); // Path B: markdown regex (fallback; also runs when JSON path extracted nothing) if (riskBlocks.length === 0) { From 187e8a1c7fe21ea3144f925db68907e0c3258ec4 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:06:52 -0400 Subject: [PATCH 03/11] feat(kg): parse exposure_by_category risk schema + string exposures The current risk-aggregator emits risk-summary.json keyed on exposure_by_category with string exposures ("$433.75M") and string probability ("8% fail"); the parser only handled the legacy risk_categories numeric schema, yielding 0 risk nodes. - Add exposure_by_category as the primary category source (legacy keys preserved). - Synthesize $-amounts from weighted_exposure/exposure_low/high when no numeric p10/p50/p90 bits exist (legacy numeric path unchanged). - Pass string probability through (already carries % for the downstream regex). - Emit Exposure BEFORE Mitigation so amounts.slice(0,5) leads with real exposures, not the RRTF figure embedded in mitigation prose. Fixes #231 (parser half). Co-Authored-By: Claude Opus 4.8 (1M context) --- .../src/utils/knowledgeGraph/kgPhases6to8.js | 29 ++++++++++++++++--- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js b/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js index 1384a1cf4..684591e71 100644 --- a/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js +++ b/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js @@ -368,7 +368,10 @@ export function buildRiskBlocksFromJson(content) { console.warn('[KG Phase 6 risk] JSON parse failed, falling back to markdown:', err.message); return blocks; } - const categories = parsed.risk_categories || parsed.categories || []; + // Schema support (first non-empty wins): + // - exposure_by_category — CURRENT producer (risk-aggregator.js), string exposures + // - risk_categories / categories — LEGACY, numeric p10/p50/p90 distributions + const categories = parsed.exposure_by_category || parsed.risk_categories || parsed.categories || []; for (const cat of categories) { const catName = cat.category || cat.name || 'Uncategorized'; for (const finding of (cat.findings || [])) { @@ -376,20 +379,38 @@ export function buildRiskBlocksFromJson(content) { const title = (finding.finding || finding.title || finding.name || '').toString(); if (!title || title.length < 5) continue; const exposureBits = []; + // Legacy numeric distribution fields (base units → $B). if (finding.p50 != null) exposureBits.push(`$${(finding.p50 / 1e9).toFixed(2)}B (p50)`); if (finding.p10 != null && finding.p10 !== finding.p50) exposureBits.push(`$${(finding.p10 / 1e9).toFixed(2)}B (p10)`); if (finding.p90 != null && finding.p90 !== finding.p50) exposureBits.push(`$${(finding.p90 / 1e9).toFixed(2)}B (p90)`); if (finding.probability_weighted != null) exposureBits.push(`$${(finding.probability_weighted / 1e9).toFixed(2)}B (probability-weighted)`); if (finding.npv_at_8pct != null) exposureBits.push(`NPV $${(finding.npv_at_8pct / 1e9).toFixed(2)}B`); if (finding.dcf_present_value != null) exposureBits.push(`DCF PV $${(finding.dcf_present_value / 1e9).toFixed(2)}B`); - const probPct = finding.probability != null ? `${Math.round(finding.probability * 100)}%` : ''; + // Current schema: pre-formatted string exposures (already contain "$..."). Only + // when no numeric bits were produced, so legacy numeric behavior is unchanged. + if (exposureBits.length === 0) { + if (finding.weighted_exposure) exposureBits.push(String(finding.weighted_exposure)); + if (finding.exposure_low) exposureBits.push(`low ${finding.exposure_low}`); + if (finding.exposure_high) exposureBits.push(`high ${finding.exposure_high}`); + } + // Probability: numeric legacy (0–1) → "NN%"; string current (e.g. "8% fail") + // passed through verbatim (already carries a "%" for the downstream regex). + let probStr = ''; + if (typeof finding.probability === 'number') probStr = `${Math.round(finding.probability * 100)}%`; + else if (typeof finding.probability === 'string' && finding.probability.trim()) probStr = finding.probability.trim(); const synthBlock = [ `**${fid ? fid + ': ' : ''}${title}**`, `Category: ${catName}`, `Severity: ${finding.severity || cat.severity || 'UNCLASSIFIED'}`, + // Exposure MUST precede Mitigation: the node-creation loop takes + // amounts.slice(0,5) in block order, and mitigation prose carries its own + // "$" figures (e.g. the RRTF). Leading with exposure keeps exposure_amounts + // accurate. `Exposure: ${exposureBits.join(', ') || 'unquantified'}`, - probPct ? `Probability: ${probPct}` : '', - finding.source ? `Source: ${finding.source}` : '', + probStr ? `Probability: ${probStr}` : '', + finding.section_reference ? `Section: ${finding.section_reference}` : '', + finding.source || finding.source_report ? `Source: ${finding.source || finding.source_report}` : '', + finding.mitigation ? `Mitigation: ${finding.mitigation}` : '', finding.notes ? `Notes: ${finding.notes}` : '', finding.correlation_note ? `Correlation: ${finding.correlation_note}` : '', ].filter(Boolean).join('\n'); From cc08cbb4c15c08e219953cf6cade08ca5cc184a0 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:06:52 -0400 Subject: [PATCH 04/11] test(kg): unit tests for risk-summary parser (legacy + current schema) Contract anchor for #231: pins both the legacy risk_categories numeric schema and the current exposure_by_category string schema, plus the exposure-before-mitigation ordering, malformed-JSON fallback, and precedence. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../test/sdk/kg-phase7-risk-parser.test.js | 137 ++++++++++++++++++ 1 file changed, 137 insertions(+) create mode 100644 super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js diff --git a/super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js b/super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js new file mode 100644 index 000000000..880d1bf65 --- /dev/null +++ b/super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js @@ -0,0 +1,137 @@ +/** + * Unit tests for buildRiskBlocksFromJson (KG Phase 7 risk parser). + * + * Contract anchor for issue #231: the risk-aggregator emits the + * `exposure_by_category` schema with STRING exposures; the parser must produce + * risk blocks whose synth text yields correct downstream extraction. Also pins + * the LEGACY `risk_categories`/`categories` numeric schema so it never regresses. + */ +import { test } from 'node:test'; +import assert from 'node:assert/strict'; +import { buildRiskBlocksFromJson } from '../../src/utils/knowledgeGraph/kgPhases6to8.js'; + +// Mirrors the downstream node-creation loop extractors (kgPhases6to8.js ~431). +const AMOUNT_RE = /\$[\d,.]+[BMK]?/g; +const PROB_RE = /(\d{1,3})[\-–]?(\d{1,3})?%/; + +test('current schema: exposure_by_category with string exposures → risk blocks', () => { + const content = JSON.stringify({ + exposure_by_category: [ + { + category: 'Regulatory/Antitrust', + findings: [ + { + id: 'R-ANT-001', + finding: 'HSR Second Request probable; behavioral-remedy clearance base case.', + severity: 'HIGH', + probability: '8% fail / 65–80% Second Request', + weighted_exposure: '$433.75M (break EL, marked) / $621.71M (headline)', + exposure_low: '$0 (clears)', + exposure_high: '$1.237B RRTF', + mitigation: '$1,237,262,000 regulatory reverse-termination fee (Fox pays).', + source_report: 'antitrust-competition-analyst-report.md', + section_reference: '§IV.A', + }, + ], + }, + ], + }); + const blocks = buildRiskBlocksFromJson(content); + assert.equal(blocks.length, 1, 'one finding → one block'); + const { title, block } = blocks[0]; + assert.match(title, /^R-ANT-001: /); + + // Downstream amount extraction must LEAD with real exposures, not the + // mitigation RRTF figure (exposure-before-mitigation ordering constraint). + const amounts = block.match(AMOUNT_RE) || []; + assert.ok(amounts.length >= 2, 'multiple exposure amounts extracted'); + assert.equal(amounts[0], '$433.75M', 'first amount is the weighted exposure, not mitigation'); + const exposureIdx = block.indexOf('Exposure:'); + const mitigationIdx = block.indexOf('Mitigation:'); + assert.ok(exposureIdx < mitigationIdx, 'Exposure line precedes Mitigation line'); + + // Probability regex picks up the leading "8%". + const prob = block.match(PROB_RE); + assert.ok(prob && prob[0] === '8%', 'probability "8%" extracted from string'); + + assert.match(block, /Severity: HIGH/); + assert.match(block, /Section: §IV\.A/); +}); + +test('current schema: exposure_low/high used when weighted_exposure absent', () => { + const content = JSON.stringify({ + exposure_by_category: [ + { category: 'Privacy', findings: [ + { id: 'P-1', finding: 'VPPA class-action exposure tied to ACR data.', severity: 'MEDIUM-HIGH', + exposure_low: '$0', exposure_high: '$113M' }, + ] }, + ], + }); + const blocks = buildRiskBlocksFromJson(content); + assert.equal(blocks.length, 1); + const amounts = blocks[0].block.match(AMOUNT_RE) || []; + assert.ok(amounts.includes('$113M'), 'exposure_high amount present'); +}); + +test('legacy schema: risk_categories with numeric p10/p50/p90 (unchanged behavior)', () => { + const content = JSON.stringify({ + risk_categories: [ + { category: 'Financial', findings: [ + { id: 'F-1', finding: 'Delivered-value erosion vs headline.', severity: 'HIGH', + p10: 1.0e9, p50: 2.09e9, p90: 2.33e9, probability: 0.65 }, + ] }, + ], + }); + const blocks = buildRiskBlocksFromJson(content); + assert.equal(blocks.length, 1); + const { block } = blocks[0]; + assert.match(block, /\$2\.09B \(p50\)/, 'numeric p50 rendered as $B'); + const prob = block.match(PROB_RE); + assert.ok(prob && prob[0] === '65%', 'numeric probability 0.65 → 65%'); +}); + +test('legacy "categories" alias still works', () => { + const content = JSON.stringify({ + categories: [{ category: 'X', findings: [{ id: 'C-1', finding: 'Some risk finding here.', p50: 5e8 }] }], + }); + const blocks = buildRiskBlocksFromJson(content); + assert.equal(blocks.length, 1); + assert.match(blocks[0].block, /\$0\.50B \(p50\)/); +}); + +test('malformed JSON → [] (caller falls back to markdown)', () => { + assert.deepEqual(buildRiskBlocksFromJson('{not valid json'), []); +}); + +test('non-JSON content → []', () => { + assert.deepEqual(buildRiskBlocksFromJson('## Risk Narrative\n**Some risk** $10M'), []); +}); + +test('empty / missing categories → []', () => { + assert.deepEqual(buildRiskBlocksFromJson('{}'), []); + assert.deepEqual(buildRiskBlocksFromJson(JSON.stringify({ exposure_by_category: [] })), []); + assert.deepEqual(buildRiskBlocksFromJson(''), []); + assert.deepEqual(buildRiskBlocksFromJson(null), []); +}); + +test('findings with too-short titles are skipped', () => { + const content = JSON.stringify({ + exposure_by_category: [{ category: 'X', findings: [ + { id: 'S-1', finding: 'abc' }, // < 5 chars → skipped + { id: 'S-2', finding: 'A real finding title' }, // kept + ] }], + }); + const blocks = buildRiskBlocksFromJson(content); + assert.equal(blocks.length, 1); + assert.match(blocks[0].title, /^S-2: /); +}); + +test('exposure_by_category takes precedence over legacy keys when both present', () => { + const content = JSON.stringify({ + exposure_by_category: [{ category: 'New', findings: [{ id: 'N-1', finding: 'New schema finding wins.', weighted_exposure: '$5M' }] }], + risk_categories: [{ category: 'Old', findings: [{ id: 'O-1', finding: 'Legacy finding ignored.', p50: 9e9 }] }], + }); + const blocks = buildRiskBlocksFromJson(content); + assert.equal(blocks.length, 1); + assert.match(blocks[0].title, /^N-1: /); +}); From f811ac9cfbff52edbe45db935883766228d7b59f Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:07:55 -0400 Subject: [PATCH 05/11] feat(hookdb): persist risk-summary.json as a review report MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The live PostToolUse hook only persisted .md files, so the risk-aggregator's risk-summary.json never reached the reports table and the KG risk layer + the CRITICAL_REPORTS gate (risk-summary) silently failed. - Add JSON_REPORT_FILENAMES allowlist (hookDBBridgeConfig.js) — exact basenames, NOT all .json, so *-state.json / banker-*.json / entities.json are excluded. - Broaden the persist gate to the allowlist. persistReport is content-agnostic; extractReportType maps /review-outputs/ → review and extractReportKey strips .json → report_key 'risk-summary'. Fail-soft (all DB writes are try/caught). Fixes #231 (live-persistence half). Co-Authored-By: Claude Opus 4.8 (1M context) --- .../src/config/hookDBBridgeConfig.js | 15 +++++++++++++++ .../src/utils/hookDBBridge.js | 7 ++++++- 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/super-legal-mcp-refactored/src/config/hookDBBridgeConfig.js b/super-legal-mcp-refactored/src/config/hookDBBridgeConfig.js index 6ccdd587f..fd003607f 100644 --- a/super-legal-mcp-refactored/src/config/hookDBBridgeConfig.js +++ b/super-legal-mcp-refactored/src/config/hookDBBridgeConfig.js @@ -80,6 +80,21 @@ export const REPORT_TYPE_MATCHERS = [ export const REPORT_TYPE_DEFAULT = 'document'; +/** + * Non-Markdown report deliverables that MUST be persisted to the `reports` + * table, keyed by exact basename. The risk-aggregator emits structured JSON + * (`risk-summary.json`) consumed by KG Phase 7/13 and the executive-summary + * synthesizer; without this allowlist it is never stored and the KG risk layer + * is silently empty (issue #231). + * + * SCOPED BY DESIGN: persistence gates broaden to these exact filenames, NOT to + * all `.json`, so state sidecars (`*-state.json`), banker context/metadata + * (`banker-*.json`), and `entities.json` are never mis-ingested as reports. + */ +export const JSON_REPORT_FILENAMES = new Set([ + 'risk-summary.json', +]); + // ============================================================ // AGENT TYPE CLASSIFICATION (State Key → agent_type) // ============================================================ diff --git a/super-legal-mcp-refactored/src/utils/hookDBBridge.js b/super-legal-mcp-refactored/src/utils/hookDBBridge.js index ac9e23ba6..3fa93a32c 100644 --- a/super-legal-mcp-refactored/src/utils/hookDBBridge.js +++ b/super-legal-mcp-refactored/src/utils/hookDBBridge.js @@ -35,6 +35,7 @@ import { STATE_FILE_DIR_DEFAULT, AUDIT_SKIP_TOOLS, P0_EXCLUDED_SUFFIXES, + JSON_REPORT_FILENAMES, } from '../config/hookDBBridgeConfig.js'; export const backgroundTasks = new Set(); @@ -1442,7 +1443,11 @@ async function persistHookEvent(pool, sessionCache, hookName, input, result) { const filePath = tool_input?.file_path || ''; if (tool_name === 'Write' && filePath.includes('/reports/')) { - if (filePath.endsWith('.md')) { + // Persist Markdown reports, plus the allowlisted non-Markdown deliverables + // (e.g. risk-summary.json) that downstream KG/synthesis consume. Scoped to + // exact basenames so state/context JSON sidecars are never mis-ingested (#231). + const baseName = filePath.split('/').pop() || ''; + if (filePath.endsWith('.md') || JSON_REPORT_FILENAMES.has(baseName)) { await persistReport(pool, sessionCache, input, result); } const isExcluded = P0_EXCLUDED_SUFFIXES.some(suffix => filePath.endsWith(suffix)); From 0256eb8eaf7dde8e5634b23f3e7aefc7afdef2df Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:08:59 -0400 Subject: [PATCH 06/11] fix(backfill): ingest risk-summary.json + review-outputs state files Bring local backfill to parity with live persistence for recovery of unpersisted sessions: - walkMarkdown now also ingests allowlisted JSON deliverables (risk-summary.json) so the KG risk layer can be rebuilt (#231); scoped by basename, state sidecars excluded. - Scan review-outputs/ (not just qa-outputs/) for *-state.json, so fact-validator / coverage-gap-analyzer / risk-aggregator states are captured. - (Carries the banker_intake/banker_qa report-type matchers added during session recovery so banker artifacts get their canonical types.) Co-Authored-By: Claude Opus 4.8 (1M context) --- .../scripts/backfill-local-to-db.mjs | 349 ++++++++++++++++++ 1 file changed, 349 insertions(+) create mode 100644 super-legal-mcp-refactored/scripts/backfill-local-to-db.mjs diff --git a/super-legal-mcp-refactored/scripts/backfill-local-to-db.mjs b/super-legal-mcp-refactored/scripts/backfill-local-to-db.mjs new file mode 100644 index 000000000..8f2012e06 --- /dev/null +++ b/super-legal-mcp-refactored/scripts/backfill-local-to-db.mjs @@ -0,0 +1,349 @@ +#!/usr/bin/env node +/** + * Backfill Local-to-DB — Upload a fully-completed local session directory + * into PostgreSQL so the staging frontend can display it. + * + * What it does: + * 1. UPDATE sessions row (status, sdk_session_id, counts) from session-summary.json + * 2. INSERT/UPSERT reports for every non-.pandoc .md file (uses extractReportType/Key) + * 3. INSERT/UPSERT charts/*.png as report_artifacts via persistSessionArtifacts() + * 4. INSERT/UPSERT agent_states from *-state-*.json files + * 5. INSERT session_metrics from counter totals + * + * After running, hit: + * POST /api/admin/sessions//rebuild-artifacts (regenerate PDFs/DOCX) + * POST /api/admin/sessions//rebuild-kg (regenerate entities + KG) + * + * report_embeddings + citation_source_links populate automatically via the + * setImmediate() side-effect path inside persistReport() — NO, that runs only + * from the live hook. We replicate the same INSERT here but skip the side effect; + * rebuild-kg will resynthesise citation_source_links and KG. + * + * Usage: + * node scripts/backfill-local-to-db.mjs [--dry-run] + */ + +import 'dotenv/config'; +import pg from 'pg'; +import { promises as fs } from 'fs'; +import path from 'path'; +import { createHash } from 'crypto'; +import { fileURLToPath } from 'url'; + +const __dirname = path.dirname(fileURLToPath(import.meta.url)); + +const args = process.argv.slice(2); +const sessionKey = args.find(a => !a.startsWith('--')); +const dryRun = args.includes('--dry-run'); +const sessionDirArg = args.find(a => a.startsWith('--dir='))?.split('=')[1]; + +if (!sessionKey || !/^\d{4}-\d{2}-\d{2}-\d+$/.test(sessionKey)) { + console.error('Usage: node scripts/backfill-local-to-db.mjs [--dry-run] [--dir=/abs/path]'); + process.exit(1); +} + +const sessionDir = sessionDirArg + ? path.resolve(sessionDirArg) + : path.resolve(__dirname, '..', 'reports', sessionKey); + +const REPORT_TYPE_MATCHERS = [ + { match: '/documents/', type: 'extraction' }, + { match: '/specialist-reports/', type: 'specialist' }, + { match: '/section-reports/', type: 'section' }, + { match: '/review-outputs/', type: 'review' }, + { match: '/qa-outputs/', type: 'qa' }, + { match: '/remediation-outputs/', type: 'remediation' }, + // Banker Q&A workflow (mirrors src/config/hookDBBridgeConfig.js:76-78) — must + // precede 'final-memorandum'/generic so the KG banker phases (1b/1c) can find + // these by their canonical report_type. Specific filename matches. + { match: 'banker-questions-presented', type: 'banker_intake' }, + { match: 'banker-question-answers', type: 'banker_qa' }, + { match: 'final-memorandum', type: 'final' }, + { match: 'executive-summary', type: 'synthesis' }, + { match: 'research-plan', type: 'synthesis' }, + { match: 'consolidated-footnotes', type: 'synthesis' }, +]; +const REPORT_TYPE_DEFAULT = 'document'; + +// Non-Markdown deliverables persisted as reports by exact basename — mirrors +// JSON_REPORT_FILENAMES in src/config/hookDBBridgeConfig.js so local backfill +// matches live persistence (#231). Scoped: state/context JSON are NOT reports. +const JSON_REPORT_FILENAMES = new Set(['risk-summary.json']); + +const AGENT_TYPE_MATCHERS = [ + { match: 'section-writer', type: 'section-writer' }, + { match: 'qa-diagnostic', type: 'qa-diagnostic' }, + { match: 'qa-certifier', type: 'qa-certifier' }, + { match: 'synthesis', type: 'synthesis' }, + { match: 'executive-summary', type: 'executive-summary' }, + { match: 'citation-validator', type: 'citation-validator' }, + { match: 'remediation-wave', type: 'remediation' }, + { match: 'risk-aggregator', type: 'risk-aggregator' }, + { match: 'orchestrator', type: 'orchestrator' }, + { match: 'document-processing', type: 'document-processing' }, + { match: 'research-review', type: 'research-review' }, + { match: 'assembly', type: 'assembly' }, + { match: 'intake-research', type: 'intake-research-analyst' }, +]; + +function extractReportType(filePath) { + for (const { match, type } of REPORT_TYPE_MATCHERS) { + if (filePath.includes(match)) return type; + } + return REPORT_TYPE_DEFAULT; +} + +function extractReportKey(filePath) { + const filename = filePath.split('/').pop() || 'unknown'; + return filename.replace(/\.pandoc\.md$|\.md$|\.json$/, ''); +} + +function extractAgentType(stateKey) { + for (const { match, type } of AGENT_TYPE_MATCHERS) { + if (stateKey.includes(match)) return type; + } + return 'unknown'; +} + +async function walkMarkdown(dir) { + const out = []; + async function recurse(d) { + let entries; + try { entries = await fs.readdir(d, { withFileTypes: true }); } + catch { return; } + for (const e of entries) { + const full = path.join(d, e.name); + if (e.isDirectory()) { + if (['documents', 'charts', 'wrapped-subagent-transcripts'].includes(e.name)) continue; + await recurse(full); + } else if (e.isFile() && ((e.name.endsWith('.md') && !e.name.endsWith('.pandoc.md')) || JSON_REPORT_FILENAMES.has(e.name))) { + out.push(full); + } + } + } + await recurse(dir); + return out; +} + +async function main() { + const exists = await fs.access(sessionDir).then(() => true, () => false); + if (!exists) { + console.error(`Session dir not found: ${sessionDir}`); + process.exit(1); + } + + const pool = new pg.Pool({ connectionString: process.env.PG_CONNECTION_STRING }); + + // ── 0. Resolve sessions.id ── + const sessRes = await pool.query( + `SELECT id, status FROM sessions WHERE session_key = $1`, + [sessionKey], + ); + if (sessRes.rows.length === 0) { + console.error(`Session ${sessionKey} not found in DB — create the sessions row first (e.g. via a live run intake) or insert manually.`); + process.exit(1); + } + const sessionId = sessRes.rows[0].id; + console.log(`[Backfill] Session ${sessionKey} → id=${sessionId} (current status=${sessRes.rows[0].status})`); + + // ── 1. Read session-summary.json ── + let summary = {}; + try { + summary = JSON.parse(await fs.readFile(path.join(sessionDir, 'session-summary.json'), 'utf8')); + } catch (err) { + console.warn(`[Backfill] No session-summary.json (${err.message}) — counters skipped`); + } + + // ── 2. UPDATE sessions ── + const finalMemoPath = path.join(sessionDir, 'final-memorandum.md'); + let finalWordCount = null; + let sectionCount = null; + try { + const text = await fs.readFile(finalMemoPath, 'utf8'); + finalWordCount = text.split(/\s+/).filter(Boolean).length; + } catch { /* skip */ } + try { + const sectionDir = path.join(sessionDir, 'section-reports'); + const entries = await fs.readdir(sectionDir); + sectionCount = entries.filter(f => f.endsWith('.md') && !f.endsWith('.pandoc.md')).length; + } catch { /* skip */ } + + if (!dryRun) { + await pool.query( + `UPDATE sessions + SET status = 'completed', + sdk_session_id = COALESCE($2, sdk_session_id), + word_count = COALESCE($3, word_count), + section_count = COALESCE($4, section_count), + metadata = metadata || $5::jsonb, + updated_at = NOW() + WHERE id = $1`, + [ + sessionId, + summary.sdk_session_id || null, + finalWordCount, + sectionCount, + JSON.stringify({ + backfilled_from_local: true, + backfilled_at: new Date().toISOString(), + local_summary: summary, + }), + ], + ); + } + console.log(`[Backfill] sessions UPDATE: status=completed, word_count=${finalWordCount}, section_count=${sectionCount}`); + + // ── 3. INSERT reports ── + const mdFiles = await walkMarkdown(sessionDir); + console.log(`[Backfill] Found ${mdFiles.length} markdown files`); + let reportsInserted = 0; + for (const file of mdFiles) { + const content = await fs.readFile(file, 'utf8'); + const reportType = extractReportType(file); + const reportKey = extractReportKey(file); + const wordCount = content.split(/\s+/).filter(Boolean).length; + const contentHash = createHash('sha256').update(content).digest('hex'); + if (dryRun) { + console.log(` [dry] ${reportType}/${reportKey} (${wordCount}w)`); + continue; + } + await pool.query( + `INSERT INTO reports (session_id, report_type, report_key, content, + content_hash, word_count, file_path) + VALUES ($1, $2, $3, $4, $5, $6, $7) + ON CONFLICT (session_id, report_type, report_key) + DO UPDATE SET content = EXCLUDED.content, + content_hash = EXCLUDED.content_hash, + word_count = EXCLUDED.word_count, + file_path = EXCLUDED.file_path, + updated_at = NOW()`, + [sessionId, reportType, reportKey, content, contentHash, wordCount, file], + ); + reportsInserted++; + } + console.log(`[Backfill] reports upserted: ${reportsInserted}`); + + // ── 4. INSERT agent_states from *-state-*.json ── + let stateFiles = []; + try { + const rootEntries = await fs.readdir(sessionDir); + for (const e of rootEntries) { + if (e.endsWith('-state.json') || (/-state-.*\.json$/.test(e))) { + stateFiles.push(path.join(sessionDir, e)); + } + } + // qa-outputs and review-outputs may also hold state files (e.g. + // fact-validator-state.json, coverage-gap-analyzer-state.json live in + // review-outputs per STATE_FILE_DIR_MAP) — scan both subdirs. + for (const sub of ['qa-outputs', 'review-outputs']) { + const subEntries = await fs.readdir(path.join(sessionDir, sub)).catch(() => []); + for (const e of subEntries) { + if (e.endsWith('-state.json')) stateFiles.push(path.join(sessionDir, sub, e)); + } + } + } catch { /* none */ } + + let statesInserted = 0; + for (const file of stateFiles) { + try { + const raw = await fs.readFile(file, 'utf8'); + const data = JSON.parse(raw); + const stateKey = path.basename(file, '.json'); + const agentType = extractAgentType(stateKey); + if (dryRun) { + console.log(` [dry] state ${agentType}/${stateKey}`); + continue; + } + await pool.query( + `INSERT INTO agent_states (session_id, agent_type, state_key, state_data, file_path) + VALUES ($1, $2, $3, $4, $5) + ON CONFLICT (session_id, agent_type, state_key) + DO UPDATE SET state_data = EXCLUDED.state_data, + file_path = EXCLUDED.file_path, + updated_at = NOW()`, + [sessionId, agentType, stateKey, JSON.stringify(data), file], + ); + statesInserted++; + } catch (err) { + console.warn(` [warn] state file ${file}: ${err.message}`); + } + } + console.log(`[Backfill] agent_states upserted: ${statesInserted}`); + + // ── 5. INSERT session_metrics ── + const c = summary.counters || {}; + const u = summary.tool_usage_breakdown || {}; + if (!dryRun && Object.keys(c).length > 0) { + await pool.query( + `INSERT INTO session_metrics (session_id, total_events, agents_started, agents_stopped, + tool_calls, tool_failures, compactions, total_duration_ms, + gate_checks_passed, gate_checks_total) + VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) + ON CONFLICT (session_id) DO UPDATE SET + total_events = EXCLUDED.total_events, + agents_started = EXCLUDED.agents_started, + agents_stopped = EXCLUDED.agents_stopped, + tool_calls = EXCLUDED.tool_calls, + tool_failures = EXCLUDED.tool_failures, + compactions = EXCLUDED.compactions, + total_duration_ms = EXCLUDED.total_duration_ms, + gate_checks_passed = EXCLUDED.gate_checks_passed, + gate_checks_total = EXCLUDED.gate_checks_total`, + [ + sessionId, + (u.totalToolCalls || 0) + (c.subagentsStarted || 0) * 2, + c.subagentsStarted || 0, + c.subagentsStopped || 0, + u.totalToolCalls || 0, + c.toolFailures || 0, + c.compactions || 0, + summary.duration_ms || null, + c.gateChecks?.passed || 0, + (c.gateChecks?.passed || 0) + (c.gateChecks?.failed || 0), + ], + ); + console.log(`[Backfill] session_metrics upserted`); + } + + // ── 6. INSERT charts as report_artifacts (PDFs/DOCX skipped — rebuild-artifacts will regenerate) ── + const chartsDir = path.join(sessionDir, 'charts'); + let chartsInserted = 0; + try { + const chartFiles = await fs.readdir(chartsDir); + for (const cf of chartFiles) { + if (!cf.endsWith('.png')) continue; + const chartPath = path.join(chartsDir, cf); + const data = await fs.readFile(chartPath); + const filePath = `charts/${cf}`; + if (dryRun) { + console.log(` [dry] chart ${cf} (${data.length} bytes)`); + continue; + } + await pool.query( + `INSERT INTO report_artifacts (session_id, file_name, file_path, category, mime_type, file_size, file_data, source) + VALUES ($1, $2, $3, 'chart', 'image/png', $4, $5, 'local_backfill') + ON CONFLICT (session_id, file_path) DO UPDATE SET + file_data = EXCLUDED.file_data, + file_size = EXCLUDED.file_size`, + [sessionId, cf, filePath, data.length, data], + ); + chartsInserted++; + } + } catch (err) { + console.warn(`[Backfill] charts dir issue: ${err.message}`); + } + console.log(`[Backfill] charts upserted: ${chartsInserted}`); + + await pool.end(); + console.log(`\n[Backfill] DONE${dryRun ? ' (dry run)' : ''}.`); + if (!dryRun) { + console.log(`\nNext steps (admin endpoints, require admin JWT):`); + console.log(` curl -X POST https:///api/admin/sessions/${sessionKey}/rebuild-artifacts -H "Authorization: Bearer $TOKEN"`); + console.log(` curl -X POST https:///api/admin/sessions/${sessionKey}/rebuild-kg -H "Authorization: Bearer $TOKEN"`); + } +} + +main().catch(err => { + console.error(err); + process.exit(1); +}); From c64bf5ce84e16f9b4dbe9322157173283a2dd451 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:09:15 -0400 Subject: [PATCH 07/11] chore(scripts): add restore-unpersisted-session recovery tool One-command recovery for a completed local session directory that never got a sessions row: bootstraps the row (idempotent), derives transaction_name from banker-deal-context.json, delegates to backfill-local-to-db.mjs, and optionally fires the admin rebuild endpoints. Closes the gap that backfill-local-to-db.mjs aborts when no sessions row exists. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../scripts/restore-unpersisted-session.mjs | 137 ++++++++++++++++++ 1 file changed, 137 insertions(+) create mode 100644 super-legal-mcp-refactored/scripts/restore-unpersisted-session.mjs diff --git a/super-legal-mcp-refactored/scripts/restore-unpersisted-session.mjs b/super-legal-mcp-refactored/scripts/restore-unpersisted-session.mjs new file mode 100644 index 000000000..e8003a4b9 --- /dev/null +++ b/super-legal-mcp-refactored/scripts/restore-unpersisted-session.mjs @@ -0,0 +1,137 @@ +#!/usr/bin/env node +/** + * Restore Unpersisted Session — One-command recovery for a completed local + * session directory that never got a `sessions` row in PostgreSQL. + * + * Why this exists: + * `backfill-local-to-db.mjs` is an UPSERT keyed on an EXISTING sessions row — + * it aborts at step 0 if the row is missing. A session that died before any + * live persistence has no row, so the backfill can never run. This wrapper + * bootstraps the row (idempotent), then delegates to the existing, tested + * backfill, then optionally fires the admin rebuilds. + * + * Pipeline: + * 1. INSERT sessions row (session_key + status + sdk_session_id + transaction_name + * derived from banker-deal-context.json/session-summary.json) — ON CONFLICT DO NOTHING + * 2. exec scripts/backfill-local-to-db.mjs --dir= [--dry-run] + * 3. (optional) POST rebuild-artifacts + rebuild-kg when --host and --token given + * + * Usage: + * node scripts/restore-unpersisted-session.mjs [--dir=/abs/path] [--dry-run] + * [--host=https://staging.example] [--token=] + * + * Idempotent: re-running re-upserts reports/states/charts and leaves the row intact. + */ + +import 'dotenv/config'; +import pg from 'pg'; +import { promises as fs } from 'fs'; +import path from 'path'; +import { fileURLToPath } from 'url'; +import { execFileSync } from 'child_process'; + +const __dirname = path.dirname(fileURLToPath(import.meta.url)); + +const args = process.argv.slice(2); +const sessionKey = args.find(a => !a.startsWith('--')); +const dryRun = args.includes('--dry-run'); +const dirArg = args.find(a => a.startsWith('--dir='))?.split('=')[1]; +const host = args.find(a => a.startsWith('--host='))?.split('=')[1]; +const token = args.find(a => a.startsWith('--token='))?.split('=')[1]; + +if (!sessionKey || !/^\d{4}-\d{2}-\d{2}-\d+$/.test(sessionKey)) { + console.error('Usage: node scripts/restore-unpersisted-session.mjs [--dir=/abs/path] [--dry-run] [--host=...] [--token=...]'); + process.exit(1); +} + +const sessionDir = dirArg + ? path.resolve(dirArg) + : path.resolve(__dirname, '..', 'reports', sessionKey); + +async function readJson(file) { + try { return JSON.parse(await fs.readFile(file, 'utf8')); } + catch { return null; } +} + +/** Build a human-readable transaction_name from whatever the session left behind. */ +function deriveTransactionName(deal, summary) { + const d = deal?.deal || deal || {}; + if (d.target && d.acquirer) return `${d.acquirer} / ${d.target}`; + if (d.target) return d.target; + return summary?.transaction_name || null; +} + +async function main() { + if (!(await fs.access(sessionDir).then(() => true, () => false))) { + console.error(`Session dir not found: ${sessionDir}`); + process.exit(1); + } + if (!process.env.PG_CONNECTION_STRING) { + console.error('PG_CONNECTION_STRING not set (.env) — cannot reach the database.'); + process.exit(1); + } + + const summary = await readJson(path.join(sessionDir, 'session-summary.json')); + const deal = await readJson(path.join(sessionDir, 'banker-deal-context.json')); + const transactionName = deriveTransactionName(deal, summary); + const sdkSessionId = summary?.sdk_session_id || null; + + // ── 1. Bootstrap the sessions row (idempotent) ── + const pool = new pg.Pool({ connectionString: process.env.PG_CONNECTION_STRING }); + const existing = await pool.query('SELECT id, status, transaction_name FROM sessions WHERE session_key = $1', [sessionKey]); + if (existing.rows.length > 0) { + const row = existing.rows[0]; + console.log(`[Restore] sessions row already present (id=${row.id}, status=${row.status}) — skipping bootstrap`); + // The row pre-exists (e.g. an errored live run), so the INSERT was skipped and + // transaction_name was never set. Backfill's UPDATE doesn't touch it either, so + // fill it here — non-destructively, only when currently NULL. + if (transactionName && !row.transaction_name) { + if (dryRun) { + console.log(`[Restore] [dry] would set transaction_name=${JSON.stringify(transactionName)} on existing row (currently NULL)`); + } else { + await pool.query( + `UPDATE sessions SET transaction_name = COALESCE(transaction_name, $2), updated_at = NOW() + WHERE id = $1 AND transaction_name IS NULL`, + [row.id, transactionName], + ); + console.log(`[Restore] transaction_name set to ${JSON.stringify(transactionName)} on existing row`); + } + } + } else if (dryRun) { + console.log(`[Restore] [dry] would INSERT sessions row: key=${sessionKey}, transaction_name=${JSON.stringify(transactionName)}, sdk_session_id=${sdkSessionId}`); + } else { + const ins = await pool.query( + `INSERT INTO sessions (session_key, status, sdk_session_id, transaction_name, metadata) + VALUES ($1, 'in_progress', $2, $3, $4::jsonb) + ON CONFLICT (session_key) DO NOTHING + RETURNING id`, + [sessionKey, sdkSessionId, transactionName, + JSON.stringify({ bootstrapped_by: 'restore-unpersisted-session', bootstrapped_at: new Date().toISOString() })], + ); + console.log(`[Restore] sessions row bootstrapped (id=${ins.rows[0]?.id}, transaction_name=${JSON.stringify(transactionName)})`); + } + await pool.end(); + + // ── 2. Delegate to the existing, tested backfill ── + const backfill = path.resolve(__dirname, 'backfill-local-to-db.mjs'); + const backfillArgs = [backfill, sessionKey, `--dir=${sessionDir}`, ...(dryRun ? ['--dry-run'] : [])]; + console.log(`\n[Restore] → node ${backfillArgs.join(' ')}\n`); + execFileSync(process.execPath, backfillArgs, { stdio: 'inherit' }); + + // ── 3. Optional admin rebuilds ── + if (host && token && !dryRun) { + for (const ep of ['rebuild-artifacts', 'rebuild-kg']) { + const url = `${host.replace(/\/$/, '')}/api/admin/sessions/${sessionKey}/${ep}`; + console.log(`\n[Restore] POST ${url}`); + execFileSync('curl', ['-fsS', '-X', 'POST', url, '-H', `Authorization: Bearer ${token}`], { stdio: 'inherit' }); + } + } else if (!dryRun) { + console.log(`\n[Restore] Skipped admin rebuilds (pass --host and --token to run them). Manual:`); + console.log(` curl -X POST /api/admin/sessions/${sessionKey}/rebuild-artifacts -H "Authorization: Bearer $TOKEN"`); + console.log(` curl -X POST /api/admin/sessions/${sessionKey}/rebuild-kg -H "Authorization: Bearer $TOKEN"`); + } + + console.log(`\n[Restore] DONE${dryRun ? ' (dry run — no writes)' : ''}.`); +} + +main().catch(err => { console.error(err); process.exit(1); }); From 03b658d079865a1545f96d8dc6fb9b0e4ec7ed06 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:09:15 -0400 Subject: [PATCH 08/11] chore(scripts): add rebuild-kg-local recovery tool Local, faithful reproduction of POST /api/admin/sessions/:key/rebuild-kg for when no server / admin JWT is available: entity-synthesis + citation-synthesis pre-steps then buildSessionKnowledgeGraph, honoring BANKER_QA_OUTPUT for the banker KG phases. Upsert-only (mirrors the endpoint); --clean is opt-in. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../scripts/rebuild-kg-local.mjs | 124 ++++++++++++++++++ 1 file changed, 124 insertions(+) create mode 100644 super-legal-mcp-refactored/scripts/rebuild-kg-local.mjs diff --git a/super-legal-mcp-refactored/scripts/rebuild-kg-local.mjs b/super-legal-mcp-refactored/scripts/rebuild-kg-local.mjs new file mode 100644 index 000000000..dcb83842e --- /dev/null +++ b/super-legal-mcp-refactored/scripts/rebuild-kg-local.mjs @@ -0,0 +1,124 @@ +#!/usr/bin/env node +/** + * Rebuild KG (local) — Faithful local reproduction of the + * POST /api/admin/sessions/:key/rebuild-kg endpoint (adminRouter.js), for when + * no server / admin JWT is available. Same three steps, same fail-soft order: + * 1. entity-synthesis pre-step (synthesize entities.json if absent) + * 2. citation-synthesis pre-step (rebuild consolidated-footnotes if truncated) + * 3. buildSessionKnowledgeGraph (nodes + edges; banker phases 1b/1c when + * BANKER_QA_OUTPUT=true) + * + * Mirrors the endpoint's UPSERT semantics — it does NOT delete existing KG rows + * (identical to the production endpoint). Run with the flag so banker nodes fire: + * BANKER_QA_OUTPUT=true node scripts/rebuild-kg-local.mjs + */ + +import 'dotenv/config'; +import pg from 'pg'; + +const sessionKey = process.argv.find(a => /^\d{4}-\d{2}-\d{2}-\d+$/.test(a)); +const clean = process.argv.includes('--clean'); +if (!sessionKey) { + console.error('Usage: BANKER_QA_OUTPUT=true node scripts/rebuild-kg-local.mjs [--clean]'); + console.error(' --clean: delete this session\'s derived KG (kg_nodes/edges/evolution/provenance,'); + console.error(' NOT kg_messages user-chat) before rebuilding, for a pristine graph.'); + process.exit(1); +} + +async function main() { + const pool = new pg.Pool({ connectionString: process.env.PG_CONNECTION_STRING }); + + const { featureFlags } = await import('../src/config/featureFlags.js'); + console.log(`[KG] BANKER_QA_OUTPUT = ${featureFlags.BANKER_QA_OUTPUT}`); + + const sess = await pool.query('SELECT id FROM sessions WHERE session_key=$1', [sessionKey]); + if (sess.rows.length === 0) { console.error('Session not found'); process.exit(1); } + const sessionId = sess.rows[0].id; + console.log(`[KG] session ${sessionKey} → ${sessionId}`); + + const before = await pool.query( + 'SELECT (SELECT count(*) FROM kg_nodes WHERE session_id=$1) nodes, (SELECT count(*) FROM kg_edges WHERE session_id=$1) edges', + [sessionId]); + console.log(`[KG] before: nodes=${before.rows[0].nodes} edges=${before.rows[0].edges}`); + + // ── 0. optional clean: clear session-scoped derived KG (NOT kg_messages) ── + if (clean) { + // kg_edges cascades from kg_nodes, but delete explicitly + first for clarity. + // kg_evolution/kg_provenance only SET NULL on node/edge delete, so their rows + // would linger — delete them by session_id to avoid stale-build accumulation. + for (const tbl of ['kg_provenance', 'kg_evolution', 'kg_edges', 'kg_nodes']) { + const r = await pool.query(`DELETE FROM ${tbl} WHERE session_id = $1`, [sessionId]); + console.log(`[KG] --clean: deleted ${r.rowCount} from ${tbl}`); + } + } + + // ── 1. entity-synthesis pre-step ── + let entitiesSource = 'native'; + try { + const existing = await pool.query( + `SELECT 1 FROM report_artifacts WHERE session_id=$1 AND file_name='entities.json' LIMIT 1`, [sessionId]); + if (existing.rows.length === 0) { + const { synthesizeEntitiesJson, persistSynthesizedEntities } = await import('../src/utils/entitySynthesis.js'); + const { payload, audit } = await synthesizeEntitiesJson(pool, sessionId, sessionKey); + if (payload.entities.length > 0) { + await persistSynthesizedEntities(pool, sessionId, payload); + entitiesSource = 'synthesized'; + console.log(`[KG] entities synthesized: ${payload.entities.length} (T1=${audit.tier1_count} T2=${audit.tier2_count} T3=${audit.tier3_count} T4=${audit.tier4_count})`); + } else { + console.warn('[KG] entity synthesis yielded 0 — legacy fallback will apply'); + } + } else { + console.log('[KG] entities.json already present — skipping synthesis'); + } + } catch (e) { console.warn(`[KG] entity synthesis failed (continuing): ${e.message}`); } + + // ── 2. citation-synthesis pre-step ── + let citationsSource = 'native'; + try { + const cf = await pool.query(`SELECT content FROM reports WHERE session_id=$1 AND report_key='consolidated-footnotes'`, [sessionId]); + const cfContent = cf.rows[0]?.content || ''; + const { + countFootnotesAcrossSectionFiles, isConsolidatedFootnotesTruncated, + synthesizeConsolidatedFootnotes, persistConsolidatedFootnotes, + } = await import('../src/utils/citationSynthesis.js'); + const sectionFnCount = await countFootnotesAcrossSectionFiles(pool, sessionId); + if (isConsolidatedFootnotesTruncated(cfContent, sectionFnCount)) { + const { markdown, citationMapMd, audit } = await synthesizeConsolidatedFootnotes(pool, sessionId, sessionKey); + if (audit.total_footnotes >= 50) { + await persistConsolidatedFootnotes(pool, sessionId, sessionKey, markdown, citationMapMd); + citationsSource = 'synthesized'; + console.log(`[KG] citations synthesized: ${audit.total_footnotes} from ${audit.source_section_count} sections`); + } else { + console.warn(`[KG] citation synthesis yielded ${audit.total_footnotes} (<50) — keeping existing`); + } + } else { + console.log(`[KG] consolidated-footnotes healthy (section fn count=${sectionFnCount}) — skipping`); + } + } catch (e) { console.warn(`[KG] citation synthesis failed (continuing): ${e.message}`); } + + // ── 3. build KG ── + const { buildSessionKnowledgeGraph } = await import('../src/utils/knowledgeGraphExtractor.js'); + const result = await buildSessionKnowledgeGraph(pool, sessionId, sessionKey); + console.log(`[KG] build result:`, JSON.stringify(result)); + + const after = await pool.query( + 'SELECT (SELECT count(*) FROM kg_nodes WHERE session_id=$1) nodes, (SELECT count(*) FROM kg_edges WHERE session_id=$1) edges', + [sessionId]); + console.log(`[KG] after: nodes=${after.rows[0].nodes} edges=${after.rows[0].edges}`); + + // node-type + banker breakdown + const types = await pool.query( + `SELECT node_type, count(*) FROM kg_nodes WHERE session_id=$1 GROUP BY node_type ORDER BY 2 DESC`, [sessionId]); + console.log('[KG] node types:'); console.table(types.rows); + const banker = await pool.query( + `SELECT count(*) FROM kg_nodes WHERE session_id=$1 AND (node_type ILIKE '%question%' OR canonical_key ILIKE '%bq%' OR canonical_key ILIKE '%question%' OR properties::text ILIKE '%banker%')`, [sessionId]); + console.log(`[KG] banker-related nodes: ${banker.rows[0].count}`); + const edgeTypes = await pool.query( + `SELECT edge_type, count(*) FROM kg_edges WHERE session_id=$1 GROUP BY edge_type ORDER BY 2 DESC`, [sessionId]); + console.log('[KG] edge types:'); console.table(edgeTypes.rows); + + console.log(`\n[KG] DONE — entities=${entitiesSource}, citations=${citationsSource}`); + await pool.end(); +} + +main().catch(e => { console.error(e); process.exit(1); }); From 968b874148b1c6788016a0cdf80501c73739e684 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:16:03 -0400 Subject: [PATCH 09/11] fix(kg): derive risk exposure_amounts/probability structurally (audit #1/#2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adversarial review found the node loop re-regexed the rendered block, so mitigation $-figures leaked into exposure_amounts (R-ANT-001 captured the $1,237,262,000 RRTF) and the title's % masqueraded as probability (65-80% instead of the 8% fail prob; 53.8%→8% decimal truncation). - buildRiskBlocksFromJson now emits structured exposureAmounts (from exposure bits only) and probability (first %-token of the probability field, else null). - The node loop prefers these; markdown Path B (no structured fields) still falls back to whole-block regex — unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../src/utils/knowledgeGraph/kgPhases6to8.js | 22 ++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js b/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js index 684591e71..3af7cc444 100644 --- a/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js +++ b/super-legal-mcp-refactored/src/utils/knowledgeGraph/kgPhases6to8.js @@ -414,7 +414,13 @@ export function buildRiskBlocksFromJson(content) { finding.notes ? `Notes: ${finding.notes}` : '', finding.correlation_note ? `Correlation: ${finding.correlation_note}` : '', ].filter(Boolean).join('\n'); - blocks.push({ title: `${fid ? fid + ': ' : ''}${title}`, block: synthBlock }); + // Structured extraction so the downstream node loop need NOT re-regex the + // rendered block (which would pull "$" figures out of mitigation prose and + // a "%" out of the title). exposureAmounts come ONLY from the exposure bits; + // probability is the first %-token of the probability field, else null (#231). + const exposureAmounts = (exposureBits.join(' ').match(/\$[\d,.]+[BMK]?/g) || []).slice(0, 5); + const probability = (probStr.match(/(\d{1,3})[\-–]?(\d{1,3})?%/) || [])[0] || null; + blocks.push({ title: `${fid ? fid + ': ' : ''}${title}`, block: synthBlock, exposureAmounts, probability }); } } return blocks; @@ -459,9 +465,15 @@ async function phase7_riskAndFacts(pool, sessionId, evolutionLog, resolver, tNum } // Unified node-creation loop (consumes both JSON-synthesized and markdown-extracted blocks) - for (const { title, block } of riskBlocks) { - const amounts = block.match(/\$[\d,.]+[BMK]?/g) || []; - const probs = block.match(/(\d{1,3})[\-–]?(\d{1,3})?%/); + for (const riskBlock of riskBlocks) { + const { title, block } = riskBlock; + // Prefer structured values from the JSON parser (exposureAmounts/probability) + // so mitigation "$" figures and title "%"s never contaminate them. Markdown + // Path B blocks carry neither field → fall back to whole-block regex (#231). + const amounts = riskBlock.exposureAmounts ?? (block.match(/\$[\d,.]+[BMK]?/g) || []); + const probValue = riskBlock.probability !== undefined + ? riskBlock.probability + : ((block.match(/(\d{1,3})[\-–]?(\d{1,3})?%/) || [])[0] ?? null); const mitigation = block.match(/(?:mitigat|recommend|address|escrow|protect|hedge|covenant)[^.]*\.[^.]*\./i); const consequence = block.match(/(?:consequence|impact|result|exposure|cost|loss|failure)[:\s]*([^.]+\.[^.]*\.)/i); const entities = block.match(/\b(?:SoftBank|ADIA|DigitalBridge|DataBank|Switch|Marc Ganzi|Vantage|Vertical Bridge|Zayo|CFIUS|FCC|IRS|SEC)\b/gi); @@ -472,7 +484,7 @@ async function phase7_riskAndFacts(pool, sessionId, evolutionLog, resolver, tNum canonical_key: `risk:${title.slice(0, 80).toLowerCase().replace(/[^a-z0-9]+/g, '-')}`, properties: { exposure_amounts: amounts.slice(0, 5), - probability: probs ? probs[0] : null, + probability: probValue, mitigation: mitigation ? mitigation[0].trim().slice(0, 400) : null, consequence: consequence ? consequence[1]?.trim().slice(0, 400) || consequence[0]?.trim().slice(0, 400) : null, entities_involved: entities ? [...new Set(entities.map(e => e.trim()))] : [], From d572da4af169567dbe9a1469c8c531f5b6bd69c6 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:16:03 -0400 Subject: [PATCH 10/11] test(kg): real-shape regression tests for exposure/probability extraction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pins audit findings #1/#2 with faithful R-ANT-001 shapes: mitigation RRTF figure must not enter exposureAmounts; probability comes from the probability field not a title %; non-quantified probability → null; legacy numeric path unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../test/sdk/kg-phase7-risk-parser.test.js | 60 +++++++++++++++++++ 1 file changed, 60 insertions(+) diff --git a/super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js b/super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js index 880d1bf65..913d34893 100644 --- a/super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js +++ b/super-legal-mcp-refactored/test/sdk/kg-phase7-risk-parser.test.js @@ -135,3 +135,63 @@ test('exposure_by_category takes precedence over legacy keys when both present', assert.equal(blocks.length, 1); assert.match(blocks[0].title, /^N-1: /); }); + +// ── Regression anchors for adversarial-audit findings #1 and #2 (real-data shapes) ── + +test('structured exposureAmounts exclude mitigation $-figures (audit #1)', () => { + // Faithful R-ANT-001 replica: mitigation prose carries the $1,237,262,000 RRTF; + // it must NOT appear in exposure_amounts even though exposure bits number < 5. + const content = JSON.stringify({ + exposure_by_category: [{ category: 'Regulatory/Antitrust', findings: [{ + id: 'R-ANT-001', + finding: 'HSR Second Request probable (65–80%); H2-2027 behavioral-remedy clearance base case; ~8% deal-fail.', + severity: 'HIGH', + probability: '8% fail / 65–80% Second Request', + weighted_exposure: '$433.75M (break EL, marked) / $621.71M (headline)', + exposure_low: '$0 (clears)', + exposure_high: '$1.237B RRTF', + mitigation: '$1,237,262,000 regulatory reverse-termination fee (Fox pays).', + }] }], + }); + const block = buildRiskBlocksFromJson(content)[0]; + assert.deepEqual(block.exposureAmounts, ['$433.75M', '$621.71M', '$0', '$1.237B'], + 'exposureAmounts derive only from exposure fields'); + assert.ok(!block.exposureAmounts.includes('$1,237,262,000'), + 'mitigation RRTF figure must not leak into exposureAmounts'); +}); + +test('structured probability is the finding probability, not a title % (audit #2)', () => { + // finding text contains "(65–80%)"; probability field leads with "8% fail". + const content = JSON.stringify({ + exposure_by_category: [{ category: 'X', findings: [{ + id: 'R-ANT-001', + finding: 'HSR Second Request probable (65–80%); ~8% deal-fail base case.', + probability: '8% fail / 65–80% Second Request', + weighted_exposure: '$433.75M', + }] }], + }); + const block = buildRiskBlocksFromJson(content)[0]; + assert.equal(block.probability, '8%', 'probability comes from the probability field, not the title'); +}); + +test('non-quantified probability string → null (no stray title %)', () => { + const content = JSON.stringify({ + exposure_by_category: [{ category: 'X', findings: [{ + id: 'F-SYN-002', + finding: 'Synergies cover only 53.8% of premium; EPS dilutive.', + probability: 'Base case (realized)', + weighted_exposure: '$139.5M', + }] }], + }); + const block = buildRiskBlocksFromJson(content)[0]; + assert.equal(block.probability, null, 'no % in probability field → null, not "8%" from "53.8%"'); +}); + +test('legacy numeric probability still yields structured NN%', () => { + const content = JSON.stringify({ + risk_categories: [{ category: 'F', findings: [{ id: 'F-1', finding: 'Numeric risk finding.', p50: 2.09e9, probability: 0.65 }] }], + }); + const block = buildRiskBlocksFromJson(content)[0]; + assert.equal(block.probability, '65%'); + assert.deepEqual(block.exposureAmounts, ['$2.09B']); +}); From 2755cff21f71f8079dcf3b839d9982277271c4a8 Mon Sep 17 00:00:00 2001 From: Number531 <120485065+Number531@users.noreply.github.com> Date: Wed, 17 Jun 2026 11:17:26 -0400 Subject: [PATCH 11/11] docs(changelog): record KG risk-layer fix (#231) Co-Authored-By: Claude Opus 4.8 (1M context) --- super-legal-mcp-refactored/CHANGELOG.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/super-legal-mcp-refactored/CHANGELOG.md b/super-legal-mcp-refactored/CHANGELOG.md index 984361eaa..cc00d7593 100644 --- a/super-legal-mcp-refactored/CHANGELOG.md +++ b/super-legal-mcp-refactored/CHANGELOG.md @@ -4,6 +4,16 @@ All notable changes to the Super Legal MCP Server are documented in this file. ## [Unreleased] +### Fixed — KG risk layer empty for current-format sessions (#231, 2026-06-17) + +The knowledge graph dropped the entire `risk` node layer on every current-format session. Two independent breaks, both required for a risk node to exist: + +- **Persistence** — the risk-aggregator emits structured `review-outputs/risk-summary.json` (`risk-aggregator.js:25`), but the live PostToolUse hook (`hookDBBridge.js`) and the local backfill (`scripts/backfill-local-to-db.mjs`) only persisted `.md`, so it never reached the `reports` table and the KG `CRITICAL_REPORTS` gate (`risk-summary`) timed out. Fixed via a scoped `JSON_REPORT_FILENAMES` allowlist (exact basenames — `*-state.json` / `banker-*.json` / `entities.json` excluded) consulted by both persist paths; `persistReport` is content-agnostic so JSON content stores cleanly as `report_type=review, report_key=risk-summary`. +- **Parser schema drift** — Phase 7 (`kgPhases6to8.js`) keyed on the legacy `risk_categories` schema; the current producer emits `exposure_by_category` with **string** exposures (`"$433.75M"`) and **string** probability (`"8% fail"`). The JSON parser was extracted to a pure, exported `buildRiskBlocksFromJson()` (unit-tested) and extended to the current schema with the legacy numeric path preserved. +- **Extraction quality (adversarial-audit remediation)** — the node loop previously re-regexed the rendered block, leaking mitigation `$`-figures into `exposure_amounts` (the `$1,237,262,000` RRTF) and pulling a stray title `%` as the probability. The parser now emits structured `exposureAmounts`/`probability`; the node loop prefers them, with the Markdown fallback path unchanged. + +Verified on the real `2026-06-16` (Fox/Roku) `risk-summary.json`: 11 risk blocks (was 0), zero mitigation-figure leaks, correct per-finding probabilities. Tests: `test/sdk/kg-phase7-risk-parser.test.js` (13 cases, legacy + current schema + the two audit regressions). Backfill also now scans `review-outputs/` for `*-state.json`. Plan + audit: `docs/pending-updates/kg-risk-layer-fix-231.md`. Recovery tooling: `scripts/restore-unpersisted-session.mjs`, `scripts/rebuild-kg-local.mjs`. + ## [8.1.0] - 2026-06-08 — Forced banker intake phase + deterministic phase harness (live-validated on staging) ### Added — Forced banker intake phase + deterministic phase harness (2026-06-08)