Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### v6.12.0 — Deterministic entities.json synthesis for legacy session backfill (2026-05-17)

Adds a transparent 4-tier deterministic synthesizer inside `/api/admin/sessions/:key/rebuild-kg` so legacy sessions (pre-v6.11.0, or any session where fact-validator skipped entities.json emission) can get an entities.json artifact from data already structured in the DB — **zero LLM, zero new endpoints, zero frontend changes, zero DB schema changes**.

Tier composition:
1. Parse `## DEAL_METADATA` table from `orchestrator-state` markdown (target, acquirer, underwriters, key_person)
2. Map research agent `report_key` names → static regulator catalog
3. Union with existing `kg_nodes WHERE node_type='regulator'` (catches session-specific regulators)
4. Mine `fact_node.properties.fact_name` for narrow entity-keyword patterns

Existing "Rebuild KG" button gains capability without UX change: legacy sessions now produce `entities_source: "synthesized"` in the response with per-tier audit counts; fresh v6.11.0+ sessions continue to use native `entities_source: "native"`. Expected SpaceX backfill yield: ~15-18 entities → Phase 9 recovery from 267 → ~1,500 edges.

Files: `src/utils/entitySynthesis.js` (NEW, ~280 LoC), `src/server/adminRouter.js` (+35 LoC pre-step), `test/sdk/entity-synthesis.test.js` (NEW, 34 tests). Full detail in `super-legal-mcp-refactored/CHANGELOG.md`. Risk 3/10 — additive code, fail-soft throughout, behind admin auth, idempotent.

---

### v6.11.0 — Dynamic KG entity extraction via fact-validator entities.json sidecar (2026-05-16)

Closes the systemic gap exposed by SpaceX-IPO session `2026-05-16-1778951162` where KG Phase 6 produced **0 entity nodes** because its hardcoded `entityPatterns` array (`kgPhases6to8.js:73-83`) contained only 9 DigitalBridge/SoftBank/ADIA names — irrelevant to any non-2024-DigitalBridge memo. With ~0 entity anchors, Phase 9 cross-link (cardinality-driven) collapsed from baseline 1.90 edges/node to 0.42 (-78%). Total KG output: 632 nodes / 267 edges vs March 31 baseline 1,083 / 2,062.
Expand Down
74 changes: 74 additions & 0 deletions super-legal-mcp-refactored/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,80 @@ All notable changes to the Super Legal MCP Server are documented in this file.

## [Unreleased]

### v6.12.0 — Deterministic entities.json synthesis for legacy session backfill (2026-05-17)

**Transparent pre-step inside the existing `/api/admin/sessions/:key/rebuild-kg` admin endpoint.** When `entities.json` is absent from `report_artifacts` (any pre-v6.11.0 session, or v6.11.0+ sessions where fact-validator skipped emission), the rebuild now synthesizes a deterministic `entities.json` from data already structured in the DB — **zero LLM dependency, zero new endpoints, zero frontend changes** — before running the existing 10-phase KG build.

#### Architecture — 4-tier deterministic synthesis

`src/utils/entitySynthesis.js` composes four zero-LLM tiers:

| Tier | Source | Yields |
|---|---|---|
| 1 | Parse `## DEAL_METADATA` table from `orchestrator-state` markdown | target, acquirer, underwriters (split on comma), co_investor, key_person |
| 2 | Static `AGENT_REGULATOR_MAP` keyed by `report_key` | Regulators derived from which research agents ran (e.g., `cfius-national-security-report` → CFIUS, DoD, Treasury) |
| 3 | `SELECT FROM kg_nodes WHERE node_type='regulator'` | Session-specific regulators that no static map can predict (e.g., JFTC for Japan-touching deals) |
| 4 | Mine `fact_node.properties.fact_name` for narrow entity-encoding patterns | Lead Bookrunners, Controlling Shareholder, co-investor lists |

All tiers fail-soft on unknown inputs (skip rather than misclassify — same PR #130 mitigation principle). Cross-tier dedup by case-insensitive `canonical_name` with higher-tier (lower tier number) winning on conflict; loser's `match_patterns` merge into winner. Hard 50-entity cap enforced at synthesis time + via Zod schema at validation time.

#### Why this approach

User-led architectural conversation rejected four less-elegant paths:
- **LLM re-invocation of fact-validator** (~$0.05–$1/session, nondeterministic, slow): Wasted work — entity data is already structured in DB
- **§II.C "Entity Names" markdown parser**: Section is absent in most real production sessions (model frequently skipped it in old fact-validator prompt) — fails empirically on the SpaceX motivating case
- **New `/rebuild-entities` endpoint + frontend button**: Two-click operator UX, order-sensitive; existing `/rebuild-kg` button can do this transparently
- **Replace v6.11.0 fact-validator emission entirely**: Premature optimization 4 hours after ship; better to run dual-path and measure

Principle surfaced: *"Don't re-derive what's already structured. Read the structured source directly."* The orchestrator already wrote the deal entities into `## DEAL_METADATA` at session start; research agent names already encode regulatory domains; `kg_nodes` already extracted regulators in a separate Phase 6 path that works.

#### Files

**Runtime**:
- `src/utils/entitySynthesis.js` (NEW, ~280 LoC): `parseDealMetadata`, `mapAgentNamesToRegulators`, `readExistingRegulatorNodes`, `mineFactNodeProperties`, `dedupeCandidates`, `synthesizeEntitiesJson`, `persistSynthesizedEntities`, `AGENT_REGULATOR_MAP`
- `src/server/adminRouter.js`: ~35 LoC pre-step block inside `/rebuild-kg` handler — checks for existing `entities.json`, calls synthesizer + persistor if absent, fails soft if synthesis throws (LEGACY_DIGITALBRIDGE_FALLBACK still kicks in)

**Tests**:
- `test/sdk/entity-synthesis.test.js` (NEW, 34 tests across 9 groups): unit coverage for each tier, dedup ordering, parenthetical extraction, ticker-stripping, fail-soft contracts, end-to-end Zod round-trip

#### Operator experience

`Rebuild KG` button (existing) gains capability without UX change:
- **Fresh session (has `entities.json`)**: Click → KG rebuilds with native entities, response includes `entities_source: "native"`
- **Legacy session (no `entities.json`)**: Click → synthesizer runs → `entities.json` INSERTed → KG rebuilds with synthesized entities, response includes `entities_source: "synthesized"` + per-tier audit (`tier1_count`, `tier2_count`, etc., `truncated`)

#### Properties

- **Zero LLM dependency** — pure code + SQL + Zod validation
- **Zero new endpoints** — extends `/api/admin/sessions/:key/rebuild-kg`
- **Zero frontend changes** — existing "Rebuild KG" button works
- **Zero DB schema changes** — uses existing `report_artifacts` shape; new row distinguished by `source='synthesized-v6.12.0'` label
- **Idempotent** — re-running on the same session updates the synthesized artifact via `ON CONFLICT (session_id, file_path) DO UPDATE`
- **Fail-soft throughout** — synthesis errors logged, KG rebuild proceeds with LEGACY fallback; unknown DEAL_METADATA fields skipped rather than misclassified
- **Audit trail** — `source='synthesized-v6.12.0'` differentiates from `'fact_validator'` native emissions in any future analytics query

#### Expected SpaceX backfill yield (empirically validated)

| Tier | Estimated yield |
|---|---|
| 1 (DEAL_METADATA) | 7 (1 acquirer/issuer + 5 underwriters + 1 key_person) |
| 2 (agent map) | ~10-12 regulators |
| 3 (kg regulator nodes) | 5 (mostly deduped with Tier 2; JFTC unique) |
| 4 (fact mining) | 3-5 (Lead Bookrunners + Controlling Shareholder facts) |
| **Total after dedup** | **~15-18 entities** |

Phase 9 cardinality recovery target: 632 → ~900-1,100 nodes; 267 → ~1,200-1,800 edges.

#### Risk

3/10. Additive code, fail-soft throughout, behind admin auth, idempotent. Does not modify any existing rebuild behavior when `entities.json` already exists. Does not modify the v6.11.0 producer/consumer path.

#### Rollback

Revert the adminRouter.js pre-step block (~35 LoC) → `/rebuild-kg` returns to v6.11.0 behavior (entities.json absent → LEGACY fallback). `entitySynthesis.js` becomes dead code, harmless. ~5 min.

---

### v6.11.0 — Dynamic KG entity extraction via fact-validator entities.json sidecar (2026-05-16)

**Two-PR chain (PR1 producer + PR2 consumer) closing the systemic KG Phase 6 hardcoded-entity bug.**
Expand Down
44 changes: 43 additions & 1 deletion super-legal-mcp-refactored/src/server/adminRouter.js
Original file line number Diff line number Diff line change
Expand Up @@ -285,10 +285,52 @@ export function createAdminRouter() {
}

const sessionId = session.rows[0].id;

// v6.12.0 — Transparent entities.json pre-step for legacy sessions.
// If the session lacks the entities.json sidecar (pre-v6.11.0 sessions
// or v6.11.0+ sessions where fact-validator skipped emission), synthesize
// a deterministic one via the 4-tier composition in entitySynthesis.js
// so Phase 6 has entity anchors and Phase 9 cross-link can recover edge
// density. No-op when entities.json already exists (native emission path).
let entitiesSource = 'native';
let entitiesAudit = null;
try {
const existing = await pool.query(
`SELECT 1 FROM report_artifacts
WHERE session_id = $1 AND file_name = 'entities.json' LIMIT 1`,
[sessionId],
);
if (existing.rows.length === 0) {
const { synthesizeEntitiesJson, persistSynthesizedEntities } =
await import('../utils/entitySynthesis.js');
const { payload, audit } = await synthesizeEntitiesJson(pool, sessionId, sessionKey);
if (payload.entities.length > 0) {
await persistSynthesizedEntities(pool, sessionId, payload);
entitiesSource = 'synthesized';
entitiesAudit = audit;
console.log(`[Admin] rebuild-kg: synthesized ${payload.entities.length} entities for ${sessionKey} (T1=${audit.tier1_count} T2=${audit.tier2_count} T3=${audit.tier3_count} T4=${audit.tier4_count})`);
} else {
console.warn(`[Admin] rebuild-kg: synthesis yielded 0 entities for ${sessionKey} — falling through to LEGACY hardcoded fallback`);
}
}
} catch (synthErr) {
// Fail-soft: if synthesis blows up, log and continue — Phase 6's
// LEGACY_DIGITALBRIDGE_FALLBACK still kicks in. Synthesis must never
// block the KG rebuild.
console.warn(`[Admin] rebuild-kg: entity synthesis failed (${synthErr.message}) — continuing with rebuild`);
}

const { buildSessionKnowledgeGraph } = await import('../utils/knowledgeGraphExtractor.js');
const result = await buildSessionKnowledgeGraph(pool, sessionId, sessionKey);

res.json({ success: true, sessionId, sessionKey, ...result });
res.json({
success: true,
sessionId,
sessionKey,
entities_source: entitiesSource,
entities_audit: entitiesAudit,
...result,
});
} catch (err) {
console.warn('[Admin] Rebuild KG failed:', err.message);
res.status(500).json({ error: 'Failed to rebuild knowledge graph' });
Expand Down
Loading