Skip to content

v6.12.0: Deterministic entities.json synthesis for legacy session backfill#148

Merged
Number531 merged 1 commit into
mainfrom
feat/entity-synthesis-backfill
May 17, 2026
Merged

v6.12.0: Deterministic entities.json synthesis for legacy session backfill#148
Number531 merged 1 commit into
mainfrom
feat/entity-synthesis-backfill

Conversation

@Number531

Copy link
Copy Markdown
Owner

Summary

Transparent 4-tier deterministic synthesizer inside the existing /api/admin/sessions/:key/rebuild-kg admin endpoint. When entities.json is absent from report_artifacts (any pre-v6.11.0 session, or v6.11.0+ sessions where fact-validator skipped emission), the rebuild now synthesizes a deterministic entities.json from data already structured in the DB — zero LLM, zero new endpoints, zero frontend changes, zero DB schema changes — before running the existing 10-phase KG build.

Architectural decisions (user-led)

User-led 8-question architectural review rejected four less-elegant paths:

Rejected Why
LLM re-invocation of fact-validator Wasted work — entity data is already structured in DB
§II.C "Entity Names" markdown parser Section is absent in most real production sessions (empirically failed on SpaceX)
New /rebuild-entities endpoint + button Two-click operator UX; existing button can do this transparently
Replace v6.11.0 fact-validator emission entirely Premature optimization 4 hours after ship

Principle surfaced: "Don't re-derive what's already structured. Read the structured source directly."

Tier composition

  1. Tier 1 — Parse ## DEAL_METADATA table from orchestrator-state markdown (target, acquirer, underwriters, key_person)
  2. Tier 2 — Static AGENT_REGULATOR_MAP keyed by report_key (architectural knowledge: which agents imply which regulators)
  3. Tier 3SELECT FROM kg_nodes WHERE node_type='regulator' (catches session-specific regulators like JFTC)
  4. Tier 4 — Mine fact_node.properties.fact_name for narrow entity-keyword patterns (Lead Bookrunners, Controlling Shareholder, etc.)

All tiers fail-soft on unknown inputs. Cross-tier dedup by case-insensitive canonical_name; higher-tier wins, loser's match_patterns merge. 50-cap enforced at synthesis + via Zod schema.

Operator experience

Rebuild KG button (existing) gains capability without UX change:

  • Fresh session (has entities.json) → response: entities_source: "native"
  • Legacy session (no entities.json) → synthesizer runs → response: entities_source: "synthesized" + per-tier audit counts

Test plan

  • 34/34 unit tests passing (test/sdk/entity-synthesis.test.js)
  • 85/85 entities-ecosystem regression suite passing (entity-synthesis + entities-json-schema + fact-validator-entities + kg-phase6-entities)
  • DEAL_METADATA format verified across 4 recent production sessions (SpaceX-2026-05, SpaceX-2026-04, DigitalBridge-2026-03-31, DigitalBridge-2026-03-13) — all 4 have the heading + table
  • Post-deploy: trigger /rebuild-kg against SpaceX session 2026-05-16-1778951162 → expect entities_source: "synthesized", entities_audit.final_count ≈ 15-18, Phase 9 edge count recovery from 267 → ~1,200-1,800
  • Post-deploy: trigger /rebuild-kg against a v6.11.0+ fresh session → expect entities_source: "native", no synthesis log lines

Files

Runtime:

  • super-legal-mcp-refactored/src/utils/entitySynthesis.js (NEW, ~280 LoC)
  • super-legal-mcp-refactored/src/server/adminRouter.js (+35 LoC pre-step in /rebuild-kg handler)

Tests:

  • super-legal-mcp-refactored/test/sdk/entity-synthesis.test.js (NEW, 34 tests across 9 groups)

Docs:

  • CHANGELOG.md (root, v6.12.0 entry)
  • super-legal-mcp-refactored/CHANGELOG.md (canonical detailed entry)

Risk

3/10. Additive code, fail-soft throughout, behind requireAdmin auth, idempotent via ON CONFLICT (session_id, file_path) DO UPDATE. Does not modify any existing rebuild behavior when entities.json already exists. Does not modify the v6.11.0 producer/consumer path. Synthesis failure logs and falls through to existing LEGACY_DIGITALBRIDGE_FALLBACK (i.e., v6.11.0-without-synthesis behavior).

Rollback

Revert the adminRouter.js pre-step block (~35 LoC) → /rebuild-kg returns to v6.11.0 behavior. entitySynthesis.js becomes dead code, harmless. ~5 min.

🤖 Generated with Claude Code

…backfill

Transparent 4-tier deterministic synthesizer inside /api/admin/sessions/:key/
rebuild-kg. When entities.json is absent from report_artifacts, synthesizes
from data already structured in the DB before running the existing 10-phase
KG build. Zero LLM, zero new endpoints, zero frontend changes, zero schema.

Tiers (all fail-soft, skip rather than misclassify):
  1. Parse ## DEAL_METADATA from orchestrator-state markdown
  2. Static map: research agent report_keys → regulator catalog
  3. Union with kg_nodes WHERE node_type='regulator' (session-specific)
  4. Mine fact_node.fact_name for narrow entity-keyword patterns

Dedup case-insensitive on canonical_name; higher-tier wins, loser's
match_patterns merge into winner. 50-cap enforced at synthesis + Zod.

Operator UX unchanged — existing "Rebuild KG" button gains capability:
legacy sessions get entities_source: "synthesized" + per-tier audit;
fresh v6.11.0+ sessions continue entities_source: "native". Failure of
synthesis is logged but never blocks rebuild — LEGACY fallback still
kicks in.

Expected SpaceX backfill yield: ~15-18 entities. Phase 9 recovery target:
267 → ~1,500 edges.

Files:
  src/utils/entitySynthesis.js                 (NEW, ~280 LoC)
  src/server/adminRouter.js                    (+35 LoC pre-step)
  test/sdk/entity-synthesis.test.js            (NEW, 34 tests, 9 groups)
  CHANGELOG.md + super-legal-mcp-refactored/CHANGELOG.md

Tests: 85/85 passing across 4 entities-ecosystem files (entity-synthesis,
entities-json-schema, fact-validator-entities, kg-phase6-entities).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Number531

Copy link
Copy Markdown
Owner Author

Tracking issue: #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant