fix(kg): deterministic raw-source provenance (NUL strip, flag resolution, embedding/Phase-4b race)#197
Draft
Number531 wants to merge 2 commits into
Draft
fix(kg): deterministic raw-source provenance (NUL strip, flag resolution, embedding/Phase-4b race)#197Number531 wants to merge 2 commits into
Number531 wants to merge 2 commits into
Conversation
…tion, embedding/Phase-4b race Durable backend correction so kg/raw-sources provenance populates reliably on live sessions instead of depending on a one-off backfill + the frontend's semantic fallback. Root causes (all confirmed): a fire-and-forget embedding queue racing the post-session KG build, RAW_SOURCE_EMBEDDING silently inert in dev (flags.env never loaded before the hoisted featureFlags import), and a NUL-byte INSERT that silently dropped sources. A. SourceChunker.chunk() strips NUL (0x00) from chunk content/header — Postgres rejects 0x00; 14/367 Cardinal sources were silently dropped. Mirrors the kgPhase4cNodeEmbeddings precedent. + source-chunker-nul.test.js. B. loadFlagsEnv.js (first import) loads flags.env before featureFlags evaluates, fixing the dev hoisting gap. Boot '[Startup] Feature flags' banner + validateFlags() warns on RAW_SOURCE_EMBEDDING prerequisites (ARCHIVE/HOOK_DB/GEMINI||GOOGLE) and the archive-on/embedding-off divergence. + validate-flags.test.js. C. hookDBBridge SessionEnd: stabilization poll waits for source_chunk_embeddings to drain (count>0 & stable, 60s cap, proceed-on-cap) before the KG build, so Phase 4b sees populated embeddings. Reconciliation safety-net re-runs Phase 4b for built-but-unlinked sessions, gated by one-shot kg_phase4b_repaired marker. D. scripts/run-phase4b.mjs + repairSourceProvenance(): idempotent DELETE + re-run of Phase 4b for one session (no full-rebuild provenance duplication). Hardened phase4b provenance write: truncate source_key to varchar(200) and strip NUL (previously dropped ~6 rows/run silently). E. kg_edges composite indexes (session_id, source_id)/(session_id, target_id) for the fallback neighbor CTE — boot path (plain, pre-listen) + migration 023 (CONCURRENTLY) for live deploys. G. Commit scripts/backfill-source-chunk-embeddings.mjs as the historical-session backfill tool. Rollout: RAW_SOURCE_EMBEDDING stays ON; RAW_SOURCE_EMBEDDING=false is the instant rollback. Route-contract hardening (linkage_mode, threshold) is a separate PR on frontend-revamp. 424/424 KG unit tests pass; Cardinal repaired (171 source_chunk provenance rows, idempotent re-run verified). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ase4b repair
Addresses gaps found in self-review of 684b293a:
- CRITICAL: reconciliation could burn the one-shot kg_phase4b_repaired marker when
the embedding service was unavailable (e.g. RAW_SOURCE_EMBEDDING=true but
EMBEDDING_PERSISTENCE=false → initEmbeddingService never ran → embedDocuments
null → Phase 4b wrote 0 rows → marker set → never retried). Two-part fix:
(a) broaden the boot init gate to also initialize the embedding service when
RAW_SOURCE_EMBEDDING is on (not only EMBEDDING_PERSISTENCE);
(b) probe the embedding service before the reconciliation repair pass and skip
WITHOUT setting the marker when it's unavailable, so the session is retried
once the service is healthy. The marker is only set when the service is
confirmed live (a true zero-match terminal state).
- MEDIUM: backfill-source-chunk-embeddings.mjs had a raw NUL byte embedded in the
stripNul regex literal (made the file non-UTF8). Replaced with the /\0/g escape.
424/424 KG unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Number531
added a commit
that referenced
this pull request
Jun 2, 2026
Renumber 022_kg-nodes-embedding-hnsw → 025 to clear a number collision with main's 022_artifact-source-width (8.0.x wrapped-subagents line) and #197's reserved 023/024. Two differently-named NNN_ migrations produce NO git conflict, so the collision is invisible to conflict review — node-pg-migrate silently skips one on fresh/production deploys. Content is idempotent (CREATE INDEX IF NOT EXISTS) so the renumber is data-safe. Add scripts/check-migration-collisions.mjs + .github/workflows/migration-lint.yml: a CI guard that fails when two migrations share a numeric prefix, running against the PR merge-result so a feature branch colliding with main is caught before merge. This is the SECOND occurrence of the class on this branch (prior: 011→022 rename), so the systemic guard converts an invisible production-migration-skip into a loud red check for all future cross-branch merges. See docs/pending-updates/Banker-Merge-Risk.md §3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this accomplishes
Makes the KG raw-source provenance pipeline deterministic and self-healing, so the
GET /api/db/sessions/:sessionKey/kg/raw-sources/:nodeIdroute (and the frontend Evidence Trail / Findings deep-links that depend on it) reliably resolves to real source attribution instead of silently returning nothing.Two commits:
fix(kg): deterministic raw-source provenance — NUL strip, flag resolution, embedding/Phase-4b racefix(kg): review follow-ups — embedding-service availability before phase4b repairWhy it was performed
The frontend team reported
kg/raw-sourcesreturning{"sources":[]}for the Cardinal reference session. Investigation found three independent latent defects that, combined, meant explicitkg_provenance.source_hashlinks were never produced for any session in the DB:setTimeoutwhose delay is for report writes, not embeddings. The raw-source embedding queue is fire-and-forget. So Phase 4b consistently ran against an emptysource_chunk_embeddingsand wrote zero provenance.RAW_SOURCE_EMBEDDINGinert at runtime: the flag istrueinflags.envbut the dev entrypoint loaded only.env, andfeatureFlagsis a hoisted static import — so the flag resolvedfalseand the embedding dispatcher ran as a no-op stub.invalid byte sequence for encoding "UTF8": 0x00, swallowed by a fire-and-forget.catch().End objective
Future live sessions self-populate
source_chunk_embeddingsand explicitkg_provenance.source_hashwith no manual backfill, and the route uses the preferred deterministic path. A reconciliation safety-net repairs any session that still raced.Changes (file-level)
src/utils/rawSource/SourceChunker.js— strip NUL (\0) from chunk content/header (single chokepoint for all raw-source chunks; mirrors the existingkgPhase4cNodeEmbeddings.jsprecedent). +test/sdk/source-chunker-nul.test.js.src/server/loadFlagsEnv.js(new) +claude-sdk-server.js— first-imported side-effect loader soflags.envlands inprocess.envbefore the hoistedfeatureFlagsimport evaluates (fixes the dev hoisting gap);[Startup] Feature flagsbanner; broadened embedding-service init gate to also fire whenRAW_SOURCE_EMBEDDINGis on.src/config/featureFlags.js—validateFlags()dependency-validation (warns whenRAW_SOURCE_EMBEDDING=truebutARCHIVE/HOOK_DB/(GEMINI||GOOGLE)_API_KEYmissing, and on the archive-on/embedding-off divergence). +test/sdk/validate-flags.test.js. Note: this branch'sfeatureFlags.jswas rebased onto currentmain— it preservesmain'sOPUS_MODEL='claude-opus-4-8'and all wrapped-subagent helpers; onlyvalidateFlags()is added.src/utils/hookDBBridge.js—source_chunk_embeddingsstabilization poll before the KG build (count>0 AND stable across two checks, 60s cap, proceed-on-cap), so Phase 4b sees populated embeddings.src/utils/sessionReconciliation.js— safety-net: re-run Phase 4b forbuiltsessions that have embeddings but nosource_chunkprovenance, gated by a one-shotkg_phase4b_repairedmarker + an embedding-service availability probe (so the marker isn't burned when the service is down).src/utils/knowledgeGraph/repairSourceProvenance.js(new) — idempotent DELETE-then-rerun of Phase 4b for one session (reused by the runner + reconciliation).src/utils/knowledgeGraph/kgPhases1to5.js— Phase 4b provenance write hardened: truncatesource_keytovarchar(200)and strip NUL (previously dropped ~6 rows/run silently).src/db/postgres.js+migrations/023_kg-edges-composite-idx.js+migrations/024_sessions-kg-phase4b-repaired.{up,down}.sql—kg_edges(session_id, source_id)/(session_id, target_id)composite indexes (boot path plain; migration usesCREATE INDEX CONCURRENTLY) + thekg_phase4b_repairedcolumn.scripts/run-phase4b.mjs+scripts/backfill-source-chunk-embeddings.mjs(new) — standalone Phase 4b runner + historical-session backfill tool (for sessions predating the live fixes).Risks
RAW_SOURCE_EMBEDDING=falseis the instant rollback; dispatcher has built-in backpressure (maxQueueDepth=50, concurrency=2); fire-and-forget (never blocks sessions). Watch Gemini spend +source_chunk_embeddingsgrowth post-deploy.kg_phase4b_repairedmarker + embedding-service probe (skip without burning marker if service down).loadFlagsEnvchanges local-dev flag resolution (enables flags.env content)CREATE INDEXlocks largekg_edgesCONCURRENTLY+IF NOT EXISTS.Viability / verification
source-chunker-nul,validate-flags) vianode --test.main;featureFlags.jsconflict resolved preserving main'sOPUS_MODEL=4.8+ wrapped-subagent helpers (onlyvalidateFlagsadded).source_chunk_embeddingsbackfill,kg_edgesindexes,kg_phase4b_repairedcolumn, 2,055 Cardinal chunk rows + 171 explicit provenance rows) are already applied to the shared DB and validated; this PR lands the code that produces them going forward.CI note
This branch does not restore
.github/workflows/kg-tests.yml(that workflow was removed onmain). The two new tests (source-chunker-nul.test.js,validate-flags.test.js) need to be wired intomain's current test setup before merge.Context
Extracted from the
v6.14/banker-qa-phase-1worktree where they were authored out-of-scope; relocated to this isolated branch offmainso the banker PR (#178) stays scoped. Related: #188 (Tier-3 verification), #23/#41 (tool/flag history).🤖 Generated with Claude Code