Skip to content

feat: tag PreCompact checkpoints#285

Merged
EtanHey merged 1 commit into
mainfrom
feat/phase-1-fm6-precompact-tagging
May 16, 2026
Merged

feat: tag PreCompact checkpoints#285
EtanHey merged 1 commit into
mainfrom
feat/phase-1-fm6-precompact-tagging

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented May 15, 2026

Summary

  • add chunk_origin tagging for PreCompact checkpoints without repurposing the overloaded source column
  • backfill existing checkpoint-shaped rows during VectorStore schema init and log the migration count in schema_migrations
  • exclude checkpoint chunks from default search, add include_checkpoints, and expose brain_resume(session_id=None, lookback_days=7) for explicit recovery

Verification

  • pytest tests/test_precompact_chunk_origin.py tests/test_search_filter_params.py tests/test_think_recall_integration.py::TestMCPToolCount -q → 29 passed
  • pytest tests/test_precompact_chunk_origin.py tests/test_search_filter_params.py tests/test_hybrid_search.py tests/test_watcher_bridge.py tests/test_chunk_lifecycle.py tests/test_3tool_aliases.py -q → 135 passed
  • ./scripts/run_tests.sh → 1860 passed, 9 skipped, 75 deselected, 1 xfailed; MCP registration 3 passed; isolated pytest 32 passed; bun 1 pass; regression shell passed
  • pre-push hook reran ./scripts/run_tests.sh with the same passing counts
  • snapshot backfill on /tmp/brainlayer-fm6-precompact-snapshot.db → 51 checkpoint rows tagged, 0 manual checkpoint-storage note false positives
  • touched-file ruff check and ruff format --check passed

Notes

  • Whole-repo ruff check / ruff format --check still report pre-existing script lint/format debt outside this PR; touched files are clean.
  • cr review --plain could not start locally because the CLI is not connected to the repository's CodeRabbit organization; requesting CodeRabbit on the PR instead.

Note

Medium Risk
Introduces a schema migration/backfill (chunks.chunk_origin) and changes default search/KG retrieval behavior to exclude checkpoint content, which could affect existing workflows and query results; adds additional retry/overfetch logic around SQLite reads that may impact performance under load.

Overview
Adds first-class chunk_origin classification (new chunks.chunk_origin column + chunk_origin.py detection helpers) and tags incoming data across ingestion paths (store, watcher/queue drain, upserts, chunk updates), including a one-time backfill migration recorded in schema_migrations.

Changes default retrieval to exclude precompact_checkpoint chunks across hybrid/text/vector search, exact chunk-id lookups, and KG fact queries unless include_checkpoints=true, with KNN overfetch/caching tweaks to avoid starving results when checkpoints are filtered.

Exposes an explicit brain_resume MCP tool to fetch recent PreCompact checkpoints (optionally filtered by session_id, with lock-resilient retries) and wires the new parameter/tool through MCP schemas and tests.

Reviewed by Cursor Bugbot for commit 8c758a1. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add PreCompact checkpoint tagging, filtering, and brain_resume tool to VectorStore search

  • Introduces a chunk_origin column to the chunks table with a backfill migration that detects and tags existing PreCompact checkpoint content using content-marker heuristics in chunk_origin.py.
  • All write paths (store.py, vector_store.py, drain.py, watcher_bridge.py) now compute and persist chunk_origin on insert or update.
  • Vector, binary, and hybrid KNN searches exclude precompact_checkpoint chunks by default, overfetching to compensate and truncating to the requested result count; an include_checkpoints flag opts back in.
  • Adds a new brain_resume MCP tool in search_handler.py that queries recent PreCompact checkpoints filtered by lookback_days and optional session_id, with retry logic on SQLite busy/locked errors.
  • KG fact queries and exact chunk-id lookups also respect the checkpoint exclusion by default.
  • Risk: existing databases undergo a one-time backfill migration on first open; checkpoint count caching (keyed on PRAGMA data_version) adds reads on each KNN query when no cached value exists.

Macroscope summarized 8c758a1.

Summary by CodeRabbit

  • New Features

    • Automatic chunk-origin tagging and persistence, plus a one-time migration/backfill to populate origin data
    • Search controls: new include_checkpoints flag (defaults to exclude) and overfetch logic so exclusions don’t starve results
    • New brain_resume tool to return recent session checkpoint content
    • Queue and ingestion now propagate chunk-origin metadata
  • Tests

    • Extensive end-to-end coverage for detection, storage, migration, search filtering/overfetch, resume, and edge cases

Review Change Stack

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

📝 Walkthrough

Walkthrough

This PR introduces chunk-origin classification to identify and filter PreCompact checkpoint content throughout the brainlayer system. New constants and detection logic classify chunks by origin; schema migrations backfill existing data; storage and event pipelines propagate origin metadata; search operations exclude checkpoints by default while allowing explicit inclusion; and a new brain_resume MCP tool retrieves recent checkpoints for session recovery.

Changes

PreCompact Checkpoint Origin System

Layer / File(s) Summary
Chunk origin classification and detection
src/brainlayer/chunk_origin.py
New module defines origin string constants (user_explicit, agent_explicit, auto_summary, precompact_checkpoint, unknown), pattern matchers for PreCompact content, is_precompact_checkpoint_content() helper checking prefixes and wrapped markers, and detect_chunk_origin() returning valid explicit origins, precompact classification, or unknown.
SQLite busy helper
src/brainlayer/_helpers.py
Adds APSW import and _is_sqlite_busy_error() to classify busy/locked errors for retry logic.
Vector store schema initialization and migration
src/brainlayer/vector_store.py
Schema startup adds chunk_origin TEXT DEFAULT 'unknown' to chunks, conditionally ALTERs existing databases, performs one-time backfill migration (2026_05_16_fm6_chunk_origin) classifying legacy chunks via content heuristics, detects column presence in readonly mode, and creates idx_chunks_chunk_origin.
Chunk insertion and upsert with origin
src/brainlayer/store.py, src/brainlayer/vector_store.py
store_memory() extends INSERT to include chunk_origin computed via detect_chunk_origin(content). VectorStore.upsert_chunks includes origin in conflict-update CASE logic, preserving existing non-unknown origins and recomputing when content changes. update_chunk recomputes origin on content change. get_chunk returns origin when available and code invalidates checkpoint-count cache on relevant writes.
Event pipeline chunk origin propagation
src/brainlayer/drain.py, src/brainlayer/watcher_bridge.py, src/brainlayer/queue_io.py
drain computes origin in _apply_store, _apply_watcher, _apply_hook. watcher_bridge derives origin from cleaned content and wires it into both arbitrated queue (enqueue_watcher_chunk) and direct SQLite insert paths. enqueue_watcher_chunk accepts optional chunk_origin and includes it in the queue payload.
Search repository filtering and overfetch logic
src/brainlayer/search_repo.py
Adds include_checkpoints: bool = False to SearchMixin.search, _binary_search, and hybrid_search. Implements checkpoint-exclusion SQL clause, cached checkpoint-count lookup keyed by PRAGMA data_version with retry-on-busy, and _effective_knn_k() to increase KNN overfetch when checkpoints are excluded. Appends checkpoint-exclusion clauses across embedding KNN, text LIKE, binary KNN, and hybrid FTS5 paths, truncates KNN results to n_results, and threads flag into hybrid cache-key.
KG search checkpoint exclusion
src/brainlayer/kg_repo.py
KGMixin.kg_search gains include_checkpoints and, when excluding checkpoints and _has_chunk_origin is true, LEFT JOINs the source chunk and filters out KG facts sourced from precompact checkpoints; kg_hybrid_search forwards the flag.
MCP tool registration and schema
src/brainlayer/mcp/__init__.py
Imports _brain_resume, adds include_checkpoints boolean (default False) to brain_search and brain_recall (search-mode) input schemas, registers new read-only idempotent brain_resume tool with session_id and lookback_days inputs, and routes calls forwarding include_checkpoints.
Search handler checkpoint filtering and resume
src/brainlayer/mcp/search_handler.py
Adds _utcnow_iso(). _exact_chunk_lookup_result, _brain_search, _brain_recall, and _search gain include_checkpoints parameter; exact lookup rejects precompact chunks when include_checkpoints=False. Implements _escape_like_pattern, a resume-query retry wrapper that uses _is_sqlite_busy_error, and _brain_resume(session_id, lookback_days) which fetches recent precompact checkpoint chunks (optionally scoped to a session), formats a markdown-like response, returns “No PreCompact checkpoints found.” when empty, and surfaces errors in structured outputs.
Comprehensive test coverage
tests/test_precompact_chunk_origin.py, tests/test_search_filter_params.py, tests/test_think_recall_integration.py, tests/test_tool_annotations.py
New tests validate detection across formats, migration/backfill correctness, upsert/update origin recomputation, watcher/drain propagation, default-exclude search with optional inclusion, vector/binary overfetch behavior when filtering discards neighbors, cache invalidation, readonly legacy DB handling without chunk_origin, and brain_resume behaviors including retry, wildcard escaping, and error embedding. brain_search schema tests updated and MCP tool-count/annotations updated to include brain_resume.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • EtanHey/brainlayer#79: Overlaps hybrid-search caching/keying and filter adjustments in search_repo.py.
  • EtanHey/brainlayer#263: Modifies MCP search flow and exact lookup routing, adjacent to _brain_search/_search changes.
  • EtanHey/brainlayer#214: Changes MCP→search plumbing and shared handler signatures similar to this PR's include_checkpoints threading.

Poem

🐰 I sniffed the chunks both near and far,
Marked PreCompact checkpoints like a star,
I routed, filtered, and hit resume,
Hopped through code to find their room,
A rabbit's cheer for ordered charm!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: tag PreCompact checkpoints' accurately describes the main change: introducing a chunk_origin tagging system for PreCompact checkpoint detection and filtering.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phase-1-fm6-precompact-tagging

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@coderabbitai review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@cursor @BugBot review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@codex review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 15, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Comment thread src/brainlayer/mcp/search_handler.py Outdated
Comment thread src/brainlayer/vector_store.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c615e4f76

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/search_repo.py Outdated
Comment on lines +361 to +362
if not include_checkpoints:
where_clauses.append("COALESCE(c.chunk_origin, 'unknown') != 'precompact_checkpoint'")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Over-fetch KNN results before dropping checkpoints

When checkpoint chunks are among the nearest k neighbors, this new predicate can discard the entire sqlite-vec candidate set because k is still only n_results unless one of the older filters triggers over-fetch. I reproduced this with 20 precompact_checkpoint vectors closer than 3 normal vectors: search(query_embedding=..., n_results=5) returned no rows even though normal chunks matched just outside the initial KNN window; the same issue applies to the binary KNN path below. Please treat the default checkpoint exclusion like the other post-KNN filters and increase effective_k before applying it.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c0529bb: default checkpoint exclusion now triggers KNN over-fetch for both float and binary vector search paths. Added regressions for checkpoint-nearest-neighbor windows.

Comment thread src/brainlayer/mcp/search_handler.py
@EtanHey EtanHey force-pushed the feat/phase-1-fm6-precompact-tagging branch from 3c615e4 to c0529bb Compare May 15, 2026 22:11
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@coderabbitai review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@codex review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@cursor @BugBot re-review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 15, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0529bba23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1324 to +1328
WHEN excluded.chunk_origin IS NOT NULL AND excluded.chunk_origin != 'unknown'
THEN excluded.chunk_origin
WHEN chunks.chunk_origin IS NULL
THEN COALESCE(excluded.chunk_origin, 'unknown')
ELSE chunks.chunk_origin
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute chunk_origin when upserted content changes

When an indexed source is reprocessed with the same stable chunk id (for example the source_file:index ids produced by index_new), this conflict handler updates content but preserves an existing precompact_checkpoint origin whenever the new content classifies as unknown. If a previous checkpoint-shaped chunk is replaced by normal content, the row remains tagged as a checkpoint and is silently filtered out of default search even though its content is no longer a checkpoint; the origin needs to track the newly upserted content in this case.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 97033ec: upsert_chunks now recomputes chunk_origin when a stable chunk_id receives different content, so a checkpoint row can become unknown/normal content instead of staying checkpoint-tagged. Added regression: test_upsert_chunks_recomputes_origin_when_stable_chunk_content_changes.

Comment thread src/brainlayer/search_repo.py Outdated
@EtanHey EtanHey force-pushed the feat/phase-1-fm6-precompact-tagging branch from c0529bb to 97033ec Compare May 15, 2026 22:28
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@cursor @BugBot re-review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@codex review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@coderabbitai review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 15, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 97033ec905

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/search_repo.py Outdated
checkpoint_count = 0
if checkpoint_count <= 0:
return n_results
return min(n_results + checkpoint_count, 1000)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid starving normal results when checkpoints exceed the cap

When a DB has more than 1000 checkpoint vectors that rank ahead of a normal chunk for a query, this cap still asks sqlite-vec for only the first 1000 neighbors and the later chunk_origin != 'precompact_checkpoint' filter removes all of them, so default vector/hybrid search can return no normal results even though matching non-checkpoint chunks exist. I reproduced this with 1001 precompact_checkpoint rows at the query vector plus one nearby unknown row; search(..., n_results=1) returned an empty result. Since these checkpoints are expected to accumulate over time, the overfetch needs to avoid a fixed cap or apply the exclusion before/inside candidate selection.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 05d215f: checkpoint-aware KNN widening no longer has the fixed 1000 cap for checkpoint filtering, so a DB with 1001+ nearer checkpoint vectors still fetches far enough to return normal chunks. Added regression: test_vector_search_does_not_starve_normal_results_after_many_checkpoints.

Comment thread src/brainlayer/mcp/search_handler.py
@EtanHey EtanHey force-pushed the feat/phase-1-fm6-precompact-tagging branch from 97033ec to 05d215f Compare May 15, 2026 22:42
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Comment thread src/brainlayer/mcp/search_handler.py Outdated
Comment thread src/brainlayer/mcp/search_handler.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce112cc42e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 581 to +584
n_results=num_results,
entity_name=entity_name,
project_filter=normalized_project,
include_checkpoints=include_checkpoints,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter checkpoint-derived KG facts

When an entity is detected, include_checkpoints is only forwarded to the chunk search leg of kg_hybrid_search; both the SQL fact path above and kg_hybrid_search() still read KG relations without joining source_chunk_id back to chunks.chunk_origin. In the default include_checkpoints=False case, any facts extracted from PreCompact checkpoint chunks can still be returned for entity searches even though checkpoint chunks themselves are hidden, so checkpoint pollution remains visible through facts.

Useful? React with 👍 / 👎.

Comment on lines +695 to +696
where_clauses = [
"COALESCE(chunk_origin, 'unknown') = ?",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard brain_resume on legacy readonly databases

brain_resume unconditionally references chunk_origin, but VectorStore._init_readonly_db() explicitly supports opening existing databases without running migrations and records _has_chunk_origin=False when that column is absent. In that readonly/pre-migration context this query fails with no such column: chunk_origin and the new resume tool returns an error instead of falling back to content-pattern detection or an empty result.

Useful? React with 👍 / 👎.

@EtanHey EtanHey force-pushed the feat/phase-1-fm6-precompact-tagging branch from ce112cc to f5e58df Compare May 15, 2026 23:37
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@coderabbitai review
@codex review
@cursor @BugBot re-review

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

@EtanHey Sure, I'll kick off a review of PR #285!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5e58df8df

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/store.py Outdated
content[:200], # summary = first 200 chars
json.dumps(tags) if tags else None,
float(importance) if importance is not None else None,
detect_chunk_origin(content),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invalidate checkpoint-count cache after storing checkpoints

When a long-lived VectorStore has already populated _checkpoint_count_cache (for example after any prior default vector search), this path can now insert a PreCompact checkpoint but only clears the hybrid cache; it never calls _invalidate_checkpoint_count_cache(). If the new checkpoint has an embedding (or is embedded before another invalidation), the next default KNN search still uses the stale lower k, sqlite-vec can return the checkpoint as the top candidate, and the checkpoint filter then drops it, yielding zero or too few normal memories until another code path happens to invalidate the count.

Useful? React with 👍 / 👎.

Comment thread src/brainlayer/store.py Outdated
@EtanHey EtanHey force-pushed the feat/phase-1-fm6-precompact-tagging branch from f5e58df to 0199ae3 Compare May 15, 2026 23:51
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 15, 2026

@coderabbitai review
@codex review
@cursor @BugBot re-review

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

@EtanHey Sure, kicking off another review of PR #285! I'll go through all the changes in detail.

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0199ae3565

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +136 to +138
cached_count = getattr(self, "_checkpoint_count_cache", None)
if cached_count is not None:
checkpoint_count = cached_count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Refresh checkpoint counts after external writes

When a long-lived MCP VectorStore has already cached 0 checkpoints, checkpoints inserted by another process (for example the watcher/drain writer) never invalidate this in-memory value, so subsequent vector searches keep using k = n_results while the SQL filter removes the new nearest checkpoint rows. In that common multi-process path, normal memories just behind the checkpoint rows can disappear from default search until the server restarts; the count needs a TTL/DB-change check or should be recomputed instead of being cached indefinitely.

Useful? React with 👍 / 👎.

Comment thread src/brainlayer/search_repo.py Outdated
@EtanHey EtanHey force-pushed the feat/phase-1-fm6-precompact-tagging branch from 0199ae3 to 46778ae Compare May 16, 2026 00:06
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 16, 2026

@coderabbitai review
@codex review
@cursor @BugBot re-review

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

@EtanHey Sure, I'll kick off another review of PR #285!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/vector_store.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/brainlayer/mcp/search_handler.py (1)

331-375: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add retry logic to _kg_facts_sql for transient SQLite locks.

The function executes one-shot queries without retrying on SQLITE_BUSY errors. The catch-all except Exception as e: block masks these lock errors and returns [], causing intermittent loss of KG facts during concurrent MCP writes. This violates the guideline requiring retry logic on SQLITE_BUSY errors in src/brainlayer/**/*.py.

Implement retry logic similar to _execute_resume_query_with_retry or convert the function to async with explicit retry handling like _search.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/brainlayer/mcp/search_handler.py` around lines 331 - 375, The
_kg_facts_sql function currently performs one-shot cursor.execute calls and
swallows all exceptions, causing intermittent loss on SQLITE_BUSY; update
_kg_facts_sql to retry the DB queries on transient SQLite lock errors
(sqlite3.OperationalError with "database is locked" or error code SQLITE_BUSY)
using the same pattern as _execute_resume_query_with_retry (or the retry/backoff
logic in _search): wrap the cursor.execute blocks that fetch the entity id and
the facts_raw query in a retry loop with a small exponential backoff and a
capped max attempts, only re-raising or returning [] after retries are
exhausted, and keep existing behavior for non-lock exceptions. Ensure you
reference and update the calls that use cursor.execute in _kg_facts_sql so both
the id lookup and the SELECT for facts are retried.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/brainlayer/_helpers.py`:
- Around line 70-76: The current _is_sqlite_busy_error only checks
apsw.BusyError and two message strings, which can miss SQLITE_LOCKED paths;
update _is_sqlite_busy_error to also treat apsw.LockedError, and inspect the
APSW/SQLite result code on the exception (e.g., check getattr(exc, "result",
None) or any APSW result_code attribute) for SQLITE_BUSY and SQLITE_LOCKED (use
the apsw constants) and finally fall back to the existing casefold message
checks; reference the function _is_sqlite_busy_error and the apsw exception
classes (apsw.BusyError, apsw.LockedError) and the SQLite result codes
(SQLITE_BUSY, SQLITE_LOCKED) so retries trigger correctly under write
contention.

In `@src/brainlayer/mcp/search_handler.py`:
- Around line 184-185: The check for checkpoint chunks currently returns None
which allows the caller to continue into the generic search path; instead,
short-circuit by returning the same empty-result value the explicit chunk_id=
path uses (i.e., an empty list/collection of matches) so a checkpoint chunk-id
resolves as a clean miss; update the branch in search_handler.py that checks
include_checkpoints and CHUNK_ORIGIN_PRECOMPACT_CHECKPOINT (the
chunk.get("chunk_origin") check) to return that empty-result type rather than
None.
- Around line 770-772: The except block currently builds error_result =
_error_result(f"Resume error: {exc}") and returns (error_result.content, {})
which strips the isError flag; change the handler to return the
_error_result(...) directly (i.e., return _error_result(f"Resume error: {exc}"))
so the MCP caller receives the full error object; update the except in
search_handler (the block that references error_result and _error_result) to
mirror other handlers in this file that return _error_result(...) directly.

---

Outside diff comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 331-375: The _kg_facts_sql function currently performs one-shot
cursor.execute calls and swallows all exceptions, causing intermittent loss on
SQLITE_BUSY; update _kg_facts_sql to retry the DB queries on transient SQLite
lock errors (sqlite3.OperationalError with "database is locked" or error code
SQLITE_BUSY) using the same pattern as _execute_resume_query_with_retry (or the
retry/backoff logic in _search): wrap the cursor.execute blocks that fetch the
entity id and the facts_raw query in a retry loop with a small exponential
backoff and a capped max attempts, only re-raising or returning [] after retries
are exhausted, and keep existing behavior for non-lock exceptions. Ensure you
reference and update the calls that use cursor.execute in _kg_facts_sql so both
the id lookup and the SELECT for facts are retried.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e7d73754-e91c-461a-8008-0c8bfe53c87a

📥 Commits

Reviewing files that changed from the base of the PR and between cf6b5f7 and 46778ae.

📒 Files selected for processing (15)
  • src/brainlayer/_helpers.py
  • src/brainlayer/chunk_origin.py
  • src/brainlayer/drain.py
  • src/brainlayer/kg_repo.py
  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/queue_io.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/store.py
  • src/brainlayer/vector_store.py
  • src/brainlayer/watcher_bridge.py
  • tests/test_precompact_chunk_origin.py
  • tests/test_search_filter_params.py
  • tests/test_think_recall_integration.py
  • tests/test_tool_annotations.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
  • GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

  • src/brainlayer/_helpers.py
  • tests/test_think_recall_integration.py
  • src/brainlayer/queue_io.py
  • src/brainlayer/chunk_origin.py
  • tests/test_tool_annotations.py
  • src/brainlayer/store.py
  • src/brainlayer/watcher_bridge.py
  • tests/test_search_filter_params.py
  • src/brainlayer/drain.py
  • src/brainlayer/kg_repo.py
  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/vector_store.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/mcp/search_handler.py
  • tests/test_precompact_chunk_origin.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

  • src/brainlayer/_helpers.py
  • src/brainlayer/queue_io.py
  • src/brainlayer/chunk_origin.py
  • src/brainlayer/store.py
  • src/brainlayer/watcher_bridge.py
  • src/brainlayer/drain.py
  • src/brainlayer/kg_repo.py
  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/vector_store.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/mcp/search_handler.py
🪛 OpenGrep (1.20.0)
src/brainlayer/vector_store.py

[ERROR] 233-233: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.

(coderabbit.sql-injection.python-fstring-execute)

🔇 Additional comments (32)
tests/test_precompact_chunk_origin.py (1)

25-830: LGTM!

tests/test_search_filter_params.py (1)

76-95: LGTM!

Also applies to: 202-273, 296-297, 448-484

tests/test_think_recall_integration.py (1)

249-255: LGTM!

tests/test_tool_annotations.py (1)

1-1: LGTM!

Also applies to: 14-15, 30-31, 51-52, 74-76

src/brainlayer/_helpers.py (1)

7-8: LGTM!

src/brainlayer/chunk_origin.py (1)

1-55: LGTM!

src/brainlayer/drain.py (1)

22-22: LGTM!

Also applies to: 141-142, 195-196, 233-234

src/brainlayer/store.py (1)

39-39: LGTM!

Also applies to: 147-147, 153-154, 171-171, 195-196

src/brainlayer/vector_store.py (9)

11-11: LGTM!

Also applies to: 37-41, 83-84, 103-104


109-118: LGTM!


178-189: LGTM!


230-234: LGTM!

Note: The static analysis SQL injection warning at line 233 is a false positive — col and typ values originate from a hardcoded list (lines 193–230), not user input.


236-279: LGTM!

Previous review concerns addressed: INSERT OR IGNORE handles concurrent migration attempts (line 271), and _checkpoint_wal_full is called before (line 242) and after (line 278) the bulk backfill.


300-300: LGTM!


1326-1327: LGTM!

Also applies to: 1340-1349, 1365-1365, 1374-1374


1394-1395: LGTM!

Also applies to: 1412-1412


1456-1462: LGTM!

Also applies to: 1485-1485

src/brainlayer/watcher_bridge.py (1)

22-22: LGTM!

Also applies to: 272-272, 288-288, 297-298, 313-313

src/brainlayer/kg_repo.py (2)

979-1055: LGTM!


1086-1091: LGTM!

src/brainlayer/search_repo.py (10)

126-130: LGTM!


132-141: LGTM!


143-182: LGTM!


420-427: LGTM!


443-458: LGTM!


504-511: LGTM!


766-800: LGTM!


921-921: LGTM!

Also applies to: 959-959


976-1017: LGTM!


1077-1088: LGTM!

src/brainlayer/mcp/__init__.py (1)

42-45: LGTM!

Also applies to: 509-549, 826-830, 1234-1267, 1345-1346

src/brainlayer/queue_io.py (1)

73-104: LGTM!

Comment thread src/brainlayer/_helpers.py
Comment thread src/brainlayer/mcp/search_handler.py Outdated
Comment thread src/brainlayer/mcp/search_handler.py Outdated
@EtanHey EtanHey force-pushed the feat/phase-1-fm6-precompact-tagging branch from 46778ae to 8c758a1 Compare May 16, 2026 00:25
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 16, 2026

@coderabbitai review
@codex review
@cursor @BugBot re-review

Updated PR-A FM6 head to 8c758a1 after addressing latest review findings: APSW LockedError handling, exact checkpoint chunk-id no-fallback behavior, brain_resume MCP error preservation, and migration/runtime wrapped-marker parity.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

@EtanHey Sure! I'll kick off a fresh review of PR #285 against the updated head (8c758a1dd6ed5517a52983b9ef2f868dea66c332). The fixes you've described—APSW LockedError handling, exact checkpoint chunk-id no-fallback behavior, brain_resume MCP error preservation, and migration/runtime wrapped-marker parity—all sound like solid improvements. I'll cover those areas closely in the review.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c758a1dd6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +243 to +247
cursor.execute(
"""
UPDATE chunks
SET chunk_origin = ?
WHERE COALESCE(chunk_origin, ?) != ?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle LockedError during checkpoint backfill

When two BrainLayer processes open the same database for the first time after this change, this new backfill can hit an APSW LockedError/database table is locked on chunks; _init_db_with_retry() only retries apsw.BusyError, so a transient table lock here bypasses the startup backoff and fails MCP/store initialization instead of retrying. Please route the migration/update path through the same busy/locked classification used for reads, or catch LockedError in the init retry loop.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8c758a1. Configure here.

if not getattr(self, "_has_chunk_origin", True):
return None
column = f"{table_alias}.chunk_origin" if table_alias else "chunk_origin"
return f"COALESCE({column}, 'unknown') != 'precompact_checkpoint'"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded magic string instead of imported constant in exclusion clause

Low Severity

_checkpoint_exclusion_clause hardcodes the string 'precompact_checkpoint' in the SQL fragment even though CHUNK_ORIGIN_PRECOMPACT_CHECKPOINT is already imported in the same file and used correctly in _checkpoint_filtered_knn_k. Similarly, kg_repo.py uses the literal "precompact_checkpoint" in checkpoint_params.append(...) without importing or referencing the constant. If the constant value ever changes, these hardcoded strings will silently diverge.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8c758a1. Configure here.


if checkpoint_count <= 0:
return n_results
return n_results + checkpoint_count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded KNN k growth with checkpoint count

Medium Severity

_checkpoint_filtered_knn_k returns n_results + checkpoint_count without any upper bound. For databases accumulating many checkpoint chunks over time, every default search (which excludes checkpoints) will set the KNN k parameter to n_results + total_checkpoints, causing sqlite-vec to scan and rank an unbounded number of vectors. A database with thousands of checkpoints would make every simple 5-result search perform a multi-thousand-vector KNN scan.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8c758a1. Configure here.

@EtanHey EtanHey merged commit fc9e6de into main May 16, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant