feat: tag PreCompact checkpoints#285
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughThis PR introduces chunk-origin classification to identify and filter PreCompact checkpoint content throughout the brainlayer system. New constants and detection logic classify chunks by origin; schema migrations backfill existing data; storage and event pipelines propagate origin metadata; search operations exclude checkpoints by default while allowing explicit inclusion; and a new ChangesPreCompact Checkpoint Origin System
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
|
@codex review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c615e4f76
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not include_checkpoints: | ||
| where_clauses.append("COALESCE(c.chunk_origin, 'unknown') != 'precompact_checkpoint'") |
There was a problem hiding this comment.
Over-fetch KNN results before dropping checkpoints
When checkpoint chunks are among the nearest k neighbors, this new predicate can discard the entire sqlite-vec candidate set because k is still only n_results unless one of the older filters triggers over-fetch. I reproduced this with 20 precompact_checkpoint vectors closer than 3 normal vectors: search(query_embedding=..., n_results=5) returned no rows even though normal chunks matched just outside the initial KNN window; the same issue applies to the binary KNN path below. Please treat the default checkpoint exclusion like the other post-KNN filters and increase effective_k before applying it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in c0529bb: default checkpoint exclusion now triggers KNN over-fetch for both float and binary vector search paths. Added regressions for checkpoint-nearest-neighbor windows.
3c615e4 to
c0529bb
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@coderabbitai review |
|
@codex review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0529bba23
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| WHEN excluded.chunk_origin IS NOT NULL AND excluded.chunk_origin != 'unknown' | ||
| THEN excluded.chunk_origin | ||
| WHEN chunks.chunk_origin IS NULL | ||
| THEN COALESCE(excluded.chunk_origin, 'unknown') | ||
| ELSE chunks.chunk_origin |
There was a problem hiding this comment.
Recompute chunk_origin when upserted content changes
When an indexed source is reprocessed with the same stable chunk id (for example the source_file:index ids produced by index_new), this conflict handler updates content but preserves an existing precompact_checkpoint origin whenever the new content classifies as unknown. If a previous checkpoint-shaped chunk is replaced by normal content, the row remains tagged as a checkpoint and is silently filtered out of default search even though its content is no longer a checkpoint; the origin needs to track the newly upserted content in this case.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 97033ec: upsert_chunks now recomputes chunk_origin when a stable chunk_id receives different content, so a checkpoint row can become unknown/normal content instead of staying checkpoint-tagged. Added regression: test_upsert_chunks_recomputes_origin_when_stable_chunk_content_changes.
c0529bb to
97033ec
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@codex review |
|
@coderabbitai review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 97033ec905
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| checkpoint_count = 0 | ||
| if checkpoint_count <= 0: | ||
| return n_results | ||
| return min(n_results + checkpoint_count, 1000) |
There was a problem hiding this comment.
Avoid starving normal results when checkpoints exceed the cap
When a DB has more than 1000 checkpoint vectors that rank ahead of a normal chunk for a query, this cap still asks sqlite-vec for only the first 1000 neighbors and the later chunk_origin != 'precompact_checkpoint' filter removes all of them, so default vector/hybrid search can return no normal results even though matching non-checkpoint chunks exist. I reproduced this with 1001 precompact_checkpoint rows at the query vector plus one nearby unknown row; search(..., n_results=1) returned an empty result. Since these checkpoints are expected to accumulate over time, the overfetch needs to avoid a fixed cap or apply the exclusion before/inside candidate selection.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 05d215f: checkpoint-aware KNN widening no longer has the fixed 1000 cap for checkpoint filtering, so a DB with 1001+ nearer checkpoint vectors still fetches far enough to return normal chunks. Added regression: test_vector_search_does_not_starve_normal_results_after_many_checkpoints.
97033ec to
05d215f
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce112cc42e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| n_results=num_results, | ||
| entity_name=entity_name, | ||
| project_filter=normalized_project, | ||
| include_checkpoints=include_checkpoints, |
There was a problem hiding this comment.
Filter checkpoint-derived KG facts
When an entity is detected, include_checkpoints is only forwarded to the chunk search leg of kg_hybrid_search; both the SQL fact path above and kg_hybrid_search() still read KG relations without joining source_chunk_id back to chunks.chunk_origin. In the default include_checkpoints=False case, any facts extracted from PreCompact checkpoint chunks can still be returned for entity searches even though checkpoint chunks themselves are hidden, so checkpoint pollution remains visible through facts.
Useful? React with 👍 / 👎.
| where_clauses = [ | ||
| "COALESCE(chunk_origin, 'unknown') = ?", |
There was a problem hiding this comment.
Guard brain_resume on legacy readonly databases
brain_resume unconditionally references chunk_origin, but VectorStore._init_readonly_db() explicitly supports opening existing databases without running migrations and records _has_chunk_origin=False when that column is absent. In that readonly/pre-migration context this query fails with no such column: chunk_origin and the new resume tool returns an error instead of falling back to content-pattern detection or an empty result.
Useful? React with 👍 / 👎.
ce112cc to
f5e58df
Compare
|
@coderabbitai review |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f5e58df8df
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| content[:200], # summary = first 200 chars | ||
| json.dumps(tags) if tags else None, | ||
| float(importance) if importance is not None else None, | ||
| detect_chunk_origin(content), |
There was a problem hiding this comment.
Invalidate checkpoint-count cache after storing checkpoints
When a long-lived VectorStore has already populated _checkpoint_count_cache (for example after any prior default vector search), this path can now insert a PreCompact checkpoint but only clears the hybrid cache; it never calls _invalidate_checkpoint_count_cache(). If the new checkpoint has an embedding (or is embedded before another invalidation), the next default KNN search still uses the stale lower k, sqlite-vec can return the checkpoint as the top candidate, and the checkpoint filter then drops it, yielding zero or too few normal memories until another code path happens to invalidate the count.
Useful? React with 👍 / 👎.
f5e58df to
0199ae3
Compare
|
@coderabbitai review |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0199ae3565
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cached_count = getattr(self, "_checkpoint_count_cache", None) | ||
| if cached_count is not None: | ||
| checkpoint_count = cached_count |
There was a problem hiding this comment.
Refresh checkpoint counts after external writes
When a long-lived MCP VectorStore has already cached 0 checkpoints, checkpoints inserted by another process (for example the watcher/drain writer) never invalidate this in-memory value, so subsequent vector searches keep using k = n_results while the SQL filter removes the new nearest checkpoint rows. In that common multi-process path, normal memories just behind the checkpoint rows can disappear from default search until the server restarts; the count needs a TTL/DB-change check or should be recomputed instead of being cached indefinitely.
Useful? React with 👍 / 👎.
0199ae3 to
46778ae
Compare
|
@coderabbitai review |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
✅ Actions performedReview triggered.
|
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/brainlayer/mcp/search_handler.py (1)
331-375:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd retry logic to
_kg_facts_sqlfor transient SQLite locks.The function executes one-shot queries without retrying on
SQLITE_BUSYerrors. The catch-allexcept Exception as e:block masks these lock errors and returns[], causing intermittent loss of KG facts during concurrent MCP writes. This violates the guideline requiring retry logic onSQLITE_BUSYerrors insrc/brainlayer/**/*.py.Implement retry logic similar to
_execute_resume_query_with_retryor convert the function to async with explicit retry handling like_search.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/brainlayer/mcp/search_handler.py` around lines 331 - 375, The _kg_facts_sql function currently performs one-shot cursor.execute calls and swallows all exceptions, causing intermittent loss on SQLITE_BUSY; update _kg_facts_sql to retry the DB queries on transient SQLite lock errors (sqlite3.OperationalError with "database is locked" or error code SQLITE_BUSY) using the same pattern as _execute_resume_query_with_retry (or the retry/backoff logic in _search): wrap the cursor.execute blocks that fetch the entity id and the facts_raw query in a retry loop with a small exponential backoff and a capped max attempts, only re-raising or returning [] after retries are exhausted, and keep existing behavior for non-lock exceptions. Ensure you reference and update the calls that use cursor.execute in _kg_facts_sql so both the id lookup and the SELECT for facts are retried.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/brainlayer/_helpers.py`:
- Around line 70-76: The current _is_sqlite_busy_error only checks
apsw.BusyError and two message strings, which can miss SQLITE_LOCKED paths;
update _is_sqlite_busy_error to also treat apsw.LockedError, and inspect the
APSW/SQLite result code on the exception (e.g., check getattr(exc, "result",
None) or any APSW result_code attribute) for SQLITE_BUSY and SQLITE_LOCKED (use
the apsw constants) and finally fall back to the existing casefold message
checks; reference the function _is_sqlite_busy_error and the apsw exception
classes (apsw.BusyError, apsw.LockedError) and the SQLite result codes
(SQLITE_BUSY, SQLITE_LOCKED) so retries trigger correctly under write
contention.
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 184-185: The check for checkpoint chunks currently returns None
which allows the caller to continue into the generic search path; instead,
short-circuit by returning the same empty-result value the explicit chunk_id=
path uses (i.e., an empty list/collection of matches) so a checkpoint chunk-id
resolves as a clean miss; update the branch in search_handler.py that checks
include_checkpoints and CHUNK_ORIGIN_PRECOMPACT_CHECKPOINT (the
chunk.get("chunk_origin") check) to return that empty-result type rather than
None.
- Around line 770-772: The except block currently builds error_result =
_error_result(f"Resume error: {exc}") and returns (error_result.content, {})
which strips the isError flag; change the handler to return the
_error_result(...) directly (i.e., return _error_result(f"Resume error: {exc}"))
so the MCP caller receives the full error object; update the except in
search_handler (the block that references error_result and _error_result) to
mirror other handlers in this file that return _error_result(...) directly.
---
Outside diff comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 331-375: The _kg_facts_sql function currently performs one-shot
cursor.execute calls and swallows all exceptions, causing intermittent loss on
SQLITE_BUSY; update _kg_facts_sql to retry the DB queries on transient SQLite
lock errors (sqlite3.OperationalError with "database is locked" or error code
SQLITE_BUSY) using the same pattern as _execute_resume_query_with_retry (or the
retry/backoff logic in _search): wrap the cursor.execute blocks that fetch the
entity id and the facts_raw query in a retry loop with a small exponential
backoff and a capped max attempts, only re-raising or returning [] after retries
are exhausted, and keep existing behavior for non-lock exceptions. Ensure you
reference and update the calls that use cursor.execute in _kg_facts_sql so both
the id lookup and the SELECT for facts are retried.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e7d73754-e91c-461a-8008-0c8bfe53c87a
📒 Files selected for processing (15)
src/brainlayer/_helpers.pysrc/brainlayer/chunk_origin.pysrc/brainlayer/drain.pysrc/brainlayer/kg_repo.pysrc/brainlayer/mcp/__init__.pysrc/brainlayer/mcp/search_handler.pysrc/brainlayer/queue_io.pysrc/brainlayer/search_repo.pysrc/brainlayer/store.pysrc/brainlayer/vector_store.pysrc/brainlayer/watcher_bridge.pytests/test_precompact_chunk_origin.pytests/test_search_filter_params.pytests/test_think_recall_integration.pytests/test_tool_annotations.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Cursor Bugbot
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
- GitHub Check: test (3.11)
- GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
src/brainlayer/_helpers.pytests/test_think_recall_integration.pysrc/brainlayer/queue_io.pysrc/brainlayer/chunk_origin.pytests/test_tool_annotations.pysrc/brainlayer/store.pysrc/brainlayer/watcher_bridge.pytests/test_search_filter_params.pysrc/brainlayer/drain.pysrc/brainlayer/kg_repo.pysrc/brainlayer/mcp/__init__.pysrc/brainlayer/vector_store.pysrc/brainlayer/search_repo.pysrc/brainlayer/mcp/search_handler.pytests/test_precompact_chunk_origin.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/_helpers.pysrc/brainlayer/queue_io.pysrc/brainlayer/chunk_origin.pysrc/brainlayer/store.pysrc/brainlayer/watcher_bridge.pysrc/brainlayer/drain.pysrc/brainlayer/kg_repo.pysrc/brainlayer/mcp/__init__.pysrc/brainlayer/vector_store.pysrc/brainlayer/search_repo.pysrc/brainlayer/mcp/search_handler.py
🪛 OpenGrep (1.20.0)
src/brainlayer/vector_store.py
[ERROR] 233-233: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.
(coderabbit.sql-injection.python-fstring-execute)
🔇 Additional comments (32)
tests/test_precompact_chunk_origin.py (1)
25-830: LGTM!tests/test_search_filter_params.py (1)
76-95: LGTM!Also applies to: 202-273, 296-297, 448-484
tests/test_think_recall_integration.py (1)
249-255: LGTM!tests/test_tool_annotations.py (1)
1-1: LGTM!Also applies to: 14-15, 30-31, 51-52, 74-76
src/brainlayer/_helpers.py (1)
7-8: LGTM!src/brainlayer/chunk_origin.py (1)
1-55: LGTM!src/brainlayer/drain.py (1)
22-22: LGTM!Also applies to: 141-142, 195-196, 233-234
src/brainlayer/store.py (1)
39-39: LGTM!Also applies to: 147-147, 153-154, 171-171, 195-196
src/brainlayer/vector_store.py (9)
11-11: LGTM!Also applies to: 37-41, 83-84, 103-104
109-118: LGTM!
178-189: LGTM!
230-234: LGTM!Note: The static analysis SQL injection warning at line 233 is a false positive —
colandtypvalues originate from a hardcoded list (lines 193–230), not user input.
236-279: LGTM!Previous review concerns addressed:
INSERT OR IGNOREhandles concurrent migration attempts (line 271), and_checkpoint_wal_fullis called before (line 242) and after (line 278) the bulk backfill.
300-300: LGTM!
1326-1327: LGTM!Also applies to: 1340-1349, 1365-1365, 1374-1374
1394-1395: LGTM!Also applies to: 1412-1412
1456-1462: LGTM!Also applies to: 1485-1485
src/brainlayer/watcher_bridge.py (1)
22-22: LGTM!Also applies to: 272-272, 288-288, 297-298, 313-313
src/brainlayer/kg_repo.py (2)
979-1055: LGTM!
1086-1091: LGTM!src/brainlayer/search_repo.py (10)
126-130: LGTM!
132-141: LGTM!
143-182: LGTM!
420-427: LGTM!
443-458: LGTM!
504-511: LGTM!
766-800: LGTM!
921-921: LGTM!Also applies to: 959-959
976-1017: LGTM!
1077-1088: LGTM!src/brainlayer/mcp/__init__.py (1)
42-45: LGTM!Also applies to: 509-549, 826-830, 1234-1267, 1345-1346
src/brainlayer/queue_io.py (1)
73-104: LGTM!
46778ae to
8c758a1
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@coderabbitai review Updated PR-A FM6 head to 8c758a1 after addressing latest review findings: APSW LockedError handling, exact checkpoint chunk-id no-fallback behavior, brain_resume MCP error preservation, and migration/runtime wrapped-marker parity. |
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c758a1dd6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cursor.execute( | ||
| """ | ||
| UPDATE chunks | ||
| SET chunk_origin = ? | ||
| WHERE COALESCE(chunk_origin, ?) != ? |
There was a problem hiding this comment.
Handle LockedError during checkpoint backfill
When two BrainLayer processes open the same database for the first time after this change, this new backfill can hit an APSW LockedError/database table is locked on chunks; _init_db_with_retry() only retries apsw.BusyError, so a transient table lock here bypasses the startup backoff and fails MCP/store initialization instead of retrying. Please route the migration/update path through the same busy/locked classification used for reads, or catch LockedError in the init retry loop.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8c758a1. Configure here.
| if not getattr(self, "_has_chunk_origin", True): | ||
| return None | ||
| column = f"{table_alias}.chunk_origin" if table_alias else "chunk_origin" | ||
| return f"COALESCE({column}, 'unknown') != 'precompact_checkpoint'" |
There was a problem hiding this comment.
Hardcoded magic string instead of imported constant in exclusion clause
Low Severity
_checkpoint_exclusion_clause hardcodes the string 'precompact_checkpoint' in the SQL fragment even though CHUNK_ORIGIN_PRECOMPACT_CHECKPOINT is already imported in the same file and used correctly in _checkpoint_filtered_knn_k. Similarly, kg_repo.py uses the literal "precompact_checkpoint" in checkpoint_params.append(...) without importing or referencing the constant. If the constant value ever changes, these hardcoded strings will silently diverge.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8c758a1. Configure here.
|
|
||
| if checkpoint_count <= 0: | ||
| return n_results | ||
| return n_results + checkpoint_count |
There was a problem hiding this comment.
Unbounded KNN k growth with checkpoint count
Medium Severity
_checkpoint_filtered_knn_k returns n_results + checkpoint_count without any upper bound. For databases accumulating many checkpoint chunks over time, every default search (which excludes checkpoints) will set the KNN k parameter to n_results + total_checkpoints, causing sqlite-vec to scan and rank an unbounded number of vectors. A database with thousands of checkpoints would make every simple 5-result search perform a multi-thousand-vector KNN scan.
Reviewed by Cursor Bugbot for commit 8c758a1. Configure here.


Summary
chunk_origintagging for PreCompact checkpoints without repurposing the overloadedsourcecolumnschema_migrationsinclude_checkpoints, and exposebrain_resume(session_id=None, lookback_days=7)for explicit recoveryVerification
pytest tests/test_precompact_chunk_origin.py tests/test_search_filter_params.py tests/test_think_recall_integration.py::TestMCPToolCount -q→ 29 passedpytest tests/test_precompact_chunk_origin.py tests/test_search_filter_params.py tests/test_hybrid_search.py tests/test_watcher_bridge.py tests/test_chunk_lifecycle.py tests/test_3tool_aliases.py -q→ 135 passed./scripts/run_tests.sh→ 1860 passed, 9 skipped, 75 deselected, 1 xfailed; MCP registration 3 passed; isolated pytest 32 passed; bun 1 pass; regression shell passed./scripts/run_tests.shwith the same passing counts/tmp/brainlayer-fm6-precompact-snapshot.db→ 51 checkpoint rows tagged, 0 manual checkpoint-storage note false positivesruff checkandruff format --checkpassedNotes
ruff check/ruff format --checkstill report pre-existing script lint/format debt outside this PR; touched files are clean.cr review --plaincould not start locally because the CLI is not connected to the repository's CodeRabbit organization; requesting CodeRabbit on the PR instead.Note
Medium Risk
Introduces a schema migration/backfill (
chunks.chunk_origin) and changes default search/KG retrieval behavior to exclude checkpoint content, which could affect existing workflows and query results; adds additional retry/overfetch logic around SQLite reads that may impact performance under load.Overview
Adds first-class
chunk_originclassification (newchunks.chunk_origincolumn +chunk_origin.pydetection helpers) and tags incoming data across ingestion paths (store, watcher/queue drain, upserts, chunk updates), including a one-time backfill migration recorded inschema_migrations.Changes default retrieval to exclude
precompact_checkpointchunks across hybrid/text/vector search, exact chunk-id lookups, and KG fact queries unlessinclude_checkpoints=true, with KNN overfetch/caching tweaks to avoid starving results when checkpoints are filtered.Exposes an explicit
brain_resumeMCP tool to fetch recent PreCompact checkpoints (optionally filtered bysession_id, with lock-resilient retries) and wires the new parameter/tool through MCP schemas and tests.Reviewed by Cursor Bugbot for commit 8c758a1. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add PreCompact checkpoint tagging, filtering, and
brain_resumetool to VectorStore searchchunk_origincolumn to thechunkstable with a backfill migration that detects and tags existing PreCompact checkpoint content using content-marker heuristics inchunk_origin.py.store.py,vector_store.py,drain.py,watcher_bridge.py) now compute and persistchunk_originon insert or update.precompact_checkpointchunks by default, overfetching to compensate and truncating to the requested result count; aninclude_checkpointsflag opts back in.brain_resumeMCP tool insearch_handler.pythat queries recent PreCompact checkpoints filtered bylookback_daysand optionalsession_id, with retry logic on SQLite busy/locked errors.PRAGMA data_version) adds reads on each KNN query when no cached value exists.Macroscope summarized 8c758a1.
Summary by CodeRabbit
New Features
Tests