fix: DB lock resilience — retry + queue for MCP writes and reads#65
Conversation
brain_store already had JSONL queue fallback but brain_update and brain_search had zero protection against BusyError from enrichment contention. Now: - brain_update retries 4x with exponential backoff before failing - brain_search retries 3x on BusyError (WAL reads shouldn't block but do during checkpoint/exclusive lock) - _queue_store enforces 100-item max, drops oldest on overflow - All retry delays exposed as module-level vars for test patching 10 new tests in test_write_queue.py covering queue, flush, retry, and max-size behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds DB-lock resilience to MCP handlers: search and update operations now retry on APSW BusyError-like conditions with exponential backoff; write operations buffer to a bounded on-disk queue when locks occur. New tests exercise queueing, flushing, and retry behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant SearchHandler as Search Handler
participant Store as Vector Store / APSW
Client->>SearchHandler: _search(query)
loop retry (up to configured attempts)
SearchHandler->>Store: hybrid_search(query)
alt Success
Store-->>SearchHandler: results
SearchHandler-->>Client: return results
else BusyError / "locked"/"busy"
Store-->>SearchHandler: BusyError
SearchHandler->>SearchHandler: log warning
SearchHandler->>SearchHandler: sleep (exponential backoff)
end
end
alt retries exhausted
SearchHandler-->>Client: raise error
end
sequenceDiagram
participant Client
participant StoreHandler as Store Handler
participant DB as APSW/DB
participant FileQueue as File Queue (pending-stores.jsonl)
Client->>StoreHandler: brain_update(item)
loop retry (up to configured attempts)
StoreHandler->>DB: attempt write/update
alt Success
DB-->>StoreHandler: OK
StoreHandler-->>Client: success response
else BusyError / "locked"/"busy"
DB-->>StoreHandler: BusyError
StoreHandler->>StoreHandler: log warning
StoreHandler->>StoreHandler: sleep (exponential backoff)
end
end
alt retries exhausted
StoreHandler->>FileQueue: append item to queue (enforce max size)
FileQueue-->>StoreHandler: queued
StoreHandler-->>Client: queued response
end
sequenceDiagram
participant Operator
participant StoreHandler as Store Handler
participant FileQueue as File Queue (pending-stores.jsonl)
participant DB as APSW/DB
Operator->>StoreHandler: _flush_pending_stores()
StoreHandler->>FileQueue: read queued items
alt file exists
FileQueue-->>StoreHandler: items list
loop each item
StoreHandler->>DB: attempt write
alt Success
DB-->>StoreHandler: OK
else Failure
StoreHandler->>FileQueue: keep item
end
end
alt all succeeded
StoreHandler->>FileQueue: delete file
else failures remain
FileQueue-->>StoreHandler: retain file with failures
end
else file missing
FileQueue-->>StoreHandler: no-op
end
StoreHandler-->>Operator: flush complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/brainlayer/mcp/search_handler.py (1)
244-247:⚠️ Potential issue | 🟠 MajorRetry scope is incomplete for DB lock resilience.
_searchretries onlyhybrid_search, but lock failures at Line 246 (store.count()) or Line 302 (enrich_results_with_session_context) still bypass retry and return an error immediately. Wrap the full DB-read sequence in the same retry loop.🔧 Suggested refactor
- store = _get_vector_store() - - if store.count() == 0: - empty = {"query": query, "total": 0, "results": []} - return ( - [TextContent(type="text", text="Knowledge base is empty. Run 'brainlayer index' to populate it.")], - empty, - ) + store = _get_vector_store() normalized_project = _normalize_project_name(project) loop = asyncio.get_running_loop() model = _get_embedding_model() query_embedding = await loop.run_in_executor(None, model.embed_query, query) @@ - results = None + results = None for attempt in range(_RETRY_MAX_ATTEMPTS): try: + if store.count() == 0: + empty = {"query": query, "total": 0, "results": []} + return ( + [TextContent(type="text", text="Knowledge base is empty. Run 'brainlayer index' to populate it.")], + empty, + ) results = store.hybrid_search( query_embedding=query_embedding, query_text=query, @@ sentiment_filter=sentiment, entity_id=entity_id, ) + results = store.enrich_results_with_session_context(results) break except (apsw.BusyError, Exception) as e: is_lock = isinstance(e, apsw.BusyError) or "locked" in str(e).lower() or "busy" in str(e).lower() if is_lock and attempt < _RETRY_MAX_ATTEMPTS - 1: delay = _retry_delay * (2**attempt) logger.warning("Search BusyError (attempt %d/%d), retrying in %.2fs", attempt + 1, _RETRY_MAX_ATTEMPTS, delay) await asyncio.sleep(delay) continue raise # Non-lock error or retries exhausted @@ - results = store.enrich_results_with_session_context(results)Also applies to: 268-297, 302-303
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/search_handler.py` around lines 244 - 247, The retry scope is too narrow: expand the retry loop inside _search to encompass the entire DB-read sequence including the call to _get_vector_store(), store.count(), hybrid_search(...), and enrich_results_with_session_context(...), so any transient DB/lock errors are retried; adjust the retry wrapper (where hybrid_search is currently retried) to surround from obtaining the store through building the final results and return the same empty/normal response after retries or propagate the final error if retries exhaust.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/mcp/store_handler.py`:
- Around line 201-214: The read/trim/rewrite block around path (using
path.read_text(), _QUEUE_MAX_SIZE, lines, trimmed, path.write_text()) is not
concurrency-safe and can drop buffered writes; protect this section with an
inter-process lock (e.g., acquire a file lock on the queue file or a companion
lock file using fcntl/portalocker) around the read/trim/write sequence and
perform the write atomically by writing to a temporary file and then replacing
the original with os.replace (or Path.replace) so concurrent processes cannot
clobber each other; ensure the lock is reliably released in a finally block and
keep the existing warning log when trimming occurs.
- Around line 111-112: _brain_update currently calls the shared singleton
_get_vector_store(), causing contention; change _brain_update so that each retry
attempt opens its own DB connection/object instead of reusing
_get_vector_store() (e.g., create a new VectorStore/connection instance inside
the retry loop), run the operation, and always close/cleanup that connection in
a finally block; ensure the SQLITE_BUSY retry logic recreates a fresh
per-attempt connection so concurrent workers do not share a single connection.
In `@tests/test_write_queue.py`:
- Around line 67-73: Replace the weak cap assertion in tests/test_write_queue.py
by asserting exactly 100 items (change "assert len(lines) <= 100" to "assert
len(lines) == 100") and add a check that the first retained entry is the
expected boundary item: parse the first line (e.g., first_item =
json.loads(lines[0])) and assert first_item["content"] equals the item at index
(total_created - 100) (or hardcode "item-5" if total_created is 105), using the
existing variables lines and last_item to locate the check.
- Around line 211-214: The test_update_retries_before_failing currently allows
non-dict error-like returns to pass; update the assertions to explicitly verify
"not an error" for both dict and CallToolResult-like objects: reference the
variables result and call_count and assert call_count == 3, then assert that if
result is a dict its "isError" key is falsy and if result has an isError
attribute (e.g., a CallToolResult) that attribute is False — use
isinstance(result, dict) with result.get("isError") and hasattr(result,
"isError") / getattr(result, "isError") checks to enforce actual success
semantics.
---
Outside diff comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 244-247: The retry scope is too narrow: expand the retry loop
inside _search to encompass the entire DB-read sequence including the call to
_get_vector_store(), store.count(), hybrid_search(...), and
enrich_results_with_session_context(...), so any transient DB/lock errors are
retried; adjust the retry wrapper (where hybrid_search is currently retried) to
surround from obtaining the store through building the final results and return
the same empty/normal response after retries or propagate the final error if
retries exhaust.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/brainlayer/mcp/search_handler.pysrc/brainlayer/mcp/store_handler.pytests/test_write_queue.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Run tests with
pytest
Files:
tests/test_write_queue.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Typer CLI framework for command-line interface implementation insrc/brainlayer/
Handle concurrency by retrying onSQLITE_BUSYerrors; each worker should use its own database connection
Export brain graph as JSON for Next.js dashboard usingbrainlayer brain-exportcommand
Export Obsidian vault with Markdown files including backlinks and tags usingbrainlayer export-obsidiancommand
Files:
src/brainlayer/mcp/store_handler.pysrc/brainlayer/mcp/search_handler.py
src/brainlayer/mcp/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement MCP server with 8 tools:
brain_search,brain_store,brain_recall,brain_entity,brain_expand,brain_update,brain_digest,brain_get_person(legacybrainlayer_*aliases still work)
Files:
src/brainlayer/mcp/store_handler.pysrc/brainlayer/mcp/search_handler.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
ruff check src/for linting andruff format src/for code formatting
Files:
src/brainlayer/mcp/store_handler.pysrc/brainlayer/mcp/search_handler.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection
Applied to files:
tests/test_write_queue.pysrc/brainlayer/mcp/store_handler.pysrc/brainlayer/mcp/search_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with 8 tools: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` (legacy `brainlayer_*` aliases still work)
Applied to files:
src/brainlayer/mcp/store_handler.pysrc/brainlayer/mcp/search_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/vector_store.py : Store database at `~/.local/share/brainlayer/brainlayer.db` with WAL mode and `busy_timeout=5000`
Applied to files:
src/brainlayer/mcp/store_handler.py
🧬 Code graph analysis (2)
tests/test_write_queue.py (2)
src/brainlayer/mcp/store_handler.py (3)
_flush_pending_stores(217-268)_queue_store(188-214)_store(271-371)src/brainlayer/mcp/search_handler.py (1)
_search(216-391)
src/brainlayer/mcp/store_handler.py (2)
src/brainlayer/mcp/_shared.py (3)
_get_vector_store(16-24)_error_result(145-147)_get_embedding_model(27-34)src/brainlayer/vector_store.py (1)
update_chunk(577-612)
| store = _get_vector_store() | ||
|
|
There was a problem hiding this comment.
Use a per-operation DB connection in the retry path.
At Line 111, _brain_update still relies on _get_vector_store() (shared singleton). Under concurrent workers, this keeps lock contention concentrated on one connection and undermines the busy-retry design.
As per coding guidelines: "Handle concurrency by retrying on SQLITE_BUSY errors; each worker should use its own database connection".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/mcp/store_handler.py` around lines 111 - 112, _brain_update
currently calls the shared singleton _get_vector_store(), causing contention;
change _brain_update so that each retry attempt opens its own DB
connection/object instead of reusing _get_vector_store() (e.g., create a new
VectorStore/connection instance inside the retry loop), run the operation, and
always close/cleanup that connection in a finally block; ensure the SQLITE_BUSY
retry logic recreates a fresh per-attempt connection so concurrent workers do
not share a single connection.
| # Enforce max size — read, trim oldest, rewrite | ||
| try: | ||
| lines = path.read_text().strip().splitlines() | ||
| if len(lines) > _QUEUE_MAX_SIZE: | ||
| trimmed = lines[-_QUEUE_MAX_SIZE:] | ||
| path.write_text("\n".join(trimmed) + "\n") | ||
| logger.warning( | ||
| "Pending store queue trimmed: %d -> %d (dropped %d oldest)", | ||
| len(lines), | ||
| _QUEUE_MAX_SIZE, | ||
| len(lines) - _QUEUE_MAX_SIZE, | ||
| ) | ||
| except Exception: | ||
| pass # Non-critical — queue still works, just unbounded |
There was a problem hiding this comment.
Queue trimming is not concurrency-safe and can drop buffered writes.
Lines 201-206 perform read/trim/rewrite without an inter-process lock or atomic replace boundary. Concurrent _queue_store calls can clobber each other and silently lose queued memories.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/mcp/store_handler.py` around lines 201 - 214, The
read/trim/rewrite block around path (using path.read_text(), _QUEUE_MAX_SIZE,
lines, trimmed, path.write_text()) is not concurrency-safe and can drop buffered
writes; protect this section with an inter-process lock (e.g., acquire a file
lock on the queue file or a companion lock file using fcntl/portalocker) around
the read/trim/write sequence and perform the write atomically by writing to a
temporary file and then replacing the original with os.replace (or Path.replace)
so concurrent processes cannot clobber each other; ensure the lock is reliably
released in a finally block and keep the existing warning log when trimming
occurs.
| # Should succeed after retries | ||
| assert call_count == 3 | ||
| assert not isinstance(result, dict) or not result.get("isError") | ||
|
|
There was a problem hiding this comment.
test_update_retries_before_failing can pass even when update returns an error.
The assertion at Line 213 is effectively always true for non-dict returns, including an error CallToolResult. This can produce false positives.
✅ Assert real success semantics
# Should succeed after retries
assert call_count == 3
- assert not isinstance(result, dict) or not result.get("isError")
+ from mcp.types import CallToolResult
+ assert not isinstance(result, CallToolResult)
+ assert any("archived" in t.text.lower() for t in result)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Should succeed after retries | |
| assert call_count == 3 | |
| assert not isinstance(result, dict) or not result.get("isError") | |
| # Should succeed after retries | |
| assert call_count == 3 | |
| from mcp.types import CallToolResult | |
| assert not isinstance(result, CallToolResult) | |
| assert any("archived" in t.text.lower() for t in result) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_write_queue.py` around lines 211 - 214, The
test_update_retries_before_failing currently allows non-dict error-like returns
to pass; update the assertions to explicitly verify "not an error" for both dict
and CallToolResult-like objects: reference the variables result and call_count
and assert call_count == 3, then assert that if result is a dict its "isError"
key is falsy and if result has an isError attribute (e.g., a CallToolResult)
that attribute is False — use isinstance(result, dict) with
result.get("isError") and hasattr(result, "isError") / getattr(result,
"isError") checks to enforce actual success semantics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/brainlayer/mcp/search_handler.py (2)
244-256: 🧹 Nitpick | 🔵 TrivialStore obtained outside retry loop may limit resilience.
_get_vector_store()at Line 244 retrieves the singleton connection before the retry loop (Lines 273-300). If the connection itself becomes stale or problematic, retries will reuse the same connection. This is less severe than the update path since reads are less contention-prone, but consider whether obtaining a fresh store inside the loop would improve resilience.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/search_handler.py` around lines 244 - 256, The code calls _get_vector_store() once before the retry loop, which can reuse a stale connection; move the call to _get_vector_store() inside the retry loop used for reads so each retry obtains a fresh store instance, then use that local store for store.count() and subsequent search operations (keep the existing normalized_project, model/embed logic intact) to improve resilience when the underlying vector store connection becomes problematic.
8-25: 🧹 Nitpick | 🔵 TrivialConsider moving imports before module-level variables.
Module-level variables (Lines 8-10) are declared before imports (Lines 12-25). While this works, the conventional pattern is imports first. This is a minor style nit.
♻️ Proposed reordering
import apsw from mcp.types import TextContent +from ._shared import ( + _build_compact_result, + _error_result, + _extract_file_path, + _get_embedding_model, + _get_vector_store, + _memory_to_dict, + _normalize_project_name, + _query_has_regression_signal, + _query_signals_current_context, + _query_signals_recall, + _query_signals_think, + logger, +) + # Retry settings for DB lock resilience on reads _RETRY_MAX_ATTEMPTS = 3 _retry_delay = 0.1 # base delay in seconds (exposed for test patching) - -from ._shared import ( - ... -)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/search_handler.py` around lines 8 - 25, Move the import block that brings in functions like _build_compact_result, _error_result, _extract_file_path, _get_embedding_model, _get_vector_store, _memory_to_dict, _normalize_project_name, _query_has_regression_signal, _query_signals_current_context, _query_signals_recall, _query_signals_think, and logger so it appears before the module-level constants _RETRY_MAX_ATTEMPTS and _retry_delay; update search_handler.py by placing the entire from ._shared import (...) section at the top of the file (above the _RETRY_MAX_ATTEMPTS and _retry_delay declarations) to follow the conventional imports-first ordering.src/brainlayer/mcp/store_handler.py (1)
25-56:⚠️ Potential issue | 🟠 MajorAdd retry logic to
_brain_digestfor DB lock resilience.The
_brain_digestfunction callsdigest_content, which performs database writes viastore.upsert_chunks()and entity operations. However, unlike_brain_updateand_search,_brain_digestlacks retry logic forBusyError. Per the coding guideline, all database operations should retry onSQLITE_BUSYerrors. Wrap the digest operation in the same retry pattern used by_brain_update(lines 109–178): a loop with exponential backoff for up to 4 attempts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/store_handler.py` around lines 25 - 56, _brain_digest currently calls digest_content (which does store.upsert_chunks and entity writes) without retrying on database BusyError; wrap the call to digest_content (the lambda passed to loop.run_in_executor) in the same retry loop used by _brain_update: attempt up to 4 times, catch BusyError specifically, on BusyError sleep with exponential backoff (e.g., 0.1 * 2**attempt) and retry, and only re-raise or return an error after exhausting attempts; keep using loop.run_in_executor with model.embed_query and preserve the existing ValueError and generic Exception handling for non-BusyError failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 291-292: The except clause currently uses the redundant form
"except (apsw.BusyError, Exception) as e"; change it to a single generic catch
like "except Exception as e" (keeping the existing is_lock logic that inspects
isinstance(e, apsw.BusyError) or checks "locked"/"busy" in str(e). Update the
handler in search_handler.py where is_lock is set so it mirrors store_handler's
simpler pattern and remove the redundant tuple in the except clause while
retaining apsw.BusyError usage in the isinstance check.
In `@src/brainlayer/mcp/store_handler.py`:
- Around line 169-170: The except clause is redundant: replace "except
(apsw.BusyError, Exception) as e" with "except Exception as e" and keep the
existing is_lock_error computation (which still checks isinstance(e,
apsw.BusyError) or "locked"/"busy" in str(e).lower()) so intent remains clear
while removing the redundant tuple; update the except block around the
is_lock_error assignment in the same function to reflect this change.
In `@tests/test_write_queue.py`:
- Around line 244-249: The test mock original_search_results is missing the
required "ids" key that hybrid_search returns; update the mock in
tests/test_write_queue.py (variable original_search_results) to include an "ids"
entry with the same nested-list shape as "documents"/"metadatas" (e.g.,
[["doc_id_1"]] or matching chunk ids like [["c1"]]) so the test accurately
mirrors hybrid_search's return structure and avoids false positives.
---
Outside diff comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 244-256: The code calls _get_vector_store() once before the retry
loop, which can reuse a stale connection; move the call to _get_vector_store()
inside the retry loop used for reads so each retry obtains a fresh store
instance, then use that local store for store.count() and subsequent search
operations (keep the existing normalized_project, model/embed logic intact) to
improve resilience when the underlying vector store connection becomes
problematic.
- Around line 8-25: Move the import block that brings in functions like
_build_compact_result, _error_result, _extract_file_path, _get_embedding_model,
_get_vector_store, _memory_to_dict, _normalize_project_name,
_query_has_regression_signal, _query_signals_current_context,
_query_signals_recall, _query_signals_think, and logger so it appears before the
module-level constants _RETRY_MAX_ATTEMPTS and _retry_delay; update
search_handler.py by placing the entire from ._shared import (...) section at
the top of the file (above the _RETRY_MAX_ATTEMPTS and _retry_delay
declarations) to follow the conventional imports-first ordering.
In `@src/brainlayer/mcp/store_handler.py`:
- Around line 25-56: _brain_digest currently calls digest_content (which does
store.upsert_chunks and entity writes) without retrying on database BusyError;
wrap the call to digest_content (the lambda passed to loop.run_in_executor) in
the same retry loop used by _brain_update: attempt up to 4 times, catch
BusyError specifically, on BusyError sleep with exponential backoff (e.g., 0.1 *
2**attempt) and retry, and only re-raise or return an error after exhausting
attempts; keep using loop.run_in_executor with model.embed_query and preserve
the existing ValueError and generic Exception handling for non-BusyError
failures.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/brainlayer/mcp/search_handler.pysrc/brainlayer/mcp/store_handler.pytests/test_write_queue.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Typer CLI framework for command-line interface implementation insrc/brainlayer/
Handle concurrency by retrying onSQLITE_BUSYerrors; each worker should use its own database connection
Export brain graph as JSON for Next.js dashboard usingbrainlayer brain-exportcommand
Export Obsidian vault with Markdown files including backlinks and tags usingbrainlayer export-obsidiancommand
Files:
src/brainlayer/mcp/search_handler.pysrc/brainlayer/mcp/store_handler.py
src/brainlayer/mcp/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement MCP server with 8 tools:
brain_search,brain_store,brain_recall,brain_entity,brain_expand,brain_update,brain_digest,brain_get_person(legacybrainlayer_*aliases still work)
Files:
src/brainlayer/mcp/search_handler.pysrc/brainlayer/mcp/store_handler.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
ruff check src/for linting andruff format src/for code formatting
Files:
src/brainlayer/mcp/search_handler.pysrc/brainlayer/mcp/store_handler.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Run tests with
pytest
Files:
tests/test_write_queue.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection
Applied to files:
src/brainlayer/mcp/search_handler.pysrc/brainlayer/mcp/store_handler.pytests/test_write_queue.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with 8 tools: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` (legacy `brainlayer_*` aliases still work)
Applied to files:
src/brainlayer/mcp/search_handler.pysrc/brainlayer/mcp/store_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/vector_store.py : Use APSW with sqlite-vec for vector storage implementation in `vector_store.py`
Applied to files:
src/brainlayer/mcp/store_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/vector_store.py : Store database at `~/.local/share/brainlayer/brainlayer.db` with WAL mode and `busy_timeout=5000`
Applied to files:
src/brainlayer/mcp/store_handler.py
🧬 Code graph analysis (1)
src/brainlayer/mcp/search_handler.py (1)
src/brainlayer/search_repo.py (1)
hybrid_search(333-553)
🔇 Additional comments (9)
src/brainlayer/mcp/store_handler.py (3)
19-22: LGTM — Retry configuration is well-exposed for testing.The module-level retry settings with sensible defaults and exposed
_retry_delayfor test patching follow good practices for testable retry logic.
109-112: Use a per-operation DB connection in the retry path.
_get_vector_store()returns a shared singleton. Under concurrent workers, retrying on the same connection keeps lock contention concentrated and undermines the busy-retry design. As per coding guidelines: "Handle concurrency by retrying onSQLITE_BUSYerrors; each worker should use its own database connection".Consider creating a fresh connection inside each retry attempt or providing a factory that returns new connections for retry scenarios.
[duplicate_comment, raise_major_issue]
203-216: Queue trimming is not concurrency-safe.The read/trim/rewrite sequence at Lines 205-208 lacks inter-process synchronization. Concurrent
_queue_storecalls can clobber each other's writes, silently losing queued memories. Consider using a file lock (e.g.,fcntlorportalocker) and atomic replace via a temp file.[duplicate_comment, raise_critical_issue]
tests/test_write_queue.py (4)
1-7: LGTM — Good test module structure and documentation.Clear docstring explaining test coverage scope and the use of
tmp_pathfixtures for isolation.
67-73: Tighten the queue cap assertion.
assert len(lines) <= 100can hide trimming regressions. The test should verify exact cap behavior and the first retained item.[duplicate_comment, raise_minor_issue]
✅ Proposed fix
lines = pending_path.read_text().strip().splitlines() # Should be capped at 100, with oldest dropped - assert len(lines) <= 100 + assert len(lines) == 100 + # Verify oldest items were dropped (items 0-4) + first_item = json.loads(lines[0]) + assert first_item["content"] == "item-5" # The newest items should be kept last_item = json.loads(lines[-1]) assert last_item["content"] == "item-104"
211-214: Strengthen the success assertion.The assertion
not isinstance(result, dict) or not result.get("isError")can pass for errorCallToolResultobjects. The test should explicitly verify successful archive behavior.[duplicate_comment, raise_minor_issue]
✅ Proposed fix
# Should succeed after retries assert call_count == 3 - assert not isinstance(result, dict) or not result.get("isError") + # Result should be a list of TextContent, not a CallToolResult error + assert isinstance(result, list) + assert len(result) > 0 + # Verify the response indicates successful archive + result_text = result[0].text + assert "archived" in result_text.lower()
90-93: The patch path is correct.The patch at line 91 targets
"brainlayer.store.store_memory", which is the correct location. When_flush_pending_storesexecutesfrom ..store import store_memory(a local import), Python resolves it tobrainlayer.store.store_memory—the definition location—and that's the appropriate target for mocking. This is standard practice and not fragile.src/brainlayer/mcp/search_handler.py (2)
270-300: LGTM — Retry logic for search is well-implemented.The retry loop with exponential backoff, clear logging, and proper re-raise on exhaustion follows good resilience patterns. The comment explaining WAL checkpoint blocking is helpful context.
263-265: LGTM — Good AIDEV-NOTE documentation.The comment explains the rationale for changing the default source filter behavior, which helps future maintainers understand the design decision.
| call_count = 0 | ||
| original_search_results = { | ||
| "documents": [["test doc"]], | ||
| "metadatas": [[{"project": "test", "content_type": "note", "chunk_id": "c1"}]], | ||
| "distances": [[0.1]], | ||
| } |
There was a problem hiding this comment.
Test mock missing required ids key in search results.
The mock original_search_results (Lines 245-249) is missing the ids key that hybrid_search returns (per the relevant code snippet showing the return structure). This may cause the test to pass despite incomplete mock data.
✅ Proposed fix
original_search_results = {
+ "ids": [["c1"]],
"documents": [["test doc"]],
"metadatas": [[{"project": "test", "content_type": "note", "chunk_id": "c1"}]],
"distances": [[0.1]],
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_write_queue.py` around lines 244 - 249, The test mock
original_search_results is missing the required "ids" key that hybrid_search
returns; update the mock in tests/test_write_queue.py (variable
original_search_results) to include an "ids" entry with the same nested-list
shape as "documents"/"metadatas" (e.g., [["doc_id_1"]] or matching chunk ids
like [["c1"]]) so the test accurately mirrors hybrid_search's return structure
and avoids false positives.
…er assertions - Simplify `except (BusyError, Exception)` to `except Exception` - Queue trim uses tempfile + atomic rename (POSIX safe) - Test assertions now verify exact queue cap and success response content Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
BusyErrorinstead of failing immediatelyBusyError(WAL reads shouldn't block, but do during enrichment checkpoint)Problem
brain_storealready had JSONL queue fallback for DB locks, butbrain_updateandbrain_searchhad zero protection — they just crashed with "database is locked" when enrichment was running. Knowledge was silently lost.Test plan
test_write_queue.py— queue, flush, retry, max-sizetest_db_lock_resilience.py)ruff checkclean🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests