Skip to content

fix: DB lock resilience — retry + queue for MCP writes and reads#65

Merged
EtanHey merged 3 commits into
mainfrom
fix/write-queue-db-lock
Feb 28, 2026
Merged

fix: DB lock resilience — retry + queue for MCP writes and reads#65
EtanHey merged 3 commits into
mainfrom
fix/write-queue-db-lock

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Feb 28, 2026

Summary

  • brain_update now retries 4x with exponential backoff on BusyError instead of failing immediately
  • brain_search now retries 3x on BusyError (WAL reads shouldn't block, but do during enrichment checkpoint)
  • _queue_store now enforces 100-item max size, dropping oldest on overflow
  • All retry delays are module-level variables for easy test patching

Problem

brain_store already had JSONL queue fallback for DB locks, but brain_update and brain_search had zero protection — they just crashed with "database is locked" when enrichment was running. Knowledge was silently lost.

Test plan

  • 10 new tests in test_write_queue.py — queue, flush, retry, max-size
  • 15 lock-related tests pass (including existing test_db_lock_resilience.py)
  • 658 full suite tests pass, 0 failures
  • ruff check clean

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Search operations now retry with exponential backoff when the database is busy/locked.
    • Write operations retry on lock contention and buffer failed write requests to a bounded queue, trimming oldest entries when full.
  • Tests

    • Added comprehensive tests covering write-queue behavior, retry/backoff handling, queue eviction, flush behavior, and DB lock scenarios.

brain_store already had JSONL queue fallback but brain_update and
brain_search had zero protection against BusyError from enrichment
contention. Now:

- brain_update retries 4x with exponential backoff before failing
- brain_search retries 3x on BusyError (WAL reads shouldn't block
  but do during checkpoint/exclusive lock)
- _queue_store enforces 100-item max, drops oldest on overflow
- All retry delays exposed as module-level vars for test patching

10 new tests in test_write_queue.py covering queue, flush, retry,
and max-size behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 28, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 20 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 71eca21 and e50e54e.

📒 Files selected for processing (3)
  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
  • tests/test_write_queue.py
📝 Walkthrough

Walkthrough

Adds DB-lock resilience to MCP handlers: search and update operations now retry on APSW BusyError-like conditions with exponential backoff; write operations buffer to a bounded on-disk queue when locks occur. New tests exercise queueing, flushing, and retry behaviors.

Changes

Cohort / File(s) Summary
Search retry
src/brainlayer/mcp/search_handler.py
Added apsw import, _RETRY_MAX_ATTEMPTS and _retry_delay constants, and a retry loop around store.hybrid_search that retries on BusyError or "locked"/"busy" messages with exponential backoff and warning logs; rethrows non-retryable errors or when exhausted.
Store retry & queueing
src/brainlayer/mcp/store_handler.py
Added apsw, _RETRY_MAX_ATTEMPTS, _retry_delay, and _QUEUE_MAX_SIZE. _brain_update now retries on DB lock conditions with exponential backoff; _queue_store enforces a bounded queue (trims oldest entries, logs on trimming) and _store integrates buffering and flush behavior for lock scenarios.
Tests for queue & lock behavior
tests/test_write_queue.py
New comprehensive tests covering _queue_store, _flush_pending_stores, queue max-size eviction, partial/complete flush behavior, _store buffering under BusyError, _brain_update retry/backoff behavior, and _search read retries using mocks.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SearchHandler as Search Handler
    participant Store as Vector Store / APSW
    
    Client->>SearchHandler: _search(query)
    loop retry (up to configured attempts)
        SearchHandler->>Store: hybrid_search(query)
        alt Success
            Store-->>SearchHandler: results
            SearchHandler-->>Client: return results
        else BusyError / "locked"/"busy"
            Store-->>SearchHandler: BusyError
            SearchHandler->>SearchHandler: log warning
            SearchHandler->>SearchHandler: sleep (exponential backoff)
        end
    end
    alt retries exhausted
        SearchHandler-->>Client: raise error
    end
Loading
sequenceDiagram
    participant Client
    participant StoreHandler as Store Handler
    participant DB as APSW/DB
    participant FileQueue as File Queue (pending-stores.jsonl)
    
    Client->>StoreHandler: brain_update(item)
    loop retry (up to configured attempts)
        StoreHandler->>DB: attempt write/update
        alt Success
            DB-->>StoreHandler: OK
            StoreHandler-->>Client: success response
        else BusyError / "locked"/"busy"
            DB-->>StoreHandler: BusyError
            StoreHandler->>StoreHandler: log warning
            StoreHandler->>StoreHandler: sleep (exponential backoff)
        end
    end
    alt retries exhausted
        StoreHandler->>FileQueue: append item to queue (enforce max size)
        FileQueue-->>StoreHandler: queued
        StoreHandler-->>Client: queued response
    end
Loading
sequenceDiagram
    participant Operator
    participant StoreHandler as Store Handler
    participant FileQueue as File Queue (pending-stores.jsonl)
    participant DB as APSW/DB

    Operator->>StoreHandler: _flush_pending_stores()
    StoreHandler->>FileQueue: read queued items
    alt file exists
        FileQueue-->>StoreHandler: items list
        loop each item
            StoreHandler->>DB: attempt write
            alt Success
                DB-->>StoreHandler: OK
            else Failure
                StoreHandler->>FileQueue: keep item
            end
        end
        alt all succeeded
            StoreHandler->>FileQueue: delete file
        else failures remain
            FileQueue-->>StoreHandler: retain file with failures
        end
    else file missing
        FileQueue-->>StoreHandler: no-op
    end
    StoreHandler-->>Operator: flush complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
Hops and retries, soft and spry,
BusyErrors pass me by,
I queue the writes, trim when full,
Backoff steps keep pace and pull,
Tomorrow the DB will reply. 🌱

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: DB lock resilience — retry + queue for MCP writes and reads' clearly and concisely describes the main change: adding retry mechanisms and queueing to handle database lock conditions in both write and read operations.
Docstring Coverage ✅ Passed Docstring coverage is 84.21% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/write-queue-db-lock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/brainlayer/mcp/search_handler.py (1)

244-247: ⚠️ Potential issue | 🟠 Major

Retry scope is incomplete for DB lock resilience.

_search retries only hybrid_search, but lock failures at Line 246 (store.count()) or Line 302 (enrich_results_with_session_context) still bypass retry and return an error immediately. Wrap the full DB-read sequence in the same retry loop.

🔧 Suggested refactor
-        store = _get_vector_store()
-
-        if store.count() == 0:
-            empty = {"query": query, "total": 0, "results": []}
-            return (
-                [TextContent(type="text", text="Knowledge base is empty. Run 'brainlayer index' to populate it.")],
-                empty,
-            )
+        store = _get_vector_store()

         normalized_project = _normalize_project_name(project)
         loop = asyncio.get_running_loop()
         model = _get_embedding_model()
         query_embedding = await loop.run_in_executor(None, model.embed_query, query)
@@
-        results = None
+        results = None
         for attempt in range(_RETRY_MAX_ATTEMPTS):
             try:
+                if store.count() == 0:
+                    empty = {"query": query, "total": 0, "results": []}
+                    return (
+                        [TextContent(type="text", text="Knowledge base is empty. Run 'brainlayer index' to populate it.")],
+                        empty,
+                    )
                 results = store.hybrid_search(
                     query_embedding=query_embedding,
                     query_text=query,
@@
                     sentiment_filter=sentiment,
                     entity_id=entity_id,
                 )
+                results = store.enrich_results_with_session_context(results)
                 break
             except (apsw.BusyError, Exception) as e:
                 is_lock = isinstance(e, apsw.BusyError) or "locked" in str(e).lower() or "busy" in str(e).lower()
                 if is_lock and attempt < _RETRY_MAX_ATTEMPTS - 1:
                     delay = _retry_delay * (2**attempt)
                     logger.warning("Search BusyError (attempt %d/%d), retrying in %.2fs", attempt + 1, _RETRY_MAX_ATTEMPTS, delay)
                     await asyncio.sleep(delay)
                     continue
                 raise  # Non-lock error or retries exhausted
@@
-        results = store.enrich_results_with_session_context(results)

Also applies to: 268-297, 302-303

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/search_handler.py` around lines 244 - 247, The retry scope
is too narrow: expand the retry loop inside _search to encompass the entire
DB-read sequence including the call to _get_vector_store(), store.count(),
hybrid_search(...), and enrich_results_with_session_context(...), so any
transient DB/lock errors are retried; adjust the retry wrapper (where
hybrid_search is currently retried) to surround from obtaining the store through
building the final results and return the same empty/normal response after
retries or propagate the final error if retries exhaust.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/mcp/store_handler.py`:
- Around line 201-214: The read/trim/rewrite block around path (using
path.read_text(), _QUEUE_MAX_SIZE, lines, trimmed, path.write_text()) is not
concurrency-safe and can drop buffered writes; protect this section with an
inter-process lock (e.g., acquire a file lock on the queue file or a companion
lock file using fcntl/portalocker) around the read/trim/write sequence and
perform the write atomically by writing to a temporary file and then replacing
the original with os.replace (or Path.replace) so concurrent processes cannot
clobber each other; ensure the lock is reliably released in a finally block and
keep the existing warning log when trimming occurs.
- Around line 111-112: _brain_update currently calls the shared singleton
_get_vector_store(), causing contention; change _brain_update so that each retry
attempt opens its own DB connection/object instead of reusing
_get_vector_store() (e.g., create a new VectorStore/connection instance inside
the retry loop), run the operation, and always close/cleanup that connection in
a finally block; ensure the SQLITE_BUSY retry logic recreates a fresh
per-attempt connection so concurrent workers do not share a single connection.

In `@tests/test_write_queue.py`:
- Around line 67-73: Replace the weak cap assertion in tests/test_write_queue.py
by asserting exactly 100 items (change "assert len(lines) <= 100" to "assert
len(lines) == 100") and add a check that the first retained entry is the
expected boundary item: parse the first line (e.g., first_item =
json.loads(lines[0])) and assert first_item["content"] equals the item at index
(total_created - 100) (or hardcode "item-5" if total_created is 105), using the
existing variables lines and last_item to locate the check.
- Around line 211-214: The test_update_retries_before_failing currently allows
non-dict error-like returns to pass; update the assertions to explicitly verify
"not an error" for both dict and CallToolResult-like objects: reference the
variables result and call_count and assert call_count == 3, then assert that if
result is a dict its "isError" key is falsy and if result has an isError
attribute (e.g., a CallToolResult) that attribute is False — use
isinstance(result, dict) with result.get("isError") and hasattr(result,
"isError") / getattr(result, "isError") checks to enforce actual success
semantics.

---

Outside diff comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 244-247: The retry scope is too narrow: expand the retry loop
inside _search to encompass the entire DB-read sequence including the call to
_get_vector_store(), store.count(), hybrid_search(...), and
enrich_results_with_session_context(...), so any transient DB/lock errors are
retried; adjust the retry wrapper (where hybrid_search is currently retried) to
surround from obtaining the store through building the final results and return
the same empty/normal response after retries or propagate the final error if
retries exhaust.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c59ddc6 and c0fd916.

📒 Files selected for processing (3)
  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
  • tests/test_write_queue.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run tests with pytest

Files:

  • tests/test_write_queue.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Typer CLI framework for command-line interface implementation in src/brainlayer/
Handle concurrency by retrying on SQLITE_BUSY errors; each worker should use its own database connection
Export brain graph as JSON for Next.js dashboard using brainlayer brain-export command
Export Obsidian vault with Markdown files including backlinks and tags using brainlayer export-obsidian command

Files:

  • src/brainlayer/mcp/store_handler.py
  • src/brainlayer/mcp/search_handler.py
src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement MCP server with 8 tools: brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_get_person (legacy brainlayer_* aliases still work)

Files:

  • src/brainlayer/mcp/store_handler.py
  • src/brainlayer/mcp/search_handler.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use ruff check src/ for linting and ruff format src/ for code formatting

Files:

  • src/brainlayer/mcp/store_handler.py
  • src/brainlayer/mcp/search_handler.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection

Applied to files:

  • tests/test_write_queue.py
  • src/brainlayer/mcp/store_handler.py
  • src/brainlayer/mcp/search_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with 8 tools: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` (legacy `brainlayer_*` aliases still work)

Applied to files:

  • src/brainlayer/mcp/store_handler.py
  • src/brainlayer/mcp/search_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/vector_store.py : Store database at `~/.local/share/brainlayer/brainlayer.db` with WAL mode and `busy_timeout=5000`

Applied to files:

  • src/brainlayer/mcp/store_handler.py
🧬 Code graph analysis (2)
tests/test_write_queue.py (2)
src/brainlayer/mcp/store_handler.py (3)
  • _flush_pending_stores (217-268)
  • _queue_store (188-214)
  • _store (271-371)
src/brainlayer/mcp/search_handler.py (1)
  • _search (216-391)
src/brainlayer/mcp/store_handler.py (2)
src/brainlayer/mcp/_shared.py (3)
  • _get_vector_store (16-24)
  • _error_result (145-147)
  • _get_embedding_model (27-34)
src/brainlayer/vector_store.py (1)
  • update_chunk (577-612)

Comment on lines +111 to +112
store = _get_vector_store()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a per-operation DB connection in the retry path.

At Line 111, _brain_update still relies on _get_vector_store() (shared singleton). Under concurrent workers, this keeps lock contention concentrated on one connection and undermines the busy-retry design.

As per coding guidelines: "Handle concurrency by retrying on SQLITE_BUSY errors; each worker should use its own database connection".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/store_handler.py` around lines 111 - 112, _brain_update
currently calls the shared singleton _get_vector_store(), causing contention;
change _brain_update so that each retry attempt opens its own DB
connection/object instead of reusing _get_vector_store() (e.g., create a new
VectorStore/connection instance inside the retry loop), run the operation, and
always close/cleanup that connection in a finally block; ensure the SQLITE_BUSY
retry logic recreates a fresh per-attempt connection so concurrent workers do
not share a single connection.

Comment thread src/brainlayer/mcp/store_handler.py Outdated
Comment on lines +201 to +214
# Enforce max size — read, trim oldest, rewrite
try:
lines = path.read_text().strip().splitlines()
if len(lines) > _QUEUE_MAX_SIZE:
trimmed = lines[-_QUEUE_MAX_SIZE:]
path.write_text("\n".join(trimmed) + "\n")
logger.warning(
"Pending store queue trimmed: %d -> %d (dropped %d oldest)",
len(lines),
_QUEUE_MAX_SIZE,
len(lines) - _QUEUE_MAX_SIZE,
)
except Exception:
pass # Non-critical — queue still works, just unbounded
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Queue trimming is not concurrency-safe and can drop buffered writes.

Lines 201-206 perform read/trim/rewrite without an inter-process lock or atomic replace boundary. Concurrent _queue_store calls can clobber each other and silently lose queued memories.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/store_handler.py` around lines 201 - 214, The
read/trim/rewrite block around path (using path.read_text(), _QUEUE_MAX_SIZE,
lines, trimmed, path.write_text()) is not concurrency-safe and can drop buffered
writes; protect this section with an inter-process lock (e.g., acquire a file
lock on the queue file or a companion lock file using fcntl/portalocker) around
the read/trim/write sequence and perform the write atomically by writing to a
temporary file and then replacing the original with os.replace (or Path.replace)
so concurrent processes cannot clobber each other; ensure the lock is reliably
released in a finally block and keep the existing warning log when trimming
occurs.

Comment thread tests/test_write_queue.py
Comment thread tests/test_write_queue.py Outdated
Comment on lines +211 to +214
# Should succeed after retries
assert call_count == 3
assert not isinstance(result, dict) or not result.get("isError")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

test_update_retries_before_failing can pass even when update returns an error.

The assertion at Line 213 is effectively always true for non-dict returns, including an error CallToolResult. This can produce false positives.

✅ Assert real success semantics
         # Should succeed after retries
         assert call_count == 3
-        assert not isinstance(result, dict) or not result.get("isError")
+        from mcp.types import CallToolResult
+        assert not isinstance(result, CallToolResult)
+        assert any("archived" in t.text.lower() for t in result)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Should succeed after retries
assert call_count == 3
assert not isinstance(result, dict) or not result.get("isError")
# Should succeed after retries
assert call_count == 3
from mcp.types import CallToolResult
assert not isinstance(result, CallToolResult)
assert any("archived" in t.text.lower() for t in result)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_write_queue.py` around lines 211 - 214, The
test_update_retries_before_failing currently allows non-dict error-like returns
to pass; update the assertions to explicitly verify "not an error" for both dict
and CallToolResult-like objects: reference the variables result and call_count
and assert call_count == 3, then assert that if result is a dict its "isError"
key is falsy and if result has an isError attribute (e.g., a CallToolResult)
that attribute is False — use isinstance(result, dict) with
result.get("isError") and hasattr(result, "isError") / getattr(result,
"isError") checks to enforce actual success semantics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/brainlayer/mcp/search_handler.py (2)

244-256: 🧹 Nitpick | 🔵 Trivial

Store obtained outside retry loop may limit resilience.

_get_vector_store() at Line 244 retrieves the singleton connection before the retry loop (Lines 273-300). If the connection itself becomes stale or problematic, retries will reuse the same connection. This is less severe than the update path since reads are less contention-prone, but consider whether obtaining a fresh store inside the loop would improve resilience.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/search_handler.py` around lines 244 - 256, The code calls
_get_vector_store() once before the retry loop, which can reuse a stale
connection; move the call to _get_vector_store() inside the retry loop used for
reads so each retry obtains a fresh store instance, then use that local store
for store.count() and subsequent search operations (keep the existing
normalized_project, model/embed logic intact) to improve resilience when the
underlying vector store connection becomes problematic.

8-25: 🧹 Nitpick | 🔵 Trivial

Consider moving imports before module-level variables.

Module-level variables (Lines 8-10) are declared before imports (Lines 12-25). While this works, the conventional pattern is imports first. This is a minor style nit.

♻️ Proposed reordering
 import apsw
 from mcp.types import TextContent

+from ._shared import (
+    _build_compact_result,
+    _error_result,
+    _extract_file_path,
+    _get_embedding_model,
+    _get_vector_store,
+    _memory_to_dict,
+    _normalize_project_name,
+    _query_has_regression_signal,
+    _query_signals_current_context,
+    _query_signals_recall,
+    _query_signals_think,
+    logger,
+)
+
 # Retry settings for DB lock resilience on reads
 _RETRY_MAX_ATTEMPTS = 3
 _retry_delay = 0.1  # base delay in seconds (exposed for test patching)
-
-from ._shared import (
-    ...
-)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/search_handler.py` around lines 8 - 25, Move the import
block that brings in functions like _build_compact_result, _error_result,
_extract_file_path, _get_embedding_model, _get_vector_store, _memory_to_dict,
_normalize_project_name, _query_has_regression_signal,
_query_signals_current_context, _query_signals_recall, _query_signals_think, and
logger so it appears before the module-level constants _RETRY_MAX_ATTEMPTS and
_retry_delay; update search_handler.py by placing the entire from ._shared
import (...) section at the top of the file (above the _RETRY_MAX_ATTEMPTS and
_retry_delay declarations) to follow the conventional imports-first ordering.
src/brainlayer/mcp/store_handler.py (1)

25-56: ⚠️ Potential issue | 🟠 Major

Add retry logic to _brain_digest for DB lock resilience.

The _brain_digest function calls digest_content, which performs database writes via store.upsert_chunks() and entity operations. However, unlike _brain_update and _search, _brain_digest lacks retry logic for BusyError. Per the coding guideline, all database operations should retry on SQLITE_BUSY errors. Wrap the digest operation in the same retry pattern used by _brain_update (lines 109–178): a loop with exponential backoff for up to 4 attempts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/store_handler.py` around lines 25 - 56, _brain_digest
currently calls digest_content (which does store.upsert_chunks and entity
writes) without retrying on database BusyError; wrap the call to digest_content
(the lambda passed to loop.run_in_executor) in the same retry loop used by
_brain_update: attempt up to 4 times, catch BusyError specifically, on BusyError
sleep with exponential backoff (e.g., 0.1 * 2**attempt) and retry, and only
re-raise or return an error after exhausting attempts; keep using
loop.run_in_executor with model.embed_query and preserve the existing ValueError
and generic Exception handling for non-BusyError failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 291-292: The except clause currently uses the redundant form
"except (apsw.BusyError, Exception) as e"; change it to a single generic catch
like "except Exception as e" (keeping the existing is_lock logic that inspects
isinstance(e, apsw.BusyError) or checks "locked"/"busy" in str(e). Update the
handler in search_handler.py where is_lock is set so it mirrors store_handler's
simpler pattern and remove the redundant tuple in the except clause while
retaining apsw.BusyError usage in the isinstance check.

In `@src/brainlayer/mcp/store_handler.py`:
- Around line 169-170: The except clause is redundant: replace "except
(apsw.BusyError, Exception) as e" with "except Exception as e" and keep the
existing is_lock_error computation (which still checks isinstance(e,
apsw.BusyError) or "locked"/"busy" in str(e).lower()) so intent remains clear
while removing the redundant tuple; update the except block around the
is_lock_error assignment in the same function to reflect this change.

In `@tests/test_write_queue.py`:
- Around line 244-249: The test mock original_search_results is missing the
required "ids" key that hybrid_search returns; update the mock in
tests/test_write_queue.py (variable original_search_results) to include an "ids"
entry with the same nested-list shape as "documents"/"metadatas" (e.g.,
[["doc_id_1"]] or matching chunk ids like [["c1"]]) so the test accurately
mirrors hybrid_search's return structure and avoids false positives.

---

Outside diff comments:
In `@src/brainlayer/mcp/search_handler.py`:
- Around line 244-256: The code calls _get_vector_store() once before the retry
loop, which can reuse a stale connection; move the call to _get_vector_store()
inside the retry loop used for reads so each retry obtains a fresh store
instance, then use that local store for store.count() and subsequent search
operations (keep the existing normalized_project, model/embed logic intact) to
improve resilience when the underlying vector store connection becomes
problematic.
- Around line 8-25: Move the import block that brings in functions like
_build_compact_result, _error_result, _extract_file_path, _get_embedding_model,
_get_vector_store, _memory_to_dict, _normalize_project_name,
_query_has_regression_signal, _query_signals_current_context,
_query_signals_recall, _query_signals_think, and logger so it appears before the
module-level constants _RETRY_MAX_ATTEMPTS and _retry_delay; update
search_handler.py by placing the entire from ._shared import (...) section at
the top of the file (above the _RETRY_MAX_ATTEMPTS and _retry_delay
declarations) to follow the conventional imports-first ordering.

In `@src/brainlayer/mcp/store_handler.py`:
- Around line 25-56: _brain_digest currently calls digest_content (which does
store.upsert_chunks and entity writes) without retrying on database BusyError;
wrap the call to digest_content (the lambda passed to loop.run_in_executor) in
the same retry loop used by _brain_update: attempt up to 4 times, catch
BusyError specifically, on BusyError sleep with exponential backoff (e.g., 0.1 *
2**attempt) and retry, and only re-raise or return an error after exhausting
attempts; keep using loop.run_in_executor with model.embed_query and preserve
the existing ValueError and generic Exception handling for non-BusyError
failures.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0fd916 and 71eca21.

📒 Files selected for processing (3)
  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
  • tests/test_write_queue.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Typer CLI framework for command-line interface implementation in src/brainlayer/
Handle concurrency by retrying on SQLITE_BUSY errors; each worker should use its own database connection
Export brain graph as JSON for Next.js dashboard using brainlayer brain-export command
Export Obsidian vault with Markdown files including backlinks and tags using brainlayer export-obsidian command

Files:

  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement MCP server with 8 tools: brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_get_person (legacy brainlayer_* aliases still work)

Files:

  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use ruff check src/ for linting and ruff format src/ for code formatting

Files:

  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run tests with pytest

Files:

  • tests/test_write_queue.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/**/*.py : Handle concurrency by retrying on `SQLITE_BUSY` errors; each worker should use its own database connection

Applied to files:

  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
  • tests/test_write_queue.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with 8 tools: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` (legacy `brainlayer_*` aliases still work)

Applied to files:

  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/mcp/store_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/vector_store.py : Use APSW with sqlite-vec for vector storage implementation in `vector_store.py`

Applied to files:

  • src/brainlayer/mcp/store_handler.py
📚 Learning: 2026-02-28T09:42:45.668Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-28T09:42:45.668Z
Learning: Applies to src/brainlayer/vector_store.py : Store database at `~/.local/share/brainlayer/brainlayer.db` with WAL mode and `busy_timeout=5000`

Applied to files:

  • src/brainlayer/mcp/store_handler.py
🧬 Code graph analysis (1)
src/brainlayer/mcp/search_handler.py (1)
src/brainlayer/search_repo.py (1)
  • hybrid_search (333-553)
🔇 Additional comments (9)
src/brainlayer/mcp/store_handler.py (3)

19-22: LGTM — Retry configuration is well-exposed for testing.

The module-level retry settings with sensible defaults and exposed _retry_delay for test patching follow good practices for testable retry logic.


109-112: Use a per-operation DB connection in the retry path.

_get_vector_store() returns a shared singleton. Under concurrent workers, retrying on the same connection keeps lock contention concentrated and undermines the busy-retry design. As per coding guidelines: "Handle concurrency by retrying on SQLITE_BUSY errors; each worker should use its own database connection".

Consider creating a fresh connection inside each retry attempt or providing a factory that returns new connections for retry scenarios.

[duplicate_comment, raise_major_issue]


203-216: Queue trimming is not concurrency-safe.

The read/trim/rewrite sequence at Lines 205-208 lacks inter-process synchronization. Concurrent _queue_store calls can clobber each other's writes, silently losing queued memories. Consider using a file lock (e.g., fcntl or portalocker) and atomic replace via a temp file.

[duplicate_comment, raise_critical_issue]

tests/test_write_queue.py (4)

1-7: LGTM — Good test module structure and documentation.

Clear docstring explaining test coverage scope and the use of tmp_path fixtures for isolation.


67-73: Tighten the queue cap assertion.

assert len(lines) <= 100 can hide trimming regressions. The test should verify exact cap behavior and the first retained item.

[duplicate_comment, raise_minor_issue]

✅ Proposed fix
         lines = pending_path.read_text().strip().splitlines()
         # Should be capped at 100, with oldest dropped
-        assert len(lines) <= 100
+        assert len(lines) == 100
+        # Verify oldest items were dropped (items 0-4)
+        first_item = json.loads(lines[0])
+        assert first_item["content"] == "item-5"
         # The newest items should be kept
         last_item = json.loads(lines[-1])
         assert last_item["content"] == "item-104"

211-214: Strengthen the success assertion.

The assertion not isinstance(result, dict) or not result.get("isError") can pass for error CallToolResult objects. The test should explicitly verify successful archive behavior.

[duplicate_comment, raise_minor_issue]

✅ Proposed fix
         # Should succeed after retries
         assert call_count == 3
-        assert not isinstance(result, dict) or not result.get("isError")
+        # Result should be a list of TextContent, not a CallToolResult error
+        assert isinstance(result, list)
+        assert len(result) > 0
+        # Verify the response indicates successful archive
+        result_text = result[0].text
+        assert "archived" in result_text.lower()

90-93: The patch path is correct.

The patch at line 91 targets "brainlayer.store.store_memory", which is the correct location. When _flush_pending_stores executes from ..store import store_memory (a local import), Python resolves it to brainlayer.store.store_memory—the definition location—and that's the appropriate target for mocking. This is standard practice and not fragile.

src/brainlayer/mcp/search_handler.py (2)

270-300: LGTM — Retry logic for search is well-implemented.

The retry loop with exponential backoff, clear logging, and proper re-raise on exhaustion follows good resilience patterns. The comment explaining WAL checkpoint blocking is helpful context.


263-265: LGTM — Good AIDEV-NOTE documentation.

The comment explains the rationale for changing the default source filter behavior, which helps future maintainers understand the design decision.

Comment thread src/brainlayer/mcp/search_handler.py Outdated
Comment thread src/brainlayer/mcp/store_handler.py Outdated
Comment thread tests/test_write_queue.py
Comment on lines +244 to +249
call_count = 0
original_search_results = {
"documents": [["test doc"]],
"metadatas": [[{"project": "test", "content_type": "note", "chunk_id": "c1"}]],
"distances": [[0.1]],
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test mock missing required ids key in search results.

The mock original_search_results (Lines 245-249) is missing the ids key that hybrid_search returns (per the relevant code snippet showing the return structure). This may cause the test to pass despite incomplete mock data.

✅ Proposed fix
         original_search_results = {
+            "ids": [["c1"]],
             "documents": [["test doc"]],
             "metadatas": [[{"project": "test", "content_type": "note", "chunk_id": "c1"}]],
             "distances": [[0.1]],
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_write_queue.py` around lines 244 - 249, The test mock
original_search_results is missing the required "ids" key that hybrid_search
returns; update the mock in tests/test_write_queue.py (variable
original_search_results) to include an "ids" entry with the same nested-list
shape as "documents"/"metadatas" (e.g., [["doc_id_1"]] or matching chunk ids
like [["c1"]]) so the test accurately mirrors hybrid_search's return structure
and avoids false positives.

…er assertions

- Simplify `except (BusyError, Exception)` to `except Exception`
- Queue trim uses tempfile + atomic rename (POSIX safe)
- Test assertions now verify exact queue cap and success response content

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey EtanHey merged commit 9537f76 into main Feb 28, 2026
5 checks passed
EtanHey added a commit that referenced this pull request Mar 14, 2026


Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EtanHey added a commit that referenced this pull request Mar 14, 2026
)

* Harden search validation and backfill coverage

* docs: update test count to 929 and add CHANGELOG entries for PRs #65-#78

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant