fix(writer): single-writer pidfile mutex + retry/timeout/throttle hardening#309
Conversation
📝 WalkthroughWalkthroughThis PR adds pidfile-based exclusive writer coordination to VectorStore, detects recency intent to fetch and boost 7-day recent chunks, makes SQLite busy-timeout and init-retry budget environment-configurable, skips ambiguous single-token KG entity matches, and updates tests/fixtures to open VectorStore in readonly mode. ChangesVectorStore Writer Coordination and Search Enhancements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@greptileai review |
|
@codex review |
|
@coderabbitai review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 17e33ea01b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| FROM chunks | ||
| WHERE created_at >= datetime('now', '-7 days') {" ".join(recent_extra)} | ||
| ORDER BY created_at DESC |
There was a problem hiding this comment.
Preserve entity scoping in recency-intent fallback
This fallback query pulls recent rows directly from chunks without applying entity_id, even though hybrid_search documents entity_id as a hard per-entity scope. Because these rows are injected into fts_ranks/keyword_data and the later FTS-only post-filter does not re-check entity membership, queries containing recency terms (e.g. “recent”, “latest”) can return chunks from other entities.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in acbf191. The recency-intent fallback now preserves entity_id and related active filters before injecting recent rows into ranking data; covered by a regression test.
| try: | ||
| pidfile.unlink() | ||
| except FileNotFoundError: |
There was a problem hiding this comment.
Make stale pidfile cleanup atomic before unlink
The stale-lock path unlinks based on a prior PID read/liveness check, but that check and unlink() are not atomic. If two writers race while recovering a stale pidfile, one process can create a fresh pidfile after the other has read the old PID, and the second process can then unlink that fresh lock and proceed, allowing concurrent writers to initialize the same DB.
Useful? React with 👍 / 👎.
|
PR-alpha CI is green and PR-beta #306 remains blocked on this landing first. Etan/admin: please merge with rebase only, no squash, when acceptable so beta can rebase onto the pidfile mutex on origin/main. |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
brainlayer/src/brainlayer/vector_store.py
Line 314 in acbf191
_INIT_MAX_RETRIES is now env-configurable, but when BRAINLAYER_INIT_MAX_RETRIES is set to 0 (or any negative value), the retry loop never executes and last_err stays None; this line then raises TypeError: exceptions must derive from BaseException instead of a meaningful DB-init failure. That turns a recoverable configuration mistake into a hard startup crash with misleading diagnostics.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _ingest_keyword_rows(fts_results, fts_ranks) | ||
| _ingest_keyword_rows(trigram_fts_results, trigram_ranks) | ||
|
|
||
| recency_intent = any(term in query_text.lower() for term in _RECENCY_QUERY_TERMS) |
There was a problem hiding this comment.
Match recency intent on whole terms only
recency_intent is computed with substring checks (term in query_text.lower()), so the term "current" also matches unrelated words like "concurrent"/"recurrent". In those cases hybrid search incorrectly enters the recency fallback path and applies the extra ≤7-day score multiplier, which can reorder results toward recent chunks even when the user did not ask for recency.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 1ece80a. This shares the same whole-term recency-intent helper and regression coverage as the Cursor finding.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1ece80ab63
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Retry settings for DB init under contention (multiple MCP instances + enrichment) | ||
| _INIT_MAX_RETRIES = 5 | ||
| _INIT_MAX_RETRIES = _int_env("BRAINLAYER_INIT_MAX_RETRIES", 10) |
There was a problem hiding this comment.
Clamp init retry env to at least one attempt
BRAINLAYER_INIT_MAX_RETRIES is read directly into _INIT_MAX_RETRIES without bounds checking, so setting it to 0 or a negative value makes _init_db_with_retry() run zero iterations and then execute raise last_err while last_err is still None, which raises TypeError and prevents writer startup entirely. This turns a retry-tuning env var into a hard boot failure under misconfiguration.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in c661a16. The retry budget is clamped to at least one attempt at env-read time and inside _init_db_with_retry, with a regression test proving a zero runtime budget re-raises the original BusyError.
| summary, tags, importance, intent, | ||
| created_at, source, sender, language, decay_score | ||
| FROM chunks | ||
| WHERE created_at >= datetime('now', '-7 days') {" ".join(recent_extra)} |
There was a problem hiding this comment.
Compare timestamps as datetimes in recency fallback
The recency fallback filters with created_at >= datetime('now', '-7 days'), but created_at values are ISO strings like YYYY-MM-DDTHH:MM:SSZ while datetime() returns YYYY-MM-DD HH:MM:SS; SQLite then performs a lexical TEXT comparison, so same-day boundary rows (e.g. ...T00:00:00Z) are treated as newer than the cutoff even when they are actually older than 7 days. This widens the fallback window and can pull stale chunks into “latest/recent” results.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4378c17. The recency fallback now compares datetime(created_at) against datetime(now, -7 days), with a boundary regression test.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4378c1718c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with self._PIDFILE_REFS_LOCK: | ||
| refs = self._PIDFILE_REFS.get(pidfile, 0) | ||
| if refs <= 1: | ||
| self._PIDFILE_REFS.pop(pidfile, None) | ||
| should_unlink = True |
There was a problem hiding this comment.
Keep pidfile ownership until release unlink finishes
Clearing _PIDFILE_REFS before the unlink step opens a race where one thread can call close() while another thread in the same process opens a new writer for the same DB: the new open goes through _handle_existing_writer_pidfile() (same PID) and marks itself acquired, then the closing thread still unlinks the pidfile, leaving an active writer with no lockfile. At that point a third process can create a fresh pidfile and start writing concurrently, violating the single-writer contract under concurrent open/close in one process.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e00434a. Added a RED regression for the same-process close/open interleaving, then kept pidfile ownership under _PIDFILE_REFS_LOCK until the release path finishes the owned-file unlink and ref removal. Focused test and pre-push gate are green.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/brainlayer/vector_store.py (1)
2027-2038:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlways release the writer pidfile even if DB close fails.
If
self.conn.close()raises,_release_writer_pidfile()is skipped and the process can leave a stale writer lock behind.Suggested fix
def close(self) -> None: """Close database connections.""" # Close thread-local read connection if it exists if hasattr(self, "_local"): read_conn = getattr(self._local, "read_conn", None) if read_conn is not None: read_conn.close() self._local.read_conn = None - if hasattr(self, "conn"): - self.conn.close() - self._release_writer_pidfile() + try: + if hasattr(self, "conn"): + self.conn.close() + finally: + self._release_writer_pidfile()As per coding guidelines
**/*.py: "Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior" and "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/brainlayer/vector_store.py` around lines 2027 - 2038, The close method can raise while closing DB connections and skip calling _release_writer_pidfile, leaving a stale writer lock; update close(self) to ensure _release_writer_pidfile() is always executed by wrapping connection-close logic in a try/finally (or similar) so that even if getattr(self, "conn").close() or closing self._local.read_conn raises, _release_writer_pidfile() runs in the finally block; handle/propagate exceptions as appropriate but do not allow them to prevent _release_writer_pidfile() from executing (reference: close, _release_writer_pidfile, conn, and _local.read_conn).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/brainlayer/search_repo.py`:
- Around line 1527-1541: The new recency query that builds recent_rows (the
cursor.execute block selecting from chunks and using candidate_fetch_count) must
be wrapped in SQLITE_BUSY retry/backoff logic: catch
sqlite3.OperationalError/SQLITE_BUSY around the cursor.execute call, retry with
exponential backoff (short sleeps between attempts) for a limited number of
attempts, and only raise after retries are exhausted; ensure this uses the
worker's own DB connection/cursor used elsewhere in this module (i.e., the same
cursor variable) to respect the guideline that each worker uses its own
connection. Replace the direct list(cursor.execute(...)) call with a retry loop
that logs or sleeps on SQLITE_BUSY and then executes the same query and
parameters (including recent_extra and recent_params) on success.
- Around line 1473-1539: The recency fallback rebuilds filters but forgot to
apply content_type_filter and date_to, so add checks for content_type_filter
(append "AND content_type = ?" and push content_type_filter into recent_params)
and for date_to (append a created_at upper-bound clause like "AND
datetime(created_at) <= datetime(?)" and push date_to into recent_params) before
executing the cursor query in the method that builds recent_extra (the block
referencing recent_extra, recent_params and the cursor.execute SELECT). Ensure
the new clauses are appended in the same manner/order as the other filters so
the final parameter list [*recent_params, min(candidate_fetch_count, 25)]
remains aligned.
In `@src/brainlayer/vector_store.py`:
- Around line 177-203: The race occurs between _handle_existing_writer_pidfile
and _release_writer_pidfile where refs and pidfile unlink can interleave, so
modify both functions to serialize ref checks/updates and unlink decisions using
the existing _PIDFILE_REFS_LOCK: in _handle_existing_writer_pidfile (after
reading other_pid and before incrementing self._PIDFILE_REFS for pidfile)
acquire _PIDFILE_REFS_LOCK to re-check the current ref count and the pidfile
existence and then increment and set
_writer_pidfile_acquired/_writer_pidfile_path_value and register atexit
atomically; likewise, when deciding to unlink a stale pidfile (the
os.path.samestat branch), acquire _PIDFILE_REFS_LOCK first and re-check that no
new refs were added and the on-disk pid still matches before calling
pidfile.unlink(); ensure _release_writer_pidfile also uses _PIDFILE_REFS_LOCK to
decrement refs and only unlink if the ref count reaches zero and the pidfile
still refers to our pid to prevent removing a pidfile re-created by the same
process.
In `@tests/test_hybrid_search.py`:
- Around line 434-521: The test suite adds recency fallback checks for
entity/sentiment and boundary behavior but omits parity tests for
content_type_filter and date_to; add two new tests mirroring the existing
patterns (e.g., follow test_recency_intent_fallback_preserves_entity_filter and
test_recency_intent_fallback_compares_created_at_as_datetime) that: (1) insert
recent chunks with different content_type values, call store.hybrid_search with
content_type_filter set and assert the non-matching content_type chunk is
excluded, and (2) insert chunks around a date_to boundary, monkeypatch
store.search same as
test_recency_intent_fallback_compares_created_at_as_datetime, call hybrid_search
with date_to set and assert chunks after date_to are excluded; reference test
function names and store.hybrid_search/search/_insert_chunk to locate where to
add them.
In `@tests/test_writer_pidfile_mutex.py`:
- Around line 235-242: The test test_init_retries_10_with_extended_backoff is
fragile because VectorStore._INIT_MAX_RETRIES can be preseeded by the
environment; make it deterministic by explicitly setting
VectorStore._INIT_MAX_RETRIES to 10 at the start of the test (or use monkeypatch
to set that attribute) before computing delays and asserting values, then
restore the original value at the end or use a fixture to isolate the change so
external BRAINLAYER_INIT_MAX_RETRIES/env settings cannot affect the assertion.
---
Outside diff comments:
In `@src/brainlayer/vector_store.py`:
- Around line 2027-2038: The close method can raise while closing DB connections
and skip calling _release_writer_pidfile, leaving a stale writer lock; update
close(self) to ensure _release_writer_pidfile() is always executed by wrapping
connection-close logic in a try/finally (or similar) so that even if
getattr(self, "conn").close() or closing self._local.read_conn raises,
_release_writer_pidfile() runs in the finally block; handle/propagate exceptions
as appropriate but do not allow them to prevent _release_writer_pidfile() from
executing (reference: close, _release_writer_pidfile, conn, and
_local.read_conn).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ef9f0b9f-6f95-469d-90a4-8f125fca111d
📒 Files selected for processing (11)
hooks/brainlayer-prompt-search.pyscripts/launchd/com.brainlayer.enrichment.plistsrc/brainlayer/drain.pysrc/brainlayer/search_repo.pysrc/brainlayer/vector_store.pytests/test_engine.pytests/test_eval_baselines.pytests/test_hybrid_search.pytests/test_think_recall_integration.pytests/test_vector_store.pytests/test_writer_pidfile_mutex.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
- GitHub Check: Cursor Bugbot
- GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
tests/test_think_recall_integration.pytests/test_vector_store.pysrc/brainlayer/drain.pyhooks/brainlayer-prompt-search.pytests/test_engine.pytests/test_hybrid_search.pysrc/brainlayer/vector_store.pysrc/brainlayer/search_repo.pytests/test_writer_pidfile_mutex.pytests/test_eval_baselines.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/drain.pysrc/brainlayer/vector_store.pysrc/brainlayer/search_repo.py
🪛 OpenGrep (1.21.0)
src/brainlayer/search_repo.py
[ERROR] 1528-1540: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.
(coderabbit.sql-injection.python-fstring-execute)
🔇 Additional comments (8)
hooks/brainlayer-prompt-search.py (1)
44-46: LGTM!Also applies to: 544-545
src/brainlayer/search_repo.py (1)
67-69: LGTM!Also applies to: 84-86, 1680-1681
src/brainlayer/drain.py (1)
37-41: LGTM!Also applies to: 71-74
scripts/launchd/com.brainlayer.enrichment.plist (1)
55-56: LGTM!tests/test_engine.py (1)
348-350: LGTM!Also applies to: 393-395
tests/test_eval_baselines.py (1)
68-70: LGTM!Also applies to: 689-691
tests/test_think_recall_integration.py (1)
21-23: LGTM!tests/test_vector_store.py (1)
14-16: LGTM!
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aeef3d2f7e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if recency_intent and not date_from: | ||
| recent_extra = [] |
There was a problem hiding this comment.
Respect date_to in recency-intent fallback
Gate the recency fallback on both date bounds or apply date_to inside the fallback query. Right now hybrid_search enters the fallback whenever recency_intent is true and date_from is unset, so requests that set only date_to (for example, “latest notes before 2024-01-01”) still inject last-7-days chunks from chunks; those injected rows are treated as FTS-only candidates and the later post-filter does not enforce date_to, so results can violate the caller’s explicit upper date bound.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d858e34. The recency fallback applies datetime(created_at) <= datetime(?) for date_to, with a regression test.
|
@codex review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@coderabbitai review |
|
@codex review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 36e9cc2168
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not pidfile_dir.is_absolute(): | ||
| pidfile_dir = Path("/tmp") / pidfile_dir | ||
| pidfile_dir = pidfile_dir.resolve() |
There was a problem hiding this comment.
Keep relative pidfile overrides under /tmp
Relative BRAINLAYER_WRITER_PIDFILE_DIR values can escape /tmp because the code appends the raw relative path and then calls resolve(). For example, ../locks becomes /locks, so writer startup may fail with permissions (or write lockfiles outside the intended sandbox) instead of honoring the documented “relative paths live under /tmp” contract.
Useful? React with 👍 / 👎.
| with self._PIDFILE_REFS_LOCK: | ||
| if self._PIDFILE_REFS.get(pidfile, 0) > 0: | ||
| ref_pid = self._PIDFILE_REF_PIDS.get(pidfile) | ||
| if ref_pid is not None and ref_pid != pid: | ||
| self._PIDFILE_REFS.pop(pidfile, None) | ||
| self._PIDFILE_REF_PIDS.pop(pidfile, None) | ||
| else: | ||
| owner_pid, owner_start_time = self._read_writer_pidfile_owner(pidfile) | ||
| if owner_pid != pid or not self._pidfile_owner_matches(owner_pid, owner_start_time): | ||
| raise WriterInUseError( | ||
| f"pidfile ref mismatch for {self.db_path}; refusing to clear active refs" | ||
| ) | ||
| self._PIDFILE_REFS[pidfile] += 1 |
There was a problem hiding this comment.
🟢 Low brainlayer/vector_store.py:155
When _read_writer_pidfile_owner() returns (None, None) due to a corrupt or empty pidfile, the condition owner_pid != pid evaluates to True because None != some_int, causing WriterInUseError to be raised even though the in-memory _PIDFILE_REFS state correctly reflects that this process owns the lock. Consider handling owner_pid is None as a valid case where the current process can reclaim ownership.
owner_pid, owner_start_time = self._read_writer_pidfile_owner(pidfile)
- if owner_pid != pid or not self._pidfile_owner_matches(owner_pid, owner_start_time):
+ if owner_pid is None:
+ # pidfile corrupted but in-memory refs valid; reclaim ownership
+ pass
+ elif owner_pid != pid or not self._pidfile_owner_matches(owner_pid, owner_start_time):
raise WriterInUseError(
f"pidfile ref mismatch for {self.db_path}; refusing to clear active refs"
)🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/vector_store.py around lines 155-167:
When `_read_writer_pidfile_owner()` returns `(None, None)` due to a corrupt or empty pidfile, the condition `owner_pid != pid` evaluates to `True` because `None != some_int`, causing `WriterInUseError` to be raised even though the in-memory `_PIDFILE_REFS` state correctly reflects that this process owns the lock. Consider handling `owner_pid is None` as a valid case where the current process can reclaim ownership.
Evidence trail:
src/brainlayer/vector_store.py lines 150-172 (REVIEWED_COMMIT): `_acquire_writer_pidfile` method with the condition at line 163. src/brainlayer/vector_store.py lines 231-244 (REVIEWED_COMMIT): `_read_writer_pidfile_owner` returning `(None, None)` for empty files (line 235) and for OSError/ValueError (line 244).
There was a problem hiding this comment.
I am leaving this unchanged intentionally. An unreadable or corrupt pidfile while in-memory writer refs exist is a mutex-integrity violation, not a safe reclaim path: treating owner_pid=None as valid could drop protection while a writer is still active. The current behavior raises and preserves the active refs instead of clearing them. Regression coverage: test_pidfile_ref_mismatch_does_not_clear_existing_refs.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/brainlayer/search_repo.py (1)
1532-1554:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAppend recency-fallback ranks after existing FTS hits.
Line 1553 assigns recent-only rows rank
i, restarting at0even whenfts_ranksalready has exact lexical matches. That lets fallback rows contribute the same RRF weight as the top FTS hit, and the later 2x recency boost can push non-matching recent chunks above genuinely matching results.Suggested fix
+ recent_rank_offset = len(fts_ranks) for i, row in enumerate(recent_rows): chunk_id = row[0] - fts_ranks.setdefault(chunk_id, i) + fts_ranks.setdefault(chunk_id, recent_rank_offset + i) keyword_data.setdefault(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/brainlayer/search_repo.py` around lines 1532 - 1554, The recency fallback loop resets ranks by using enumerate(recent_rows) starting at 0, allowing recent-only rows to get the same rank as top FTS hits; change the enumeration to start after existing FTS ranks (e.g., use enumerate(recent_rows, start=len(fts_ranks)) or start=max(fts_ranks.values(), default=0)+1) so chunk_id assignments in the loop that sets fts_ranks (the block iterating recent_rows and using fts_ranks.setdefault) append ranks rather than colliding with existing FTS ranks; ensure you do not overwrite existing fts_ranks and add a short comment explaining the choice.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/brainlayer/vector_store.py`:
- Around line 2112-2122: The current finally block always calls
self._release_writer_pidfile(), which can drop the writer pidfile even if
read_conn.close() or self.conn.close() fails; instead, remove the unconditional
finally call and only invoke self._release_writer_pidfile() after both
connection closes complete successfully (i.e., call it after read_conn.close()
and self.conn.close() without wrapping those in a finally that always releases
the pidfile); if any close raises, do not release the pidfile and re-raise the
exception so the single-writer guard remains held; update the failing close test
to expect the pidfile to be retained rather than removed.
---
Outside diff comments:
In `@src/brainlayer/search_repo.py`:
- Around line 1532-1554: The recency fallback loop resets ranks by using
enumerate(recent_rows) starting at 0, allowing recent-only rows to get the same
rank as top FTS hits; change the enumeration to start after existing FTS ranks
(e.g., use enumerate(recent_rows, start=len(fts_ranks)) or
start=max(fts_ranks.values(), default=0)+1) so chunk_id assignments in the loop
that sets fts_ranks (the block iterating recent_rows and using
fts_ranks.setdefault) append ranks rather than colliding with existing FTS
ranks; ensure you do not overwrite existing fts_ranks and add a short comment
explaining the choice.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1ab4c20e-a711-4c76-a8fd-983278b6d842
📒 Files selected for processing (7)
hooks/brainlayer-prompt-search.pysrc/brainlayer/drain.pysrc/brainlayer/search_repo.pysrc/brainlayer/vector_store.pytests/test_adaptive_injection.pytests/test_hybrid_search.pytests/test_writer_pidfile_mutex.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
- GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
tests/test_adaptive_injection.pysrc/brainlayer/drain.pyhooks/brainlayer-prompt-search.pysrc/brainlayer/search_repo.pytests/test_hybrid_search.pysrc/brainlayer/vector_store.pytests/test_writer_pidfile_mutex.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/drain.pysrc/brainlayer/search_repo.pysrc/brainlayer/vector_store.py
🔇 Additional comments (3)
hooks/brainlayer-prompt-search.py (1)
46-46: LGTM!Also applies to: 776-776
tests/test_adaptive_injection.py (1)
158-196: LGTM!tests/test_writer_pidfile_mutex.py (1)
29-521: Please run the fullpytestsuite before merging this mutex change.This adds good focused coverage, but the repo rule for risky DB/concurrency work is stricter than the targeted tests called out in the PR description. I would not treat the behavior change as safely validated until the full 929-test suite passes.
As per coding guidelines
**/*.py: "Run pytest before claiming behavior changed safely; current test suite has 929 tests".
| try: | ||
| # Close thread-local read connection if it exists | ||
| if hasattr(self, "_local"): | ||
| read_conn = getattr(self._local, "read_conn", None) | ||
| if read_conn is not None: | ||
| read_conn.close() | ||
| self._local.read_conn = None | ||
| if hasattr(self, "conn"): | ||
| self.conn.close() | ||
| finally: | ||
| self._release_writer_pidfile() |
There was a problem hiding this comment.
Do not release the writer pidfile when writer shutdown fails.
Lines 2112-2122 drop the pidfile in finally, so a failing read_conn.close() or self.conn.close() can release the single-writer guard while the writable connection is still alive. That breaks the mutex guarantee this PR is adding. Keep the pidfile until the writer connection has actually closed, and update the companion close-failure test to expect retention instead of removal.
Suggested fix
def close(self) -> None:
"""Close database connections."""
- try:
- # Close thread-local read connection if it exists
- if hasattr(self, "_local"):
- read_conn = getattr(self._local, "read_conn", None)
- if read_conn is not None:
- read_conn.close()
- self._local.read_conn = None
- if hasattr(self, "conn"):
- self.conn.close()
- finally:
- self._release_writer_pidfile()
+ if hasattr(self, "_local"):
+ read_conn = getattr(self._local, "read_conn", None)
+ if read_conn is not None:
+ read_conn.close()
+ self._local.read_conn = None
+ if hasattr(self, "conn"):
+ self.conn.close()
+ self._release_writer_pidfile()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/brainlayer/vector_store.py` around lines 2112 - 2122, The current finally
block always calls self._release_writer_pidfile(), which can drop the writer
pidfile even if read_conn.close() or self.conn.close() fails; instead, remove
the unconditional finally call and only invoke self._release_writer_pidfile()
after both connection closes complete successfully (i.e., call it after
read_conn.close() and self.conn.close() without wrapping those in a finally that
always releases the pidfile); if any close raises, do not release the pidfile
and re-raise the exception so the single-writer guard remains held; update the
failing close test to expect the pidfile to be retained rather than removed.
There was a problem hiding this comment.
I investigated this and am not applying the suggestion. PR-alpha explicitly requires pidfile unlink on close/atexit, and this PR already fixed the stale-lock failure mode where close exceptions skipped _release_writer_pidfile(). If APSW close raises while the process is unwinding, retaining the pidfile can permanently block future writers. _release_writer_pidfile() is idempotent and only unlinks when the pidfile still names this PID. Regression coverage: test_close_releases_pidfile_when_connection_close_raises.
There was a problem hiding this comment.
@EtanHey, understood — thank you for the thorough explanation. The PID-checked idempotency of _release_writer_pidfile() and the risk of a permanently stranded pidfile from a dying process tip the balance in favour of the finally block here. I was treating this like a generic lock where retention-on-failure is always safer, but the process-scoped, PID-validated pidfile design means a failed-close process that hangs onto its pidfile is strictly worse than releasing it. The existing test test_close_releases_pidfile_when_connection_close_raises confirms the behaviour is intentional and covered.
I'll withdraw the suggestion.
✏️ Learnings added
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 309
File: src/brainlayer/vector_store.py:2114-2124
Timestamp: 2026-05-21T20:41:49.909Z
Learning: In `src/brainlayer/vector_store.py`, `VectorStore.close()` intentionally releases the writer pidfile in a `finally` block (always, even if `read_conn.close()` or `self.conn.close()` raises). This is by design: the pidfile is PID-checked and idempotent, so releasing it during process unwind is safe, and retaining it after a failed close would permanently block future writers. Do not suggest moving `_release_writer_pidfile()` outside the `finally` block. Covered by `test_close_releases_pidfile_when_connection_close_raises`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8afb4387a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| timeout_ms = int(os.environ.get("BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS", "30000")) | ||
| except (TypeError, ValueError): | ||
| return 30000 | ||
| return timeout_ms if timeout_ms > 0 else 30000 |
There was a problem hiding this comment.
Clamp busy-timeout override to 32-bit APSW range
_drain_busy_timeout_ms() accepts any positive integer, but apsw.Connection.setbusytimeout() expects a C int; values above 2_147_483_647 raise OverflowError and abort drain startup. A misconfigured BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS (for example 3000000000) therefore turns a tuning knob into a hard failure path instead of falling back safely.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in c668da8. Added a RED regression for BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS=3000000000 causing APSW OverflowError, then clamped invalid, non-positive, and >C-int values back to the 30000ms default. Verified with the focused drain timeout tests, the affected 68-test PR-alpha slice, and the full pre-push gate.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Review update:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c668da8c0e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| query_embedding = model.embed_query(prompt) | ||
|
|
||
| store = VectorStore(Path(db_path)) | ||
| store = VectorStore(Path(db_path), readonly=True) |
There was a problem hiding this comment.
Preserve writable fallback when opening prompt search store
Opening VectorStore in read-only mode here causes prompt search to hard-fail when the DB file or its parent directory does not exist yet (fresh install, cleaned state, or custom path). In that scenario VectorStore(..., readonly=True) raises immediately, and the fallback path also opens SQLite read-only, so the hook returns an error instead of gracefully returning no memories. Before this change, the writable open path created the DB directory/file and allowed the hook to continue.
Useful? React with 👍 / 👎.
|
CI/review gates are green on head c668da8. Attempted required merge command: GitHub blocked it with base branch policy and suggested @EtanHey please admin-merge PR #309 with rebase merge (no squash). |
PR #309 added a pidfile mutex on VectorStore RW init to gate multiple enrich processes. Unintended consequence: the FastAPI daemon's lifespan also opens VectorStore RW at startup (daemon.py:83) — if the enrich supervisor is alive (which it normally is), the daemon fails to start entirely with WriterInUseError. That breaks: - brainlayer search CLI (DaemonClient hangs trying to start daemon) - brainlayer stats CLI (same) - any dashboard frontend on localhost:3000 etc. Empirically observed: enrich supervisor PID 14909 holds the pidfile, brainlayer-daemon refuses to bootstrap, /tmp/brainlayer.sock never appears, all CLI operations time out at the DAEMON_STARTUP_TIMEOUT. Fix: catch WriterInUseError on the lifespan VectorStore() call and fall back to readonly mode. In readonly mode: - Search endpoints work (the common case) - Stats/context/dashboard endpoints work - Write endpoints (/digest, /store, /entity PATCH, /backlog/items POST/PATCH/DELETE) will return 500 on the SQL layer Operator can restart the daemon (with no live writer) to regain RW mode for write endpoints. Or eventually move write endpoints to the arbitrated queue path (post-Phase-4 work). Empirically verified by smoke-test: - Direct lifespan invocation against current production state (enrich supervisor PID 14909 holds pidfile) - Pre-fix: WriterInUseError raised, lifespan fails to enter, daemon exits - Post-fix: warning logged, lifespan enters readonly, "LIFESPAN OK" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ience Etan-mandate 2026-05-22: "IMPROVE BRAINBAR AND MAKE IT SHOW GRAPH AND INJESTIONS ALREADY, WITHOUT DEGRATION!" + addendum: "AUTO MERGE AFTER /pr-loop, not sloppy." PR #314 wired the runtime database + InjectionStore so the "Warming memory…" placeholder cleared and BrainBarGraphTab + BrainBarInjectionTab get non-nil dependencies. But two failure modes remained: 1. When KG queries failed (typically transient ReadOnly / busy / locked from writer-pidfile contention with the Python enrich-supervisor + drain — PR #309), `KGViewModel.loadGraph` silently set nodes=[]/ edges=[] and returned false. The UI had no way to distinguish "empty graph" from "queries failing," so the canvas blanked. 2. Same pattern in `InjectionStore.refresh`: failures only `NSLog`'d and the panel held stale or empty events with no user-visible signal. This change adds a small `DegradationState` enum (`.healthy` / `.degraded(reason:)`) published on both view-models. `loadGraph` retries once on a 200ms backoff before reporting degraded, then keeps last-known- good nodes/edges so the user sees prior data rather than a blank canvas. `InjectionStore.refresh` flips state on failure and clears it on the next successful poll cycle. A small `DegradationBadge` (orange capsule with an exclamation-triangle icon) overlays the top-right of each tab when its state is degraded, with the underlying error in the SwiftUI .help tooltip. Per Etan's mandate: degraded ≠ hidden. Tests cover the enum surface, KG load-failure → degraded transition, KG success → healthy transition, and InjectionStore initial-healthy state. 387/387 tests pass, swift build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
VectorStoreDB init; readonly opens do not touch the writer lock.Writer Mutex Contract
/tmp/brainlayer-writer-${SHA256_RESOLVED_DB_PATH_16}-${RESOLVED_DB_BASENAME}.pid.BRAINLAYER_WRITER_PIDFILE_DIR=/path/to/dir; relative override values are normalized under/tmp.O_CREAT | O_EXCL, takes an exclusiveflockbefore writing, and records PID plus process start-time metadata.WriterInUseError: another writer is using $DB_PATH (pid $PID).close()/atexit.Retry / Timeout / Throttle Math
_INIT_MAX_RETRIES:5 -> 10, overrideBRAINLAYER_INIT_MAX_RETRIES, clamped to at least one attempt.0.5 + 1 + 2 + 4 + 8 + 16 + 30 + 30 + 30 + 30 = 151.5ssleep budget.200ms -> 30000ms, overrideBRAINLAYER_DRAIN_BUSY_TIMEOUT_MS; invalid, non-positive, or APSW-overflowing env falls back to 30000.ThrottleInterval:30 -> 60inscripts/launchd/com.brainlayer.enrichment.plist; installed user plist was also updated locally to 60.RED -> GREEN
pytest tests/test_writer_pidfile_mutex.pycollected 7 required tests and all 7 failed before implementation.test_enrichment_plist_throttle_interval_at_least_60sfailed againstThrottleInterval=30before restoring the template change.pytest tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py tests/test_hybrid_search.py tests/test_db_lock_resilience.py-> 68 passed.Enrich-Rate Bench
Baseline supplied by PR-alpha prompt:
2026-05-21T16:22:12Z: 213 enrichments / 77 min = ~166/hr.Fresh >=30 min live sample command:
Fresh sample observed during PR-alpha implementation:
2026-05-21T18:03:20.907164+00:00to2026-05-21T18:33:20.907301+00:00.Test Plan
pytest tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py tests/test_hybrid_search.py tests/test_db_lock_resilience.py-> 68 passed.ruff check src/brainlayer/vector_store.py hooks/brainlayer-prompt-search.py src/brainlayer/search_repo.py tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py-> passed.ruff format --check src/brainlayer/vector_store.py hooks/brainlayer-prompt-search.py src/brainlayer/search_repo.py tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py-> 5 files already formatted.pytest tests/test_eval_baselines.py -m live-> 34 passed, 5 xfailed, 2 xpassed.cr review --plainpassed earlier on staged diff; a later local CLI attempt hung and was stopped, with hosted CodeRabbit running on the pushed commits.Notes
pytestunder system Python 3.13 was blocked earlier by optional dependency drift (deepchecksmissing,numbaincompatible with NumPy 2.4). The repo pre-push hook uses its Python 3.12 env and passed the configured gate.Note
Add single-writer pidfile mutex and recency-intent search fallback to
VectorStoreandhybrid_searchVectorStoreinstances now acquire an exclusive pidfile lock on init; a second writer raisesWriterInUseError. Same-process reuse is ref-counted and the lock is released oncloseeven if connection teardown fails.hybrid_searchdetects recency intent in query text and supplements FTS results with the most-recent matching chunks via a SQL fallback, retrying up to 3 times on SQLite busy errors with exponential backoff. Chunks ≤7 days old receive an extra 2.0× score boost when recency intent is detected.BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS(default 30 s, was 200 ms).technologyandtoolare skipped during entity span matching in the prompt-search hook.ThrottleIntervalis raised from 30 s to 60 s.VectorStoreinstances concurrently will now fail withWriterInUseErrorunless re-entrant acquisition is detected correctly.Macroscope summarized c668da8.
Note
Medium Risk
Introduces a new cross-process single-writer gate for writable
VectorStoreopens and changes search fallback/ranking behavior; misconfiguration or unexpected multi-writer usage could now raiseWriterInUseErroror alter result ordering.Overview
Adds a single-writer mutex for writable DB access. Writable
VectorStoreinitialization now acquires a pidfile-based lock (with PID/start-time validation, symlink-safe path hashing, ref-counting, and guaranteed release onclose()/atexit) and raisesWriterInUseErrorwhen another live writer owns the database.Hardens read paths and retrieval behavior under contention. Hybrid prompt search and multiple integration tests now open
VectorStore(..., readonly=True);VectorStoreinit retry budget/backoff is increased and made configurable;drain’s APSW busy timeout is increased to 30s (env-configurable); the enrichment LaunchAgent throttle is raised to 60s.Improves search quality edge cases.
hybrid_searchgains a “recency intent” fallback that supplements lexical candidates with the most recent 7 days (respecting existing filters and retrying on busy), plus an extra boost for ≤7-day results when recency words are present; prompt entity injection skips ambiguous single-token matches fortechnology/toolentity types.Reviewed by Cursor Bugbot for commit c668da8. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Release Notes
New Features
Improvements