Skip to content

fix(writer): single-writer pidfile mutex + retry/timeout/throttle hardening#309

Merged
EtanHey merged 14 commits into
mainfrom
fix/single-writer-arbitration-option-a
May 21, 2026
Merged

fix(writer): single-writer pidfile mutex + retry/timeout/throttle hardening#309
EtanHey merged 14 commits into
mainfrom
fix/single-writer-arbitration-option-a

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented May 21, 2026

Summary

  • Add a pidfile-guarded single-writer mutex before writable VectorStore DB init; readonly opens do not touch the writer lock.
  • Harden lock ownership for basename collisions, symlink aliases, forked/inherited refs, close/open races, PID reuse, relative pidfile dirs, and empty pidfile write races.
  • Increase init retry budget, raise drain busy timeout, and raise enrichment LaunchAgent throttle.
  • Keep prompt/live retrieval read-only under the new mutex and harden recency fallback filter parity/retry behavior.

Writer Mutex Contract

  • Default pidfile path: /tmp/brainlayer-writer-${SHA256_RESOLVED_DB_PATH_16}-${RESOLVED_DB_BASENAME}.pid.
  • Override: BRAINLAYER_WRITER_PIDFILE_DIR=/path/to/dir; relative override values are normalized under /tmp.
  • RW init uses O_CREAT | O_EXCL, takes an exclusive flock before writing, and records PID plus process start-time metadata.
  • Existing live owner raises WriterInUseError: another writer is using $DB_PATH (pid $PID).
  • Dead, invalid, or PID-reused owners are reclaimed after verifying the opened pidfile is still the path target.
  • Same-process writer opens are ref-counted; release keeps ownership until unlink completes and always runs from close()/atexit.

Retry / Timeout / Throttle Math

  • _INIT_MAX_RETRIES: 5 -> 10, override BRAINLAYER_INIT_MAX_RETRIES, clamped to at least one attempt.
  • Backoff sleeps: 0.5 + 1 + 2 + 4 + 8 + 16 + 30 + 30 + 30 + 30 = 151.5s sleep budget.
  • With APSW busy timeout at 30s per attempt, worst-case wall time remains under the 600s cap.
  • Drain busy timeout: 200ms -> 30000ms, override BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS; invalid, non-positive, or APSW-overflowing env falls back to 30000.
  • Enrichment LaunchAgent ThrottleInterval: 30 -> 60 in scripts/launchd/com.brainlayer.enrichment.plist; installed user plist was also updated locally to 60.

RED -> GREEN

  • Initial required RED: pytest tests/test_writer_pidfile_mutex.py collected 7 required tests and all 7 failed before implementation.
  • Initial plist RED: test_enrichment_plist_throttle_interval_at_least_60s failed against ThrottleInterval=30 before restoring the template change.
  • Review-fix REDs were added and watched fail before fixes for basename collisions, symlink aliases, stale unlink races, same-process atexit reuse, release ownership race, close failure release, inherited refs, locked-before-write creation, PID-reuse owner metadata, relative pidfile dir normalization, hook readonly opens, recency filter parity/date handling/busy retry, retry budget clamp, and drain timeout env validation.
  • Latest focused GREEN: pytest tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py tests/test_hybrid_search.py tests/test_db_lock_resilience.py -> 68 passed.

Enrich-Rate Bench

Baseline supplied by PR-alpha prompt:

  • Pre-PR-alpha post-T0, 2026-05-21T16:22:12Z: 213 enrichments / 77 min = ~166/hr.

Fresh >=30 min live sample command:

PYTHONPATH=src python3 - <<'PY'
from datetime import datetime, timezone, timedelta
from brainlayer.paths import get_db_path
from brainlayer.vector_store import VectorStore
start = datetime.now(timezone.utc) - timedelta(minutes=30)
end = datetime.now(timezone.utc)
store = VectorStore(get_db_path(), readonly=True)
count = 0
try:
    for _, enriched_at in store.conn.cursor().execute(
        "SELECT id, enriched_at FROM chunks WHERE enriched_at IS NOT NULL AND enriched_at NOT LIKE 'skipped:%'"
    ):
        if not enriched_at:
            continue
        try:
            dt = datetime.fromisoformat(str(enriched_at).replace('Z', '+00:00'))
        except ValueError:
            continue
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=timezone.utc)
        if start <= dt <= end:
            count += 1
finally:
    store.close()
minutes = (end - start).total_seconds() / 60
print(start.isoformat(), end.isoformat(), count, f"{count / minutes * 60:.1f}/hr")
PY

Fresh sample observed during PR-alpha implementation:

  • Window: 2026-05-21T18:03:20.907164+00:00 to 2026-05-21T18:33:20.907301+00:00.
  • Count: 207 enrichments / 30.0 min = 414.0/hr.

Test Plan

  • pytest tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py tests/test_hybrid_search.py tests/test_db_lock_resilience.py -> 68 passed.
  • ruff check src/brainlayer/vector_store.py hooks/brainlayer-prompt-search.py src/brainlayer/search_repo.py tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py -> passed.
  • ruff format --check src/brainlayer/vector_store.py hooks/brainlayer-prompt-search.py src/brainlayer/search_repo.py tests/test_writer_pidfile_mutex.py tests/test_adaptive_injection.py -> 5 files already formatted.
  • pytest tests/test_eval_baselines.py -m live -> 34 passed, 5 xfailed, 2 xpassed.
  • Latest pre-push gate -> 2090 passed, 9 skipped, 75 deselected, 1 xfailed, 102 warnings; MCP registration 3 passed; isolated eval/hook routing 32 passed; Bun 1 passed; FTS determinism passed.
  • cr review --plain passed earlier on staged diff; a later local CLI attempt hung and was stopped, with hosted CodeRabbit running on the pushed commits.

Notes

  • Plain local pytest under system Python 3.13 was blocked earlier by optional dependency drift (deepchecks missing, numba incompatible with NumPy 2.4). The repo pre-push hook uses its Python 3.12 env and passed the configured gate.

Note

Add single-writer pidfile mutex and recency-intent search fallback to VectorStore and hybrid_search

  • Read-write VectorStore instances now acquire an exclusive pidfile lock on init; a second writer raises WriterInUseError. Same-process reuse is ref-counted and the lock is released on close even if connection teardown fails.
  • hybrid_search detects recency intent in query text and supplements FTS results with the most-recent matching chunks via a SQL fallback, retrying up to 3 times on SQLite busy errors with exponential backoff. Chunks ≤7 days old receive an extra 2.0× score boost when recency intent is detected.
  • The drain connection busy timeout is now configurable via BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS (default 30 s, was 200 ms).
  • Single-token entity matches for types technology and tool are skipped during entity span matching in the prompt-search hook.
  • The launchd enrichment job ThrottleInterval is raised from 30 s to 60 s.
  • Risk: any process that previously opened multiple read-write VectorStore instances concurrently will now fail with WriterInUseError unless re-entrant acquisition is detected correctly.

Macroscope summarized c668da8.


Note

Medium Risk
Introduces a new cross-process single-writer gate for writable VectorStore opens and changes search fallback/ranking behavior; misconfiguration or unexpected multi-writer usage could now raise WriterInUseError or alter result ordering.

Overview
Adds a single-writer mutex for writable DB access. Writable VectorStore initialization now acquires a pidfile-based lock (with PID/start-time validation, symlink-safe path hashing, ref-counting, and guaranteed release on close()/atexit) and raises WriterInUseError when another live writer owns the database.

Hardens read paths and retrieval behavior under contention. Hybrid prompt search and multiple integration tests now open VectorStore(..., readonly=True); VectorStore init retry budget/backoff is increased and made configurable; drain’s APSW busy timeout is increased to 30s (env-configurable); the enrichment LaunchAgent throttle is raised to 60s.

Improves search quality edge cases. hybrid_search gains a “recency intent” fallback that supplements lexical candidates with the most recent 7 days (respecting existing filters and retrying on busy), plus an extra boost for ≤7-day results when recency words are present; prompt entity injection skips ambiguous single-token matches for technology/tool entity types.

Reviewed by Cursor Bugbot for commit c668da8. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

Release Notes

  • New Features

    • Automatic recency prioritization: queries containing "current," "latest," "recent," "today," or "this week" now automatically retrieve results from the last 7 days with enhanced ranking when no explicit date range is specified.
  • Improvements

    • Enhanced disambiguation for single-token entity type matching in search
    • Improved database concurrency protection and configurable connection timeout behavior

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

This PR adds pidfile-based exclusive writer coordination to VectorStore, detects recency intent to fetch and boost 7-day recent chunks, makes SQLite busy-timeout and init-retry budget environment-configurable, skips ambiguous single-token KG entity matches, and updates tests/fixtures to open VectorStore in readonly mode.

Changes

VectorStore Writer Coordination and Search Enhancements

Layer / File(s) Summary
Entity span filtering & hook readonly open
hooks/brainlayer-prompt-search.py
Adds AMBIGUOUS_SINGLE_TOKEN_ENTITY_TYPES and skips single-token KG matches for those types; constructs VectorStore(..., readonly=True) in hybrid search hook.
Recency-intent detection & recent-chunk fallback
src/brainlayer/search_repo.py, tests/test_hybrid_search.py
Adds regex helpers and _has_recency_intent(), retrieves last-7-days chunks when recency intent is detected without date_from, merges them into pre-RRF ranks/keyword data, and applies extra 2.0x boost to results with age_days <= 7; includes new tests and helper adjustments.
VectorStore writer pidfile mutex and init/retry behavior
src/brainlayer/vector_store.py, tests/test_writer_pidfile_mutex.py
Introduces WriterInUseError, pidfile acquisition/release with fcntl.flock, deterministic pidfile naming, PID liveness/start-time checks, cross-process ref-counting, readonly init skip, and env-driven init retry/backoff tuning. Comprehensive pidfile tests added.
Drain busy-timeout, retry caps, and launchd throttle
src/brainlayer/drain.py, scripts/launchd/com.brainlayer.enrichment.plist, tests/test_writer_pidfile_mutex.py
Adds _drain_busy_timeout_ms() reading BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS (fallback 30000 ms) and applies it via conn.setbusytimeout(). Makes init retry budget environment-configurable with capped exponential backoff. Updates launchd ThrottleInterval to 60s and adds tests asserting behavior.
Test fixtures & readonly openings
tests/test_engine.py, tests/test_eval_baselines.py, tests/test_think_recall_integration.py, tests/test_vector_store.py, tests/test_adaptive_injection.py
Updates multiple fixtures and adds a unit test to assert prompt_search.run_hybrid_search opens VectorStore with readonly=True, converting integration fixtures to non-mutating readonly DB opens.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I nibble pidfiles neat and small,

I hop through recent results and call,
I slow the drain and tune the sleep,
Quiet tests now safely peep—
A rabbit hums: locks kept, searches leap.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main changes: implementation of a single-writer pidfile mutex, plus improvements to retry/timeout/throttle behavior. It aligns with the primary objectives in this comprehensive PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/single-writer-arbitration-option-a

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@greptileai review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@cursor @BugBot review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@codex review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@coderabbitai review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17e33ea01b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1521 to +1523
FROM chunks
WHERE created_at >= datetime('now', '-7 days') {" ".join(recent_extra)}
ORDER BY created_at DESC
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve entity scoping in recency-intent fallback

This fallback query pulls recent rows directly from chunks without applying entity_id, even though hybrid_search documents entity_id as a hard per-entity scope. Because these rows are injected into fts_ranks/keyword_data and the later FTS-only post-filter does not re-check entity membership, queries containing recency terms (e.g. “recent”, “latest”) can return chunks from other entities.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in acbf191. The recency-intent fallback now preserves entity_id and related active filters before injecting recent rows into ranking data; covered by a regression test.

Comment on lines +177 to +179
try:
pidfile.unlink()
except FileNotFoundError:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make stale pidfile cleanup atomic before unlink

The stale-lock path unlinks based on a prior PID read/liveness check, but that check and unlink() are not atomic. If two writers race while recovering a stale pidfile, one process can create a fresh pidfile after the other has read the old PID, and the second process can then unlink that fresh lock and proceed, allowing concurrent writers to initialize the same DB.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in acbf191 and further hardened in 4378c17. Stale cleanup now opens and locks the pidfile, verifies the opened fd still matches the path with samestat before unlink, and ignores FileNotFoundError races.

Comment thread src/brainlayer/vector_store.py Outdated
Comment thread src/brainlayer/drain.py
Comment thread src/brainlayer/search_repo.py
Comment thread src/brainlayer/search_repo.py Outdated
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

PR-alpha CI is green and PR-beta #306 remains blocked on this landing first. Etan/admin: please merge with rebase only, no squash, when acceptable so beta can rebase onto the pidfile mutex on origin/main.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

raise last_err # type: ignore[misc]

P2 Badge Guard zero retry budget before re-raising init error

_INIT_MAX_RETRIES is now env-configurable, but when BRAINLAYER_INIT_MAX_RETRIES is set to 0 (or any negative value), the retry loop never executes and last_err stays None; this line then raises TypeError: exceptions must derive from BaseException instead of a meaningful DB-init failure. That turns a recoverable configuration mistake into a hard startup crash with misleading diagnostics.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/search_repo.py Outdated
_ingest_keyword_rows(fts_results, fts_ranks)
_ingest_keyword_rows(trigram_fts_results, trigram_ranks)

recency_intent = any(term in query_text.lower() for term in _RECENCY_QUERY_TERMS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match recency intent on whole terms only

recency_intent is computed with substring checks (term in query_text.lower()), so the term "current" also matches unrelated words like "concurrent"/"recurrent". In those cases hybrid search incorrectly enters the recency fallback path and applies the extra ≤7-day score multiplier, which can reorder results toward recent chunks even when the user did not ask for recency.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1ece80a. This shares the same whole-term recency-intent helper and regression coverage as the Cursor finding.

Comment thread src/brainlayer/vector_store.py Outdated
Comment thread src/brainlayer/vector_store.py
Comment thread src/brainlayer/vector_store.py Outdated
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ece80ab63

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/vector_store.py Outdated

# Retry settings for DB init under contention (multiple MCP instances + enrichment)
_INIT_MAX_RETRIES = 5
_INIT_MAX_RETRIES = _int_env("BRAINLAYER_INIT_MAX_RETRIES", 10)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp init retry env to at least one attempt

BRAINLAYER_INIT_MAX_RETRIES is read directly into _INIT_MAX_RETRIES without bounds checking, so setting it to 0 or a negative value makes _init_db_with_retry() run zero iterations and then execute raise last_err while last_err is still None, which raises TypeError and prevents writer startup entirely. This turns a retry-tuning env var into a hard boot failure under misconfiguration.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c661a16. The retry budget is clamped to at least one attempt at env-read time and inside _init_db_with_retry, with a regression test proving a zero runtime budget re-raises the original BusyError.

Comment thread src/brainlayer/search_repo.py Outdated
summary, tags, importance, intent,
created_at, source, sender, language, decay_score
FROM chunks
WHERE created_at >= datetime('now', '-7 days') {" ".join(recent_extra)}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compare timestamps as datetimes in recency fallback

The recency fallback filters with created_at >= datetime('now', '-7 days'), but created_at values are ISO strings like YYYY-MM-DDTHH:MM:SSZ while datetime() returns YYYY-MM-DD HH:MM:SS; SQLite then performs a lexical TEXT comparison, so same-day boundary rows (e.g. ...T00:00:00Z) are treated as newer than the cutoff even when they are actually older than 7 days. This widens the fallback window and can pull stale chunks into “latest/recent” results.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4378c17. The recency fallback now compares datetime(created_at) against datetime(now, -7 days), with a boundary regression test.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4378c1718c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/vector_store.py Outdated
Comment on lines +241 to +245
with self._PIDFILE_REFS_LOCK:
refs = self._PIDFILE_REFS.get(pidfile, 0)
if refs <= 1:
self._PIDFILE_REFS.pop(pidfile, None)
should_unlink = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep pidfile ownership until release unlink finishes

Clearing _PIDFILE_REFS before the unlink step opens a race where one thread can call close() while another thread in the same process opens a new writer for the same DB: the new open goes through _handle_existing_writer_pidfile() (same PID) and marks itself acquired, then the closing thread still unlinks the pidfile, leaving an active writer with no lockfile. At that point a third process can create a fresh pidfile and start writing concurrently, violating the single-writer contract under concurrent open/close in one process.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e00434a. Added a RED regression for the same-process close/open interleaving, then kept pidfile ownership under _PIDFILE_REFS_LOCK until the release path finishes the owned-file unlink and ref removal. Focused test and pre-push gate are green.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e00434a and hardened further in d5e694c. Added regression coverage that close/open same-process races keep pidfile ownership through unlink and that mismatched same-process refs are preserved instead of cleared.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/brainlayer/vector_store.py (1)

2027-2038: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always release the writer pidfile even if DB close fails.

If self.conn.close() raises, _release_writer_pidfile() is skipped and the process can leave a stale writer lock behind.

Suggested fix
 def close(self) -> None:
     """Close database connections."""
     # Close thread-local read connection if it exists
     if hasattr(self, "_local"):
         read_conn = getattr(self._local, "read_conn", None)
         if read_conn is not None:
             read_conn.close()
             self._local.read_conn = None
-    if hasattr(self, "conn"):
-        self.conn.close()
-    self._release_writer_pidfile()
+    try:
+        if hasattr(self, "conn"):
+            self.conn.close()
+    finally:
+        self._release_writer_pidfile()

As per coding guidelines **/*.py: "Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior" and "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/brainlayer/vector_store.py` around lines 2027 - 2038, The close method
can raise while closing DB connections and skip calling _release_writer_pidfile,
leaving a stale writer lock; update close(self) to ensure
_release_writer_pidfile() is always executed by wrapping connection-close logic
in a try/finally (or similar) so that even if getattr(self, "conn").close() or
closing self._local.read_conn raises, _release_writer_pidfile() runs in the
finally block; handle/propagate exceptions as appropriate but do not allow them
to prevent _release_writer_pidfile() from executing (reference: close,
_release_writer_pidfile, conn, and _local.read_conn).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/brainlayer/search_repo.py`:
- Around line 1527-1541: The new recency query that builds recent_rows (the
cursor.execute block selecting from chunks and using candidate_fetch_count) must
be wrapped in SQLITE_BUSY retry/backoff logic: catch
sqlite3.OperationalError/SQLITE_BUSY around the cursor.execute call, retry with
exponential backoff (short sleeps between attempts) for a limited number of
attempts, and only raise after retries are exhausted; ensure this uses the
worker's own DB connection/cursor used elsewhere in this module (i.e., the same
cursor variable) to respect the guideline that each worker uses its own
connection. Replace the direct list(cursor.execute(...)) call with a retry loop
that logs or sleeps on SQLITE_BUSY and then executes the same query and
parameters (including recent_extra and recent_params) on success.
- Around line 1473-1539: The recency fallback rebuilds filters but forgot to
apply content_type_filter and date_to, so add checks for content_type_filter
(append "AND content_type = ?" and push content_type_filter into recent_params)
and for date_to (append a created_at upper-bound clause like "AND
datetime(created_at) <= datetime(?)" and push date_to into recent_params) before
executing the cursor query in the method that builds recent_extra (the block
referencing recent_extra, recent_params and the cursor.execute SELECT). Ensure
the new clauses are appended in the same manner/order as the other filters so
the final parameter list [*recent_params, min(candidate_fetch_count, 25)]
remains aligned.

In `@src/brainlayer/vector_store.py`:
- Around line 177-203: The race occurs between _handle_existing_writer_pidfile
and _release_writer_pidfile where refs and pidfile unlink can interleave, so
modify both functions to serialize ref checks/updates and unlink decisions using
the existing _PIDFILE_REFS_LOCK: in _handle_existing_writer_pidfile (after
reading other_pid and before incrementing self._PIDFILE_REFS for pidfile)
acquire _PIDFILE_REFS_LOCK to re-check the current ref count and the pidfile
existence and then increment and set
_writer_pidfile_acquired/_writer_pidfile_path_value and register atexit
atomically; likewise, when deciding to unlink a stale pidfile (the
os.path.samestat branch), acquire _PIDFILE_REFS_LOCK first and re-check that no
new refs were added and the on-disk pid still matches before calling
pidfile.unlink(); ensure _release_writer_pidfile also uses _PIDFILE_REFS_LOCK to
decrement refs and only unlink if the ref count reaches zero and the pidfile
still refers to our pid to prevent removing a pidfile re-created by the same
process.

In `@tests/test_hybrid_search.py`:
- Around line 434-521: The test suite adds recency fallback checks for
entity/sentiment and boundary behavior but omits parity tests for
content_type_filter and date_to; add two new tests mirroring the existing
patterns (e.g., follow test_recency_intent_fallback_preserves_entity_filter and
test_recency_intent_fallback_compares_created_at_as_datetime) that: (1) insert
recent chunks with different content_type values, call store.hybrid_search with
content_type_filter set and assert the non-matching content_type chunk is
excluded, and (2) insert chunks around a date_to boundary, monkeypatch
store.search same as
test_recency_intent_fallback_compares_created_at_as_datetime, call hybrid_search
with date_to set and assert chunks after date_to are excluded; reference test
function names and store.hybrid_search/search/_insert_chunk to locate where to
add them.

In `@tests/test_writer_pidfile_mutex.py`:
- Around line 235-242: The test test_init_retries_10_with_extended_backoff is
fragile because VectorStore._INIT_MAX_RETRIES can be preseeded by the
environment; make it deterministic by explicitly setting
VectorStore._INIT_MAX_RETRIES to 10 at the start of the test (or use monkeypatch
to set that attribute) before computing delays and asserting values, then
restore the original value at the end or use a fixture to isolate the change so
external BRAINLAYER_INIT_MAX_RETRIES/env settings cannot affect the assertion.

---

Outside diff comments:
In `@src/brainlayer/vector_store.py`:
- Around line 2027-2038: The close method can raise while closing DB connections
and skip calling _release_writer_pidfile, leaving a stale writer lock; update
close(self) to ensure _release_writer_pidfile() is always executed by wrapping
connection-close logic in a try/finally (or similar) so that even if
getattr(self, "conn").close() or closing self._local.read_conn raises,
_release_writer_pidfile() runs in the finally block; handle/propagate exceptions
as appropriate but do not allow them to prevent _release_writer_pidfile() from
executing (reference: close, _release_writer_pidfile, conn, and
_local.read_conn).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef9f0b9f-6f95-469d-90a4-8f125fca111d

📥 Commits

Reviewing files that changed from the base of the PR and between be11d44 and aeef3d2.

📒 Files selected for processing (11)
  • hooks/brainlayer-prompt-search.py
  • scripts/launchd/com.brainlayer.enrichment.plist
  • src/brainlayer/drain.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/vector_store.py
  • tests/test_engine.py
  • tests/test_eval_baselines.py
  • tests/test_hybrid_search.py
  • tests/test_think_recall_integration.py
  • tests/test_vector_store.py
  • tests/test_writer_pidfile_mutex.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

  • tests/test_think_recall_integration.py
  • tests/test_vector_store.py
  • src/brainlayer/drain.py
  • hooks/brainlayer-prompt-search.py
  • tests/test_engine.py
  • tests/test_hybrid_search.py
  • src/brainlayer/vector_store.py
  • src/brainlayer/search_repo.py
  • tests/test_writer_pidfile_mutex.py
  • tests/test_eval_baselines.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

  • src/brainlayer/drain.py
  • src/brainlayer/vector_store.py
  • src/brainlayer/search_repo.py
🪛 OpenGrep (1.21.0)
src/brainlayer/search_repo.py

[ERROR] 1528-1540: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.

(coderabbit.sql-injection.python-fstring-execute)

🔇 Additional comments (8)
hooks/brainlayer-prompt-search.py (1)

44-46: LGTM!

Also applies to: 544-545

src/brainlayer/search_repo.py (1)

67-69: LGTM!

Also applies to: 84-86, 1680-1681

src/brainlayer/drain.py (1)

37-41: LGTM!

Also applies to: 71-74

scripts/launchd/com.brainlayer.enrichment.plist (1)

55-56: LGTM!

tests/test_engine.py (1)

348-350: LGTM!

Also applies to: 393-395

tests/test_eval_baselines.py (1)

68-70: LGTM!

Also applies to: 689-691

tests/test_think_recall_integration.py (1)

21-23: LGTM!

tests/test_vector_store.py (1)

14-16: LGTM!

Comment thread src/brainlayer/search_repo.py Outdated
Comment thread src/brainlayer/search_repo.py Outdated
Comment thread src/brainlayer/vector_store.py
Comment thread tests/test_hybrid_search.py
Comment thread tests/test_writer_pidfile_mutex.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aeef3d2f7e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1474 to +1475
if recency_intent and not date_from:
recent_extra = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect date_to in recency-intent fallback

Gate the recency fallback on both date bounds or apply date_to inside the fallback query. Right now hybrid_search enters the fallback whenever recency_intent is true and date_from is unset, so requests that set only date_to (for example, “latest notes before 2024-01-01”) still inject last-7-days chunks from chunks; those injected rows are treated as FTS-only candidates and the later post-filter does not enforce date_to, so results can violate the caller’s explicit upper date bound.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d858e34. The recency fallback applies datetime(created_at) <= datetime(?) for date_to, with a regression test.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@codex review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@cursor @BugBot re-review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Comment thread src/brainlayer/vector_store.py
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@coderabbitai review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@cursor @BugBot re-review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@codex review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36e9cc2168

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +143 to +145
if not pidfile_dir.is_absolute():
pidfile_dir = Path("/tmp") / pidfile_dir
pidfile_dir = pidfile_dir.resolve()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep relative pidfile overrides under /tmp

Relative BRAINLAYER_WRITER_PIDFILE_DIR values can escape /tmp because the code appends the raw relative path and then calls resolve(). For example, ../locks becomes /locks, so writer startup may fail with permissions (or write lockfiles outside the intended sandbox) instead of honoring the documented “relative paths live under /tmp” contract.

Useful? React with 👍 / 👎.

Comment on lines +155 to +167
with self._PIDFILE_REFS_LOCK:
if self._PIDFILE_REFS.get(pidfile, 0) > 0:
ref_pid = self._PIDFILE_REF_PIDS.get(pidfile)
if ref_pid is not None and ref_pid != pid:
self._PIDFILE_REFS.pop(pidfile, None)
self._PIDFILE_REF_PIDS.pop(pidfile, None)
else:
owner_pid, owner_start_time = self._read_writer_pidfile_owner(pidfile)
if owner_pid != pid or not self._pidfile_owner_matches(owner_pid, owner_start_time):
raise WriterInUseError(
f"pidfile ref mismatch for {self.db_path}; refusing to clear active refs"
)
self._PIDFILE_REFS[pidfile] += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low brainlayer/vector_store.py:155

When _read_writer_pidfile_owner() returns (None, None) due to a corrupt or empty pidfile, the condition owner_pid != pid evaluates to True because None != some_int, causing WriterInUseError to be raised even though the in-memory _PIDFILE_REFS state correctly reflects that this process owns the lock. Consider handling owner_pid is None as a valid case where the current process can reclaim ownership.

                    owner_pid, owner_start_time = self._read_writer_pidfile_owner(pidfile)
-                    if owner_pid != pid or not self._pidfile_owner_matches(owner_pid, owner_start_time):
+                    if owner_pid is None:
+                        # pidfile corrupted but in-memory refs valid; reclaim ownership
+                        pass
+                    elif owner_pid != pid or not self._pidfile_owner_matches(owner_pid, owner_start_time):
                         raise WriterInUseError(
                             f"pidfile ref mismatch for {self.db_path}; refusing to clear active refs"
                         )
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/vector_store.py around lines 155-167:

When `_read_writer_pidfile_owner()` returns `(None, None)` due to a corrupt or empty pidfile, the condition `owner_pid != pid` evaluates to `True` because `None != some_int`, causing `WriterInUseError` to be raised even though the in-memory `_PIDFILE_REFS` state correctly reflects that this process owns the lock. Consider handling `owner_pid is None` as a valid case where the current process can reclaim ownership.

Evidence trail:
src/brainlayer/vector_store.py lines 150-172 (REVIEWED_COMMIT): `_acquire_writer_pidfile` method with the condition at line 163. src/brainlayer/vector_store.py lines 231-244 (REVIEWED_COMMIT): `_read_writer_pidfile_owner` returning `(None, None)` for empty files (line 235) and for OSError/ValueError (line 244).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving this unchanged intentionally. An unreadable or corrupt pidfile while in-memory writer refs exist is a mutex-integrity violation, not a safe reclaim path: treating owner_pid=None as valid could drop protection while a writer is still active. The current behavior raises and preserves the active refs instead of clearing them. Regression coverage: test_pidfile_ref_mismatch_does_not_clear_existing_refs.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/brainlayer/search_repo.py (1)

1532-1554: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Append recency-fallback ranks after existing FTS hits.

Line 1553 assigns recent-only rows rank i, restarting at 0 even when fts_ranks already has exact lexical matches. That lets fallback rows contribute the same RRF weight as the top FTS hit, and the later 2x recency boost can push non-matching recent chunks above genuinely matching results.

Suggested fix
+            recent_rank_offset = len(fts_ranks)
             for i, row in enumerate(recent_rows):
                 chunk_id = row[0]
-                fts_ranks.setdefault(chunk_id, i)
+                fts_ranks.setdefault(chunk_id, recent_rank_offset + i)
                 keyword_data.setdefault(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/brainlayer/search_repo.py` around lines 1532 - 1554, The recency fallback
loop resets ranks by using enumerate(recent_rows) starting at 0, allowing
recent-only rows to get the same rank as top FTS hits; change the enumeration to
start after existing FTS ranks (e.g., use enumerate(recent_rows,
start=len(fts_ranks)) or start=max(fts_ranks.values(), default=0)+1) so chunk_id
assignments in the loop that sets fts_ranks (the block iterating recent_rows and
using fts_ranks.setdefault) append ranks rather than colliding with existing FTS
ranks; ensure you do not overwrite existing fts_ranks and add a short comment
explaining the choice.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/brainlayer/vector_store.py`:
- Around line 2112-2122: The current finally block always calls
self._release_writer_pidfile(), which can drop the writer pidfile even if
read_conn.close() or self.conn.close() fails; instead, remove the unconditional
finally call and only invoke self._release_writer_pidfile() after both
connection closes complete successfully (i.e., call it after read_conn.close()
and self.conn.close() without wrapping those in a finally that always releases
the pidfile); if any close raises, do not release the pidfile and re-raise the
exception so the single-writer guard remains held; update the failing close test
to expect the pidfile to be retained rather than removed.

---

Outside diff comments:
In `@src/brainlayer/search_repo.py`:
- Around line 1532-1554: The recency fallback loop resets ranks by using
enumerate(recent_rows) starting at 0, allowing recent-only rows to get the same
rank as top FTS hits; change the enumeration to start after existing FTS ranks
(e.g., use enumerate(recent_rows, start=len(fts_ranks)) or
start=max(fts_ranks.values(), default=0)+1) so chunk_id assignments in the loop
that sets fts_ranks (the block iterating recent_rows and using
fts_ranks.setdefault) append ranks rather than colliding with existing FTS
ranks; ensure you do not overwrite existing fts_ranks and add a short comment
explaining the choice.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ab4c20e-a711-4c76-a8fd-983278b6d842

📥 Commits

Reviewing files that changed from the base of the PR and between aeef3d2 and 36e9cc2.

📒 Files selected for processing (7)
  • hooks/brainlayer-prompt-search.py
  • src/brainlayer/drain.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/vector_store.py
  • tests/test_adaptive_injection.py
  • tests/test_hybrid_search.py
  • tests/test_writer_pidfile_mutex.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

  • tests/test_adaptive_injection.py
  • src/brainlayer/drain.py
  • hooks/brainlayer-prompt-search.py
  • src/brainlayer/search_repo.py
  • tests/test_hybrid_search.py
  • src/brainlayer/vector_store.py
  • tests/test_writer_pidfile_mutex.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

  • src/brainlayer/drain.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/vector_store.py
🔇 Additional comments (3)
hooks/brainlayer-prompt-search.py (1)

46-46: LGTM!

Also applies to: 776-776

tests/test_adaptive_injection.py (1)

158-196: LGTM!

tests/test_writer_pidfile_mutex.py (1)

29-521: Please run the full pytest suite before merging this mutex change.

This adds good focused coverage, but the repo rule for risky DB/concurrency work is stricter than the targeted tests called out in the PR description. I would not treat the behavior change as safely validated until the full 929-test suite passes.

As per coding guidelines **/*.py: "Run pytest before claiming behavior changed safely; current test suite has 929 tests".

Comment on lines +2112 to +2122
try:
# Close thread-local read connection if it exists
if hasattr(self, "_local"):
read_conn = getattr(self._local, "read_conn", None)
if read_conn is not None:
read_conn.close()
self._local.read_conn = None
if hasattr(self, "conn"):
self.conn.close()
finally:
self._release_writer_pidfile()
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Do not release the writer pidfile when writer shutdown fails.

Lines 2112-2122 drop the pidfile in finally, so a failing read_conn.close() or self.conn.close() can release the single-writer guard while the writable connection is still alive. That breaks the mutex guarantee this PR is adding. Keep the pidfile until the writer connection has actually closed, and update the companion close-failure test to expect retention instead of removal.

Suggested fix
     def close(self) -> None:
         """Close database connections."""
-        try:
-            # Close thread-local read connection if it exists
-            if hasattr(self, "_local"):
-                read_conn = getattr(self._local, "read_conn", None)
-                if read_conn is not None:
-                    read_conn.close()
-                    self._local.read_conn = None
-            if hasattr(self, "conn"):
-                self.conn.close()
-        finally:
-            self._release_writer_pidfile()
+        if hasattr(self, "_local"):
+            read_conn = getattr(self._local, "read_conn", None)
+            if read_conn is not None:
+                read_conn.close()
+                self._local.read_conn = None
+        if hasattr(self, "conn"):
+            self.conn.close()
+        self._release_writer_pidfile()
As per coding guidelines `**/*.py`: "Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior" and "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/brainlayer/vector_store.py` around lines 2112 - 2122, The current finally
block always calls self._release_writer_pidfile(), which can drop the writer
pidfile even if read_conn.close() or self.conn.close() fails; instead, remove
the unconditional finally call and only invoke self._release_writer_pidfile()
after both connection closes complete successfully (i.e., call it after
read_conn.close() and self.conn.close() without wrapping those in a finally that
always releases the pidfile); if any close raises, do not release the pidfile
and re-raise the exception so the single-writer guard remains held; update the
failing close test to expect the pidfile to be retained rather than removed.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this and am not applying the suggestion. PR-alpha explicitly requires pidfile unlink on close/atexit, and this PR already fixed the stale-lock failure mode where close exceptions skipped _release_writer_pidfile(). If APSW close raises while the process is unwinding, retaining the pidfile can permanently block future writers. _release_writer_pidfile() is idempotent and only unlinks when the pidfile still names this PID. Regression coverage: test_close_releases_pidfile_when_connection_close_raises.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EtanHey, understood — thank you for the thorough explanation. The PID-checked idempotency of _release_writer_pidfile() and the risk of a permanently stranded pidfile from a dying process tip the balance in favour of the finally block here. I was treating this like a generic lock where retention-on-failure is always safer, but the process-scoped, PID-validated pidfile design means a failed-close process that hangs onto its pidfile is strictly worse than releasing it. The existing test test_close_releases_pidfile_when_connection_close_raises confirms the behaviour is intentional and covered.

I'll withdraw the suggestion.


✏️ Learnings added
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 309
File: src/brainlayer/vector_store.py:2114-2124
Timestamp: 2026-05-21T20:41:49.909Z
Learning: In `src/brainlayer/vector_store.py`, `VectorStore.close()` intentionally releases the writer pidfile in a `finally` block (always, even if `read_conn.close()` or `self.conn.close()` raises). This is by design: the pidfile is PID-checked and idempotent, so releasing it during process unwind is safe, and retaining it after a failed close would permanently block future writers. Do not suggest moving `_release_writer_pidfile()` outside the `finally` block. Covered by `test_close_releases_pidfile_when_connection_close_raises`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8afb4387a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/drain.py Outdated
timeout_ms = int(os.environ.get("BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS", "30000"))
except (TypeError, ValueError):
return 30000
return timeout_ms if timeout_ms > 0 else 30000
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp busy-timeout override to 32-bit APSW range

_drain_busy_timeout_ms() accepts any positive integer, but apsw.Connection.setbusytimeout() expects a C int; values above 2_147_483_647 raise OverflowError and abort drain startup. A misconfigured BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS (for example 3000000000) therefore turns a tuning knob into a hard failure path instead of falling back safely.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c668da8. Added a RED regression for BRAINLAYER_DRAIN_BUSY_TIMEOUT_MS=3000000000 causing APSW OverflowError, then clamped invalid, non-positive, and >C-int values back to the 30000ms default. Verified with the focused drain timeout tests, the affected 68-test PR-alpha slice, and the full pre-push gate.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

Review update:

  • Fixed CodeRabbit outside-diff recency rank collision in e035747: added RED regression test_recency_intent_fallback_appends_ranks_after_fts_hits, then appended fallback ranks after existing FTS ranks.
  • Fixed drain busy-timeout overflow in c668da8 with RED regression test_drain_busy_timeout_overflow_env_falls_back_to_30s.
  • Latest local gate after c668da8: 2090 passed, 9 skipped, 75 deselected, 1 xfailed; MCP registration 3 passed; isolated eval/hook routing 32 passed; Bun 1 passed; FTS determinism passed.

@cursor @BugBot re-review latest head c668da8.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

@codex review latest head c668da8.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c668da8c0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

query_embedding = model.embed_query(prompt)

store = VectorStore(Path(db_path))
store = VectorStore(Path(db_path), readonly=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve writable fallback when opening prompt search store

Opening VectorStore in read-only mode here causes prompt search to hard-fail when the DB file or its parent directory does not exist yet (fresh install, cleaned state, or custom path). In that scenario VectorStore(..., readonly=True) raises immediately, and the fallback path also opens SQLite read-only, so the hook returns an error instead of gracefully returning no memories. Before this change, the writable open path created the DB directory/file and allowed the hook to continue.

Useful? React with 👍 / 👎.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 21, 2026

CI/review gates are green on head c668da8.

Attempted required merge command:
gh pr merge 309 --rebase

GitHub blocked it with base branch policy and suggested --admin. Per PR-alpha instructions and no-squash/no-admin-self-merge rule, requesting Etan admin-merge with rebase merge only.

@EtanHey please admin-merge PR #309 with rebase merge (no squash).

@EtanHey EtanHey merged commit 634c222 into main May 21, 2026
7 checks passed
EtanHey added a commit that referenced this pull request May 22, 2026
PR #309 added a pidfile mutex on VectorStore RW init to gate multiple
enrich processes. Unintended consequence: the FastAPI daemon's
lifespan also opens VectorStore RW at startup (daemon.py:83) — if
the enrich supervisor is alive (which it normally is), the daemon
fails to start entirely with WriterInUseError. That breaks:

  - brainlayer search CLI (DaemonClient hangs trying to start daemon)
  - brainlayer stats CLI (same)
  - any dashboard frontend on localhost:3000 etc.

Empirically observed: enrich supervisor PID 14909 holds the pidfile,
brainlayer-daemon refuses to bootstrap, /tmp/brainlayer.sock never
appears, all CLI operations time out at the DAEMON_STARTUP_TIMEOUT.

Fix: catch WriterInUseError on the lifespan VectorStore() call and
fall back to readonly mode. In readonly mode:
  - Search endpoints work (the common case)
  - Stats/context/dashboard endpoints work
  - Write endpoints (/digest, /store, /entity PATCH, /backlog/items
    POST/PATCH/DELETE) will return 500 on the SQL layer

Operator can restart the daemon (with no live writer) to regain RW
mode for write endpoints. Or eventually move write endpoints to the
arbitrated queue path (post-Phase-4 work).

Empirically verified by smoke-test:
  - Direct lifespan invocation against current production state
    (enrich supervisor PID 14909 holds pidfile)
  - Pre-fix: WriterInUseError raised, lifespan fails to enter, daemon exits
  - Post-fix: warning logged, lifespan enters readonly, "LIFESPAN OK"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EtanHey added a commit that referenced this pull request May 23, 2026
…ience

Etan-mandate 2026-05-22: "IMPROVE BRAINBAR AND MAKE IT SHOW GRAPH AND
INJESTIONS ALREADY, WITHOUT DEGRATION!" + addendum: "AUTO MERGE AFTER
/pr-loop, not sloppy."

PR #314 wired the runtime database + InjectionStore so the "Warming
memory…" placeholder cleared and BrainBarGraphTab + BrainBarInjectionTab
get non-nil dependencies. But two failure modes remained:

1. When KG queries failed (typically transient ReadOnly / busy / locked
   from writer-pidfile contention with the Python enrich-supervisor +
   drain — PR #309), `KGViewModel.loadGraph` silently set nodes=[]/
   edges=[] and returned false. The UI had no way to distinguish "empty
   graph" from "queries failing," so the canvas blanked.
2. Same pattern in `InjectionStore.refresh`: failures only `NSLog`'d and
   the panel held stale or empty events with no user-visible signal.

This change adds a small `DegradationState` enum (`.healthy` /
`.degraded(reason:)`) published on both view-models. `loadGraph` retries
once on a 200ms backoff before reporting degraded, then keeps last-known-
good nodes/edges so the user sees prior data rather than a blank canvas.
`InjectionStore.refresh` flips state on failure and clears it on the next
successful poll cycle.

A small `DegradationBadge` (orange capsule with an exclamation-triangle
icon) overlays the top-right of each tab when its state is degraded, with
the underlying error in the SwiftUI .help tooltip. Per Etan's mandate:
degraded ≠ hidden.

Tests cover the enum surface, KG load-failure → degraded transition, KG
success → healthy transition, and InjectionStore initial-healthy state.
387/387 tests pass, swift build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant