fix(daemon): fall back to readonly when writer pidfile is held#311
Conversation
PR #309 added a pidfile mutex on VectorStore RW init to gate multiple enrich processes. Unintended consequence: the FastAPI daemon's lifespan also opens VectorStore RW at startup (daemon.py:83) — if the enrich supervisor is alive (which it normally is), the daemon fails to start entirely with WriterInUseError. That breaks: - brainlayer search CLI (DaemonClient hangs trying to start daemon) - brainlayer stats CLI (same) - any dashboard frontend on localhost:3000 etc. Empirically observed: enrich supervisor PID 14909 holds the pidfile, brainlayer-daemon refuses to bootstrap, /tmp/brainlayer.sock never appears, all CLI operations time out at the DAEMON_STARTUP_TIMEOUT. Fix: catch WriterInUseError on the lifespan VectorStore() call and fall back to readonly mode. In readonly mode: - Search endpoints work (the common case) - Stats/context/dashboard endpoints work - Write endpoints (/digest, /store, /entity PATCH, /backlog/items POST/PATCH/DELETE) will return 500 on the SQL layer Operator can restart the daemon (with no live writer) to regain RW mode for write endpoints. Or eventually move write endpoints to the arbitrated queue path (post-Phase-4 work). Empirically verified by smoke-test: - Direct lifespan invocation against current production state (enrich supervisor PID 14909 holds pidfile) - Pre-fix: WriterInUseError raised, lifespan fails to enter, daemon exits - Post-fix: warning logged, lifespan enters readonly, "LIFESPAN OK" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughThe daemon startup logic now attempts to initialize the vector store in read-write mode. If another live writer holds the mutex lock, raising ChangesVector Store Readonly Fallback
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/brainlayer/daemon.py`:
- Around line 92-97: The warning message emitted in the pidfile mutex fallback
(the logger.warning call that describes entering READONLY mode) incorrectly
lists /backlog/items POST/PATCH/DELETE as blocked; remove those backlog
endpoints from the degraded-mode message so it only advertises write endpoints
that actually touch vector_store (for example keep /digest, /store, /entity
PATCH, etc.), and if you intend to gate backlog mutations separately, add a
distinct check/gate for them (see the code around the pidfile mutex/READONLY
logic in daemon.py where logger.warning is called).
- Around line 89-99: The startup path currently constructs VectorStore using
DEFAULT_DB_PATH directly; instead, call get_db_path() once from paths.py and
store the result (e.g., db_path = get_db_path()) and use that db_path for both
VectorStore(...) calls (the initial read-write attempt and the readonly fallback
in the WriterInUseError handler) so the daemon honors repository/CLI overrides;
update references to DEFAULT_DB_PATH to use the db_path variable and keep
existing exception handling around WriterInUseError and logging unchanged.
- Around line 88-99: When VectorStore is opened in readonly fallback (catching
WriterInUseError in the VectorStore(DEFAULT_DB_PATH) block), set a
module-level/deamon-level flag (e.g., readonly_degraded = True) alongside
assigning vector_store so the daemon state explicitly records the degraded mode;
then update the write-route handlers api_digest, api_store, and
api_update_entity to check that flag at the top of each handler and immediately
return an HTTP 503 response (with a clear message) before any write-heavy
processing or DB calls if readonly_degraded is set. Ensure the flag name and
checks are consistent with the VectorStore/WriterInUseError code paths so the
handlers short-circuit reliably whenever the store was opened with
readonly=True.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 83faec47-4603-46a0-a1be-1d4334207333
📒 Files selected for processing (1)
src/brainlayer/daemon.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Cursor Bugbot
- GitHub Check: Macroscope - Correctness Check
- GitHub Check: test (3.13)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
src/brainlayer/daemon.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/daemon.py
| try: | ||
| vector_store = VectorStore(DEFAULT_DB_PATH) | ||
| logger.info(f"Loaded vector store (READ-WRITE): {vector_store.count()} chunks") | ||
| except WriterInUseError as exc: | ||
| logger.warning( | ||
| "Another writer holds the pidfile mutex (%s); falling back to READONLY mode. " | ||
| "Search/stats/context endpoints work; write endpoints (/digest, /store, /entity PATCH, " | ||
| "/backlog/items POST/PATCH/DELETE) will return 500 until daemon is restarted with no live writer.", | ||
| exc, | ||
| ) | ||
| vector_store = VectorStore(DEFAULT_DB_PATH, readonly=True) | ||
| logger.info(f"Loaded vector store (READONLY fallback): {vector_store.count()} chunks") |
There was a problem hiding this comment.
Reject local write routes explicitly after readonly fallback.
This only changes how VectorStore is opened. api_digest, api_store, and api_update_entity still enter their normal write-heavy code paths and will fail only after they reach SQLite, so the daemon is still doing concurrent write work while another writer owns the mutex. Persist a readonly/degraded flag here and short-circuit those handlers with a 503 before any write-side processing starts. As per coding guidelines, "Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior" and "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/brainlayer/daemon.py` around lines 88 - 99, When VectorStore is opened in
readonly fallback (catching WriterInUseError in the VectorStore(DEFAULT_DB_PATH)
block), set a module-level/deamon-level flag (e.g., readonly_degraded = True)
alongside assigning vector_store so the daemon state explicitly records the
degraded mode; then update the write-route handlers api_digest, api_store, and
api_update_entity to check that flag at the top of each handler and immediately
return an HTTP 503 response (with a clear message) before any write-heavy
processing or DB calls if readonly_degraded is set. Ensure the flag name and
checks are consistent with the VectorStore/WriterInUseError code paths so the
handlers short-circuit reliably whenever the store was opened with
readonly=True.
| vector_store = VectorStore(DEFAULT_DB_PATH) | ||
| logger.info(f"Loaded vector store (READ-WRITE): {vector_store.count()} chunks") | ||
| except WriterInUseError as exc: | ||
| logger.warning( | ||
| "Another writer holds the pidfile mutex (%s); falling back to READONLY mode. " | ||
| "Search/stats/context endpoints work; write endpoints (/digest, /store, /entity PATCH, " | ||
| "/backlog/items POST/PATCH/DELETE) will return 500 until daemon is restarted with no live writer.", | ||
| exc, | ||
| ) | ||
| vector_store = VectorStore(DEFAULT_DB_PATH, readonly=True) | ||
| logger.info(f"Loaded vector store (READONLY fallback): {vector_store.count()} chunks") |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Resolve the DB path with get_db_path() in this startup path.
The new open path still uses DEFAULT_DB_PATH directly for both the RW and readonly attempts, which bypasses the repo's required resolver and can desync the daemon from callers that honor DB path overrides. Please fetch the path once via paths.py:get_db_path() and use that for both initializations. As per coding guidelines, "Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/brainlayer/daemon.py` around lines 89 - 99, The startup path currently
constructs VectorStore using DEFAULT_DB_PATH directly; instead, call
get_db_path() once from paths.py and store the result (e.g., db_path =
get_db_path()) and use that db_path for both VectorStore(...) calls (the initial
read-write attempt and the readonly fallback in the WriterInUseError handler) so
the daemon honors repository/CLI overrides; update references to DEFAULT_DB_PATH
to use the db_path variable and keep existing exception handling around
WriterInUseError and logging unchanged.
| logger.warning( | ||
| "Another writer holds the pidfile mutex (%s); falling back to READONLY mode. " | ||
| "Search/stats/context endpoints work; write endpoints (/digest, /store, /entity PATCH, " | ||
| "/backlog/items POST/PATCH/DELETE) will return 500 until daemon is restarted with no live writer.", | ||
| exc, | ||
| ) |
There was a problem hiding this comment.
Don't advertise backlog mutations as blocked by readonly mode.
/backlog/items POST/PATCH/DELETE do not touch vector_store; they go through _supabase_mutate, so this warning tells operators those endpoints are unavailable even though they should still work. Please remove them from the degraded-mode message unless you also add a separate gate for them.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/brainlayer/daemon.py` around lines 92 - 97, The warning message emitted
in the pidfile mutex fallback (the logger.warning call that describes entering
READONLY mode) incorrectly lists /backlog/items POST/PATCH/DELETE as blocked;
remove those backlog endpoints from the degraded-mode message so it only
advertises write endpoints that actually touch vector_store (for example keep
/digest, /store, /entity PATCH, etc.), and if you intend to gate backlog
mutations separately, add a distinct check/gate for them (see the code around
the pidfile mutex/READONLY logic in daemon.py where logger.warning is called).
Why
PR #309 (single-writer pidfile mutex) inadvertently broke
brainlayer-daemon(FastAPI). At startup, daemon.py:83 opensVectorStore(DEFAULT_DB_PATH)in RW mode, but the new pidfile mutex raisesWriterInUseErrorif any other writer (enrich supervisor, drain, etc.) holds the lock — which is normal production state.Symptom:
brainlayer searchCLI hangs ~17-36 min,brainlayer statshangs,/tmp/brainlayer.socknever appears.Discovered during Phase 4a eval baseline.json generation tonight — the runner needed
brainlayer searchto work.Fix
Catch
WriterInUseErrorin the lifespan handler and fall back to readonly mode. Log a clear warning explaining what works and what doesn't.Smoke-test evidence (empirical)
Direct lifespan invocation while enrich supervisor PID 14909 holds the pidfile:
Test plan
Sequencing
Independent of Phase 4a/4b. Ships standalone as a small bugfix.
🤖 Generated with Claude Code by orcClaude-successor s:42
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Note
Medium Risk
Changes daemon startup behavior to continue running in a degraded (readonly) mode when a writer lock is held; this can surface as 500s on write endpoints until restart. Risk is limited to initialization/control flow and logging, with no schema or query logic changes.
Overview
Fixes
brainlayer-daemonstartup failures introduced by the single-writer pidfile mutex by catchingWriterInUseErrorduringVectorStoreinitialization.On lock contention, the daemon now falls back to
VectorStore(..., readonly=True)and logs a clear warning that read endpoints (search/stats/context/dashboard) remain available while write endpoints (e.g.POST /digest,POST /store,PATCH /entity, backlog mutations) will fail until restarted without another active writer.Reviewed by Cursor Bugbot for commit 6383adf. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Fall back to readonly
VectorStorewhen writer pidfile is held on daemon startupIn daemon.py, the
lifespanstartup now wrapsVectorStoreinitialization in a try/except. IfWriterInUseErroris raised, the daemon logs a warning and re-opens the store withreadonly=Trueinstead of failing. Behavioral Change: the daemon can now start successfully when another process holds the writer lock, but will serve read-only traffic in that state.Macroscope summarized 6383adf.
Summary by CodeRabbit