feat(enrich): long-lived supervisor replaces respawn-init storm#306
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR introduces a new ChangesEnrichment Supervisor Mode Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
|
@codex review |
✅ Actions performedReview triggered.
|
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a899484cf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logger.exception("Enrich supervisor cycle failed; continuing") | ||
| continue |
There was a problem hiding this comment.
Back off after failed supervisor cycle
When enrich_fn raises, the supervisor immediately continues without any sleep, so a persistent failure path (for example API outage or DB write error) will spin the loop as fast as possible, flooding logs and burning CPU while repeatedly retrying. This is a regression from the previous launchd shell loop behavior, which had a delay between attempts, and it can destabilize production during incidents instead of degrading gracefully.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ed786c4. Supervisor cycle exceptions now sleep via the same idle-poll helper before retrying, with a persistent-failure regression test.
| since_hours=since_hours | ||
| if since_hours != DEFAULT_REALTIME_ENRICH_SINCE_HOURS | ||
| else DEFAULT_ENRICH_SUPERVISOR_SINCE_HOURS, |
There was a problem hiding this comment.
Preserve explicit --since-hours in supervisor mode
This remapping treats since_hours == DEFAULT_REALTIME_ENRICH_SINCE_HOURS as "option not provided" and replaces it with DEFAULT_ENRICH_SUPERVISOR_SINCE_HOURS, so users cannot intentionally request the realtime default window (currently 8760h) in supervisor mode. Passing --since-hours 8760 is silently changed to 87600, which can unexpectedly widen scope and alter enrichment workload.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ed786c4. --since-hours now defaults to None at the CLI boundary, so omitted supervisor runs use 87600h while explicit values like 8760 are preserved.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/brainlayer/enrichment_controller.py`:
- Around line 137-195: The supervisor tight-loop on exceptions is caused by the
except Exception branch in run_enrich_supervisor skipping any sleep/backoff;
modify that except block in run_enrich_supervisor (the one that currently
increments stats and logger.exception) to call the same idle-poll/backoff helper
_sleep_or_wait_for_stop(stop_event, idle_poll_seconds, sleep_fn) (or a small
exponential backoff using sleep_fn) before continuing so repeated failures yield
CPU/log backoff; keep the existing stats increments and error recording. Also
add a regression test (e.g., similar to
test_supervisor_logs_gemini_error_and_continues) that forces enrich_fn to raise
on every cycle with max_cycles > 1 and asserts the provided sleep_fn was invoked
between failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e959b4d0-118e-43ff-bd47-bbca7008d924
📒 Files selected for processing (6)
scripts/launchd/com.brainlayer.enrichment.plistsrc/brainlayer/cli/__init__.pysrc/brainlayer/enrichment_controller.pytests/test_cli_enrich.pytests/test_enrich_supervisor.pytests/test_enrichment_controller.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
- GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
tests/test_cli_enrich.pytests/test_enrichment_controller.pysrc/brainlayer/cli/__init__.pysrc/brainlayer/enrichment_controller.pytests/test_enrich_supervisor.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/cli/__init__.pysrc/brainlayer/enrichment_controller.py
🔇 Additional comments (11)
scripts/launchd/com.brainlayer.enrichment.plist (1)
9-17: LGTM!tests/test_enrichment_controller.py (1)
1230-1233: LGTM!src/brainlayer/enrichment_controller.py (1)
36-38: LGTM!Also applies to: 109-119, 121-125, 128-134
tests/test_enrich_supervisor.py (1)
1-213: LGTM!src/brainlayer/cli/__init__.py (4)
906-906: LGTM!
911-916: LGTM!
944-976: LGTM!
1020-1021: LGTM!tests/test_cli_enrich.py (3)
3-4: LGTM!
33-61: LGTM!
63-88: LGTM!
| finally: | ||
| signal.signal(signal.SIGTERM, previous_sigterm) | ||
| signal.signal(signal.SIGINT, previous_sigint) |
There was a problem hiding this comment.
🟢 Low cli/__init__.py:967
signal.getsignal() returns None if the previous handler wasn't installed from Python, but signal.signal() raises TypeError when passed None. The finally block at lines 967-969 unconditionally restores the previous handlers, so if either previous_sigterm or previous_sigint is None, the cleanup itself crashes.
- signal.signal(signal.SIGTERM, previous_sigterm)
- signal.signal(signal.SIGINT, previous_sigint)🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/cli/__init__.py around lines 967-969:
`signal.getsignal()` returns `None` if the previous handler wasn't installed from Python, but `signal.signal()` raises `TypeError` when passed `None`. The `finally` block at lines 967-969 unconditionally restores the previous handlers, so if either `previous_sigterm` or `previous_sigint` is `None`, the cleanup itself crashes.
Evidence trail:
src/brainlayer/cli/__init__.py lines 954-969 (viewed at REVIEWED_COMMIT): `previous_sigterm = signal.getsignal(signal.SIGTERM)` (line 954), `previous_sigint = signal.getsignal(signal.SIGINT)` (line 955), and unconditional restore in `finally` block at lines 967-969. Python docs (https://docs.python.org/3/library/signal.html): getsignal() 'returned value may be a callable Python object, or one of the special values signal.SIG_IGN, signal.SIG_DFL or None'; signal.signal() handler must be 'a callable Python object taking two arguments, or one of the special values signal.SIG_IGN or signal.SIG_DFL'.
|
@coderabbitai review |
|
@codex review |
✅ Actions performedReview triggered.
|
Summary
Testing
|
| stop_event: threading.Event | None = None, | ||
| vector_store_cls=None, | ||
| enrich_fn=None, | ||
| sleep_fn=time.sleep, |
There was a problem hiding this comment.
🟡 Medium brainlayer/enrichment_controller.py:148
run_enrich_supervisor returns EnrichmentSupervisorResult with exit_code=0 even when failed > 0 or failed_cycles > 0, causing the CLI to exit with status 0 on failure. Consider setting exit_code=1 when any failures occur so the caller can detect the supervisor encountered errors.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/enrichment_controller.py around line 148:
`run_enrich_supervisor` returns `EnrichmentSupervisorResult` with `exit_code=0` even when `failed > 0` or `failed_cycles > 0`, causing the CLI to exit with status 0 on failure. Consider setting `exit_code=1` when any failures occur so the caller can detect the supervisor encountered errors.
Evidence trail:
src/brainlayer/enrichment_controller.py lines 110-119 (EnrichmentSupervisorResult dataclass with exit_code=0 default), lines 138-198 (run_enrich_supervisor function — exit_code never modified), lines 176-184 (failed_cycles incremented but exit_code untouched), lines 186-191 (failed accumulated but exit_code untouched). src/brainlayer/cli/__init__.py line 973 (`raise typer.Exit(result.exit_code)` — CLI uses exit_code directly).
ed786c4 to
e96848f
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e96848f497
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| db_path = Path(db_path) | ||
| stats = EnrichmentSupervisorResult() | ||
| store = vector_store_cls(db_path) |
There was a problem hiding this comment.
Avoid holding writable VectorStore for full supervisor life
run_enrich_supervisor creates one writable VectorStore before the loop and only closes it on shutdown, so the writer pidfile stays owned for the entire LaunchAgent lifetime. In this codebase, writable VectorStore initialization acquires that pidfile and competing writable opens raise WriterInUseError, so this change turns lock contention from temporary (per one-shot cycle) into effectively permanent while supervisor mode is running. That can block other processes that still need writable VectorStore access (e.g., maintenance/bootstrap paths) until enrichment is stopped.
Useful? React with 👍 / 👎.
Status
DRAFT: PR-alpha (
fix/single-writer-arbitration-option-a) is not merged or visible on GitHub yet. This PR intentionally stays draft until it can be rebased on the pidfile mutex base and the real single-instance path can be validated.Summary
run_enrich_supervisor()as a long-lived realtime enrichment loop around one writableVectorStore(db_path).brainlayer enrich --mode realtime --supervisorwith SIGTERM/SIGINT stop-event handling.while truerespawn loop with a direct supervisor invocation.Architecture
Single-init proof
Focused test:
pytest tests/test_enrich_supervisor.py::test_supervisor_logs_single_vectorstore_initialized_line -qRelevant assertion:
caplogrecords exactly oneVectorStore initialized for enrich supervisorline across three supervisor cycles.Error survival proof
Focused test:
pytest tests/test_enrich_supervisor.py::test_supervisor_logs_gemini_error_and_continues -qThis injects
RuntimeError("gemini 503")on cycle 1 and asserts cycle 2 still enriches successfully, withfailed == 1andenriched == 1.SIGTERM graceful-shutdown proof
Focused test:
pytest tests/test_cli_enrich.py::test_cli_enrich_supervisor_handles_sigterm_gracefully -qThis sends a real
SIGTERMto the process while the CLI handler is installed, asserts the supervisor stop event is set, and exits with code 0.Before/after enrich rate
Reproducible bench commands
Test plan
pytest tests/test_enrich_supervisor.py -qinitially failed 6/6 on missingrun_enrich_supervisor.pytest tests/test_cli_enrich.py::test_cli_enrich_supervisor_routes_to_controller -qinitially failed on unknown--supervisor.pytest tests/test_enrich_supervisor.py::test_enrichment_launchagent_runs_supervisor_not_shell_respawn_loop -qinitially failed on/bin/zsh -lc while trueargs.pytest tests/test_enrich_supervisor.py tests/test_cli_enrich.py tests/test_enrichment_controller.py::test_enrichment_plist_invokes_cli_enrich_entrypoint -q-> 18 passed.ruff check src tests-> passed.ruff format --check src tests-> passed.Known dependency
The actual pidfile single-instance enforcement remains owned by PR-alpha. Until that merges, this PR proves constructor-error propagation with a
WriterInUseError-shaped fake and keeps the PR draft.Note
Replace shell-loop respawn with a long-lived
run_enrich_supervisorloop in the enrichment LaunchAgentrun_enrich_supervisorto enrichment_controller.py: a persistent loop that reuses a singleVectorStore, sleeps when the queue is idle, continues on per-cycle errors, and returns anEnrichmentSupervisorResultwith aggregated stats.--supervisorflag to theenrichCLI command; when used with--mode realtime, the CLI runs the supervisor loop, installs SIGTERM/SIGINT handlers for graceful shutdown, and exits with the supervisor's exit code.__BRAINLAYER_BIN__ enrich --mode realtime --supervisordirectly, delegating restart-on-crash to launchd instead of awhile trueshell loop.BRAINLAYER_ENRICH_IDLE_POLL_SECONDSenvironment variable, defaulting to 30 seconds.📊 Macroscope summarized 7a89948. 3 files reviewed, 1 issue evaluated, 0 issues filtered, 1 comment posted
🗂️ Filtered Issues
Summary by CodeRabbit
New Features
Tests
Note
Medium Risk
Changes how realtime enrichment runs under
launchdby introducing a long-lived supervisor loop with signal handling and new defaults, which could impact background job stability and throughput if misconfigured. Test coverage is added, but behavior changes are operationally significant.Overview
Switches realtime enrichment from a
launchd-driven shellwhile truerespawn loop to a first-class--supervisormode that keeps a singleVectorStoreopen and repeatedly runs realtime enrichment, sleeping when idle and continuing past per-cycle errors.Updates
brainlayer enrichto support--supervisor(with SIGTERM/SIGINT graceful shutdown and explicitsince_hoursdefaulting behavior), adds supervisor defaults/tuning (BRAINLAYER_ENRICH_IDLE_POLL_SECONDS) inenrichment_controller, and revises the enrichment LaunchAgent plist to invoke the supervisor directly; adds regression tests covering CLI routing, supervisor lifecycle/error handling, and plist shape.Reviewed by Cursor Bugbot for commit e96848f. Bugbot is set up for automated code reviews on this repo. Configure here.