feat: R81b meta-noise filter + enrichment prompt additions#225
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughAdds process-level stderr diagnostic markers for enrichment runtime and prompt loading, strengthens prompt instructions for meta-research and chunking, implements optional meta-noise filtering in hybrid search, and adds tests covering prompt content, concurrency for prompt signature emission, and meta-noise filtering behavior. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/enrichment_controller.py`:
- Around line 463-467: The direct os.write() of the "ENRICHMENT_RUNTIME_LOADED"
marker in enrich_realtime() must be made best-effort: wrap the os.write(...)
call in a try/except OSError block and catch/log the exception at debug level
(include the exception message and that the stderr write failed) so an OSError
on FD 2 won’t abort enrich_realtime(); apply the same guard pattern around the
analogous os.write(...) call in the pipeline enrichment module (the call
referenced at enrichment.py:409) so both emitters are fault-tolerant.
In `@src/brainlayer/pipeline/enrichment.py`:
- Around line 403-412: Protect the unsynchronized check/set in
_emit_prompt_signature_once by introducing a module-level threading.Lock
(similar to existing _groq_rate_lock), acquire it before checking
_prompt_signature_emitted, perform a double-check inside the lock, set
_prompt_signature_emitted=True and call os.write() while still holding the lock,
and add try/except around os.write() to catch and log/ignore OSError so the call
won’t crash; update the same pattern at the other emission sites referenced by
the comment (the calls around lines where build_prompt/build_external_prompt are
invoked) so all prompt-signature emissions use the new locked, exception-safe
logic.
In `@src/brainlayer/search_repo.py`:
- Around line 107-110: The post-filter _contains_meta_noise currently does a
case-sensitive substring check against META_NOISE_PATTERNS so variants like
"Brain_Search" slip through; update _contains_meta_noise to normalize comparison
(e.g., use content.casefold() and compare against a pre-normalized, casefolded
META_NOISE_PATTERNS or use re.search with re.IGNORECASE) and also make the
SQL-level filter case-insensitive by switching pattern matching to ILIKE or
wrapping fields with LOWER(...) and comparing to lowercased patterns; ensure any
other places that reference META_NOISE_PATTERNS for SQL construction or
filtering (the other post-filter/code paths noted around the same sections) are
updated to use the same case-insensitive approach so both SQL and Python
post-filtering behave consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a840f78c-ac92-4913-9ad4-c59c860ea1a3
📒 Files selected for processing (5)
src/brainlayer/enrichment_controller.pysrc/brainlayer/pipeline/enrichment.pysrc/brainlayer/search_repo.pytests/test_enrichment_entity_schema.pytests/test_hybrid_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
- GitHub Check: test (3.12)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
src/brainlayer/enrichment_controller.pytests/test_hybrid_search.pysrc/brainlayer/search_repo.pytests/test_enrichment_entity_schema.pysrc/brainlayer/pipeline/enrichment.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/enrichment_controller.pysrc/brainlayer/search_repo.pysrc/brainlayer/pipeline/enrichment.py
🧠 Learnings (8)
📚 Learning: 2026-04-02T23:32:14.543Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T23:32:14.543Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.pysrc/brainlayer/pipeline/enrichment.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.
Applied to files:
src/brainlayer/enrichment_controller.pysrc/brainlayer/pipeline/enrichment.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Implement chunk lifecycle columns: `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search; allow `include_archived=True` to show history
Applied to files:
src/brainlayer/search_repo.py
📚 Learning: 2026-04-04T23:24:03.159Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-04T23:24:03.159Z
Learning: Applies to src/brainlayer/{vector_store,search}*.py : Chunk lifecycle: implement columns `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search
Applied to files:
src/brainlayer/search_repo.py
📚 Learning: 2026-04-04T15:22:02.740Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:22:02.740Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.
Applied to files:
src/brainlayer/search_repo.py
📚 Learning: 2026-04-04T15:21:39.570Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:169-169
Timestamp: 2026-04-04T15:21:39.570Z
Learning: In EtanHey/brainlayer, `hooks/brainlayer-prompt-search.py` reads `entity_type` directly from existing rows in `kg_entities` (read-only). `contracts/entity-types.yaml` defines the write-side schema only and is not authoritative for what `entity_type` values exist in the DB. The DB already stores `technology` (72 entities), `project` (24), and `tool` (1) as valid `entity_type` values, so `INJECT_TYPES` in the hook should match these DB values, not the contract file.
Applied to files:
tests/test_enrichment_entity_schema.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Pipeline architecture: Extract → Classify → Chunk → Embed → Index, with post-processing for enrichment, brain graph, and Obsidian export
Applied to files:
src/brainlayer/pipeline/enrichment.py
🔇 Additional comments (4)
src/brainlayer/search_repo.py (1)
683-684: Nice cache-key isolation forfilter_meta_noise.Including
filter_meta_noisein the cache key avoids filtered/unfiltered cross-contamination from_hybrid_cache.Also applies to: 717-717
tests/test_hybrid_search.py (1)
230-263: Strong coverage for default-vs-opt-out meta-noise behavior.This test captures both expected defaults and explicit override behavior in one place.
src/brainlayer/pipeline/enrichment.py (1)
304-308: Prompt additions are aligned and actionable.The meta-research detection and short conversational chunk guidance are clear, and rubric wording remains explicit.
Also applies to: 315-316, 396-399
tests/test_enrichment_entity_schema.py (1)
100-119: Good prompt-contract regression checks.These assertions lock in the newly required meta-research and rubric directives with clear intent.
| if mode == "realtime": | ||
| os.write( | ||
| 2, | ||
| b"ENRICHMENT_RUNTIME_LOADED mode=realtime prompt=r81 truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n", | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify unguarded stderr marker writes in enrichment paths.
rg -n -C3 'def _emit_enrichment_start|os\.write\(' src/brainlayer/enrichment_controller.py src/brainlayer/pipeline/enrichment.pyRepository: EtanHey/brainlayer
Length of output: 1299
🏁 Script executed:
#!/bin/bash
# Get full context of _emit_enrichment_start and its callers
# First, check the function definition and return behavior
sed -n '462,480p' src/brainlayer/enrichment_controller.py
# Then find where _emit_enrichment_start is called
echo "=== Callers of _emit_enrichment_start ==="
rg -n '_emit_enrichment_start' src/brainlayer/enrichment_controller.py
# Check if there are other unguarded os.write calls in the codebase
echo "=== All os.write calls in brainlayer ==="
rg -n 'os\.write\(' src/brainlayer/ --type pyRepository: EtanHey/brainlayer
Length of output: 1095
🏁 Script executed:
#!/bin/bash
# Check enrich_realtime to see what happens if _emit_enrichment_start fails
# Search for the function definition and how it handles _emit_enrichment_start
rg -B5 -A20 'def enrich_realtime' src/brainlayer/enrichment_controller.pyRepository: EtanHey/brainlayer
Length of output: 894
🏁 Script executed:
#!/bin/bash
# Check if os.write can actually raise OSError and under what conditions
# Also verify if any existing guard patterns are used in the codebase
rg -B2 -A2 'try:' src/brainlayer/enrichment_controller.py | head -50Repository: EtanHey/brainlayer
Length of output: 1582
🏁 Script executed:
#!/bin/bash
# Check exact current state of lines 462-470
sed -n '462,470p' src/brainlayer/enrichment_controller.pyRepository: EtanHey/brainlayer
Length of output: 418
🏁 Script executed:
#!/bin/bash
# Check if logger is already available in enrichment_controller.py
head -30 src/brainlayer/enrichment_controller.py | rg -n 'import|logger'Repository: EtanHey/brainlayer
Length of output: 516
🏁 Script executed:
#!/bin/bash
# Check if there's any error handling pattern around _emit_enrichment_start calls
sed -n '515,530p' src/brainlayer/enrichment_controller.pyRepository: EtanHey/brainlayer
Length of output: 705
Guard realtime stderr marker emission to prevent enrichment failure.
Line 464 writes directly to FD 2 without error handling. An OSError from a broken/unavailable stderr—common in containerized or CI environments—will interrupt the entire enrich_realtime() call before any chunks are processed. This diagnostic marker should be best-effort.
Wrap the os.write() call in a try-except block that logs the failure at debug level:
Proposed fix
def _emit_enrichment_start(mode: str, limit: int) -> bool:
if mode == "realtime":
- os.write(
- 2,
- b"ENRICHMENT_RUNTIME_LOADED mode=realtime prompt=r81 truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n",
- )
+ try:
+ os.write(
+ 2,
+ b"ENRICHMENT_RUNTIME_LOADED mode=realtime prompt=r81 truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n",
+ )
+ except OSError:
+ logger.debug("Failed to emit ENRICHMENT_RUNTIME_LOADED marker", exc_info=True)
return _emit_enrichment_event(Note: A similar unguarded os.write() call exists at src/brainlayer/pipeline/enrichment.py:409. Apply the same guard pattern there.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if mode == "realtime": | |
| os.write( | |
| 2, | |
| b"ENRICHMENT_RUNTIME_LOADED mode=realtime prompt=r81 truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n", | |
| ) | |
| if mode == "realtime": | |
| try: | |
| os.write( | |
| 2, | |
| b"ENRICHMENT_RUNTIME_LOADED mode=realtime prompt=r81 truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n", | |
| ) | |
| except OSError: | |
| logger.debug("Failed to emit ENRICHMENT_RUNTIME_LOADED marker", exc_info=True) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/enrichment_controller.py` around lines 463 - 467, The direct
os.write() of the "ENRICHMENT_RUNTIME_LOADED" marker in enrich_realtime() must
be made best-effort: wrap the os.write(...) call in a try/except OSError block
and catch/log the exception at debug level (include the exception message and
that the stderr write failed) so an OSError on FD 2 won’t abort
enrich_realtime(); apply the same guard pattern around the analogous
os.write(...) call in the pipeline enrichment module (the call referenced at
enrichment.py:409) so both emitters are fault-tolerant.
| def _emit_prompt_signature_once() -> None: | ||
| """Write a single prompt signature line per process for daemon verification.""" | ||
| global _prompt_signature_emitted | ||
| if _prompt_signature_emitted: | ||
| return | ||
| _prompt_signature_emitted = True | ||
| os.write( | ||
| 2, | ||
| b"ENRICHMENT_PROMPT_LOADED truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n", | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify call sites and parallel paths for prompt building and signature emission.
rg -n -C3 '_emit_prompt_signature_once|build_external_prompt\(|build_prompt\(|ThreadPoolExecutor|executor\.submit\(' src/brainlayer/pipeline/enrichment.py src/brainlayer/enrichment_controller.pyRepository: EtanHey/brainlayer
Length of output: 10848
🏁 Script executed:
#!/bin/bash
# Check for any existing synchronization around _prompt_signature_emitted
# and verify the actual execution context
# 1. Check the full context of the global variable and function
sed -n '40,50p' src/brainlayer/pipeline/enrichment.py
sed -n '403,420p' src/brainlayer/pipeline/enrichment.py
# 2. Check if there are any threading locks defined in the module
rg -n "threading\.(Lock|RLock|Semaphore|Condition)" src/brainlayer/pipeline/enrichment.py
# 3. Verify the execution flow in _enrich_one to see if it calls build_prompt
rg -n -A20 "^def _enrich_one" src/brainlayer/pipeline/enrichment.py | head -40Repository: EtanHey/brainlayer
Length of output: 2005
🏁 Script executed:
#!/bin/bash
# Check enrichment_controller to confirm ThreadPoolExecutor calls paths with _emit_prompt_signature_once
sed -n '540,560p' src/brainlayer/enrichment_controller.py
sed -n '340,360p' src/brainlayer/enrichment_controller.pyRepository: EtanHey/brainlayer
Length of output: 1528
Add lock to protect prompt-signature emission from race conditions.
_prompt_signature_emitted uses unsynchronized check-then-set at lines 405–407, and os.write() at line 409 has no error handling. Multiple threads call build_prompt() and build_external_prompt() via ThreadPoolExecutor in both enrichment.py and enrichment_controller.py, creating a race condition that can emit duplicate markers or fail silently.
The module already uses threading.Lock() for _groq_rate_lock (line 99); apply the same pattern here with a double-check inside the lock and exception handling for os.write().
Proposed fix
logger = logging.getLogger(__name__)
_prompt_signature_emitted = False
+_prompt_signature_lock = threading.Lock()
@@
def _emit_prompt_signature_once() -> None:
"""Write a single prompt signature line per process for daemon verification."""
global _prompt_signature_emitted
if _prompt_signature_emitted:
return
- _prompt_signature_emitted = True
- os.write(
- 2,
- b"ENRICHMENT_PROMPT_LOADED truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n",
- )
+ with _prompt_signature_lock:
+ if _prompt_signature_emitted:
+ return
+ try:
+ os.write(
+ 2,
+ b"ENRICHMENT_PROMPT_LOADED truncation=8000 split=4800/3200 rubrics=epistemic_level,debt_impact,sentiment_label\n",
+ )
+ except OSError:
+ logger.debug("Failed to emit ENRICHMENT_PROMPT_LOADED marker", exc_info=True)
+ finally:
+ _prompt_signature_emitted = TrueAlso applies to call sites: enrichment.py:444, 515
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/pipeline/enrichment.py` around lines 403 - 412, Protect the
unsynchronized check/set in _emit_prompt_signature_once by introducing a
module-level threading.Lock (similar to existing _groq_rate_lock), acquire it
before checking _prompt_signature_emitted, perform a double-check inside the
lock, set _prompt_signature_emitted=True and call os.write() while still holding
the lock, and add try/except around os.write() to catch and log/ignore OSError
so the call won’t crash; update the same pattern at the other emission sites
referenced by the comment (the calls around lines where
build_prompt/build_external_prompt are invoked) so all prompt-signature
emissions use the new locked, exception-safe logic.
- ON CONFLICT(id): The id is uuid5(etype, name.lower()) — deterministic. Conflict should be on id, not (entity_type, name), so retries idempotently UPDATE the same row instead of failing with PRIMARY KEY violation. Fixes UNIQUE constraint errors during enrichment runs. - ruff format: test_enrichment_v2.py line wrapping per --check requirements. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
META_NOISE_PATTERNSto exclude literal MCP/meta-noise chunk content fromhybrid_searchby default, with explicit opt-out supportbrain_search(...)/brain_entity(...)tool invocations, setting low importance and adding themeta-researchtagTests
Note
Add meta-noise filter to
hybrid_searchand extend enrichment prompts with meta-research detectionfilter_meta_noise: bool = TruetoSearchMixin.hybrid_searchin search_repo.py, excluding chunks matching tool-transcript and QA-table patterns via SQL predicates and a post-filter; callers can opt out withfilter_meta_noise=False.ENRICHMENT_PROMPTin enrichment.py with rubrics for meta-research detection, short/conversational chunks,epistemic_level,debt_impact, andsentiment_label.kg_entities(refreshingupdated_aton conflict) and inserts entity-to-chunk links intokg_entity_chunks, ignoring duplicates; failures are logged and do not interrupt enrichment.build_prompt/build_external_promptand_emit_enrichment_start(realtime mode); write failures are swallowed and logged at debug.hybrid_searchexcludes meta-noise chunks by default, which may reduce result counts for callers that previously received tool-transcript content.Macroscope summarized bcd32e0.
Summary by CodeRabbit
New Features
Bug Fixes
Tests