Skip to content

feat: BGE-M3 re-embedding script + enrichment throughput boost#211

Merged
EtanHey merged 3 commits into
mainfrom
feat/bgem3-reembed-and-enrichment-boost
Apr 6, 2026
Merged

feat: BGE-M3 re-embedding script + enrichment throughput boost#211
EtanHey merged 3 commits into
mainfrom
feat/bgem3-reembed-and-enrichment-boost

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 5, 2026

Summary

  • BGE-M3 re-embedding script (scripts/reembed_bgem3.py): re-embed all 301K chunks from bge-large-en-v1.5 to BAAI/bge-m3 for improved multilingual + retrieval quality
    • Resumable via checkpoint file (saves every 1000 chunks)
    • --test flag processes 100 chunks for quick verification
    • Uses APSW + sqlite-vec for production DB, falls back to sqlite3 for test DBs
    • Updates both chunk_vectors (float32) and chunk_vectors_binary (quantized)
    • Verified: 100 chunks on real DB at ~3.3 chunks/s on MPS
  • Enrichment throughput boost: default limit 25→500, since_hours 24→8760
    • With LaunchAgent changes (limit 50→500, StartInterval 3600→600s), backlog clears in ~9h vs 22 days
  • 7 tests: script interface, embedding correctness, checkpoint resume, binary vectors

Test plan

  • 7/7 unit tests pass (isolated test DBs, sqlite3 fallback path)
  • --test flag verified on real 8GB production DB (100 chunks, APSW + sqlite-vec)
  • Embeddings verified: 1024-dim, unit-normalized, non-zero binary vectors
  • Checkpoint resume verified: pre-seeded checkpoint correctly skips processed chunks
  • Lint clean (ruff check + format)
  • Full 301K run overnight after merge

🤖 Generated with Claude Code

Note

Add BGE-M3 re-embedding CLI script and increase enrich_realtime default throughput

  • Adds reembed_bgem3.py, a standalone CLI that loads BAAI/bge-m3 via SentenceTransformers, generates 1024-dim float32 embeddings for all chunks rows, and writes them to chunk_vectors and chunk_vectors_binary in SQLite.
  • Supports resumable processing via a JSON checkpoint file, batch processing, and retry with exponential backoff on SQLITE_BUSY.
  • Uses APSW + sqlite-vec (vec_quantize_binary) when available, falling back to stdlib sqlite3 with a pure-Python sign-bit quantizer for binary embeddings.
  • Increases enrich_realtime defaults in enrichment_controller.py from limit=25, since_hours=24 to limit=500, since_hours=8760.
  • Behavioral Change: enrich_realtime called without explicit arguments now processes up to 500 candidates over the last year instead of 25 over the last day.

Macroscope summarized 6218941.

Summary by CodeRabbit

  • New Features

    • Added a CLI to re-run embeddings with BGE-M3, updating both float and binary vector stores and supporting resumable checkpoints.
  • Improvements

    • Increased default enrichment scope: per-run limit raised to 500 and historical lookback extended to ~1 year.
  • Tests

    • Added end-to-end tests for the re-embedding CLI: help/test flags, progress output, checkpoint resumability, and vector updates.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 5, 2026

📝 Walkthrough

Walkthrough

Adds a new CLI script to re-embed chunks with the BAAI/bge-m3 model (float + binary vectors, JSON checkpointed resumability), raises enrich_realtime() defaults for broader selection, and adds tests for the script and adjusted defaults.

Changes

Cohort / File(s) Summary
BGE‑M3 Re-embedding Script
scripts/reembed_bgem3.py
New CLI script that re-embeds rows from chunks using BAAI/bge-m3, writes float32 embeddings to chunk_vectors and quantized binary blobs to chunk_vectors_binary, supports APSW+sqlite-vec or sqlite3 fallback quantization, processes in batches with retries, and maintains a JSON checkpoint for resumability.
Re-embedding Test Suite
tests/test_reembed_bgem3.py
New tests creating a temporary SQLite DB and seed vectors, validate script presence/import and CLI flags (--test, --db, --batch-size, --checkpoint-every), run the script in limited/test mode, assert checkpoint creation and resumability, and confirm binary vectors are regenerated (128 bytes, non-zero).
Enrichment Controller Parameters
src/brainlayer/enrichment_controller.py
Adjusted enrich_realtime() defaults: limit 25 → 500 and since_hours 24 → 8760; minor reformatting of RATE_LIMITS["realtime"] environment parsing without behavior change.
Tests Updated
tests/test_enrichment_controller.py
Updated a single test assertion to expect limit=500 and since_hours=8760 when calling store.get_enrichment_candidates.

Sequence Diagram

sequenceDiagram
    participant User as "User"
    participant Script as "Reembed Script"
    participant Model as "BAAI/bge‑m3 Model"
    participant DB as "SQLite DB"
    participant Checkpoint as "Checkpoint File"

    User->>Script: invoke CLI (--db, --test, --checkpoint-file, ...)
    Script->>Checkpoint: load or init checkpoint (processed_ids)
    Script->>DB: query candidate `chunks` (ORDER BY id, optional LIMIT)
    DB-->>Script: return chunk rows

    loop per batch
        Script->>Model: request embeddings for batch
        Model-->>Script: return float vectors
        Script->>Script: quantize to binary (sqlite‑vec or fallback)
        Script->>DB: delete existing vector rows for batch
        Script->>DB: insert float vectors into `chunk_vectors`
        Script->>DB: insert binary blobs into `chunk_vectors_binary`
        DB-->>Script: commit or retry on busy
        Script->>Checkpoint: append processed_ids and persist periodically
        Script->>User: emit progress, ETA
    end

    Script->>Checkpoint: save final checkpoint
    Script-->>User: exit with summary
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble bytes and stitch each chunk, 🐇
BGE‑M3 hums while vectors thunk,
Checkpoints mark the hops I made,
Tests keep watch so none will fade,
New embeddings spring from trunk.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the two main changes: introducing a BGE-M3 re-embedding script and increasing enrichment throughput defaults.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/bgem3-reembed-and-enrichment-boost

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread scripts/reembed_bgem3.py
Comment thread scripts/reembed_bgem3.py Outdated
[
sys.executable,
"-c",
f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low tests/test_reembed_bgem3.py:49

test_script_imports claims to check for syntax errors, but spec_from_file_location followed by module_from_spec only creates a module object—it never parses or compiles the file. Without calling spec.loader.exec_module(mod), the test passes even if reembed_bgem3.py contains invalid Python syntax. Consider adding spec.loader.exec_module(mod) to actually validate the file compiles, or update the comment to reflect what the test actually verifies.

Suggested change
f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)",
f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)",
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file tests/test_reembed_bgem3.py around line 49:

`test_script_imports` claims to check for syntax errors, but `spec_from_file_location` followed by `module_from_spec` only creates a module object—it never parses or compiles the file. Without calling `spec.loader.exec_module(mod)`, the test passes even if `reembed_bgem3.py` contains invalid Python syntax. Consider adding `spec.loader.exec_module(mod)` to actually validate the file compiles, or update the comment to reflect what the test actually verifies.

Evidence trail:
tests/test_reembed_bgem3.py lines 43-57 (REVIEWED_COMMIT) - shows the test code with `spec_from_file_location` and `module_from_spec` but no `exec_module` call, plus the comment claiming syntax validation. Python docs at https://docs.python.org/3/library/importlib.html ("Importing a source file directly" example) confirms that `exec_module(module)` is required to actually parse/compile the module code.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/reembed_bgem3.py`:
- Around line 110-112: The finally block currently only closes sqlite3
connections (conn.close()); add explicit closing for APSW by checking backend ==
"apsw" and calling the APSW connection's close method (e.g., apsw_conn.close()
or conn.close() if APSW connection is stored in conn) to ensure the APSW DB
handle is released and the lock cleared; update the finally block in
scripts/reembed_bgem3.py to close the APSW connection using the correct variable
(apsw_conn or conn) when backend == "apsw".
- Around line 193-196: The finally block in reembed_batch isn't closing APSW
connections, so add explicit conn.close() when backend == "apsw" (similar to
get_chunks_to_process) to release DB locks; update the finally clause in the
reembed_batch function to close conn for both "sqlite3" and "apsw" backends (or
call conn.close() unconditionally if safe) and ensure any variable names (conn)
match the ones used in the function.
- Around line 144-176: The bulk DB update loop (the for batch_start ...
generating embeddings and calling conn.execute and commit) must not run while
enrichment workers may be writing and must handle SQLITE_BUSY retries; add a
pre-run interactive safety check (e.g., check_safe_to_run()) and call a WAL
checkpoint (execute "PRAGMA wal_checkpoint(TRUNCATE)" on conn) before entering
the batch loop, then wrap DB operations in a retry wrapper that catches
sqlite3.OperationalError with "database is locked"/SQLITE_BUSY, uses exponential
backoff and a max attempt count around the conn.execute calls and the commit
(affecting the inner loop that uses serialize_f32 + conn.execute and the final
commit guarded by backend == "sqlite3"), and only raise after exhausting
retries. Ensure the retry wrapper is applied to the block that inserts into
chunk_vectors and chunk_vectors_binary as well as the commit to avoid transient
SQLITE_BUSY failures.
- Around line 65-71: The current except clause `except (ImportError, Exception)`
is too broad and may hide real errors; change it to only catch ImportError when
falling back to the stdlib `sqlite3` (i.e., use `except ImportError:`), and for
other exceptions log the exception details (via `logging.exception` or `print`
to stderr) and re-raise or handle appropriately; update the block around the
`sqlite3.connect(db_path)` / `conn` creation so you still set PRAGMA
journal_mode and busy_timeout but do not swallow non-import-related errors.
- Around line 27-28: Replace the hardcoded DEFAULT_DB Path with the project's
path resolver by importing get_db_path from paths.py and assigning DEFAULT_DB =
get_db_path() (or Path(get_db_path()) if it returns a string); update any use of
DEFAULT_DB in scripts/reembed_bgem3.py accordingly and remove the hardcoded
Path.home() construction so the script uses paths.get_db_path() consistently.

In `@tests/test_reembed_bgem3.py`:
- Around line 175-188: The subprocess.run invocation currently discards the
completion status; capture its result (e.g., assign to a variable like result)
when invoking the script with SCRIPT_PATH, db_path and checkpoint_path and
immediately assert result.returncode == 0 before proceeding to any database
state checks (same pattern used in test_checkpoint_created) so the test fails
fast if the script errored.
- Around line 108-121: The test invokes subprocess.run but doesn't capture or
assert the process exit status, which can leave flaky checkpoints; change the
call to assign the result (e.g., result = subprocess.run(...)) and then add an
assertion like assert result.returncode == 0 to ensure the script (invoked via
sys.executable and SCRIPT_PATH with --db set to db_path and --checkpoint-file
set to checkpoint_path) completed successfully; keep capture_output=True and
timeout as-is so you can still inspect result.stdout/result.stderr on failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 214a43d1-016d-4434-9c70-5d11e0b96f78

📥 Commits

Reviewing files that changed from the base of the PR and between 5ecc318 and 5fcf21d.

📒 Files selected for processing (3)
  • scripts/reembed_bgem3.py
  • src/brainlayer/enrichment_controller.py
  • tests/test_reembed_bgem3.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Macroscope - Correctness Check
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • src/brainlayer/enrichment_controller.py
  • scripts/reembed_bgem3.py
  • tests/test_reembed_bgem3.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/enrichment_controller.py
src/brainlayer/*enrichment*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/*enrichment*.py: Use Groq as primary enrichment backend with Gemini fallback via enrichment_controller.py, configurable via BRAINLAYER_ENRICH_BACKEND environment variable
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable with default 0.2 (12 RPM)

Files:

  • src/brainlayer/enrichment_controller.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_reembed_bgem3.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable with default 0.2 (12 RPM)

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

  • src/brainlayer/enrichment_controller.py
🔇 Additional comments (6)
scripts/reembed_bgem3.py (2)

200-257: LGTM on the main function structure.

Good design with checkpoint-based resumability, progress reporting, and clear CLI interface. The final save_checkpoint call at line 257 ensures state is persisted even if the last batch didn't trigger a periodic save.


36-50: No issues found — quantize_binary() bit ordering matches sqlite-vec's vec_quantize_binary().

The fallback implementation uses byte_val |= 1 << j to set bit j for the j-th float's sign, which correctly implements LSB-first packing (element index N → bit position N). This matches sqlite-vec's behavior exactly, so binary vector search results are consistent whether quantized via the fallback or via the sqlite-vec extension.

tests/test_reembed_bgem3.py (3)

12-34: LGTM on test fixture.

The _create_test_db helper correctly sets up a minimal database with zero-initialized embeddings, enabling verification that re-embedding actually updates the values.


40-71: LGTM on CLI interface tests.

Good coverage of script existence, syntax validation (without loading heavy dependencies), and help flag verification.


132-168: LGTM on resumability test.

Properly seeds a partial checkpoint and verifies the script:

  1. Recognizes and reports skipped chunks
  2. Processes only remaining chunks
  3. Updates checkpoint to include all processed IDs
src/brainlayer/enrichment_controller.py (1)

385-386: No issues found. The parameter defaults change (limit: 25→500, since_hours: 24→8760) in enrich_realtime() is intentional per commit 5fcf21d ("BGE-M3 re-embedding script + enrichment throughput boost"). The commit explicitly documents this as a targeted backlog-clearing optimization: a backlog of 27K unenriched chunks clears in ~9 hours instead of 22 days. While _brain_digest in store_handler.py:49 will inherit the new since_hours default when not explicitly passed, this is the intended behavior and requires no changes.

Comment thread scripts/reembed_bgem3.py Outdated
Comment thread scripts/reembed_bgem3.py Outdated
Comment thread scripts/reembed_bgem3.py Outdated
Comment thread scripts/reembed_bgem3.py Outdated
Comment thread scripts/reembed_bgem3.py
Comment thread tests/test_reembed_bgem3.py Outdated
Comment thread tests/test_reembed_bgem3.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
scripts/reembed_bgem3.py (1)

255-259: 🧹 Nitpick | 🔵 Trivial

Add a warning about concurrent enrichment workers.

The script performs bulk writes that can conflict with enrichment daemon operations, causing WAL bloat or lock contention. Consider adding a startup warning:

     print(f"BGE-M3 Re-embedding {'(TEST MODE — 100 chunks)' if args.test else ''}")
+    print("⚠️  Ensure enrichment workers are stopped before running on production DB.")
+    print("   Run: launchctl unload ~/Library/LaunchAgents/com.brainlayer.enrich.plist")
     print(f"  DB: {db_path}")

Based on learnings: "Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/reembed_bgem3.py` around lines 255 - 259, Add a startup warning to
the printed startup info near the existing prints (the block that prints BGE-M3
Re-embedding, DB: db_path, Checkpoint: checkpoint_path, Batch size:
args.batch_size). The warning should explicitly tell users to stop enrichment
workers and checkpoint the WAL before running this bulk re-embedding (e.g.,
"WARNING: stop enrichment workers and checkpoint WAL before running – bulk
writes can conflict with enrichment daemons and cause WAL bloat or lock
contention"). Place this message unconditionally (or visible in test and normal
runs) so it appears when the script starts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/reembed_bgem3.py`:
- Around line 169-170: Replace the non-blocking WAL checkpoint call
conn.execute("PRAGMA wal_checkpoint(PASSIVE)") with a blocking checkpoint such
as conn.execute("PRAGMA wal_checkpoint(FULL)") (or TRUNCATE) to ensure the WAL
is fully checkpointed before the bulk reembed work; locate the PRAGMA
wal_checkpoint invocation in scripts/reembed_bgem3.py and update that call
accordingly, and if applicable add brief comments or surrounding logic to
stop/ensure enrichment workers are halted before invoking the FULL checkpoint.
- Around line 186-209: The batch writer (_write_batch) must run inside an
explicit transaction so APSW's autocommit doesn't partially persist rows if a
retry occurs; wrap the body of _write_batch in a transaction by issuing a BEGIN
(e.g., conn.execute("BEGIN")) before the loop and COMMIT after the loop, and on
any exception ensure you ROLLBACK (conn.execute("ROLLBACK")) and re-raise so
_retry_on_busy can retry; keep the existing sqlite3-only conn.commit()
conditional or remove it if you always use explicit BEGIN/COMMIT for both
backends, and leave the logic using ids, embeddings, serialize_f32,
quantize_binary, use_vec_quantize, and chunk_vectors / chunk_vectors_binary
unchanged.

In `@src/brainlayer/enrichment_controller.py`:
- Around line 40-42: The enrichment rate fallback currently uses "5.0" for the
BRAINLAYER_ENRICH_RATE env var which is incorrect; update the default value used
in the "realtime" dict entry to "0.2" (so
float(os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2"))) and adjust the inline
comment to reflect the 0.2 RPS (12 RPM) default; ensure the change is made where
the "realtime" key is defined in enrichment_controller.py so behavior reverts to
the intended default.
- Around line 387-388: The default for enrich_realtime currently expanding to
since_hours=8760 is too broad; change the enrich_realtime signature to a much
smaller safe default (e.g., since_hours=24) and/or require callers to pass an
explicit since_hours; then update src/brainlayer/mcp/store_handler.py so its
call enrich_realtime(store=store, limit=limit) supplies an explicit since_hours
(or wraps the call in the MCP one-write-at-a-time lock) to avoid inheriting a
1-year window and to enforce the concurrency constraint around write-heavy work
such as brain_digest.

---

Duplicate comments:
In `@scripts/reembed_bgem3.py`:
- Around line 255-259: Add a startup warning to the printed startup info near
the existing prints (the block that prints BGE-M3 Re-embedding, DB: db_path,
Checkpoint: checkpoint_path, Batch size: args.batch_size). The warning should
explicitly tell users to stop enrichment workers and checkpoint the WAL before
running this bulk re-embedding (e.g., "WARNING: stop enrichment workers and
checkpoint WAL before running – bulk writes can conflict with enrichment daemons
and cause WAL bloat or lock contention"). Place this message unconditionally (or
visible in test and normal runs) so it appears when the script starts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e1c12e95-d5f5-450c-a5d7-8bf618397771

📥 Commits

Reviewing files that changed from the base of the PR and between 5fcf21d and 34337a7.

📒 Files selected for processing (3)
  • scripts/reembed_bgem3.py
  • src/brainlayer/enrichment_controller.py
  • tests/test_reembed_bgem3.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • src/brainlayer/enrichment_controller.py
  • scripts/reembed_bgem3.py
  • tests/test_reembed_bgem3.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/enrichment_controller.py
src/brainlayer/*enrichment*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/*enrichment*.py: Use Groq as primary enrichment backend with Gemini fallback via enrichment_controller.py, configurable via BRAINLAYER_ENRICH_BACKEND environment variable
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable with default 0.2 (12 RPM)

Files:

  • src/brainlayer/enrichment_controller.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_reembed_bgem3.py
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable with default 0.2 (12 RPM)

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : All scripts and CLI must use `paths.py:get_db_path()` for resolving database path instead of hardcoding

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : Store and resolve database path using `paths.py:get_db_path()` — supports environment variable override or defaults to canonical path `~/.local/share/brainlayer/brainlayer.db`

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Never run bulk database operations while enrichment is writing; this causes WAL bloat and potential freeze

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Be aware of known BrainLayer issues: DB locking during enrichment and WAL growth up to 4.7GB

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Implement retry logic on `SQLITE_BUSY` errors; each worker must use its own database connection

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : All database connections must retry on `SQLITE_BUSY`; each worker uses its own connection

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-04-04T15:21:59.853Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:21:59.853Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Drop FTS triggers before bulk deletes on `chunks` table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches

Applied to files:

  • scripts/reembed_bgem3.py
🔇 Additional comments (9)
scripts/reembed_bgem3.py (3)

34-41: LGTM — DB path resolution now uses brainlayer.paths.get_db_path() when available.

This correctly addresses the coding guideline requirement with a graceful fallback for standalone execution.


139-153: LGTM — Retry logic for SQLITE_BUSY with exponential backoff is correctly implemented.

The broad Exception catch combined with string/type checking handles both sqlite3 and APSW busy errors appropriately.


107-124: LGTM — Connection cleanup in finally block is now correctly implemented.

The APSW/sqlite3 connection is properly closed regardless of which backend is used.

tests/test_reembed_bgem3.py (6)

12-34: LGTM — Test fixture correctly uses regular tables for extension-free testing.

The comment at Line 17 clearly documents the intentional deviation from production vec0 schema, and the fallback quantize_binary code path is exercised.


40-56: LGTM — Lightweight syntax validation without loading heavy dependencies.

The importlib.util.spec_from_file_location approach correctly validates script syntax without triggering side effects.


72-101: LGTM — Return code assertion and embedding verification are properly implemented.

Past review feedback regarding result.returncode == 0 assertion has been addressed.


103-131: LGTM — Checkpoint validation test correctly asserts script success and checkpoint structure.


133-169: LGTM — Resumability test correctly validates checkpoint-based skip behavior.

The pre-seeded checkpoint approach effectively tests that already-processed chunks are skipped on re-run.


171-199: LGTM — Binary vector regeneration test validates length and non-zero content.

The 128-byte assertion correctly corresponds to 1024 bits quantized to bytes.

Comment thread scripts/reembed_bgem3.py
Comment on lines +169 to +170
# WAL checkpoint before bulk work
conn.execute("PRAGMA wal_checkpoint(PASSIVE)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider using PRAGMA wal_checkpoint(FULL) instead of PASSIVE.

PASSIVE is non-blocking but may leave WAL data uncheckpointed if readers hold the database. For a bulk operation processing ~301K chunks, FULL (or TRUNCATE) ensures complete WAL checkpoint before starting, reducing WAL bloat risk during the long run.

     # WAL checkpoint before bulk work
-    conn.execute("PRAGMA wal_checkpoint(PASSIVE)")
+    conn.execute("PRAGMA wal_checkpoint(FULL)")

Based on learnings: "Before bulk database operations: stop enrichment workers, checkpoint WAL with PRAGMA wal_checkpoint(FULL)"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/reembed_bgem3.py` around lines 169 - 170, Replace the non-blocking
WAL checkpoint call conn.execute("PRAGMA wal_checkpoint(PASSIVE)") with a
blocking checkpoint such as conn.execute("PRAGMA wal_checkpoint(FULL)") (or
TRUNCATE) to ensure the WAL is fully checkpointed before the bulk reembed work;
locate the PRAGMA wal_checkpoint invocation in scripts/reembed_bgem3.py and
update that call accordingly, and if applicable add brief comments or
surrounding logic to stop/ensure enrichment workers are halted before invoking
the FULL checkpoint.

Comment thread scripts/reembed_bgem3.py
Comment on lines +186 to +209
def _write_batch():
for chunk_id, embedding in zip(ids, embeddings):
emb_bytes = serialize_f32(embedding.tolist())

conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
conn.execute(
"INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)",
(chunk_id, emb_bytes),
)
conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,))
if use_vec_quantize:
conn.execute(
"INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))",
(chunk_id, emb_bytes),
)
else:
conn.execute(
"INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)",
(chunk_id, quantize_binary(emb_bytes)),
)
if backend == "sqlite3":
conn.commit()

_retry_on_busy(_write_batch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Transaction inconsistency between APSW and sqlite3 backends may cause partial writes.

With APSW (autocommit by default), each conn.execute() commits immediately. If the batch fails mid-iteration and retries, some chunks will already be committed while others aren't, but the checkpoint will record all IDs as processed.

The sqlite3 backend accumulates changes until commit(), so retries work correctly.

Proposed fix — wrap batch in explicit transaction
         def _write_batch():
+            if backend == "apsw":
+                conn.execute("BEGIN IMMEDIATE")
             for chunk_id, embedding in zip(ids, embeddings):
                 emb_bytes = serialize_f32(embedding.tolist())

                 conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
                 conn.execute(
                     "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)",
                     (chunk_id, emb_bytes),
                 )
                 conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,))
                 if use_vec_quantize:
                     conn.execute(
                         "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))",
                         (chunk_id, emb_bytes),
                     )
                 else:
                     conn.execute(
                         "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)",
                         (chunk_id, quantize_binary(emb_bytes)),
                     )
-            if backend == "sqlite3":
-                conn.commit()
+            conn.execute("COMMIT") if backend == "apsw" else conn.commit()

         _retry_on_busy(_write_batch)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/reembed_bgem3.py` around lines 186 - 209, The batch writer
(_write_batch) must run inside an explicit transaction so APSW's autocommit
doesn't partially persist rows if a retry occurs; wrap the body of _write_batch
in a transaction by issuing a BEGIN (e.g., conn.execute("BEGIN")) before the
loop and COMMIT after the loop, and on any exception ensure you ROLLBACK
(conn.execute("ROLLBACK")) and re-raise so _retry_on_busy can retry; keep the
existing sqlite3-only conn.commit() conditional or remove it if you always use
explicit BEGIN/COMMIT for both backends, and leave the logic using ids,
embeddings, serialize_f32, quantize_binary, use_vec_quantize, and chunk_vectors
/ chunk_vectors_binary unchanged.

Comment on lines +40 to +42
"realtime": float(
os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")
), # 300 RPM default (AI Pro verified 500+ RPM Apr 2026)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore the default enrichment fallback to 0.2 RPS.

At Line 41, the fallback "5.0" (300 RPM) conflicts with the project’s configured default and materially changes cost/rate behavior when BRAINLAYER_ENRICH_RATE is unset.

Proposed fix
 RATE_LIMITS = {
     "realtime": float(
-        os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")
-    ),  # 300 RPM default (AI Pro verified 500+ RPM Apr 2026)
+        os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2")
+    ),  # 12 RPM default; override via BRAINLAYER_ENRICH_RATE

As per coding guidelines: src/brainlayer/*enrichment*.py: Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable with default 0.2 (12 RPM).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/enrichment_controller.py` around lines 40 - 42, The enrichment
rate fallback currently uses "5.0" for the BRAINLAYER_ENRICH_RATE env var which
is incorrect; update the default value used in the "realtime" dict entry to
"0.2" (so float(os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2"))) and adjust the
inline comment to reflect the 0.2 RPS (12 RPM) default; ensure the change is
made where the "realtime" key is defined in enrichment_controller.py so behavior
reverts to the intended default.

Comment on lines +387 to +388
limit: int = 500,
since_hours: int = 8760,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The new realtime defaults broaden MCP enrich scope too aggressively.

At Line 388, since_hours=8760 is now the implicit default. src/brainlayer/mcp/store_handler.py calls enrich_realtime(store=store, limit=limit) without since_hours, so routine MCP enrich requests inherit a 1-year window, increasing DB contention risk on write-heavy paths.

Safer default patch
 def enrich_realtime(
     store,
-    limit: int = 500,
-    since_hours: int = 8760,
+    limit: int = 25,
+    since_hours: int = 24,
     rate_per_second: float | None = None,

As per coding guidelines: **/*.py: Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
limit: int = 500,
since_hours: int = 8760,
limit: int = 25,
since_hours: int = 24,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/enrichment_controller.py` around lines 387 - 388, The default
for enrich_realtime currently expanding to since_hours=8760 is too broad; change
the enrich_realtime signature to a much smaller safe default (e.g.,
since_hours=24) and/or require callers to pass an explicit since_hours; then
update src/brainlayer/mcp/store_handler.py so its call
enrich_realtime(store=store, limit=limit) supplies an explicit since_hours (or
wraps the call in the MCP one-write-at-a-time lock) to avoid inheriting a 1-year
window and to enforce the concurrency constraint around write-heavy work such as
brain_digest.

EtanHey and others added 3 commits April 6, 2026 04:09
- Add scripts/reembed_bgem3.py: re-embed all chunks with BAAI/bge-m3
  - Resumable via checkpoint file (saves every 1000 chunks)
  - --test flag for 100-chunk verification run
  - Uses APSW + sqlite-vec for real DB, falls back to sqlite3 for tests
  - Updates both chunk_vectors (float) and chunk_vectors_binary (quantized)
  - Verified: 100 chunks on real DB at ~3.3 chunks/s on MPS

- Bump enrichment defaults: limit 25→500, since_hours 24→8760
  - Matches LaunchAgent plist changes (limit 50→500, interval 3600→600s)
  - Backlog of 27K unenriched chunks clears in ~9 hours vs 22 days

- 7 tests covering: script interface, embedding correctness, checkpoint
  resume, binary vector generation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…st asserts

- Close APSW connections in finally blocks (not just sqlite3)
- Use brainlayer.paths.get_db_path() instead of hardcoded path
- Narrow except clause to ImportError only (was catching all exceptions)
- Add SQLITE_BUSY retry with exponential backoff for batch writes
- WAL checkpoint before bulk operations
- Assert subprocess returncode in test_checkpoint_created and test_binary_vectors_updated
- ruff format enrichment_controller.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nce_hours=8760)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EtanHey EtanHey force-pushed the feat/bgem3-reembed-and-enrichment-boost branch from 67e730d to 6218941 Compare April 6, 2026 01:09
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
src/brainlayer/enrichment_controller.py (2)

40-42: ⚠️ Potential issue | 🟠 Major

Rate limit default violates coding guidelines.

The fallback "5.0" (300 RPM) conflicts with the documented project default of 0.2 (12 RPM). This materially changes cost and rate behavior when BRAINLAYER_ENRICH_RATE is unset.

Proposed fix
 RATE_LIMITS = {
     "realtime": float(
-        os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")
-    ),  # 300 RPM default (AI Pro verified 500+ RPM Apr 2026)
+        os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2")
+    ),  # 12 RPM default; override via BRAINLAYER_ENRICH_RATE

As per coding guidelines: src/brainlayer/*enrichment*.py: Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable with default 0.2 (12 RPM).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/enrichment_controller.py` around lines 40 - 42, The default
enrichment rate fallback is incorrect: change the env var fallback from "5.0" to
"0.2" for BRAINLAYER_ENRICH_RATE used when building the "realtime" value in
enrichment_controller.py (look for the "realtime":
float(os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")) expression) so the
default is 0.2 (12 RPM); also update or remove the misleading inline comment
about 300 RPM to reflect the new default.

385-392: ⚠️ Potential issue | 🟠 Major

Aggressive defaults silently affect MCP callers.

src/brainlayer/mcp/store_handler.py calls enrich_realtime(store=store, limit=limit) without since_hours, so it will silently inherit since_hours=8760 (1 year). This broadens the enrichment window from 24 hours to 365 days for all MCP mode=enrich requests, increasing DB contention on write-heavy paths.

Consider one of these approaches:

  1. Keep conservative defaults (limit=25, since_hours=24) and require explicit parameters for broader scans
  2. Update store_handler.py to pass explicit since_hours=24 to preserve existing MCP behavior
  3. Document this as a breaking change if intentional
Option 2: Preserve MCP behavior

In src/brainlayer/mcp/store_handler.py:

-            enrich_result = await loop.run_in_executor(None, lambda: enrich_realtime(store=store, limit=limit))
+            enrich_result = await loop.run_in_executor(None, lambda: enrich_realtime(store=store, limit=limit, since_hours=24))

As per coding guidelines: Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/enrichment_controller.py` around lines 385 - 392, The
enrich_realtime function currently uses aggressive defaults (limit=500,
since_hours=8760) which cause callers like the MCP handler (store_handler
calling enrich_realtime(store, limit=limit)) to unintentionally scan a year; to
fix either (A) change enrich_realtime signature defaults to conservative values
(e.g., limit=25, since_hours=24) so callers get safe behavior by default, or (B)
update the MCP caller that invokes enrich_realtime to pass an explicit
since_hours=24 (and adjust limit if needed) to preserve existing MCP behavior;
while making this change, ensure the MCP one-write-at-a-time concurrency
constraint around brain_digest is respected so enrichment does not run in
parallel with other write-heavy MCP work.
scripts/reembed_bgem3.py (3)

186-209: ⚠️ Potential issue | 🟠 Major

Wrap each batch write in an explicit transaction (especially for APSW).

The retry wrapper re-runs _write_batch, but Line 186-209 has no explicit transaction boundary. A mid-batch failure can leave partially-applied row changes before retry, which weakens consistency guarantees.

Suggested fix
             def _write_batch():
-                for chunk_id, embedding in zip(ids, embeddings):
-                    emb_bytes = serialize_f32(embedding.tolist())
+                try:
+                    conn.execute("BEGIN IMMEDIATE")
+                    for chunk_id, embedding in zip(ids, embeddings):
+                        emb_bytes = serialize_f32(embedding.tolist())
 
-                    conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
-                    conn.execute(
-                        "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)",
-                        (chunk_id, emb_bytes),
-                    )
-                    conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,))
-                    if use_vec_quantize:
+                        conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
                         conn.execute(
-                            "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))",
+                            "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)",
                             (chunk_id, emb_bytes),
                         )
-                    else:
+                        conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,))
+                        if use_vec_quantize:
+                            conn.execute(
+                                "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))",
+                                (chunk_id, emb_bytes),
+                            )
+                        else:
+                            conn.execute(
+                                "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)",
+                                (chunk_id, quantize_binary(emb_bytes)),
+                            )
+                    conn.execute("COMMIT")
+                except Exception:
+                    conn.execute("ROLLBACK")
+                    raise
-                        conn.execute(
-                            "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)",
-                            (chunk_id, quantize_binary(emb_bytes)),
-                        )
-                if backend == "sqlite3":
-                    conn.commit()
#!/bin/bash
set -euo pipefail
# Verify whether batch writer has explicit transaction boundaries around the loop.
rg -n -C3 'def _write_batch|BEGIN|COMMIT|ROLLBACK|_retry_on_busy' scripts/reembed_bgem3.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/reembed_bgem3.py` around lines 186 - 209, The batch writer
_write_batch currently performs multiple conn.execute calls without an explicit
transaction boundary, so wrap the entire loop in a BEGIN/COMMIT (and ROLLBACK on
exception) transaction so retries via _retry_on_busy don't leave
partially-applied rows; specifically, at the start of _write_batch call
conn.execute("BEGIN") (or appropriate begin_transaction on APSW), run the loop
of conn.execute(...) calls, then conn.execute("COMMIT") on success and
conn.execute("ROLLBACK") in an exception handler before re-raising, and remove
the per-backend conn.commit() inside the loop so commit occurs once per batch.

255-259: ⚠️ Potential issue | 🟠 Major

Enforce a single-writer safety gate before starting bulk re-embedding.

This script is a write-heavy bulk job, but startup currently does not enforce/confirm the one-write-at-a-time constraint with enrichment/other MCP writers. Add an explicit guard (or required --force-single-writer acknowledgment) before processing.

As per coding guidelines: "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/reembed_bgem3.py` around lines 255 - 259, Startup does not enforce
the one-write-at-a-time constraint before the bulk re-embed run; add a required
acknowledgment flag and an explicit single-writer gate: add a
--force-single-writer boolean CLI arg (alongside existing args.* like args.test
and args.batch_size) and, in the script's main/startup path before processing
(the area printing DB: db_path and Checkpoint: checkpoint_path), check for
either a held lock (e.g., create/try-acquire a sentinel lock file or DB advisory
lock) or require the presence of args.force_single_writer=True, otherwise abort
with a clear message instructing the user to rerun with --force-single-writer;
ensure the lock is released on exit/failure so other MCP writers can resume.

169-170: ⚠️ Potential issue | 🟠 Major

Use a blocking WAL checkpoint before this bulk write pass.

Line 170 uses PASSIVE, which can leave WAL pages uncheckpointed during long bulk updates; this increases lock pressure and WAL growth risk.

Suggested fix
-    conn.execute("PRAGMA wal_checkpoint(PASSIVE)")
+    conn.execute("PRAGMA wal_checkpoint(FULL)")

Based on learnings: "Before bulk database operations: stop enrichment workers, checkpoint WAL with PRAGMA wal_checkpoint(FULL)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/reembed_bgem3.py` around lines 169 - 170, The WAL checkpoint
currently uses PASSIVE which can leave pages uncheckpointed during long bulk
updates; change the call in the bulk-write preparation to perform a blocking
checkpoint by invoking conn.execute("PRAGMA wal_checkpoint(FULL)") instead of
PASSIVE, and ensure any enrichment workers are stopped before this checkpoint so
the FULL checkpoint can complete without interference (look for the conn.execute
call and surrounding bulk-write preparation logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/reembed_bgem3.py`:
- Around line 237-243: The CLI currently allows zero or negative values for
--batch-size and --checkpoint-every which can cause range/ZeroDivisionError;
update the parser.add_argument calls that define "--batch-size" and
"--checkpoint-every" to validate strictly positive integers at parse time (e.g.,
provide a custom type/validator function or use argparse's choices to enforce
value > 0) so parsing fails with a clear error message if a non-positive value
is supplied; ensure the validator is referenced in the same parser.add_argument
calls that currently set default=8 and default=1000 respectively.

In `@tests/test_reembed_bgem3.py`:
- Around line 43-56: The test_script_imports currently only creates a module
spec and module object but never executes it, so import-time errors are missed;
update the subprocess.run "-c" snippet used by test_script_imports to call
spec.loader.exec_module(mod) (and guard that spec and spec.loader are present)
so the module at SCRIPT_PATH is actually executed during the test; ensure the
invocation still captures output via subprocess.run to surface any import
errors.
- Around line 72-190: The tests repeatedly invoke the end-to-end script via
subprocess (using SCRIPT_PATH) which is slow and flaky; create a shared pytest
fixture (e.g., run_reembed_once) that sets up _create_test_db and
checkpoint_path, runs subprocess once with the "--test" and "--checkpoint-file"
args, captures result and stdout/stderr, reads/returns the DB connection and
checkpoint JSON; then refactor test_test_flag_runs_limited_chunks,
test_checkpoint_created, test_resumable, and test_binary_vectors_updated to
accept that fixture and use its returned items (result, output, db_path/conn,
checkpoint data) instead of calling subprocess in each test so the heavy run is
executed only once. Ensure the fixture exposes processed_ids and parsed JSON to
verify assertions about model, processed count and resumed behavior.

---

Duplicate comments:
In `@scripts/reembed_bgem3.py`:
- Around line 186-209: The batch writer _write_batch currently performs multiple
conn.execute calls without an explicit transaction boundary, so wrap the entire
loop in a BEGIN/COMMIT (and ROLLBACK on exception) transaction so retries via
_retry_on_busy don't leave partially-applied rows; specifically, at the start of
_write_batch call conn.execute("BEGIN") (or appropriate begin_transaction on
APSW), run the loop of conn.execute(...) calls, then conn.execute("COMMIT") on
success and conn.execute("ROLLBACK") in an exception handler before re-raising,
and remove the per-backend conn.commit() inside the loop so commit occurs once
per batch.
- Around line 255-259: Startup does not enforce the one-write-at-a-time
constraint before the bulk re-embed run; add a required acknowledgment flag and
an explicit single-writer gate: add a --force-single-writer boolean CLI arg
(alongside existing args.* like args.test and args.batch_size) and, in the
script's main/startup path before processing (the area printing DB: db_path and
Checkpoint: checkpoint_path), check for either a held lock (e.g.,
create/try-acquire a sentinel lock file or DB advisory lock) or require the
presence of args.force_single_writer=True, otherwise abort with a clear message
instructing the user to rerun with --force-single-writer; ensure the lock is
released on exit/failure so other MCP writers can resume.
- Around line 169-170: The WAL checkpoint currently uses PASSIVE which can leave
pages uncheckpointed during long bulk updates; change the call in the bulk-write
preparation to perform a blocking checkpoint by invoking conn.execute("PRAGMA
wal_checkpoint(FULL)") instead of PASSIVE, and ensure any enrichment workers are
stopped before this checkpoint so the FULL checkpoint can complete without
interference (look for the conn.execute call and surrounding bulk-write
preparation logic).

In `@src/brainlayer/enrichment_controller.py`:
- Around line 40-42: The default enrichment rate fallback is incorrect: change
the env var fallback from "5.0" to "0.2" for BRAINLAYER_ENRICH_RATE used when
building the "realtime" value in enrichment_controller.py (look for the
"realtime": float(os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")) expression)
so the default is 0.2 (12 RPM); also update or remove the misleading inline
comment about 300 RPM to reflect the new default.
- Around line 385-392: The enrich_realtime function currently uses aggressive
defaults (limit=500, since_hours=8760) which cause callers like the MCP handler
(store_handler calling enrich_realtime(store, limit=limit)) to unintentionally
scan a year; to fix either (A) change enrich_realtime signature defaults to
conservative values (e.g., limit=25, since_hours=24) so callers get safe
behavior by default, or (B) update the MCP caller that invokes enrich_realtime
to pass an explicit since_hours=24 (and adjust limit if needed) to preserve
existing MCP behavior; while making this change, ensure the MCP
one-write-at-a-time concurrency constraint around brain_digest is respected so
enrichment does not run in parallel with other write-heavy MCP work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0746b000-2cd1-426b-8e0a-22bd992129c6

📥 Commits

Reviewing files that changed from the base of the PR and between 67e730d and 6218941.

📒 Files selected for processing (4)
  • scripts/reembed_bgem3.py
  • src/brainlayer/enrichment_controller.py
  • tests/test_enrichment_controller.py
  • tests/test_reembed_bgem3.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
  • GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • tests/test_enrichment_controller.py
  • src/brainlayer/enrichment_controller.py
  • scripts/reembed_bgem3.py
  • tests/test_reembed_bgem3.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_enrichment_controller.py
  • tests/test_reembed_bgem3.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/enrichment_controller.py
src/brainlayer/*enrichment*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/*enrichment*.py: Use Groq as primary enrichment backend with Gemini fallback via enrichment_controller.py, configurable via BRAINLAYER_ENRICH_BACKEND environment variable
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable with default 0.2 (12 RPM)

Files:

  • src/brainlayer/enrichment_controller.py
🧠 Learnings (23)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 embeddings model in `embeddings.py` for text vectorization
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 for text embeddings
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable with default 0.2 (12 RPM)

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-04-01T01:24:44.281Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment backend priority: Groq (primary/cloud) → Gemini (fallback) → Ollama (offline last-resort), configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work

Applied to files:

  • src/brainlayer/enrichment_controller.py
  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 embeddings model in `embeddings.py` for text vectorization

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 for text embeddings

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : All scripts and CLI must use `paths.py:get_db_path()` for resolving database path instead of hardcoding

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : Store and resolve database path using `paths.py:get_db_path()` — supports environment variable override or defaults to canonical path `~/.local/share/brainlayer/brainlayer.db`

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Implement retry logic on `SQLITE_BUSY` errors; each worker must use its own database connection

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Never run bulk database operations while enrichment is writing; this causes WAL bloat and potential freeze

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Be aware of known BrainLayer issues: DB locking during enrichment and WAL growth up to 4.7GB

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : All database connections must retry on `SQLITE_BUSY`; each worker uses its own connection

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-04-04T15:21:59.853Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:21:59.853Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Drop FTS triggers before bulk deletes on `chunks` table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec storage via APSW in `vector_store.py` for vector operations

Applied to files:

  • scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec with APSW for vector storage and retrieval

Applied to files:

  • scripts/reembed_bgem3.py
🔇 Additional comments (1)
tests/test_enrichment_controller.py (1)

838-838: Test correctly reflects the updated defaults.

The assertion update is consistent with the source code changes in enrichment_controller.py. The test properly validates that chunk_ids are passed through to get_enrichment_candidates along with the new default values.

Note: The appropriateness of these new default values (limit=500, since_hours=8760) is addressed in the review of enrichment_controller.py.

Comment thread scripts/reembed_bgem3.py
Comment on lines +237 to +243
"--batch-size",
type=int,
default=8,
help="Embedding batch size (default: 8, BGE-M3 needs more VRAM than bge-large)",
)
parser.add_argument("--checkpoint-every", type=int, default=1000, help="Save checkpoint every N chunks")
parser.add_argument(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate --batch-size and --checkpoint-every as strictly positive integers.

Line 214 can raise ZeroDivisionError when --checkpoint-every=0, and Line 177 can fail when --batch-size<=0 (range step cannot be zero). Add argparse validation at parse time.

Suggested fix
 def main():
+    def _positive_int(value: str) -> int:
+        ivalue = int(value)
+        if ivalue <= 0:
+            raise argparse.ArgumentTypeError("must be > 0")
+        return ivalue
+
     parser = argparse.ArgumentParser(description="Re-embed chunks with BAAI/bge-m3")
@@
-        type=int,
+        type=_positive_int,
@@
-    parser.add_argument("--checkpoint-every", type=int, default=1000, help="Save checkpoint every N chunks")
+    parser.add_argument("--checkpoint-every", type=_positive_int, default=1000, help="Save checkpoint every N chunks")

Also applies to: 214-214, 177-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/reembed_bgem3.py` around lines 237 - 243, The CLI currently allows
zero or negative values for --batch-size and --checkpoint-every which can cause
range/ZeroDivisionError; update the parser.add_argument calls that define
"--batch-size" and "--checkpoint-every" to validate strictly positive integers
at parse time (e.g., provide a custom type/validator function or use argparse's
choices to enforce value > 0) so parsing fails with a clear error message if a
non-positive value is supplied; ensure the validator is referenced in the same
parser.add_argument calls that currently set default=8 and default=1000
respectively.

Comment on lines +43 to +56
def test_script_imports(self):
"""Script should be importable without side effects."""
result = subprocess.run(
[
sys.executable,
"-c",
f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)",
],
capture_output=True,
text=True,
timeout=30,
)
# Just checks syntax is valid, not full import (which would load torch)
assert result.returncode == 0, f"Script has syntax errors: {result.stderr}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

test_script_imports does not actually import/execute the module.

The current -c snippet only constructs a module object; it never executes spec.loader.exec_module(...), so import-time errors can be missed.

Suggested fix
-                f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)",
+                f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_reembed_bgem3.py` around lines 43 - 56, The test_script_imports
currently only creates a module spec and module object but never executes it, so
import-time errors are missed; update the subprocess.run "-c" snippet used by
test_script_imports to call spec.loader.exec_module(mod) (and guard that spec
and spec.loader are present) so the module at SCRIPT_PATH is actually executed
during the test; ensure the invocation still captures output via subprocess.run
to surface any import errors.

Comment on lines +72 to +190
def test_test_flag_runs_limited_chunks(self, tmp_path):
"""--test flag should process only 100 chunks (or all if fewer)."""
db_path = _create_test_db(tmp_path)
checkpoint_path = tmp_path / "checkpoint.json"

result = subprocess.run(
[
sys.executable,
str(SCRIPT_PATH),
"--db",
str(db_path),
"--test",
"--checkpoint-file",
str(checkpoint_path),
],
capture_output=True,
text=True,
timeout=300,
)
assert result.returncode == 0, f"Script failed: {result.stderr}"
assert "Processed" in result.stdout or "processed" in result.stdout.lower()

# Verify embeddings were updated (not all zeros anymore)
conn = sqlite3.connect(str(db_path))
row = conn.execute("SELECT embedding FROM chunk_vectors WHERE chunk_id = 'chunk_0'").fetchone()
conn.close()
assert row is not None
emb_floats = struct.unpack(f"{1024}f", row[0])
# At least some values should be non-zero after re-embedding
assert any(v != 0.0 for v in emb_floats), "Embeddings should be updated from zeros"

def test_checkpoint_created(self, tmp_path):
"""Script should create a checkpoint file tracking progress."""
db_path = _create_test_db(tmp_path)
checkpoint_path = tmp_path / "checkpoint.json"

result = subprocess.run(
[
sys.executable,
str(SCRIPT_PATH),
"--db",
str(db_path),
"--test",
"--checkpoint-file",
str(checkpoint_path),
],
capture_output=True,
text=True,
timeout=300,
)
assert result.returncode == 0, f"Script failed: {result.stderr}"
assert checkpoint_path.exists(), "Checkpoint file should be created"

import json

data = json.loads(checkpoint_path.read_text())
assert "processed_ids" in data
assert "model" in data
assert data["model"] == "BAAI/bge-m3"
assert len(data["processed_ids"]) == 5 # all 5 test chunks

def test_resumable(self, tmp_path):
"""Script should skip already-processed chunks on resume."""
db_path = _create_test_db(tmp_path)
checkpoint_path = tmp_path / "checkpoint.json"

import json

# Pre-seed checkpoint with 3 of 5 chunks
checkpoint_data = {
"model": "BAAI/bge-m3",
"processed_ids": ["chunk_0", "chunk_1", "chunk_2"],
"last_updated": "2026-04-05T00:00:00",
}
checkpoint_path.write_text(json.dumps(checkpoint_data))

result = subprocess.run(
[
sys.executable,
str(SCRIPT_PATH),
"--db",
str(db_path),
"--test",
"--checkpoint-file",
str(checkpoint_path),
],
capture_output=True,
text=True,
timeout=300,
)
assert result.returncode == 0, f"Script failed: {result.stderr}"
# Should mention skipping or resuming
output = result.stdout.lower()
assert "skip" in output or "resum" in output or "already" in output

# Checkpoint should now have all 5
data = json.loads(checkpoint_path.read_text())
assert len(data["processed_ids"]) == 5

def test_binary_vectors_updated(self, tmp_path):
"""Binary vectors should be regenerated from new float embeddings."""
db_path = _create_test_db(tmp_path)
checkpoint_path = tmp_path / "checkpoint.json"

result = subprocess.run(
[
sys.executable,
str(SCRIPT_PATH),
"--db",
str(db_path),
"--test",
"--checkpoint-file",
str(checkpoint_path),
],
capture_output=True,
text=True,
timeout=300,
)
assert result.returncode == 0, f"Script failed: {result.stderr}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Reduce repeated subprocess end-to-end runs in this test module.

Multiple tests each launch the full script path with long timeouts, which makes this suite expensive and less stable. Consider a shared pytest fixture that performs one run and reuses the DB/checkpoint state for multiple assertions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_reembed_bgem3.py` around lines 72 - 190, The tests repeatedly
invoke the end-to-end script via subprocess (using SCRIPT_PATH) which is slow
and flaky; create a shared pytest fixture (e.g., run_reembed_once) that sets up
_create_test_db and checkpoint_path, runs subprocess once with the "--test" and
"--checkpoint-file" args, captures result and stdout/stderr, reads/returns the
DB connection and checkpoint JSON; then refactor
test_test_flag_runs_limited_chunks, test_checkpoint_created, test_resumable, and
test_binary_vectors_updated to accept that fixture and use its returned items
(result, output, db_path/conn, checkpoint data) instead of calling subprocess in
each test so the heavy run is executed only once. Ensure the fixture exposes
processed_ids and parsed JSON to verify assertions about model, processed count
and resumed behavior.

@EtanHey EtanHey merged commit db39b3f into main Apr 6, 2026
6 checks passed
@EtanHey EtanHey deleted the feat/bgem3-reembed-and-enrichment-boost branch April 6, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant