feat: BGE-M3 re-embedding script + enrichment throughput boost#211
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughAdds a new CLI script to re-embed Changes
Sequence DiagramsequenceDiagram
participant User as "User"
participant Script as "Reembed Script"
participant Model as "BAAI/bge‑m3 Model"
participant DB as "SQLite DB"
participant Checkpoint as "Checkpoint File"
User->>Script: invoke CLI (--db, --test, --checkpoint-file, ...)
Script->>Checkpoint: load or init checkpoint (processed_ids)
Script->>DB: query candidate `chunks` (ORDER BY id, optional LIMIT)
DB-->>Script: return chunk rows
loop per batch
Script->>Model: request embeddings for batch
Model-->>Script: return float vectors
Script->>Script: quantize to binary (sqlite‑vec or fallback)
Script->>DB: delete existing vector rows for batch
Script->>DB: insert float vectors into `chunk_vectors`
Script->>DB: insert binary blobs into `chunk_vectors_binary`
DB-->>Script: commit or retry on busy
Script->>Checkpoint: append processed_ids and persist periodically
Script->>User: emit progress, ETA
end
Script->>Checkpoint: save final checkpoint
Script-->>User: exit with summary
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| [ | ||
| sys.executable, | ||
| "-c", | ||
| f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)", |
There was a problem hiding this comment.
🟢 Low tests/test_reembed_bgem3.py:49
test_script_imports claims to check for syntax errors, but spec_from_file_location followed by module_from_spec only creates a module object—it never parses or compiles the file. Without calling spec.loader.exec_module(mod), the test passes even if reembed_bgem3.py contains invalid Python syntax. Consider adding spec.loader.exec_module(mod) to actually validate the file compiles, or update the comment to reflect what the test actually verifies.
| f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)", | |
| f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)", |
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file tests/test_reembed_bgem3.py around line 49:
`test_script_imports` claims to check for syntax errors, but `spec_from_file_location` followed by `module_from_spec` only creates a module object—it never parses or compiles the file. Without calling `spec.loader.exec_module(mod)`, the test passes even if `reembed_bgem3.py` contains invalid Python syntax. Consider adding `spec.loader.exec_module(mod)` to actually validate the file compiles, or update the comment to reflect what the test actually verifies.
Evidence trail:
tests/test_reembed_bgem3.py lines 43-57 (REVIEWED_COMMIT) - shows the test code with `spec_from_file_location` and `module_from_spec` but no `exec_module` call, plus the comment claiming syntax validation. Python docs at https://docs.python.org/3/library/importlib.html ("Importing a source file directly" example) confirms that `exec_module(module)` is required to actually parse/compile the module code.
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/reembed_bgem3.py`:
- Around line 110-112: The finally block currently only closes sqlite3
connections (conn.close()); add explicit closing for APSW by checking backend ==
"apsw" and calling the APSW connection's close method (e.g., apsw_conn.close()
or conn.close() if APSW connection is stored in conn) to ensure the APSW DB
handle is released and the lock cleared; update the finally block in
scripts/reembed_bgem3.py to close the APSW connection using the correct variable
(apsw_conn or conn) when backend == "apsw".
- Around line 193-196: The finally block in reembed_batch isn't closing APSW
connections, so add explicit conn.close() when backend == "apsw" (similar to
get_chunks_to_process) to release DB locks; update the finally clause in the
reembed_batch function to close conn for both "sqlite3" and "apsw" backends (or
call conn.close() unconditionally if safe) and ensure any variable names (conn)
match the ones used in the function.
- Around line 144-176: The bulk DB update loop (the for batch_start ...
generating embeddings and calling conn.execute and commit) must not run while
enrichment workers may be writing and must handle SQLITE_BUSY retries; add a
pre-run interactive safety check (e.g., check_safe_to_run()) and call a WAL
checkpoint (execute "PRAGMA wal_checkpoint(TRUNCATE)" on conn) before entering
the batch loop, then wrap DB operations in a retry wrapper that catches
sqlite3.OperationalError with "database is locked"/SQLITE_BUSY, uses exponential
backoff and a max attempt count around the conn.execute calls and the commit
(affecting the inner loop that uses serialize_f32 + conn.execute and the final
commit guarded by backend == "sqlite3"), and only raise after exhausting
retries. Ensure the retry wrapper is applied to the block that inserts into
chunk_vectors and chunk_vectors_binary as well as the commit to avoid transient
SQLITE_BUSY failures.
- Around line 65-71: The current except clause `except (ImportError, Exception)`
is too broad and may hide real errors; change it to only catch ImportError when
falling back to the stdlib `sqlite3` (i.e., use `except ImportError:`), and for
other exceptions log the exception details (via `logging.exception` or `print`
to stderr) and re-raise or handle appropriately; update the block around the
`sqlite3.connect(db_path)` / `conn` creation so you still set PRAGMA
journal_mode and busy_timeout but do not swallow non-import-related errors.
- Around line 27-28: Replace the hardcoded DEFAULT_DB Path with the project's
path resolver by importing get_db_path from paths.py and assigning DEFAULT_DB =
get_db_path() (or Path(get_db_path()) if it returns a string); update any use of
DEFAULT_DB in scripts/reembed_bgem3.py accordingly and remove the hardcoded
Path.home() construction so the script uses paths.get_db_path() consistently.
In `@tests/test_reembed_bgem3.py`:
- Around line 175-188: The subprocess.run invocation currently discards the
completion status; capture its result (e.g., assign to a variable like result)
when invoking the script with SCRIPT_PATH, db_path and checkpoint_path and
immediately assert result.returncode == 0 before proceeding to any database
state checks (same pattern used in test_checkpoint_created) so the test fails
fast if the script errored.
- Around line 108-121: The test invokes subprocess.run but doesn't capture or
assert the process exit status, which can leave flaky checkpoints; change the
call to assign the result (e.g., result = subprocess.run(...)) and then add an
assertion like assert result.returncode == 0 to ensure the script (invoked via
sys.executable and SCRIPT_PATH with --db set to db_path and --checkpoint-file
set to checkpoint_path) completed successfully; keep capture_output=True and
timeout as-is so you can still inspect result.stdout/result.stderr on failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 214a43d1-016d-4434-9c70-5d11e0b96f78
📒 Files selected for processing (3)
scripts/reembed_bgem3.pysrc/brainlayer/enrichment_controller.pytests/test_reembed_bgem3.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Macroscope - Correctness Check
- GitHub Check: test (3.13)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
Files:
src/brainlayer/enrichment_controller.pyscripts/reembed_bgem3.pytests/test_reembed_bgem3.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package insrc/brainlayer/
All scripts and CLI must usepaths.py:get_db_path()for resolving database path instead of hardcoding
Implement chunk lifecycle management with columnssuperseded_by,aggregated_into,archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes onchunkstable and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic onSQLITE_BUSYerrors; each worker must use its own database connection
Useruff check src/ && ruff format src/for linting and formatting
Files:
src/brainlayer/enrichment_controller.py
src/brainlayer/*enrichment*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/*enrichment*.py: Use Groq as primary enrichment backend with Gemini fallback viaenrichment_controller.py, configurable viaBRAINLAYER_ENRICH_BACKENDenvironment variable
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable with default 0.2 (12 RPM)
Files:
src/brainlayer/enrichment_controller.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
pytestfor testing
Files:
tests/test_reembed_bgem3.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable with default 0.2 (12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
Applied to files:
src/brainlayer/enrichment_controller.py
🔇 Additional comments (6)
scripts/reembed_bgem3.py (2)
200-257: LGTM on the main function structure.Good design with checkpoint-based resumability, progress reporting, and clear CLI interface. The final
save_checkpointcall at line 257 ensures state is persisted even if the last batch didn't trigger a periodic save.
36-50: No issues found —quantize_binary()bit ordering matchessqlite-vec'svec_quantize_binary().The fallback implementation uses
byte_val |= 1 << jto set bit j for the j-th float's sign, which correctly implements LSB-first packing (element index N → bit position N). This matches sqlite-vec's behavior exactly, so binary vector search results are consistent whether quantized via the fallback or via the sqlite-vec extension.tests/test_reembed_bgem3.py (3)
12-34: LGTM on test fixture.The
_create_test_dbhelper correctly sets up a minimal database with zero-initialized embeddings, enabling verification that re-embedding actually updates the values.
40-71: LGTM on CLI interface tests.Good coverage of script existence, syntax validation (without loading heavy dependencies), and help flag verification.
132-168: LGTM on resumability test.Properly seeds a partial checkpoint and verifies the script:
- Recognizes and reports skipped chunks
- Processes only remaining chunks
- Updates checkpoint to include all processed IDs
src/brainlayer/enrichment_controller.py (1)
385-386: No issues found. The parameter defaults change (limit: 25→500, since_hours: 24→8760) inenrich_realtime()is intentional per commit 5fcf21d ("BGE-M3 re-embedding script + enrichment throughput boost"). The commit explicitly documents this as a targeted backlog-clearing optimization: a backlog of 27K unenriched chunks clears in ~9 hours instead of 22 days. While_brain_digestinstore_handler.py:49will inherit the newsince_hoursdefault when not explicitly passed, this is the intended behavior and requires no changes.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
scripts/reembed_bgem3.py (1)
255-259: 🧹 Nitpick | 🔵 TrivialAdd a warning about concurrent enrichment workers.
The script performs bulk writes that can conflict with enrichment daemon operations, causing WAL bloat or lock contention. Consider adding a startup warning:
print(f"BGE-M3 Re-embedding {'(TEST MODE — 100 chunks)' if args.test else ''}") + print("⚠️ Ensure enrichment workers are stopped before running on production DB.") + print(" Run: launchctl unload ~/Library/LaunchAgents/com.brainlayer.enrich.plist") print(f" DB: {db_path}")Based on learnings: "Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/reembed_bgem3.py` around lines 255 - 259, Add a startup warning to the printed startup info near the existing prints (the block that prints BGE-M3 Re-embedding, DB: db_path, Checkpoint: checkpoint_path, Batch size: args.batch_size). The warning should explicitly tell users to stop enrichment workers and checkpoint the WAL before running this bulk re-embedding (e.g., "WARNING: stop enrichment workers and checkpoint WAL before running – bulk writes can conflict with enrichment daemons and cause WAL bloat or lock contention"). Place this message unconditionally (or visible in test and normal runs) so it appears when the script starts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/reembed_bgem3.py`:
- Around line 169-170: Replace the non-blocking WAL checkpoint call
conn.execute("PRAGMA wal_checkpoint(PASSIVE)") with a blocking checkpoint such
as conn.execute("PRAGMA wal_checkpoint(FULL)") (or TRUNCATE) to ensure the WAL
is fully checkpointed before the bulk reembed work; locate the PRAGMA
wal_checkpoint invocation in scripts/reembed_bgem3.py and update that call
accordingly, and if applicable add brief comments or surrounding logic to
stop/ensure enrichment workers are halted before invoking the FULL checkpoint.
- Around line 186-209: The batch writer (_write_batch) must run inside an
explicit transaction so APSW's autocommit doesn't partially persist rows if a
retry occurs; wrap the body of _write_batch in a transaction by issuing a BEGIN
(e.g., conn.execute("BEGIN")) before the loop and COMMIT after the loop, and on
any exception ensure you ROLLBACK (conn.execute("ROLLBACK")) and re-raise so
_retry_on_busy can retry; keep the existing sqlite3-only conn.commit()
conditional or remove it if you always use explicit BEGIN/COMMIT for both
backends, and leave the logic using ids, embeddings, serialize_f32,
quantize_binary, use_vec_quantize, and chunk_vectors / chunk_vectors_binary
unchanged.
In `@src/brainlayer/enrichment_controller.py`:
- Around line 40-42: The enrichment rate fallback currently uses "5.0" for the
BRAINLAYER_ENRICH_RATE env var which is incorrect; update the default value used
in the "realtime" dict entry to "0.2" (so
float(os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2"))) and adjust the inline
comment to reflect the 0.2 RPS (12 RPM) default; ensure the change is made where
the "realtime" key is defined in enrichment_controller.py so behavior reverts to
the intended default.
- Around line 387-388: The default for enrich_realtime currently expanding to
since_hours=8760 is too broad; change the enrich_realtime signature to a much
smaller safe default (e.g., since_hours=24) and/or require callers to pass an
explicit since_hours; then update src/brainlayer/mcp/store_handler.py so its
call enrich_realtime(store=store, limit=limit) supplies an explicit since_hours
(or wraps the call in the MCP one-write-at-a-time lock) to avoid inheriting a
1-year window and to enforce the concurrency constraint around write-heavy work
such as brain_digest.
---
Duplicate comments:
In `@scripts/reembed_bgem3.py`:
- Around line 255-259: Add a startup warning to the printed startup info near
the existing prints (the block that prints BGE-M3 Re-embedding, DB: db_path,
Checkpoint: checkpoint_path, Batch size: args.batch_size). The warning should
explicitly tell users to stop enrichment workers and checkpoint the WAL before
running this bulk re-embedding (e.g., "WARNING: stop enrichment workers and
checkpoint WAL before running – bulk writes can conflict with enrichment daemons
and cause WAL bloat or lock contention"). Place this message unconditionally (or
visible in test and normal runs) so it appears when the script starts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e1c12e95-d5f5-450c-a5d7-8bf618397771
📒 Files selected for processing (3)
scripts/reembed_bgem3.pysrc/brainlayer/enrichment_controller.pytests/test_reembed_bgem3.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
Files:
src/brainlayer/enrichment_controller.pyscripts/reembed_bgem3.pytests/test_reembed_bgem3.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package insrc/brainlayer/
All scripts and CLI must usepaths.py:get_db_path()for resolving database path instead of hardcoding
Implement chunk lifecycle management with columnssuperseded_by,aggregated_into,archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes onchunkstable and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic onSQLITE_BUSYerrors; each worker must use its own database connection
Useruff check src/ && ruff format src/for linting and formatting
Files:
src/brainlayer/enrichment_controller.py
src/brainlayer/*enrichment*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/*enrichment*.py: Use Groq as primary enrichment backend with Gemini fallback viaenrichment_controller.py, configurable viaBRAINLAYER_ENRICH_BACKENDenvironment variable
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable with default 0.2 (12 RPM)
Files:
src/brainlayer/enrichment_controller.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
pytestfor testing
Files:
tests/test_reembed_bgem3.py
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable with default 0.2 (12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : All scripts and CLI must use `paths.py:get_db_path()` for resolving database path instead of hardcoding
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : Store and resolve database path using `paths.py:get_db_path()` — supports environment variable override or defaults to canonical path `~/.local/share/brainlayer/brainlayer.db`
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Never run bulk database operations while enrichment is writing; this causes WAL bloat and potential freeze
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Be aware of known BrainLayer issues: DB locking during enrichment and WAL growth up to 4.7GB
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Implement retry logic on `SQLITE_BUSY` errors; each worker must use its own database connection
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : All database connections must retry on `SQLITE_BUSY`; each worker uses its own connection
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-04-04T15:21:59.853Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:21:59.853Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Drop FTS triggers before bulk deletes on `chunks` table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Applied to files:
scripts/reembed_bgem3.py
🔇 Additional comments (9)
scripts/reembed_bgem3.py (3)
34-41: LGTM — DB path resolution now usesbrainlayer.paths.get_db_path()when available.This correctly addresses the coding guideline requirement with a graceful fallback for standalone execution.
139-153: LGTM — Retry logic for SQLITE_BUSY with exponential backoff is correctly implemented.The broad
Exceptioncatch combined with string/type checking handles both sqlite3 and APSW busy errors appropriately.
107-124: LGTM — Connection cleanup infinallyblock is now correctly implemented.The APSW/sqlite3 connection is properly closed regardless of which backend is used.
tests/test_reembed_bgem3.py (6)
12-34: LGTM — Test fixture correctly uses regular tables for extension-free testing.The comment at Line 17 clearly documents the intentional deviation from production vec0 schema, and the fallback
quantize_binarycode path is exercised.
40-56: LGTM — Lightweight syntax validation without loading heavy dependencies.The
importlib.util.spec_from_file_locationapproach correctly validates script syntax without triggering side effects.
72-101: LGTM — Return code assertion and embedding verification are properly implemented.Past review feedback regarding
result.returncode == 0assertion has been addressed.
103-131: LGTM — Checkpoint validation test correctly asserts script success and checkpoint structure.
133-169: LGTM — Resumability test correctly validates checkpoint-based skip behavior.The pre-seeded checkpoint approach effectively tests that already-processed chunks are skipped on re-run.
171-199: LGTM — Binary vector regeneration test validates length and non-zero content.The 128-byte assertion correctly corresponds to 1024 bits quantized to bytes.
| # WAL checkpoint before bulk work | ||
| conn.execute("PRAGMA wal_checkpoint(PASSIVE)") |
There was a problem hiding this comment.
Consider using PRAGMA wal_checkpoint(FULL) instead of PASSIVE.
PASSIVE is non-blocking but may leave WAL data uncheckpointed if readers hold the database. For a bulk operation processing ~301K chunks, FULL (or TRUNCATE) ensures complete WAL checkpoint before starting, reducing WAL bloat risk during the long run.
# WAL checkpoint before bulk work
- conn.execute("PRAGMA wal_checkpoint(PASSIVE)")
+ conn.execute("PRAGMA wal_checkpoint(FULL)")Based on learnings: "Before bulk database operations: stop enrichment workers, checkpoint WAL with PRAGMA wal_checkpoint(FULL)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/reembed_bgem3.py` around lines 169 - 170, Replace the non-blocking
WAL checkpoint call conn.execute("PRAGMA wal_checkpoint(PASSIVE)") with a
blocking checkpoint such as conn.execute("PRAGMA wal_checkpoint(FULL)") (or
TRUNCATE) to ensure the WAL is fully checkpointed before the bulk reembed work;
locate the PRAGMA wal_checkpoint invocation in scripts/reembed_bgem3.py and
update that call accordingly, and if applicable add brief comments or
surrounding logic to stop/ensure enrichment workers are halted before invoking
the FULL checkpoint.
| def _write_batch(): | ||
| for chunk_id, embedding in zip(ids, embeddings): | ||
| emb_bytes = serialize_f32(embedding.tolist()) | ||
|
|
||
| conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,)) | ||
| conn.execute( | ||
| "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)", | ||
| (chunk_id, emb_bytes), | ||
| ) | ||
| conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,)) | ||
| if use_vec_quantize: | ||
| conn.execute( | ||
| "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))", | ||
| (chunk_id, emb_bytes), | ||
| ) | ||
| else: | ||
| conn.execute( | ||
| "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)", | ||
| (chunk_id, quantize_binary(emb_bytes)), | ||
| ) | ||
| if backend == "sqlite3": | ||
| conn.commit() | ||
|
|
||
| _retry_on_busy(_write_batch) |
There was a problem hiding this comment.
Transaction inconsistency between APSW and sqlite3 backends may cause partial writes.
With APSW (autocommit by default), each conn.execute() commits immediately. If the batch fails mid-iteration and retries, some chunks will already be committed while others aren't, but the checkpoint will record all IDs as processed.
The sqlite3 backend accumulates changes until commit(), so retries work correctly.
Proposed fix — wrap batch in explicit transaction
def _write_batch():
+ if backend == "apsw":
+ conn.execute("BEGIN IMMEDIATE")
for chunk_id, embedding in zip(ids, embeddings):
emb_bytes = serialize_f32(embedding.tolist())
conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
conn.execute(
"INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)",
(chunk_id, emb_bytes),
)
conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,))
if use_vec_quantize:
conn.execute(
"INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))",
(chunk_id, emb_bytes),
)
else:
conn.execute(
"INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)",
(chunk_id, quantize_binary(emb_bytes)),
)
- if backend == "sqlite3":
- conn.commit()
+ conn.execute("COMMIT") if backend == "apsw" else conn.commit()
_retry_on_busy(_write_batch)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/reembed_bgem3.py` around lines 186 - 209, The batch writer
(_write_batch) must run inside an explicit transaction so APSW's autocommit
doesn't partially persist rows if a retry occurs; wrap the body of _write_batch
in a transaction by issuing a BEGIN (e.g., conn.execute("BEGIN")) before the
loop and COMMIT after the loop, and on any exception ensure you ROLLBACK
(conn.execute("ROLLBACK")) and re-raise so _retry_on_busy can retry; keep the
existing sqlite3-only conn.commit() conditional or remove it if you always use
explicit BEGIN/COMMIT for both backends, and leave the logic using ids,
embeddings, serialize_f32, quantize_binary, use_vec_quantize, and chunk_vectors
/ chunk_vectors_binary unchanged.
| "realtime": float( | ||
| os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0") | ||
| ), # 300 RPM default (AI Pro verified 500+ RPM Apr 2026) |
There was a problem hiding this comment.
Restore the default enrichment fallback to 0.2 RPS.
At Line 41, the fallback "5.0" (300 RPM) conflicts with the project’s configured default and materially changes cost/rate behavior when BRAINLAYER_ENRICH_RATE is unset.
Proposed fix
RATE_LIMITS = {
"realtime": float(
- os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")
- ), # 300 RPM default (AI Pro verified 500+ RPM Apr 2026)
+ os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2")
+ ), # 12 RPM default; override via BRAINLAYER_ENRICH_RATEAs per coding guidelines: src/brainlayer/*enrichment*.py: Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable with default 0.2 (12 RPM).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/enrichment_controller.py` around lines 40 - 42, The enrichment
rate fallback currently uses "5.0" for the BRAINLAYER_ENRICH_RATE env var which
is incorrect; update the default value used in the "realtime" dict entry to
"0.2" (so float(os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2"))) and adjust the
inline comment to reflect the 0.2 RPS (12 RPM) default; ensure the change is
made where the "realtime" key is defined in enrichment_controller.py so behavior
reverts to the intended default.
| limit: int = 500, | ||
| since_hours: int = 8760, |
There was a problem hiding this comment.
The new realtime defaults broaden MCP enrich scope too aggressively.
At Line 388, since_hours=8760 is now the implicit default. src/brainlayer/mcp/store_handler.py calls enrich_realtime(store=store, limit=limit) without since_hours, so routine MCP enrich requests inherit a 1-year window, increasing DB contention risk on write-heavy paths.
Safer default patch
def enrich_realtime(
store,
- limit: int = 500,
- since_hours: int = 8760,
+ limit: int = 25,
+ since_hours: int = 24,
rate_per_second: float | None = None,As per coding guidelines: **/*.py: Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| limit: int = 500, | |
| since_hours: int = 8760, | |
| limit: int = 25, | |
| since_hours: int = 24, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/enrichment_controller.py` around lines 387 - 388, The default
for enrich_realtime currently expanding to since_hours=8760 is too broad; change
the enrich_realtime signature to a much smaller safe default (e.g.,
since_hours=24) and/or require callers to pass an explicit since_hours; then
update src/brainlayer/mcp/store_handler.py so its call
enrich_realtime(store=store, limit=limit) supplies an explicit since_hours (or
wraps the call in the MCP one-write-at-a-time lock) to avoid inheriting a 1-year
window and to enforce the concurrency constraint around write-heavy work such as
brain_digest.
- Add scripts/reembed_bgem3.py: re-embed all chunks with BAAI/bge-m3 - Resumable via checkpoint file (saves every 1000 chunks) - --test flag for 100-chunk verification run - Uses APSW + sqlite-vec for real DB, falls back to sqlite3 for tests - Updates both chunk_vectors (float) and chunk_vectors_binary (quantized) - Verified: 100 chunks on real DB at ~3.3 chunks/s on MPS - Bump enrichment defaults: limit 25→500, since_hours 24→8760 - Matches LaunchAgent plist changes (limit 50→500, interval 3600→600s) - Backlog of 27K unenriched chunks clears in ~9 hours vs 22 days - 7 tests covering: script interface, embedding correctness, checkpoint resume, binary vector generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…st asserts - Close APSW connections in finally blocks (not just sqlite3) - Use brainlayer.paths.get_db_path() instead of hardcoded path - Narrow except clause to ImportError only (was catching all exceptions) - Add SQLITE_BUSY retry with exponential backoff for batch writes - WAL checkpoint before bulk operations - Assert subprocess returncode in test_checkpoint_created and test_binary_vectors_updated - ruff format enrichment_controller.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nce_hours=8760) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
67e730d to
6218941
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (5)
src/brainlayer/enrichment_controller.py (2)
40-42:⚠️ Potential issue | 🟠 MajorRate limit default violates coding guidelines.
The fallback
"5.0"(300 RPM) conflicts with the documented project default of0.2(12 RPM). This materially changes cost and rate behavior whenBRAINLAYER_ENRICH_RATEis unset.Proposed fix
RATE_LIMITS = { "realtime": float( - os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0") - ), # 300 RPM default (AI Pro verified 500+ RPM Apr 2026) + os.environ.get("BRAINLAYER_ENRICH_RATE", "0.2") + ), # 12 RPM default; override via BRAINLAYER_ENRICH_RATEAs per coding guidelines:
src/brainlayer/*enrichment*.py: Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable with default 0.2 (12 RPM).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/enrichment_controller.py` around lines 40 - 42, The default enrichment rate fallback is incorrect: change the env var fallback from "5.0" to "0.2" for BRAINLAYER_ENRICH_RATE used when building the "realtime" value in enrichment_controller.py (look for the "realtime": float(os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")) expression) so the default is 0.2 (12 RPM); also update or remove the misleading inline comment about 300 RPM to reflect the new default.
385-392:⚠️ Potential issue | 🟠 MajorAggressive defaults silently affect MCP callers.
src/brainlayer/mcp/store_handler.pycallsenrich_realtime(store=store, limit=limit)withoutsince_hours, so it will silently inheritsince_hours=8760(1 year). This broadens the enrichment window from 24 hours to 365 days for all MCPmode=enrichrequests, increasing DB contention on write-heavy paths.Consider one of these approaches:
- Keep conservative defaults (
limit=25,since_hours=24) and require explicit parameters for broader scans- Update
store_handler.pyto pass explicitsince_hours=24to preserve existing MCP behavior- Document this as a breaking change if intentional
Option 2: Preserve MCP behavior
In
src/brainlayer/mcp/store_handler.py:- enrich_result = await loop.run_in_executor(None, lambda: enrich_realtime(store=store, limit=limit)) + enrich_result = await loop.run_in_executor(None, lambda: enrich_realtime(store=store, limit=limit, since_hours=24))As per coding guidelines: Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/enrichment_controller.py` around lines 385 - 392, The enrich_realtime function currently uses aggressive defaults (limit=500, since_hours=8760) which cause callers like the MCP handler (store_handler calling enrich_realtime(store, limit=limit)) to unintentionally scan a year; to fix either (A) change enrich_realtime signature defaults to conservative values (e.g., limit=25, since_hours=24) so callers get safe behavior by default, or (B) update the MCP caller that invokes enrich_realtime to pass an explicit since_hours=24 (and adjust limit if needed) to preserve existing MCP behavior; while making this change, ensure the MCP one-write-at-a-time concurrency constraint around brain_digest is respected so enrichment does not run in parallel with other write-heavy MCP work.scripts/reembed_bgem3.py (3)
186-209:⚠️ Potential issue | 🟠 MajorWrap each batch write in an explicit transaction (especially for APSW).
The retry wrapper re-runs
_write_batch, but Line 186-209 has no explicit transaction boundary. A mid-batch failure can leave partially-applied row changes before retry, which weakens consistency guarantees.Suggested fix
def _write_batch(): - for chunk_id, embedding in zip(ids, embeddings): - emb_bytes = serialize_f32(embedding.tolist()) + try: + conn.execute("BEGIN IMMEDIATE") + for chunk_id, embedding in zip(ids, embeddings): + emb_bytes = serialize_f32(embedding.tolist()) - conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,)) - conn.execute( - "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)", - (chunk_id, emb_bytes), - ) - conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,)) - if use_vec_quantize: + conn.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,)) conn.execute( - "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))", + "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)", (chunk_id, emb_bytes), ) - else: + conn.execute("DELETE FROM chunk_vectors_binary WHERE chunk_id = ?", (chunk_id,)) + if use_vec_quantize: + conn.execute( + "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, vec_quantize_binary(?))", + (chunk_id, emb_bytes), + ) + else: + conn.execute( + "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)", + (chunk_id, quantize_binary(emb_bytes)), + ) + conn.execute("COMMIT") + except Exception: + conn.execute("ROLLBACK") + raise - conn.execute( - "INSERT INTO chunk_vectors_binary (chunk_id, embedding) VALUES (?, ?)", - (chunk_id, quantize_binary(emb_bytes)), - ) - if backend == "sqlite3": - conn.commit()#!/bin/bash set -euo pipefail # Verify whether batch writer has explicit transaction boundaries around the loop. rg -n -C3 'def _write_batch|BEGIN|COMMIT|ROLLBACK|_retry_on_busy' scripts/reembed_bgem3.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/reembed_bgem3.py` around lines 186 - 209, The batch writer _write_batch currently performs multiple conn.execute calls without an explicit transaction boundary, so wrap the entire loop in a BEGIN/COMMIT (and ROLLBACK on exception) transaction so retries via _retry_on_busy don't leave partially-applied rows; specifically, at the start of _write_batch call conn.execute("BEGIN") (or appropriate begin_transaction on APSW), run the loop of conn.execute(...) calls, then conn.execute("COMMIT") on success and conn.execute("ROLLBACK") in an exception handler before re-raising, and remove the per-backend conn.commit() inside the loop so commit occurs once per batch.
255-259:⚠️ Potential issue | 🟠 MajorEnforce a single-writer safety gate before starting bulk re-embedding.
This script is a write-heavy bulk job, but startup currently does not enforce/confirm the one-write-at-a-time constraint with enrichment/other MCP writers. Add an explicit guard (or required
--force-single-writeracknowledgment) before processing.As per coding guidelines: "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/reembed_bgem3.py` around lines 255 - 259, Startup does not enforce the one-write-at-a-time constraint before the bulk re-embed run; add a required acknowledgment flag and an explicit single-writer gate: add a --force-single-writer boolean CLI arg (alongside existing args.* like args.test and args.batch_size) and, in the script's main/startup path before processing (the area printing DB: db_path and Checkpoint: checkpoint_path), check for either a held lock (e.g., create/try-acquire a sentinel lock file or DB advisory lock) or require the presence of args.force_single_writer=True, otherwise abort with a clear message instructing the user to rerun with --force-single-writer; ensure the lock is released on exit/failure so other MCP writers can resume.
169-170:⚠️ Potential issue | 🟠 MajorUse a blocking WAL checkpoint before this bulk write pass.
Line 170 uses
PASSIVE, which can leave WAL pages uncheckpointed during long bulk updates; this increases lock pressure and WAL growth risk.Suggested fix
- conn.execute("PRAGMA wal_checkpoint(PASSIVE)") + conn.execute("PRAGMA wal_checkpoint(FULL)")Based on learnings: "Before bulk database operations: stop enrichment workers, checkpoint WAL with
PRAGMA wal_checkpoint(FULL)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/reembed_bgem3.py` around lines 169 - 170, The WAL checkpoint currently uses PASSIVE which can leave pages uncheckpointed during long bulk updates; change the call in the bulk-write preparation to perform a blocking checkpoint by invoking conn.execute("PRAGMA wal_checkpoint(FULL)") instead of PASSIVE, and ensure any enrichment workers are stopped before this checkpoint so the FULL checkpoint can complete without interference (look for the conn.execute call and surrounding bulk-write preparation logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/reembed_bgem3.py`:
- Around line 237-243: The CLI currently allows zero or negative values for
--batch-size and --checkpoint-every which can cause range/ZeroDivisionError;
update the parser.add_argument calls that define "--batch-size" and
"--checkpoint-every" to validate strictly positive integers at parse time (e.g.,
provide a custom type/validator function or use argparse's choices to enforce
value > 0) so parsing fails with a clear error message if a non-positive value
is supplied; ensure the validator is referenced in the same parser.add_argument
calls that currently set default=8 and default=1000 respectively.
In `@tests/test_reembed_bgem3.py`:
- Around line 43-56: The test_script_imports currently only creates a module
spec and module object but never executes it, so import-time errors are missed;
update the subprocess.run "-c" snippet used by test_script_imports to call
spec.loader.exec_module(mod) (and guard that spec and spec.loader are present)
so the module at SCRIPT_PATH is actually executed during the test; ensure the
invocation still captures output via subprocess.run to surface any import
errors.
- Around line 72-190: The tests repeatedly invoke the end-to-end script via
subprocess (using SCRIPT_PATH) which is slow and flaky; create a shared pytest
fixture (e.g., run_reembed_once) that sets up _create_test_db and
checkpoint_path, runs subprocess once with the "--test" and "--checkpoint-file"
args, captures result and stdout/stderr, reads/returns the DB connection and
checkpoint JSON; then refactor test_test_flag_runs_limited_chunks,
test_checkpoint_created, test_resumable, and test_binary_vectors_updated to
accept that fixture and use its returned items (result, output, db_path/conn,
checkpoint data) instead of calling subprocess in each test so the heavy run is
executed only once. Ensure the fixture exposes processed_ids and parsed JSON to
verify assertions about model, processed count and resumed behavior.
---
Duplicate comments:
In `@scripts/reembed_bgem3.py`:
- Around line 186-209: The batch writer _write_batch currently performs multiple
conn.execute calls without an explicit transaction boundary, so wrap the entire
loop in a BEGIN/COMMIT (and ROLLBACK on exception) transaction so retries via
_retry_on_busy don't leave partially-applied rows; specifically, at the start of
_write_batch call conn.execute("BEGIN") (or appropriate begin_transaction on
APSW), run the loop of conn.execute(...) calls, then conn.execute("COMMIT") on
success and conn.execute("ROLLBACK") in an exception handler before re-raising,
and remove the per-backend conn.commit() inside the loop so commit occurs once
per batch.
- Around line 255-259: Startup does not enforce the one-write-at-a-time
constraint before the bulk re-embed run; add a required acknowledgment flag and
an explicit single-writer gate: add a --force-single-writer boolean CLI arg
(alongside existing args.* like args.test and args.batch_size) and, in the
script's main/startup path before processing (the area printing DB: db_path and
Checkpoint: checkpoint_path), check for either a held lock (e.g.,
create/try-acquire a sentinel lock file or DB advisory lock) or require the
presence of args.force_single_writer=True, otherwise abort with a clear message
instructing the user to rerun with --force-single-writer; ensure the lock is
released on exit/failure so other MCP writers can resume.
- Around line 169-170: The WAL checkpoint currently uses PASSIVE which can leave
pages uncheckpointed during long bulk updates; change the call in the bulk-write
preparation to perform a blocking checkpoint by invoking conn.execute("PRAGMA
wal_checkpoint(FULL)") instead of PASSIVE, and ensure any enrichment workers are
stopped before this checkpoint so the FULL checkpoint can complete without
interference (look for the conn.execute call and surrounding bulk-write
preparation logic).
In `@src/brainlayer/enrichment_controller.py`:
- Around line 40-42: The default enrichment rate fallback is incorrect: change
the env var fallback from "5.0" to "0.2" for BRAINLAYER_ENRICH_RATE used when
building the "realtime" value in enrichment_controller.py (look for the
"realtime": float(os.environ.get("BRAINLAYER_ENRICH_RATE", "5.0")) expression)
so the default is 0.2 (12 RPM); also update or remove the misleading inline
comment about 300 RPM to reflect the new default.
- Around line 385-392: The enrich_realtime function currently uses aggressive
defaults (limit=500, since_hours=8760) which cause callers like the MCP handler
(store_handler calling enrich_realtime(store, limit=limit)) to unintentionally
scan a year; to fix either (A) change enrich_realtime signature defaults to
conservative values (e.g., limit=25, since_hours=24) so callers get safe
behavior by default, or (B) update the MCP caller that invokes enrich_realtime
to pass an explicit since_hours=24 (and adjust limit if needed) to preserve
existing MCP behavior; while making this change, ensure the MCP
one-write-at-a-time concurrency constraint around brain_digest is respected so
enrichment does not run in parallel with other write-heavy MCP work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0746b000-2cd1-426b-8e0a-22bd992129c6
📒 Files selected for processing (4)
scripts/reembed_bgem3.pysrc/brainlayer/enrichment_controller.pytests/test_enrichment_controller.pytests/test_reembed_bgem3.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (3.13)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
Files:
tests/test_enrichment_controller.pysrc/brainlayer/enrichment_controller.pyscripts/reembed_bgem3.pytests/test_reembed_bgem3.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
pytestfor testing
Files:
tests/test_enrichment_controller.pytests/test_reembed_bgem3.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package insrc/brainlayer/
All scripts and CLI must usepaths.py:get_db_path()for resolving database path instead of hardcoding
Implement chunk lifecycle management with columnssuperseded_by,aggregated_into,archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes onchunkstable and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic onSQLITE_BUSYerrors; each worker must use its own database connection
Useruff check src/ && ruff format src/for linting and formatting
Files:
src/brainlayer/enrichment_controller.py
src/brainlayer/*enrichment*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/*enrichment*.py: Use Groq as primary enrichment backend with Gemini fallback viaenrichment_controller.py, configurable viaBRAINLAYER_ENRICH_BACKENDenvironment variable
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable with default 0.2 (12 RPM)
Files:
src/brainlayer/enrichment_controller.py
🧠 Learnings (23)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 embeddings model in `embeddings.py` for text vectorization
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 for text embeddings
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable with default 0.2 (12 RPM)
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/*enrichment*.py : Use Groq as primary enrichment backend with Gemini fallback via `enrichment_controller.py`, configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-04-01T01:24:44.281Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment backend priority: Groq (primary/cloud) → Gemini (fallback) → Ollama (offline last-resort), configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
Applied to files:
src/brainlayer/enrichment_controller.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Applied to files:
src/brainlayer/enrichment_controller.pyscripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 embeddings model in `embeddings.py` for text vectorization
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/embeddings.py : Use bge-large-en-v1.5 for text embeddings
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : All scripts and CLI must use `paths.py:get_db_path()` for resolving database path instead of hardcoding
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : Store and resolve database path using `paths.py:get_db_path()` — supports environment variable override or defaults to canonical path `~/.local/share/brainlayer/brainlayer.db`
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Implement retry logic on `SQLITE_BUSY` errors; each worker must use its own database connection
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Never run bulk database operations while enrichment is writing; this causes WAL bloat and potential freeze
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Be aware of known BrainLayer issues: DB locking during enrichment and WAL growth up to 4.7GB
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : All database connections must retry on `SQLITE_BUSY`; each worker uses its own connection
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-04-04T15:21:59.853Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:21:59.853Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Drop FTS triggers before bulk deletes on `chunks` table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec storage via APSW in `vector_store.py` for vector operations
Applied to files:
scripts/reembed_bgem3.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec with APSW for vector storage and retrieval
Applied to files:
scripts/reembed_bgem3.py
🔇 Additional comments (1)
tests/test_enrichment_controller.py (1)
838-838: Test correctly reflects the updated defaults.The assertion update is consistent with the source code changes in
enrichment_controller.py. The test properly validates thatchunk_idsare passed through toget_enrichment_candidatesalong with the new default values.Note: The appropriateness of these new default values (
limit=500,since_hours=8760) is addressed in the review ofenrichment_controller.py.
| "--batch-size", | ||
| type=int, | ||
| default=8, | ||
| help="Embedding batch size (default: 8, BGE-M3 needs more VRAM than bge-large)", | ||
| ) | ||
| parser.add_argument("--checkpoint-every", type=int, default=1000, help="Save checkpoint every N chunks") | ||
| parser.add_argument( |
There was a problem hiding this comment.
Validate --batch-size and --checkpoint-every as strictly positive integers.
Line 214 can raise ZeroDivisionError when --checkpoint-every=0, and Line 177 can fail when --batch-size<=0 (range step cannot be zero). Add argparse validation at parse time.
Suggested fix
def main():
+ def _positive_int(value: str) -> int:
+ ivalue = int(value)
+ if ivalue <= 0:
+ raise argparse.ArgumentTypeError("must be > 0")
+ return ivalue
+
parser = argparse.ArgumentParser(description="Re-embed chunks with BAAI/bge-m3")
@@
- type=int,
+ type=_positive_int,
@@
- parser.add_argument("--checkpoint-every", type=int, default=1000, help="Save checkpoint every N chunks")
+ parser.add_argument("--checkpoint-every", type=_positive_int, default=1000, help="Save checkpoint every N chunks")Also applies to: 214-214, 177-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/reembed_bgem3.py` around lines 237 - 243, The CLI currently allows
zero or negative values for --batch-size and --checkpoint-every which can cause
range/ZeroDivisionError; update the parser.add_argument calls that define
"--batch-size" and "--checkpoint-every" to validate strictly positive integers
at parse time (e.g., provide a custom type/validator function or use argparse's
choices to enforce value > 0) so parsing fails with a clear error message if a
non-positive value is supplied; ensure the validator is referenced in the same
parser.add_argument calls that currently set default=8 and default=1000
respectively.
| def test_script_imports(self): | ||
| """Script should be importable without side effects.""" | ||
| result = subprocess.run( | ||
| [ | ||
| sys.executable, | ||
| "-c", | ||
| f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)", | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=30, | ||
| ) | ||
| # Just checks syntax is valid, not full import (which would load torch) | ||
| assert result.returncode == 0, f"Script has syntax errors: {result.stderr}" |
There was a problem hiding this comment.
test_script_imports does not actually import/execute the module.
The current -c snippet only constructs a module object; it never executes spec.loader.exec_module(...), so import-time errors can be missed.
Suggested fix
- f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec)",
+ f"import importlib.util; spec = importlib.util.spec_from_file_location('reembed', '{SCRIPT_PATH}'); mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_reembed_bgem3.py` around lines 43 - 56, The test_script_imports
currently only creates a module spec and module object but never executes it, so
import-time errors are missed; update the subprocess.run "-c" snippet used by
test_script_imports to call spec.loader.exec_module(mod) (and guard that spec
and spec.loader are present) so the module at SCRIPT_PATH is actually executed
during the test; ensure the invocation still captures output via subprocess.run
to surface any import errors.
| def test_test_flag_runs_limited_chunks(self, tmp_path): | ||
| """--test flag should process only 100 chunks (or all if fewer).""" | ||
| db_path = _create_test_db(tmp_path) | ||
| checkpoint_path = tmp_path / "checkpoint.json" | ||
|
|
||
| result = subprocess.run( | ||
| [ | ||
| sys.executable, | ||
| str(SCRIPT_PATH), | ||
| "--db", | ||
| str(db_path), | ||
| "--test", | ||
| "--checkpoint-file", | ||
| str(checkpoint_path), | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=300, | ||
| ) | ||
| assert result.returncode == 0, f"Script failed: {result.stderr}" | ||
| assert "Processed" in result.stdout or "processed" in result.stdout.lower() | ||
|
|
||
| # Verify embeddings were updated (not all zeros anymore) | ||
| conn = sqlite3.connect(str(db_path)) | ||
| row = conn.execute("SELECT embedding FROM chunk_vectors WHERE chunk_id = 'chunk_0'").fetchone() | ||
| conn.close() | ||
| assert row is not None | ||
| emb_floats = struct.unpack(f"{1024}f", row[0]) | ||
| # At least some values should be non-zero after re-embedding | ||
| assert any(v != 0.0 for v in emb_floats), "Embeddings should be updated from zeros" | ||
|
|
||
| def test_checkpoint_created(self, tmp_path): | ||
| """Script should create a checkpoint file tracking progress.""" | ||
| db_path = _create_test_db(tmp_path) | ||
| checkpoint_path = tmp_path / "checkpoint.json" | ||
|
|
||
| result = subprocess.run( | ||
| [ | ||
| sys.executable, | ||
| str(SCRIPT_PATH), | ||
| "--db", | ||
| str(db_path), | ||
| "--test", | ||
| "--checkpoint-file", | ||
| str(checkpoint_path), | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=300, | ||
| ) | ||
| assert result.returncode == 0, f"Script failed: {result.stderr}" | ||
| assert checkpoint_path.exists(), "Checkpoint file should be created" | ||
|
|
||
| import json | ||
|
|
||
| data = json.loads(checkpoint_path.read_text()) | ||
| assert "processed_ids" in data | ||
| assert "model" in data | ||
| assert data["model"] == "BAAI/bge-m3" | ||
| assert len(data["processed_ids"]) == 5 # all 5 test chunks | ||
|
|
||
| def test_resumable(self, tmp_path): | ||
| """Script should skip already-processed chunks on resume.""" | ||
| db_path = _create_test_db(tmp_path) | ||
| checkpoint_path = tmp_path / "checkpoint.json" | ||
|
|
||
| import json | ||
|
|
||
| # Pre-seed checkpoint with 3 of 5 chunks | ||
| checkpoint_data = { | ||
| "model": "BAAI/bge-m3", | ||
| "processed_ids": ["chunk_0", "chunk_1", "chunk_2"], | ||
| "last_updated": "2026-04-05T00:00:00", | ||
| } | ||
| checkpoint_path.write_text(json.dumps(checkpoint_data)) | ||
|
|
||
| result = subprocess.run( | ||
| [ | ||
| sys.executable, | ||
| str(SCRIPT_PATH), | ||
| "--db", | ||
| str(db_path), | ||
| "--test", | ||
| "--checkpoint-file", | ||
| str(checkpoint_path), | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=300, | ||
| ) | ||
| assert result.returncode == 0, f"Script failed: {result.stderr}" | ||
| # Should mention skipping or resuming | ||
| output = result.stdout.lower() | ||
| assert "skip" in output or "resum" in output or "already" in output | ||
|
|
||
| # Checkpoint should now have all 5 | ||
| data = json.loads(checkpoint_path.read_text()) | ||
| assert len(data["processed_ids"]) == 5 | ||
|
|
||
| def test_binary_vectors_updated(self, tmp_path): | ||
| """Binary vectors should be regenerated from new float embeddings.""" | ||
| db_path = _create_test_db(tmp_path) | ||
| checkpoint_path = tmp_path / "checkpoint.json" | ||
|
|
||
| result = subprocess.run( | ||
| [ | ||
| sys.executable, | ||
| str(SCRIPT_PATH), | ||
| "--db", | ||
| str(db_path), | ||
| "--test", | ||
| "--checkpoint-file", | ||
| str(checkpoint_path), | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=300, | ||
| ) | ||
| assert result.returncode == 0, f"Script failed: {result.stderr}" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Reduce repeated subprocess end-to-end runs in this test module.
Multiple tests each launch the full script path with long timeouts, which makes this suite expensive and less stable. Consider a shared pytest fixture that performs one run and reuses the DB/checkpoint state for multiple assertions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_reembed_bgem3.py` around lines 72 - 190, The tests repeatedly
invoke the end-to-end script via subprocess (using SCRIPT_PATH) which is slow
and flaky; create a shared pytest fixture (e.g., run_reembed_once) that sets up
_create_test_db and checkpoint_path, runs subprocess once with the "--test" and
"--checkpoint-file" args, captures result and stdout/stderr, reads/returns the
DB connection and checkpoint JSON; then refactor
test_test_flag_runs_limited_chunks, test_checkpoint_created, test_resumable, and
test_binary_vectors_updated to accept that fixture and use its returned items
(result, output, db_path/conn, checkpoint data) instead of calling subprocess in
each test so the heavy run is executed only once. Ensure the fixture exposes
processed_ids and parsed JSON to verify assertions about model, processed count
and resumed behavior.
Summary
scripts/reembed_bgem3.py): re-embed all 301K chunks from bge-large-en-v1.5 to BAAI/bge-m3 for improved multilingual + retrieval quality--testflag processes 100 chunks for quick verificationchunk_vectors(float32) andchunk_vectors_binary(quantized)Test plan
--testflag verified on real 8GB production DB (100 chunks, APSW + sqlite-vec)🤖 Generated with Claude Code
Note
Add BGE-M3 re-embedding CLI script and increase
enrich_realtimedefault throughputBAAI/bge-m3via SentenceTransformers, generates 1024-dim float32 embeddings for allchunksrows, and writes them tochunk_vectorsandchunk_vectors_binaryin SQLite.SQLITE_BUSY.vec_quantize_binary) when available, falling back to stdlibsqlite3with a pure-Python sign-bit quantizer for binary embeddings.enrich_realtimedefaults in enrichment_controller.py fromlimit=25, since_hours=24tolimit=500, since_hours=8760.enrich_realtimecalled without explicit arguments now processes up to 500 candidates over the last year instead of 25 over the last day.Macroscope summarized 6218941.
Summary by CodeRabbit
New Features
Improvements
Tests