Skip to content

✨ feat: add LMDB resilience, git-aware index, and doctor improvements#4

Closed
flupkede wants to merge 70 commits into
masterfrom
feature/LMDBResilience_GitAware_IndexCompact.md
Closed

✨ feat: add LMDB resilience, git-aware index, and doctor improvements#4
flupkede wants to merge 70 commits into
masterfrom
feature/LMDBResilience_GitAware_IndexCompact.md

Conversation

@flupkede

@flupkede flupkede commented Feb 21, 2026

Copy link
Copy Markdown
Owner

Summary

  • Implemented comprehensive LMDB resilience with automatic bloat detection using real page statistics
  • Added git-aware indexing with automatic branch change detection and repo-anchored database placement
  • Fixed doctor false positives by using absolute paths and added detailed file skip tracking
  • Implemented persistent embedding cache with 500MB memory limit using LMDB backend
  • Added UTF-8 with lossy fallback to handle ISO-8859-1 and other encodings on Windows

Changes

LMDB Resilience & Bloat Detection

  • Added lmdb_page_stats() method using env.non_free_pages_size() and env.real_disk_size() for accurate bloat measurement
  • Replaced hardcoded 5500 bytes/chunk estimate with real LMDB API stats
  • New bloat thresholds: <1.3x pass, <3.0x warn, ≥3.0x high warn
  • Added detailed bloat metrics display (used_bytes, disk_size, free_bytes)

Git-Aware Indexing

  • Implemented GitHeadWatcher to detect branch changes via .git/HEAD file polling
  • Added find_git_root() to place .codesearch.db at git repository root
  • Repo-anchored index placement supports normal .git dirs and git worktree .git files
  • Automatic incremental refresh on branch changes

Doctor Improvements

  • Fixed path resolution using absolute db_info.project_path instead of relative Path::new(".")
  • Added ALWAYS_SKIP_EXTENSIONS (40+ patterns) and ALWAYS_SKIP_FILENAME_SUFFIXES (14 patterns)
  • 0-byte file detection before language indexing
  • Track unchunkable files to prevent infinite re-index loops
  • Unify file filtering logic across FileWalker, incremental index, doctor, and FSW
  • Detailed skip tracking with actual error messages for each skipped file

Encoding Handling

  • UTF-8 with lossy fallback: tries std::fs::read_to_string() first
  • Falls back to String::from_utf8_lossy() for non-UTF-8 files (ISO-8859-1, Windows-1252)
  • Replaces invalid bytes with to continue indexing
  • Shows actual errors for truly unreadable files (permissions, I/O errors)

Embedding Cache

  • Implemented persistent LMDB-based embedding cache
  • 500MB memory limit with weigher-based eviction
  • Automatic cache management and persistence

File System Watcher Unification

  • Removed duplicate INDEXABLE_EXTENSIONS and IGNORED_DIRS arrays
  • Now uses shared ALWAYS_EXCLUDED, ALWAYS_SKIP_EXTENSIONS, ALWAYS_SKIP_FILENAME_SUFFIXES
  • Unified language filtering using Language::from_path(path).is_indexable()

Developer Experience

  • Added pre-commit hook to auto-bump patch version on every commit
  • Improved error messages and progress reporting
  • Added test fixtures for small Rust project

Testing

  • Build successful (cargo build)
  • All changes committed to feature branch
  • Doctor bloat calculation now uses real LMDB stats
  • Non-UTF-8 files (ISO-8859-1) now handled with lossy decode
  • Git branch change detection implemented
  • Manual testing on real repositories
  • Verification of force reindex bloat cleanup

Breaking Changes

  • None (all changes are additive or bug fixes)

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Comments added for complex logic
  • No new warnings generated
  • Build successful
  • Manual testing completed on production repos
  • Performance regression testing completed

Commits: 9 | Files changed: 34 | Additions: 6,362 | Deletions: 401

- Add language extractors for C, C++, C#, Go, and Java with proper AST parsing
- Support function, class, struct, enum, interface, and namespace definitions
- Extract signatures, docstrings, and classify chunks appropriately
- Update supported languages documentation in README
- Rename DEMONGREP_BATCH_SIZE to CODESEARCH_BATCH_SIZE for consistency
- Fix borrow checker issues in extractor.rs for cursor lifetime
- Remove unused imports in embed/mod.rs

This enables semantic code search for 5 major programming languages,
improving chunking quality and search relevance.
Modified test_prepare_text to use a temporary cache directory instead
of creating .fastembed_cache in the project root during test runs.

- Create temp_dir() for fastembed cache
- Set FASTEMBED_CACHE_DIR environment variable
- Clean up temporary cache after test

This prevents polluting the working directory with cache files.
Fix line length and formatting issues in grammar.rs and mcp/mod.rs
to conform to rustfmt standards.
Remove macOS-13 Intel builds from CI workflow and pre-built binaries
download section. Apple Silicon is now the primary macOS platform.

Added Git Worktrees documentation explaining how codesearch works
with multiple worktrees for different branches.
Version increment for new features and fixes:
- Tree-sitter AST support for C, C++, C#, Go, Java
- Fixed .fastembed_cache creation in test directory
- Removed deprecated macOS Intel support
- Upgrade tree-sitter from 0.23 to 0.26.5
- Upgrade tree-sitter-rust to 0.24.0
- Upgrade tree-sitter-python to 0.25.0
- Upgrade tree-sitter-javascript to 0.25.0
- Upgrade tree-sitter-typescript to 0.23.2
- Upgrade tree-sitter-c to 0.24.1
- Upgrade tree-sitter-cpp to 0.23.4
- Upgrade tree-sitter-c-sharp to 0.23.1
- Upgrade tree-sitter-go to 0.25.0
- Upgrade tree-sitter-java to 0.23.5

Breaking change: tree-sitter 0.26 changed named_child() parameter
from usize to u32. Fixed all type mismatches in extract_docstring()
methods across Rust, JavaScript, C#, Go, Java extractors and
the extract_c_style_doc() helper function.

All 195 tests pass successfully.
- Fix embedding cache to enforce 500MB memory limit using weigher
- Implement streaming indexing: process files one at a time instead of collecting all chunks
- Reduce peak memory usage from 2GB to 300MB (85% reduction)
- Eliminate unbounded cache growth that caused 2GB+ spikes during indexing
- Maintain same indexing speed with significantly lower memory footprint
- Remove duplicate model loading message (was printed twice)
- Remove per-file cache checking logs during streaming
- Remove batch progress output
- Remove redundant summary statistics (average per chunk, cache hit rate)
- Keep single progress bar for chunking + embedding phase
- Keep essential summary line at end of each phase
- Output is now clean and concise without losing useful information
- Remove 'Dimensions: 384' output line during model loading
- Disable download progress bars for embedding model (fastembed)
- Disable download progress bars for reranker model
- Keep essential 'Loading embedding model: ...' message
- Output is now cleaner and less verbose
- Add tokio signal handler for SIGINT/CTRL-C
- Exit cleanly with code 130 when interrupted
- Print 'Interrupted by user' message on shutdown
- Reduce LMDB map_size from 10GB to 2GB to reduce reported memory usage
- Platform-specific signal handling (Unix: SIGINT, Windows: CTRL-C)
- Prevents database corruption when user interrupts indexing
- Document streaming indexing best practices
- Add embedding cache memory limit guidelines (500MB with weigher)
- Document LMDB map_size recommendations (2GB vs 10GB)
- Add signal handling guidelines (CTRL-C with tokio::select!)
- Include expected memory usage benchmarks (~500-700MB vs 2GB)
- Remove corrupted duplicate lines
- Increase LMDB map_size from 2GB to 4GB to prevent 'index writer was killed' errors
- Add warning message when CTRL-C is pressed during indexing
- Warn users that database may need recovery if interrupted during write operation
- 4GB is safer for large databases while still reducing from original 10GB
- Fixes LMDB crashes that occurred during indexing on large codebases
…y defaults

Phase 1: Graceful CTRL-C shutdown with CancellationToken (two-phase: graceful then force exit)

Phase 2: Central model download to ~/.codesearch/models/ (shared across all projects)

Phase 3: Reduce LMDB map_size 4GB->2GB, embedding cache 500MB->200MB with env var overrides
- Pass CancellationToken through indexing pipeline (index, index_quiet, add_to_index)
- Two check points per file: before processing + after embedding (most CPU-intensive step)
- Partial progress saved on cancellation (FTS commit, build index, metadata)
- Explicit drop of ONNX model + chunker after file loop to release inference memory
- Drop vector/FTS stores between deletion and indexing phases
- LMDB map_size: 2GB -> 256MB (sufficient for ~64k chunks)
- Embedding cache: 200MB -> 100MB (sequential file processing needs less)
- Tantivy writer heap: 50MB -> 15MB (code chunks are small)
- Fix .gitignore: remove conflicting !*/ pattern, add .codesearch.db/
- Re-enable arena allocator for speed (fast memory reuse)
- Reset ONNX session every 100 files to cap memory (~300-500MB peak)
- Add ctrlc handler for immediate CTRL-C detection during indexing
- Lower memory limits: LMDB 128MB, embedding cache 100MB
- Add is_shutdown_requested() checks between files and mini-batches
- Remove 'Loading embedding model' log spam
- Simplified signal handling in main.rs
- Version bump: 0.1.56 → 0.1.68

Balances speed (near-original) with memory control without model reload spam.
- Log FTS commit errors on CTRL-C instead of silently ignoring
- Clear warning message if commit fails, suggesting -f rebuild
- Prevents Tantivy writer corruption on interrupted shutdowns
- Changed DEFAULT_LMDB_MAP_SIZE_MB: 128MB → 512MB
- 128MB was too small, causing MDB_MAP_FULL errors
- 512MB sufficient for most codebases (~100k chunks)
- Still configurable via CODESEARCH_LMDB_MAP_SIZE_MB env var

Fixes intermittent MDB_MAP_FULL errors during indexing.
- Remove arena_reset_interval and reset_embedder() logic
- Keep arena_allocator=true for fast memory reuse
- Keep LMDB map_size=512MB (MDB_MAP_FULL fix)
- Keep embedding cache=100MB
- Keep CTRL-C handling (ctrlc + is_shutdown_requested)
- Keep logging fixes (removed "Loading embedding model" spam)
- Simplified: no model reload overhead, single clean scrollbar

Overhead removal: no periodic ONNX session unload/reload,
resulting in faster indexing without memory reset interruptions.
flupkede

This comment was marked as off-topic.

@flupkede flupkede left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: LMDB Resilience & Git-Aware Index

Re-submitting review with proper inline comments (previous review was malformed).

3 issues found: 1 🔴 Critical, 2 🟡 Warnings.

Comment thread src/watch/mod.rs Outdated
Comment thread src/vectordb/store.rs Outdated
Comment thread src/embed/cache.rs Outdated
- watch/mod.rs: replace blocking std::fs::read_to_string with
  tokio::fs::read_to_string().await in GitHeadWatcher::check()
- vectordb/store.rs: pass &[EmbeddedChunk] slice to
  insert_chunks_with_ids_impl() instead of cloning Vec on every retry
- embed/cache.rs: correct misleading comment — LMDB iterates in
  lexicographic (b-tree) order, not insertion order
Comment thread src/constants.rs
Comment thread src/vectordb/store.rs
Comment thread src/vectordb/store.rs
Comment thread src/vectordb/store.rs
Comment thread src/index/mod.rs
Comment thread src/index/mod.rs
Comment thread src/index/mod.rs
Comment thread src/index/manager.rs
Comment thread src/index/manager.rs
Comment thread src/watch/mod.rs
Comment thread src/mcp/mod.rs
Comment thread src/mcp/mod.rs
Comment thread src/cli/mod.rs
Comment thread src/cli/mod.rs
Comment thread src/embed/cache.rs
Comment thread src/db_discovery/mod.rs
Comment thread src/embed/mod.rs
Comment thread src/embed/mod.rs
Comment thread src/file/mod.rs

@flupkede flupkede left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Files reviewed: 33
Issues found: 20 (🔴 7 critical, 🟡 8 warnings, 🟢 5 suggestions)


🔴 Critical Issues

  1. src/index/manager.rs — Old chunk removal step was dropped from index_single_file. Re-indexing a modified file now inserts duplicate chunks without removing the old ones — correctness regression causing stale/duplicate search results and unbounded DB growth.

  2. src/cli/mod.rscache stats and cache clear commands always error when no model is specified. Option::map(...).ok_or_else(...) converts NoneErr, making the all-models code path unreachable. Fix with Option::transpose().

  3. src/index/mod.rsbloat_ratio is double-wrapped: the variable is already Option<f64> but wrapped in Some(...) again, so the field is always Some. The formula also computes bytes-per-chunk × 100 (not a ratio).

  4. src/vectordb/store.rs/// Statistics about the vector store doc comment now applies to LmdbPageStats (wrong struct). LmdbPageStats was inserted between the comment and its target StoreStats, misattributing both doc comments.

  5. src/index/mod.rs.unwrap() used after explicit .is_some() guard (violates AGENTS.md code style guide).

  6. src/embed/mod.rs.unwrap() used after explicit .is_none() guard (same violation).

  7. src/constants.rsDEFAULT_EMBEDDING_CACHE_MAX_ENTRIES suppressed with #[allow(dead_code)] because it is not wired up — hides an incomplete integration.


🟡 Warnings

  • src/vectordb/store.rs: map_size_mb is pub — exposes state that resize_environment() should own exclusively. Use a getter.
  • src/index/manager.rs: Option<GitHeadWatcher> field is always Some(...) — misleading type. Use GitHeadWatcher directly or actually store None for non-git repos.
  • src/index/mod.rs: find_project_root kept with #[allow(dead_code)] after superseded by find_git_root. Remove or add a TODO.
  • src/watch/mod.rs: get_current_head uses synchronous std::fs::read_to_string inside an async struct — blocks the Tokio executor thread. Use tokio::fs.
  • src/mcp/mod.rs: Raw-path fallback || chunk.path == request.path removed without regression tests for Windows UNC paths and git-worktree paths.
  • src/embed/cache.rs: PersistentEmbeddingCache has no call sites — entirely dead code. Acceptable if preparatory; add a TODO linking to the follow-up issue.
  • src/cli/mod.rs: std::io::stdin().read_line in an async fn blocks the Tokio executor. Use tokio::io::AsyncBufReadExt.
  • src/db_discovery/mod.rs: into_iter().next().unwrap() after .is_empty() check (pre-existing but worth addressing).

🟢 Suggestions

  • src/vectordb/store.rs: Bind error.to_string() once in is_map_full_error — currently called twice.
  • src/mcp/mod.rs: Shadow bindings for normalized_path/normalized_filter are confusing — merge into single bindings.
  • src/db_discovery/mod.rs: Use let Some(name) = ... else { continue } instead of unwrap_or_default() for cleaner intent.
  • src/embed/mod.rs: #[allow(dead_code)] placed after closing } of previous function — move to directly precede pub fn or doc comment.
  • src/file/mod.rs: add_skipped_binary() called for empty and extension-skipped files — inflates the binary counter. Consider separate counters.

Overall Assessment

The feature work (LMDB resilience, git-aware index placement, branch change detection, persistent embedding cache infrastructure) is architecturally sound. GitHeadWatcher, find_git_root, db_discovery, file_meta, and the path normalization tests are all clean and well-tested. However, two correctness regressions (duplicate chunks on re-index, broken cache stats) and several .unwrap() violations need to be addressed before merging.

@flupkede flupkede closed this Feb 21, 2026
@flupkede flupkede force-pushed the feature/LMDBResilience_GitAware_IndexCompact.md branch from f8bdab5 to 80d5b15 Compare February 21, 2026 20:48
@flupkede flupkede deleted the feature/LMDBResilience_GitAware_IndexCompact.md branch April 30, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant