✨ feat: add LMDB resilience, git-aware index, and doctor improvements#4
✨ feat: add LMDB resilience, git-aware index, and doctor improvements#4flupkede wants to merge 70 commits into
Conversation
- Add language extractors for C, C++, C#, Go, and Java with proper AST parsing - Support function, class, struct, enum, interface, and namespace definitions - Extract signatures, docstrings, and classify chunks appropriately - Update supported languages documentation in README - Rename DEMONGREP_BATCH_SIZE to CODESEARCH_BATCH_SIZE for consistency - Fix borrow checker issues in extractor.rs for cursor lifetime - Remove unused imports in embed/mod.rs This enables semantic code search for 5 major programming languages, improving chunking quality and search relevance.
Modified test_prepare_text to use a temporary cache directory instead of creating .fastembed_cache in the project root during test runs. - Create temp_dir() for fastembed cache - Set FASTEMBED_CACHE_DIR environment variable - Clean up temporary cache after test This prevents polluting the working directory with cache files.
Fix line length and formatting issues in grammar.rs and mcp/mod.rs to conform to rustfmt standards.
Remove macOS-13 Intel builds from CI workflow and pre-built binaries download section. Apple Silicon is now the primary macOS platform. Added Git Worktrees documentation explaining how codesearch works with multiple worktrees for different branches.
Version increment for new features and fixes: - Tree-sitter AST support for C, C++, C#, Go, Java - Fixed .fastembed_cache creation in test directory - Removed deprecated macOS Intel support
- Upgrade tree-sitter from 0.23 to 0.26.5 - Upgrade tree-sitter-rust to 0.24.0 - Upgrade tree-sitter-python to 0.25.0 - Upgrade tree-sitter-javascript to 0.25.0 - Upgrade tree-sitter-typescript to 0.23.2 - Upgrade tree-sitter-c to 0.24.1 - Upgrade tree-sitter-cpp to 0.23.4 - Upgrade tree-sitter-c-sharp to 0.23.1 - Upgrade tree-sitter-go to 0.25.0 - Upgrade tree-sitter-java to 0.23.5 Breaking change: tree-sitter 0.26 changed named_child() parameter from usize to u32. Fixed all type mismatches in extract_docstring() methods across Rust, JavaScript, C#, Go, Java extractors and the extract_c_style_doc() helper function. All 195 tests pass successfully.
Feature/upgrade tree sitter
- Fix embedding cache to enforce 500MB memory limit using weigher - Implement streaming indexing: process files one at a time instead of collecting all chunks - Reduce peak memory usage from 2GB to 300MB (85% reduction) - Eliminate unbounded cache growth that caused 2GB+ spikes during indexing - Maintain same indexing speed with significantly lower memory footprint
- Remove duplicate model loading message (was printed twice) - Remove per-file cache checking logs during streaming - Remove batch progress output - Remove redundant summary statistics (average per chunk, cache hit rate) - Keep single progress bar for chunking + embedding phase - Keep essential summary line at end of each phase - Output is now clean and concise without losing useful information
- Remove 'Dimensions: 384' output line during model loading - Disable download progress bars for embedding model (fastembed) - Disable download progress bars for reranker model - Keep essential 'Loading embedding model: ...' message - Output is now cleaner and less verbose
- Add tokio signal handler for SIGINT/CTRL-C - Exit cleanly with code 130 when interrupted - Print 'Interrupted by user' message on shutdown - Reduce LMDB map_size from 10GB to 2GB to reduce reported memory usage - Platform-specific signal handling (Unix: SIGINT, Windows: CTRL-C) - Prevents database corruption when user interrupts indexing
- Document streaming indexing best practices - Add embedding cache memory limit guidelines (500MB with weigher) - Document LMDB map_size recommendations (2GB vs 10GB) - Add signal handling guidelines (CTRL-C with tokio::select!) - Include expected memory usage benchmarks (~500-700MB vs 2GB) - Remove corrupted duplicate lines
- Increase LMDB map_size from 2GB to 4GB to prevent 'index writer was killed' errors - Add warning message when CTRL-C is pressed during indexing - Warn users that database may need recovery if interrupted during write operation - 4GB is safer for large databases while still reducing from original 10GB - Fixes LMDB crashes that occurred during indexing on large codebases
…y defaults Phase 1: Graceful CTRL-C shutdown with CancellationToken (two-phase: graceful then force exit) Phase 2: Central model download to ~/.codesearch/models/ (shared across all projects) Phase 3: Reduce LMDB map_size 4GB->2GB, embedding cache 500MB->200MB with env var overrides
- Pass CancellationToken through indexing pipeline (index, index_quiet, add_to_index) - Two check points per file: before processing + after embedding (most CPU-intensive step) - Partial progress saved on cancellation (FTS commit, build index, metadata) - Explicit drop of ONNX model + chunker after file loop to release inference memory - Drop vector/FTS stores between deletion and indexing phases - LMDB map_size: 2GB -> 256MB (sufficient for ~64k chunks) - Embedding cache: 200MB -> 100MB (sequential file processing needs less) - Tantivy writer heap: 50MB -> 15MB (code chunks are small) - Fix .gitignore: remove conflicting !*/ pattern, add .codesearch.db/
- Re-enable arena allocator for speed (fast memory reuse) - Reset ONNX session every 100 files to cap memory (~300-500MB peak) - Add ctrlc handler for immediate CTRL-C detection during indexing - Lower memory limits: LMDB 128MB, embedding cache 100MB - Add is_shutdown_requested() checks between files and mini-batches - Remove 'Loading embedding model' log spam - Simplified signal handling in main.rs - Version bump: 0.1.56 → 0.1.68 Balances speed (near-original) with memory control without model reload spam.
- Log FTS commit errors on CTRL-C instead of silently ignoring - Clear warning message if commit fails, suggesting -f rebuild - Prevents Tantivy writer corruption on interrupted shutdowns
- Changed DEFAULT_LMDB_MAP_SIZE_MB: 128MB → 512MB - 128MB was too small, causing MDB_MAP_FULL errors - 512MB sufficient for most codebases (~100k chunks) - Still configurable via CODESEARCH_LMDB_MAP_SIZE_MB env var Fixes intermittent MDB_MAP_FULL errors during indexing.
- Remove arena_reset_interval and reset_embedder() logic - Keep arena_allocator=true for fast memory reuse - Keep LMDB map_size=512MB (MDB_MAP_FULL fix) - Keep embedding cache=100MB - Keep CTRL-C handling (ctrlc + is_shutdown_requested) - Keep logging fixes (removed "Loading embedding model" spam) - Simplified: no model reload overhead, single clean scrollbar Overhead removal: no periodic ONNX session unload/reload, resulting in faster indexing without memory reset interruptions.
flupkede
left a comment
There was a problem hiding this comment.
Code Review: LMDB Resilience & Git-Aware Index
Re-submitting review with proper inline comments (previous review was malformed).
3 issues found: 1 🔴 Critical, 2 🟡 Warnings.
- watch/mod.rs: replace blocking std::fs::read_to_string with tokio::fs::read_to_string().await in GitHeadWatcher::check() - vectordb/store.rs: pass &[EmbeddedChunk] slice to insert_chunks_with_ids_impl() instead of cloning Vec on every retry - embed/cache.rs: correct misleading comment — LMDB iterates in lexicographic (b-tree) order, not insertion order
flupkede
left a comment
There was a problem hiding this comment.
Code Review Summary
Files reviewed: 33
Issues found: 20 (🔴 7 critical, 🟡 8 warnings, 🟢 5 suggestions)
🔴 Critical Issues
-
src/index/manager.rs— Old chunk removal step was dropped fromindex_single_file. Re-indexing a modified file now inserts duplicate chunks without removing the old ones — correctness regression causing stale/duplicate search results and unbounded DB growth. -
src/cli/mod.rs—cache statsandcache clearcommands always error when no model is specified.Option::map(...).ok_or_else(...)convertsNone→Err, making the all-models code path unreachable. Fix withOption::transpose(). -
src/index/mod.rs—bloat_ratiois double-wrapped: the variable is alreadyOption<f64>but wrapped inSome(...)again, so the field is alwaysSome. The formula also computes bytes-per-chunk × 100 (not a ratio). -
src/vectordb/store.rs—/// Statistics about the vector storedoc comment now applies toLmdbPageStats(wrong struct).LmdbPageStatswas inserted between the comment and its targetStoreStats, misattributing both doc comments. -
src/index/mod.rs—.unwrap()used after explicit.is_some()guard (violates AGENTS.md code style guide). -
src/embed/mod.rs—.unwrap()used after explicit.is_none()guard (same violation). -
src/constants.rs—DEFAULT_EMBEDDING_CACHE_MAX_ENTRIESsuppressed with#[allow(dead_code)]because it is not wired up — hides an incomplete integration.
🟡 Warnings
src/vectordb/store.rs:map_size_mbispub— exposes state thatresize_environment()should own exclusively. Use a getter.src/index/manager.rs:Option<GitHeadWatcher>field is alwaysSome(...)— misleading type. UseGitHeadWatcherdirectly or actually storeNonefor non-git repos.src/index/mod.rs:find_project_rootkept with#[allow(dead_code)]after superseded byfind_git_root. Remove or add a TODO.src/watch/mod.rs:get_current_headuses synchronousstd::fs::read_to_stringinside an async struct — blocks the Tokio executor thread. Usetokio::fs.src/mcp/mod.rs: Raw-path fallback|| chunk.path == request.pathremoved without regression tests for Windows UNC paths and git-worktree paths.src/embed/cache.rs:PersistentEmbeddingCachehas no call sites — entirely dead code. Acceptable if preparatory; add a TODO linking to the follow-up issue.src/cli/mod.rs:std::io::stdin().read_linein anasync fnblocks the Tokio executor. Usetokio::io::AsyncBufReadExt.src/db_discovery/mod.rs:into_iter().next().unwrap()after.is_empty()check (pre-existing but worth addressing).
🟢 Suggestions
src/vectordb/store.rs: Binderror.to_string()once inis_map_full_error— currently called twice.src/mcp/mod.rs: Shadow bindings fornormalized_path/normalized_filterare confusing — merge into single bindings.src/db_discovery/mod.rs: Uselet Some(name) = ... else { continue }instead ofunwrap_or_default()for cleaner intent.src/embed/mod.rs:#[allow(dead_code)]placed after closing}of previous function — move to directly precedepub fnor doc comment.src/file/mod.rs:add_skipped_binary()called for empty and extension-skipped files — inflates the binary counter. Consider separate counters.
Overall Assessment
The feature work (LMDB resilience, git-aware index placement, branch change detection, persistent embedding cache infrastructure) is architecturally sound. GitHeadWatcher, find_git_root, db_discovery, file_meta, and the path normalization tests are all clean and well-tested. However, two correctness regressions (duplicate chunks on re-index, broken cache stats) and several .unwrap() violations need to be addressed before merging.
f8bdab5 to
80d5b15
Compare
Summary
Changes
LMDB Resilience & Bloat Detection
lmdb_page_stats()method usingenv.non_free_pages_size()andenv.real_disk_size()for accurate bloat measurement5500 bytes/chunkestimate with real LMDB API statsGit-Aware Indexing
GitHeadWatcherto detect branch changes via.git/HEADfile pollingfind_git_root()to place.codesearch.dbat git repository root.gitdirs and git worktree.gitfilesDoctor Improvements
db_info.project_pathinstead of relativePath::new(".")ALWAYS_SKIP_EXTENSIONS(40+ patterns) andALWAYS_SKIP_FILENAME_SUFFIXES(14 patterns)Encoding Handling
std::fs::read_to_string()firstString::from_utf8_lossy()for non-UTF-8 files (ISO-8859-1, Windows-1252)�to continue indexingEmbedding Cache
File System Watcher Unification
INDEXABLE_EXTENSIONSandIGNORED_DIRSarraysALWAYS_EXCLUDED,ALWAYS_SKIP_EXTENSIONS,ALWAYS_SKIP_FILENAME_SUFFIXESLanguage::from_path(path).is_indexable()Developer Experience
Testing
Breaking Changes
Checklist
Commits: 9 | Files changed: 34 | Additions: 6,362 | Deletions: 401