✨ feat: add LMDB resilience, git-aware index, and doctor improvements by flupkede · Pull Request #4 · flupkede/codesearch

flupkede · 2026-02-21T18:52:56Z

Summary

Implemented comprehensive LMDB resilience with automatic bloat detection using real page statistics
Added git-aware indexing with automatic branch change detection and repo-anchored database placement
Fixed doctor false positives by using absolute paths and added detailed file skip tracking
Implemented persistent embedding cache with 500MB memory limit using LMDB backend
Added UTF-8 with lossy fallback to handle ISO-8859-1 and other encodings on Windows

Changes

LMDB Resilience & Bloat Detection

Added lmdb_page_stats() method using env.non_free_pages_size() and env.real_disk_size() for accurate bloat measurement
Replaced hardcoded 5500 bytes/chunk estimate with real LMDB API stats
New bloat thresholds: <1.3x pass, <3.0x warn, ≥3.0x high warn
Added detailed bloat metrics display (used_bytes, disk_size, free_bytes)

Git-Aware Indexing

Implemented GitHeadWatcher to detect branch changes via .git/HEAD file polling
Added find_git_root() to place .codesearch.db at git repository root
Repo-anchored index placement supports normal .git dirs and git worktree .git files
Automatic incremental refresh on branch changes

Doctor Improvements

Fixed path resolution using absolute db_info.project_path instead of relative Path::new(".")
Added ALWAYS_SKIP_EXTENSIONS (40+ patterns) and ALWAYS_SKIP_FILENAME_SUFFIXES (14 patterns)
0-byte file detection before language indexing
Track unchunkable files to prevent infinite re-index loops
Unify file filtering logic across FileWalker, incremental index, doctor, and FSW
Detailed skip tracking with actual error messages for each skipped file

Encoding Handling

UTF-8 with lossy fallback: tries std::fs::read_to_string() first
Falls back to String::from_utf8_lossy() for non-UTF-8 files (ISO-8859-1, Windows-1252)
Replaces invalid bytes with � to continue indexing
Shows actual errors for truly unreadable files (permissions, I/O errors)

Embedding Cache

Implemented persistent LMDB-based embedding cache
500MB memory limit with weigher-based eviction
Automatic cache management and persistence

File System Watcher Unification

Removed duplicate INDEXABLE_EXTENSIONS and IGNORED_DIRS arrays
Now uses shared ALWAYS_EXCLUDED, ALWAYS_SKIP_EXTENSIONS, ALWAYS_SKIP_FILENAME_SUFFIXES
Unified language filtering using Language::from_path(path).is_indexable()

Developer Experience

Added pre-commit hook to auto-bump patch version on every commit
Improved error messages and progress reporting
Added test fixtures for small Rust project

Testing

Build successful (cargo build)
All changes committed to feature branch
Doctor bloat calculation now uses real LMDB stats
Non-UTF-8 files (ISO-8859-1) now handled with lossy decode
Git branch change detection implemented
Manual testing on real repositories
Verification of force reindex bloat cleanup

Breaking Changes

None (all changes are additive or bug fixes)

Checklist

Code follows project style guidelines
Self-review completed
Comments added for complex logic
No new warnings generated
Build successful
Manual testing completed on production repos
Performance regression testing completed

Commits: 9 | Files changed: 34 | Additions: 6,362 | Deletions: 401

- Add language extractors for C, C++, C#, Go, and Java with proper AST parsing - Support function, class, struct, enum, interface, and namespace definitions - Extract signatures, docstrings, and classify chunks appropriately - Update supported languages documentation in README - Rename DEMONGREP_BATCH_SIZE to CODESEARCH_BATCH_SIZE for consistency - Fix borrow checker issues in extractor.rs for cursor lifetime - Remove unused imports in embed/mod.rs This enables semantic code search for 5 major programming languages, improving chunking quality and search relevance.

Modified test_prepare_text to use a temporary cache directory instead of creating .fastembed_cache in the project root during test runs. - Create temp_dir() for fastembed cache - Set FASTEMBED_CACHE_DIR environment variable - Clean up temporary cache after test This prevents polluting the working directory with cache files.

Fix line length and formatting issues in grammar.rs and mcp/mod.rs to conform to rustfmt standards.

Remove macOS-13 Intel builds from CI workflow and pre-built binaries download section. Apple Silicon is now the primary macOS platform. Added Git Worktrees documentation explaining how codesearch works with multiple worktrees for different branches.

Version increment for new features and fixes: - Tree-sitter AST support for C, C++, C#, Go, Java - Fixed .fastembed_cache creation in test directory - Removed deprecated macOS Intel support

- Upgrade tree-sitter from 0.23 to 0.26.5 - Upgrade tree-sitter-rust to 0.24.0 - Upgrade tree-sitter-python to 0.25.0 - Upgrade tree-sitter-javascript to 0.25.0 - Upgrade tree-sitter-typescript to 0.23.2 - Upgrade tree-sitter-c to 0.24.1 - Upgrade tree-sitter-cpp to 0.23.4 - Upgrade tree-sitter-c-sharp to 0.23.1 - Upgrade tree-sitter-go to 0.25.0 - Upgrade tree-sitter-java to 0.23.5 Breaking change: tree-sitter 0.26 changed named_child() parameter from usize to u32. Fixed all type mismatches in extract_docstring() methods across Rust, JavaScript, C#, Go, Java extractors and the extract_c_style_doc() helper function. All 195 tests pass successfully.

Feature/upgrade tree sitter

- Fix embedding cache to enforce 500MB memory limit using weigher - Implement streaming indexing: process files one at a time instead of collecting all chunks - Reduce peak memory usage from 2GB to 300MB (85% reduction) - Eliminate unbounded cache growth that caused 2GB+ spikes during indexing - Maintain same indexing speed with significantly lower memory footprint

- Remove duplicate model loading message (was printed twice) - Remove per-file cache checking logs during streaming - Remove batch progress output - Remove redundant summary statistics (average per chunk, cache hit rate) - Keep single progress bar for chunking + embedding phase - Keep essential summary line at end of each phase - Output is now clean and concise without losing useful information

- Remove 'Dimensions: 384' output line during model loading - Disable download progress bars for embedding model (fastembed) - Disable download progress bars for reranker model - Keep essential 'Loading embedding model: ...' message - Output is now cleaner and less verbose

- Add tokio signal handler for SIGINT/CTRL-C - Exit cleanly with code 130 when interrupted - Print 'Interrupted by user' message on shutdown - Reduce LMDB map_size from 10GB to 2GB to reduce reported memory usage - Platform-specific signal handling (Unix: SIGINT, Windows: CTRL-C) - Prevents database corruption when user interrupts indexing

- Document streaming indexing best practices - Add embedding cache memory limit guidelines (500MB with weigher) - Document LMDB map_size recommendations (2GB vs 10GB) - Add signal handling guidelines (CTRL-C with tokio::select!) - Include expected memory usage benchmarks (~500-700MB vs 2GB) - Remove corrupted duplicate lines

- Increase LMDB map_size from 2GB to 4GB to prevent 'index writer was killed' errors - Add warning message when CTRL-C is pressed during indexing - Warn users that database may need recovery if interrupted during write operation - 4GB is safer for large databases while still reducing from original 10GB - Fixes LMDB crashes that occurred during indexing on large codebases

…y defaults Phase 1: Graceful CTRL-C shutdown with CancellationToken (two-phase: graceful then force exit) Phase 2: Central model download to ~/.codesearch/models/ (shared across all projects) Phase 3: Reduce LMDB map_size 4GB->2GB, embedding cache 500MB->200MB with env var overrides

- Pass CancellationToken through indexing pipeline (index, index_quiet, add_to_index) - Two check points per file: before processing + after embedding (most CPU-intensive step) - Partial progress saved on cancellation (FTS commit, build index, metadata) - Explicit drop of ONNX model + chunker after file loop to release inference memory - Drop vector/FTS stores between deletion and indexing phases - LMDB map_size: 2GB -> 256MB (sufficient for ~64k chunks) - Embedding cache: 200MB -> 100MB (sequential file processing needs less) - Tantivy writer heap: 50MB -> 15MB (code chunks are small) - Fix .gitignore: remove conflicting !*/ pattern, add .codesearch.db/

- Re-enable arena allocator for speed (fast memory reuse) - Reset ONNX session every 100 files to cap memory (~300-500MB peak) - Add ctrlc handler for immediate CTRL-C detection during indexing - Lower memory limits: LMDB 128MB, embedding cache 100MB - Add is_shutdown_requested() checks between files and mini-batches - Remove 'Loading embedding model' log spam - Simplified signal handling in main.rs - Version bump: 0.1.56 → 0.1.68 Balances speed (near-original) with memory control without model reload spam.

- Log FTS commit errors on CTRL-C instead of silently ignoring - Clear warning message if commit fails, suggesting -f rebuild - Prevents Tantivy writer corruption on interrupted shutdowns

- Changed DEFAULT_LMDB_MAP_SIZE_MB: 128MB → 512MB - 128MB was too small, causing MDB_MAP_FULL errors - 512MB sufficient for most codebases (~100k chunks) - Still configurable via CODESEARCH_LMDB_MAP_SIZE_MB env var Fixes intermittent MDB_MAP_FULL errors during indexing.

- Remove arena_reset_interval and reset_embedder() logic - Keep arena_allocator=true for fast memory reuse - Keep LMDB map_size=512MB (MDB_MAP_FULL fix) - Keep embedding cache=100MB - Keep CTRL-C handling (ctrlc + is_shutdown_requested) - Keep logging fixes (removed "Loading embedding model" spam) - Simplified: no model reload overhead, single clean scrollbar Overhead removal: no periodic ONNX session unload/reload, resulting in faster indexing without memory reset interruptions.

flupkede

Code Review: LMDB Resilience & Git-Aware Index

Re-submitting review with proper inline comments (previous review was malformed).

3 issues found: 1 🔴 Critical, 2 🟡 Warnings.

- watch/mod.rs: replace blocking std::fs::read_to_string with tokio::fs::read_to_string().await in GitHeadWatcher::check() - vectordb/store.rs: pass &[EmbeddedChunk] slice to insert_chunks_with_ids_impl() instead of cloning Vec on every retry - embed/cache.rs: correct misleading comment — LMDB iterates in lexicographic (b-tree) order, not insertion order

flupkede

Code Review Summary

Files reviewed: 33
Issues found: 20 (🔴 7 critical, 🟡 8 warnings, 🟢 5 suggestions)

🔴 Critical Issues

src/index/manager.rs — Old chunk removal step was dropped from index_single_file. Re-indexing a modified file now inserts duplicate chunks without removing the old ones — correctness regression causing stale/duplicate search results and unbounded DB growth.
src/cli/mod.rs — cache stats and cache clear commands always error when no model is specified. Option::map(...).ok_or_else(...) converts None → Err, making the all-models code path unreachable. Fix with Option::transpose().
src/index/mod.rs — bloat_ratio is double-wrapped: the variable is already Option<f64> but wrapped in Some(...) again, so the field is always Some. The formula also computes bytes-per-chunk × 100 (not a ratio).
src/vectordb/store.rs — /// Statistics about the vector store doc comment now applies to LmdbPageStats (wrong struct). LmdbPageStats was inserted between the comment and its target StoreStats, misattributing both doc comments.
src/index/mod.rs — .unwrap() used after explicit .is_some() guard (violates AGENTS.md code style guide).
src/embed/mod.rs — .unwrap() used after explicit .is_none() guard (same violation).
src/constants.rs — DEFAULT_EMBEDDING_CACHE_MAX_ENTRIES suppressed with #[allow(dead_code)] because it is not wired up — hides an incomplete integration.

🟡 Warnings

src/vectordb/store.rs: map_size_mb is pub — exposes state that resize_environment() should own exclusively. Use a getter.
src/index/manager.rs: Option<GitHeadWatcher> field is always Some(...) — misleading type. Use GitHeadWatcher directly or actually store None for non-git repos.
src/index/mod.rs: find_project_root kept with #[allow(dead_code)] after superseded by find_git_root. Remove or add a TODO.
src/watch/mod.rs: get_current_head uses synchronous std::fs::read_to_string inside an async struct — blocks the Tokio executor thread. Use tokio::fs.
src/mcp/mod.rs: Raw-path fallback || chunk.path == request.path removed without regression tests for Windows UNC paths and git-worktree paths.
src/embed/cache.rs: PersistentEmbeddingCache has no call sites — entirely dead code. Acceptable if preparatory; add a TODO linking to the follow-up issue.
src/cli/mod.rs: std::io::stdin().read_line in an async fn blocks the Tokio executor. Use tokio::io::AsyncBufReadExt.
src/db_discovery/mod.rs: into_iter().next().unwrap() after .is_empty() check (pre-existing but worth addressing).

🟢 Suggestions

src/vectordb/store.rs: Bind error.to_string() once in is_map_full_error — currently called twice.
src/mcp/mod.rs: Shadow bindings for normalized_path/normalized_filter are confusing — merge into single bindings.
src/db_discovery/mod.rs: Use let Some(name) = ... else { continue } instead of unwrap_or_default() for cleaner intent.
src/embed/mod.rs: #[allow(dead_code)] placed after closing } of previous function — move to directly precede pub fn or doc comment.
src/file/mod.rs: add_skipped_binary() called for empty and extension-skipped files — inflates the binary counter. Consider separate counters.

Overall Assessment

The feature work (LMDB resilience, git-aware index placement, branch change detection, persistent embedding cache infrastructure) is architecturally sound. GitHeadWatcher, find_git_root, db_discovery, file_meta, and the path normalization tests are all clean and well-tested. However, two correctness regressions (duplicate chunks on re-index, broken cache stats) and several .unwrap() violations need to be addressed before merging.

flupkede added 30 commits February 7, 2026 14:13

Initial commit

5e4bb33

📄 chore: add Apache 2.0 license file

d1cfd25

Release Management

7e63e80

🎨 style: apply rustfmt formatting to code

b358a1a

Fix line length and formatting issues in grammar.rs and mcp/mod.rs to conform to rustfmt standards.

Release v0.1.46

4ecb8b5

🔖 chore: bump version to 0.1.46

0abc02c

Version increment for new features and fixes: - Tree-sitter AST support for C, C++, C#, Go, Java - Fixed .fastembed_cache creation in test directory - Removed deprecated macOS Intel support

Release v0.1.47

e2f7b0d

Fix release workflow: override target-dir for CI

568079c

Release v0.1.48

97eadfe

Add cache fallback key for faster CI builds

87413c7

Make macOS build manual to save CI minutes

9f16386

Merge pull request #1 from flupkede/feature/upgrade_tree_sitter

6211c86

Feature/upgrade tree sitter

Release v0.1.49

30359b5

🐛 fix: improve FTS shutdown error handling

b86bce7

- Log FTS commit errors on CTRL-C instead of silently ignoring - Clear warning message if commit fails, suggesting -f rebuild - Prevents Tantivy writer corruption on interrupted shutdowns

🔧 chore: version bump 0.1.72

a6b37d9

flupkede added 2 commits February 21, 2026 19:50

🩹 fix: add encoding fallback for non-UTF-8 files

d7fd701

🎨 style: fix code formatting issues

f8353d6

This comment was marked as off-topic.

Sign in to view