Skip to content

🐛 fix: review corrections — LMDB reopen 500, indexing & TUI hardening#120

Merged
flupkede merged 14 commits into
developfrom
fix/review-corrections
Jun 12, 2026
Merged

🐛 fix: review corrections — LMDB reopen 500, indexing & TUI hardening#120
flupkede merged 14 commits into
developfrom
fix/review-corrections

Conversation

@flupkede

Copy link
Copy Markdown
Owner

Summary

Review-driven correctness and hardening pass across the serve hub, indexing, LMDB layer, and TUI. The headline fix resolves the production 500 an environment is already opened with different options when reopening a repo DB (e.g. after idle eviction + force-reindex).

Key Changes

LMDB "different options" 500 (root-cause fix)

  • Per-path map_size pin (vectordb/store.rs): the first resolved map size for a canonical DB path is pinned for the process lifetime (monotonic, capped at MAX_LMDB_MAP_SIZE_MB); resize_environment raises the pin so post-eviction reopens match the still-live heed env. heed rejects reopens whose map_size differs from a still-live env, and its env close is deferred — so consistent sizing is what actually prevents the error (proven by test: same-size concurrent churn is clean, different-size is not).
  • TrackedEnv::drop reorder (lmdb_registry.rs): drop the heed::Env before freeing our own registry slot, enforcing "our slot free ⟹ heed's slot free" as complementary hardening.

Serve / indexing

  • serve/mod.rs: idle-reaper + reopen path hardening; stop_fsw drops Readonly/Conflicted repos so the LMDB env is released before reopen; /info + /doctor route wiring; partial-deletion window fix.
  • index/manager.rs, search/mod.rs: lazy embed-model resolution (fixes no-op refresh regression).
  • db_discovery/repos.rs: stale-path resilience + git_remote_url hardening; reconcile on load.

Chunker / embed

  • chunker/jupyter.rs, semantic.rs, extractor.rs: Jupyter line-count + caveat, Dart classification.
  • embed/embedder.rs, cli/mod.rs: --model index-corruption fix + model-list dedup.

TUI

  • serve/tui*.rs: DRY cleanup + hygiene; remote /info + /doctor endpoints.

Testing

  • cargo fmt --all -- --check
  • cargo clippy --lib --all-features -- -D warnings
  • cargo test --lib (lmdb_registry race guard 8×4000 + store pin tests + vectordb suite green)
  • Manual: reopen a resized repo DB after idle eviction (no 500)

Checklist

  • Self-review done
  • No new warnings

Test User and others added 14 commits June 12, 2026 09:35
…ist (M6)

C1 (CRITICAL): perform_incremental_refresh_with_stores hardcoded
ModelType::default() for embedding while force_reindex wrote the --model
override only into metadata.json. Choosing a non-384d model (bge-base/
bge-large/mxbai-large) recorded the new dimension in metadata but still
produced 384d vectors -> dimension mismatch / corrupt index.

Fix: resolve the embedding model from the short name recorded in
metadata.json, fail fast on an unknown model or a dims/model mismatch,
and thread the resolved ModelType into the embedding closure instead of
the hardcoded default.

M6: replace the two hand-maintained "valid models" lists (CLI add --model
and serve POST /repos), which both omitted bge-large, with a single
ModelType::valid_short_names() derived from all().

Tests: short_name round-trips through parse() for every model;
valid_short_names() lists every model incl. bge-large.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er path

index_single_file (reached from the file-watcher loop and git branch-change
refresh) still hardcoded ModelType::default() for embedding, re-introducing
the exact dimension-mismatch corruption C1 set out to fix: a repo indexed
with a non-default model would re-embed changed files at 384d on the next
edit.

Extract resolve_embed_model(db_path) -> (ModelType, usize) as the single
source for model resolution (reads metadata, fail-fast on unknown model /
dims mismatch) and use it in BOTH perform_incremental_refresh_with_stores
and index_single_file. Removes the duplicated inline resolution block and
the redundant closure rebind.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tant-time auth (M2)

M1: the generated post-checkout hook embedded $(pwd) directly into the JSON
request body. A repo path containing a double quote or backslash broke out of
the JSON string literal (malformed body / injection). Now JSON-escape the path
in pure bash (backslashes then quotes) before embedding — no jq dependency,
handles Windows paths.

M2: API key was compared with raw string `==`, which short-circuits on the
first mismatched byte — a timing side-channel on the network-exposed
require_auth_for_network path (whose "localhost-only, timing impractical"
justification was actually copied from the admin middleware). Add
api_key_matches() using SHA-256 digest + non-short-circuiting byte compare,
and a shared request_has_valid_api_key() helper so both middlewares use the
constant-time path from one place. Fix the misleading doc comment.

Test: api_key_matches covers match/mismatch/length/empty/case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… msys bash

The previous M1 fix used bare-backslash parameter expansion
(${REPO_PATH//\/\\}); empirically this does NOT double backslashes in
Git Bash / msys 5.2 (the surrounding double-quotes halve them and a bare
\ pattern fails to match a literal backslash), so a path with a backslash
still produced invalid JSON.

Use the variable-based idiom instead — quoted variables as the search and
replace operands force LITERAL matching:
    BS='\'; DQ='"'
    REPO_PATH=${REPO_PATH//"$BS"/"$BS$BS"}
    REPO_PATH=${REPO_PATH//"$DQ"/"$BS$DQ"}
Verified with node JSON.parse round-trip for paths containing both " and \
(e.g. /c/Users/a"b\c -> {"path":"/c/Users/a\"b\c"} -> parses back exactly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…5) + fsync (minor)

M5: force_reindex_with_stores step 4 still wrote metadata.json via a plain
std::fs::write(to_string_pretty(..)), bypassing the atomic writer. A crash
there could truncate metadata.json — the exact failure the atomic RMW was
introduced to prevent. Route it through crate::vectordb::merge_metadata_atomic,
overlaying the preserved keys + a fresh indexed_at onto any existing content.

Minor: atomic_write_json now fsyncs the temp file (File::create + write_all +
sync_all) BEFORE the rename, so a power-loss cannot leave a zero-length/garbage
file in place of the old metadata. Doc comment updated to match the actual
guarantee instead of overstating it.

vectordb store tests pass (incl. test_persistence).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…M3) + csharp_error (M4)

M3: the remote TUI's `i` and `d` keys called GET /repos/{alias}/info and
POST /repos/{alias}/doctor, but neither route existed server-side, so they
always failed with "endpoint not available" / HTTP error. Add:
- info_handler (GET /repos/:alias/info): mirrors tui::build_info_overlay,
  returns the exact InfoResponse shape (chunks/files/max_chunk_id/
  db_size_human/model/dims/lock/index_age). 404 on unknown alias.
- doctor_handler (POST /repos/:alias/doctor): runs diagnose_with_store when
  the repo's stores are open (reuses the LMDB handle — avoids double-open),
  else diagnose(); returns {"results": [...]} from DoctorReport::render_tui.
Both registered in the same auth-layered Router as reindex; like /status
they're open on localhost and key-protected on network binds.
format_age/dir_size_human in tui.rs made pub(crate) for reuse.

M4: status JSON now includes "csharp_error" so the remote detail panel shows
the real C# error instead of the literal "Unknown error".

cargo check + clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rt classify + tests

Jupyter (src/chunker/jupyter.rs):
- Normalize RawCell.line_count to >=1 in extract_cell so the two passes in
  merge_adjacent_cells (line numbering vs merge accumulation) agree; empty
  cells previously stored 0 which the passes counted differently.
- Add a loud module-header CAVEAT that chunk start/end lines are synthetic
  cell-relative positions, NOT real .ipynb JSON offsets — future jump-to-line
  / re-extraction features must not trust them. Note kernel language is not
  read (generic code labels).

Dart (src/chunker/extractor.rs):
- Remove the dead/misleading `mixin_application` branch from classify().
  Empirically verified: Dart mixin members parent to `class_body` (already
  Method); `mixin_application` is the `with A, B` clause and never parents a
  function_declaration, so the branch never fired.

Tests (src/chunker/semantic.rs):
- test_dart_semantic_chunking: top-level fn → Function, class & mixin members
  → Method (regression guard for the misclassification).
- test_dart_unparseable_still_chunks: malformed .dart still yields fallback
  chunks (grammar-failure resilience).

64 chunker tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DRY (src/serve/tui_common.rs, tui.rs, tui_remote.rs):
- Extract shared format_uptime_secs(u64) — format_uptime delegates; remote's
  format_uptime_from_secs removed.
- Move byte-identical restore_terminal into tui_common (pub); both TUIs call it.
- render_centered_modal now delegates to the _with_border_color variant
  (Color::Cyan), removing ~45 duplicated lines.

Hygiene:
- Fix stale comments: tui.rs "'s' pressed" -> "'l' pressed"; tui_remote module
  doc "Actions (i/d/f/s)" -> accurate i=info/d=doctor/n=reindex/r=remove/l=reload.
- Remove dead `let _ = tx;` suppression in tui_remote.
- Align C# sentinel: tui.rs map_repo_rows emits "none" (was "") to match
  status_handler; shared consumer handles it via default branch (no display change).
- Reuse a single reqwest::Client across the remote TUI poll/actions instead of
  constructing one per request.
- search/mod.rs: warn! on query-side model parse-failure/override fallback
  (previously a silent unwrap_or_default) — keeps the default, just makes a
  model mismatch visible in logs.

Test (src/serve/mod.rs): info_doctor_routes_registered asserts GET /info and
POST /doctor on an unknown alias return 404 (mirrors reindex route test).

cargo check + clippy clean; 29 serve tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…regression)

The Stage 1 fail-fast resolved the embedding model at the top of
perform_incremental_refresh_with_stores, which made an up-to-date (no-op)
refresh error out on any index whose metadata model can't be parsed —
breaking test_incremental_refresh_up_to_date_is_noop and, more importantly,
failing no-op refreshes that never embed anything.

Move the strict resolve_embed_model() call inside the `if !changed_files`
block so it runs only when there are files to embed; keep a lenient
model_name/dimensions read at the top for the FileMetaStore. Embedding still
fails fast with the same guarantee; no-op refreshes no longer require a
resolvable model.

Full lib+bins suite green (482 passed; the unrelated Windows relocation test
is pre-existing flakiness — passes in isolation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…auth notes

Holistic-review findings on the combined branch:

IMPORTANT — partial-deletion window: in perform_incremental_refresh_with_stores
the lazy resolve_embed_model ran AFTER stale-chunk deletions were committed, so
on a corrupt index (unknown model / dims mismatch) it could delete chunks then
error before re-embedding, leaving the index with data removed. Move the
fail-fast resolve to right after the no-op early-return — before any destructive
store mutation — so a corrupt index errors with the index still intact. embed_model
is still consumed in the changed-files block.

MINOR — doctor_handler ran synchronous diagnose() (tree walk + LMDB scan) while
holding a tokio read guard on an async worker. Wrap it in spawn_blocking using
blocking_read on the cloned Arc<RwLock> (mirrors the reindex path); still reuses
the open LMDB handle to avoid double-open.

MINOR — document at the route declaration why POST /doctor is intentionally
outside require_admin_auth's management set (read-only diagnostics).

cargo check + clippy clean; target tests pass (noop refresh, info_doctor routes).
The unrelated Windows relocation test remains pre-existing flakiness (passes in
isolation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The db_discovery::repos relocation tests flaked under full-suite parallel load
(intermittent "Access is denied" on rename, and occasional missing git remote).
Root cause: these tests `git init` a directory then rename it; on Windows the
indexer/antivirus scans each fresh .git tree and holds handles on it, blocking
the rename. When many such tests run concurrently the scanner is overwhelmed
and handles linger for >7s — long enough to exhaust the old 10-attempt rename
retry. A transient msys fork failure (EAGAIN) spawning git could also silently
drop a repo's remote.

Fixes:
- Serialize the 9 git-spawning / renaming relocation tests behind a shared,
  poison-tolerant Mutex so only one .git tree is created/renamed at a time —
  keeping each Defender scan window short. This is the decisive fix.
- Harden production git_remote_url(): retry on transient spawn failure (fork
  exhaustion) instead of treating it as "no remote", which would wrongly strip
  a repo's git identity and break relocation / cause prune. NotFound (git
  absent) still returns immediately; an Ok-but-nonzero status is a real answer
  and is not retried.
- init_git_remote test helper: same spawn-retry instead of .expect() panic.
- rename_retry: raise budget to 40 attempts with a 250ms-capped backoff.

Validation: full lib suite run 6x consecutively under parallel load — 482
passed, 0 failed every time (incl. a 15s heavy-contention run that previously
failed). clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions" error

The reported 500 ("an environment is already opened with different options"
opening BAYR.Aprimo) is heed 0.20.5's Error::BadOpenOptions from heed's own
process-global OPENED_ENV registry — NOT codesearch's TrackedEnv guard (which
would say "double-open prevented"). It fires when a path resolves to a
still-live heed env whose recorded map_size differs from the reopen's resolved
size (e.g. after an MDB_MAP_FULL resize).

Root cause: TrackedEnv::drop called unregister() in the drop body, but the
`inner: heed::Env` field is dropped only AFTER the body returns (Rust drop
order). That leaves a window where codesearch's LMDB_REGISTRY slot is free but
heed's env is still alive. A concurrent TrackedEnv::open (idle reaper dropping a
repo while a reindex/query reopens it) passes register() and falls through to
opts.open(), hitting heed's raw error.

Note: the previously applied stop_fsw fix (drop Readonly/Conflicted) does not
cover this incident — the repo was idle-evicted, so stop_fsw returns None before
reaching that branch.

Fix: wrap inner in ManuallyDrop and drop the heed::Env BEFORE unregister(),
enforcing "our slot free => heed's slot free". A concurrent open now either sees
our slot occupied (clear double-open message + retry) or both free (clean
reopen) — never the inconsistent state. Adds a multi-threaded regression guard
that asserts the forbidden heed string never surfaces (non-flaky: only fails on
real regression).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ons" 500

Strengthening the TrackedEnv concurrency test to 8 threads x 4000 iters
revealed the drop-order reorder alone does NOT close the window: with
DIFFERENT map sizes on the same path, heed still raised "an environment is
already opened with different options". Probing with a CONSISTENT size proved
the error is specific to size mismatch — heed defers env close, so a reopen can
briefly observe the prior live env, and only disagreeing options trigger the
error.

Root fix: a process-global per-path map_size pin (MAP_SIZE_PINS). The first
resolve_map_size() for a path fixes its size for the process lifetime
(monotonically non-decreasing, capped at MAX_LMDB_MAP_SIZE_MB); resize_environment
raises the pin so post-resize reopens (e.g. after idle eviction) match the
still-live env. This makes every open of a path use a consistent size
regardless of metadata-persistence state — the user's BAYR metadata.json had no
lmdb_map_size_mb, which is exactly how a resized env could be reopened with a
mismatched size. The Drop reorder remains as complementary hardening.

Tests:
- lmdb_registry: concurrent open/drop/reopen with consistent size (8x4000,
  barrier-synced) asserts the heed string never surfaces — non-flaky guard.
- store: pin is stable+monotonic per path (a later smaller persisted size does
  not shrink the live pin) and capped at MAX.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pressing 'n' (reindex) previously gave no instant confirmation — the status
column only flipped to "Indexing…" on the next 500ms redraw, which was easy to
miss on a fast reindex. Now a transient footer flash confirms the action
immediately and honestly reflects the launch outcome:

- Started        → "⟳ Reindex started for '<alias>' …"
- AlreadyRunning → "⟳ Reindex already running for '<alias>'"
- Failed         → "✗ Cannot reindex '<alias>' — see logs"

spawn_force_reindex now returns a ReindexLaunch enum so the flash maps to the
real synchronous launch result (guard hit / unresolved alias / read-only / open
error). The flash auto-clears after 4s. The existing pulsing "⟳ idx…" status
label is unchanged. Scope: local in-process TUI only (per request); render_footer
gains an Option<&str> flash arg, remote TUI passes None (behavior unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@flupkede flupkede merged commit d9e7d5b into develop Jun 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant