feat(indexing): port CspIndex orchestrator (fromPath/fromGit/search/findRelated/save/load)#17
Conversation
…indRelated/save/load)
There was a problem hiding this comment.
Code Review
This pull request ports the indexing functionality from Python, introducing the CspIndex class for hybrid search, directory and Git repository indexing, and persistence helpers, along with comprehensive tests. The reviewer feedback highlights several critical improvements: optimizing file size calculation in _computeFileSizes by using statSync instead of reading entire files into memory, fixing a filtering bug in _getSelectorVector where empty filter results incorrectly disable filtering, avoiding potential process hangs in runGitClone by ignoring stdout instead of piping it without consumption, and cleaning up unused imports and helper functions.
There was a problem hiding this comment.
4 issues found across 6 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
- _computeFileSizes: use statSync().size instead of readFileSync().length to avoid loading file content into memory (P1, gemini) and to report true UTF-8 byte counts instead of UTF-16 code units (P3, cubic). - _getSelectorVector: return [] (not null) when filters are set but match no chunks, so search() honors the empty filter instead of falling back to an unfiltered search (P1, gemini/cubic). - search(): early-return [] for zero-match selectors and topK<=0. - resolveDirectory: preserve filesystem-root trailing separators so paths like 'C:\' or '/' survive normalization on Windows/POSIX (P2, cubic). - runGitClone: set stdout to 'ignore' (not 'pipe') so the OS pipe buffer can't fill and deadlock git clone (P2, gemini/cubic). stdin/stderr unchanged. - fromGit: tolerate rm() failures during finally so they never mask the inner error. - Drop now-unused readFileSync import and readFileSyncSafe helper. Tests: - Add regression test for filters-with-zero-matches returning []. - Add regression test for topK<=0 returning [].
amondnet
left a comment
There was a problem hiding this comment.
Applied all 9 bot-review findings (5 from gemini-code-assist, 4 from cubic). Pushed in fde496f.
Summary:
- _computeFileSizes:
statSync().sizeinstead ofreadFileSync().length— no in-memory file content, correct UTF-8 byte counts. - _getSelectorVector: distinguishes 'no filter' (returns null) from 'filter matches nothing' (returns []).
search()early-returns [] for the empty-selector case so an unmatched filter never silently falls back to an unfiltered search. Intentional divergence from semble's Python, which has the same latent bug. - resolveDirectory: trailing-separator stripping now skips filesystem-root paths (
/on POSIX,C:\\on Windows) so we don't mutateC:\\→C:. - runGitClone:
stdio: ['ignore', 'ignore', 'pipe']— stdout no longer piped, preventing OS pipe-buffer deadlock when verbose git hooks emit output. Keptspawn+ manual timeout as per the original code-review decision (NOT reverted toexecFile). - fromGit: rm errors during finally are swallowed so they never mask the inner error.
- Cleanup: removed unused
readFileSyncimport andreadFileSyncSafehelper. - Tests: added regressions for filters-with-zero-matches → [] and topK<=0 → [].
Applied 9, deferred 0.
Summary
Port
src/semble/index/{create,index,types}.pyto TypeScript:src/indexing/types.ts—PersistencePathclass withchunks,bm25Index,semanticIndex,metadatafields,nonExisting()instance method, and staticfromPath(base)factory matching semble's layout (chunks.json,bm25_index/,semantic_index/,metadata.json).src/indexing/create.ts—createIndexFromPath(path, options)walks files, enforcesMAX_FILE_BYTES = 1_000_000, chunks each file, builds the BM25 index overtokenize(enrichForBm25(chunk)), and embeds chunks into aSelectableBasicBackend. Throws when no chunks are produced.src/indexing/index.ts—CspIndexclass with:static fromPath(path | URL, options)→ resolves directory, throws if missing or not a dirstatic fromGit(url, options)→ clones viaspawn('git', ['clone', '--depth', '1', ...])withGIT_CLONE_TIMEOUT_MS(fromCSP_CLONE_TIMEOUT, default 60s), manual SIGTERM on timeout, tmpdir cleanup infinallystatic loadFromDisk(path)→ readsmetadata.json+chunks.jsonand reconstructs from disksearch(query, options)andfindRelated(source, options)→ delegated to../search.ts, with rerank defaulting toContentType.Code in this._contentsave(path)→ mkdir -p + writes all 4 artifactsDEFAULT_CONTENT,ALL_CONTENT, andGIT_CLONE_TIMEOUT_MSto match the spec../types.ts,../tokens.ts,../search.ts,../stats.ts,../chunking/chunk-source.ts,./dense.ts,./sparse.ts,./files.ts,./file-walker.ts) marked// TODO(integration): import from …so the unit compiles and tests pass before the dependent units land. (Stubs are not committed in this PR.)Tests
bun test src/indexing/types.test.ts src/indexing/create.test.ts src/indexing/index.test.ts— 20/20 pass.PersistencePath: layout,nonExisting()empty/partial/full casescreateIndexFromPath: chunk/bm25/semantic build, throws when empty, extension override,MAX_FILE_BYTESskip, subdir descentCspIndex.stats: empty + populated language distributionCspIndex.search('')/ empty index →[]CspIndex.findRelated: excludes the source chunk, acceptsChunkorSearchResultsave→loadFromDiskroundtrip preserves chunks + language statsfromPaththrows on non-existent path / file (not directory)Code-review fixes applied during review
fromGitusedexecFilewith a non-existentstdiooption.child_process.execFiledoes not acceptstdio(that'sspawn-only), so the stdin-redirection request was silently dropped. Replaced with a dedicatedrunGitClonehelper usingspawn('git', args, { stdio: ['ignore', 'pipe', 'pipe'] })and a manualsetTimeout+kill('SIGTERM')so the child is reliably torn down on timeout — matching semble'ssubprocess.run(..., stdin=subprocess.DEVNULL, timeout=…)semantics._computeFileSizesusedpath.joininstead ofpath.resolve.join('/root', '/abs/path')returns/root/abs/path, but semble'sroot / chunk.file_pathfollows pathlib rules where an absolute right-hand operand wins. Switched toresolve(root, chunk.filePath)so future absolute-path scenarios behave identically to the upstream.Notes
filePath,startLine,endLine) perCLAUDE.md.~/.csp/savings.jsonl(not~/.semble/).noUncheckedIndexedAccess+exactOptionalPropertyTypeshonored — optional spread guards onextensions/selector.Summary by cubic
Ported the
CspIndexindexing orchestrator to TypeScript for local and git-based indexing with hybrid (BM25 + semantic) search, plus a save/load disk format. Adds tests, enforces a 1 MB per-file cap, and improves clone reliability, filter handling, and path/size accuracy.New Features
CspIndexwithfromPath,fromGit,search,findRelated,save, andloadFromDisk; exportsDEFAULT_CONTENT,ALL_CONTENT, andGIT_CLONE_TIMEOUT_MS.createIndexFromPathbuilds BM25 overtokenize(enrichForBm25(chunk))and a semantic index from embeddings; walks subdirs, skips files > 1 MB, supportsextensionsanddisplayRoot.PersistencePathstandardizes on-disk layout:chunks.json,bm25_index/,semantic_index/,metadata.json.fromGitclones withspawn('git', ...)and a timeout viaCSP_CLONE_TIMEOUT→GIT_CLONE_TIMEOUT_MS; cleans up tmpdir.Bug Fixes
search: return [] for zero-match filters and whentopK <= 0; no fallback to unfiltered results.fromGit: ignore stdout to prevent pipe deadlocks; toleraterm()failures during cleanup; keep stdin closed and timeout handling.statSync().sizefor accurate byte counts and to avoid reading file contents.resolveDirectory; fix absolute-path resolution viapath.resolvefor file size computation.Written for commit fde496f. Summary will update on new commits.