Skip to content

feat(indexing): port CspIndex orchestrator (fromPath/fromGit/search/findRelated/save/load)#17

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-12-cspindex
May 28, 2026
Merged

feat(indexing): port CspIndex orchestrator (fromPath/fromGit/search/findRelated/save/load)#17
amondnet merged 2 commits into
mainfrom
feat/unit-12-cspindex

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Port src/semble/index/{create,index,types}.py to TypeScript:

  • src/indexing/types.tsPersistencePath class with chunks, bm25Index, semanticIndex, metadata fields, nonExisting() instance method, and static fromPath(base) factory matching semble's layout (chunks.json, bm25_index/, semantic_index/, metadata.json).
  • src/indexing/create.tscreateIndexFromPath(path, options) walks files, enforces MAX_FILE_BYTES = 1_000_000, chunks each file, builds the BM25 index over tokenize(enrichForBm25(chunk)), and embeds chunks into a SelectableBasicBackend. Throws when no chunks are produced.
  • src/indexing/index.tsCspIndex class with:
    • static fromPath(path | URL, options) → resolves directory, throws if missing or not a dir
    • static fromGit(url, options) → clones via spawn('git', ['clone', '--depth', '1', ...]) with GIT_CLONE_TIMEOUT_MS (from CSP_CLONE_TIMEOUT, default 60s), manual SIGTERM on timeout, tmpdir cleanup in finally
    • static loadFromDisk(path) → reads metadata.json + chunks.json and reconstructs from disk
    • search(query, options) and findRelated(source, options) → delegated to ../search.ts, with rerank defaulting to ContentType.Code in this._content
    • save(path) → mkdir -p + writes all 4 artifacts
    • Exports DEFAULT_CONTENT, ALL_CONTENT, and GIT_CLONE_TIMEOUT_MS to match the spec
  • Inline minimal stubs for sibling modules (../types.ts, ../tokens.ts, ../search.ts, ../stats.ts, ../chunking/chunk-source.ts, ./dense.ts, ./sparse.ts, ./files.ts, ./file-walker.ts) marked // TODO(integration): import from … so the unit compiles and tests pass before the dependent units land. (Stubs are not committed in this PR.)

Tests

bun test src/indexing/types.test.ts src/indexing/create.test.ts src/indexing/index.test.ts20/20 pass.

  • PersistencePath: layout, nonExisting() empty/partial/full cases
  • createIndexFromPath: chunk/bm25/semantic build, throws when empty, extension override, MAX_FILE_BYTES skip, subdir descent
  • CspIndex.stats: empty + populated language distribution
  • CspIndex.search('') / empty index → []
  • CspIndex.findRelated: excludes the source chunk, accepts Chunk or SearchResult
  • saveloadFromDisk roundtrip preserves chunks + language stats
  • fromPath throws on non-existent path / file (not directory)

Code-review fixes applied during review

  1. fromGit used execFile with a non-existent stdio option. child_process.execFile does not accept stdio (that's spawn-only), so the stdin-redirection request was silently dropped. Replaced with a dedicated runGitClone helper using spawn('git', args, { stdio: ['ignore', 'pipe', 'pipe'] }) and a manual setTimeout + kill('SIGTERM') so the child is reliably torn down on timeout — matching semble's subprocess.run(..., stdin=subprocess.DEVNULL, timeout=…) semantics.
  2. _computeFileSizes used path.join instead of path.resolve. join('/root', '/abs/path') returns /root/abs/path, but semble's root / chunk.file_path follows pathlib rules where an absolute right-hand operand wins. Switched to resolve(root, chunk.filePath) so future absolute-path scenarios behave identically to the upstream.

Notes

  • Public-API field names are camelCase (filePath, startLine, endLine) per CLAUDE.md.
  • Stats file path is ~/.csp/savings.jsonl (not ~/.semble/).
  • TypeScript strict + noUncheckedIndexedAccess + exactOptionalPropertyTypes honored — optional spread guards on extensions / selector.
  • e2e (real model / real walker) intentionally out of scope per unit spec.

Summary by cubic

Ported the CspIndex indexing orchestrator to TypeScript for local and git-based indexing with hybrid (BM25 + semantic) search, plus a save/load disk format. Adds tests, enforces a 1 MB per-file cap, and improves clone reliability, filter handling, and path/size accuracy.

  • New Features

    • CspIndex with fromPath, fromGit, search, findRelated, save, and loadFromDisk; exports DEFAULT_CONTENT, ALL_CONTENT, and GIT_CLONE_TIMEOUT_MS.
    • createIndexFromPath builds BM25 over tokenize(enrichForBm25(chunk)) and a semantic index from embeddings; walks subdirs, skips files > 1 MB, supports extensions and displayRoot.
    • PersistencePath standardizes on-disk layout: chunks.json, bm25_index/, semantic_index/, metadata.json.
    • fromGit clones with spawn('git', ...) and a timeout via CSP_CLONE_TIMEOUTGIT_CLONE_TIMEOUT_MS; cleans up tmpdir.
  • Bug Fixes

    • search: return [] for zero-match filters and when topK <= 0; no fallback to unfiltered results.
    • fromGit: ignore stdout to prevent pipe deadlocks; tolerate rm() failures during cleanup; keep stdin closed and timeout handling.
    • File sizes: use statSync().size for accurate byte counts and to avoid reading file contents.
    • Path handling: preserve filesystem-root trailing separators in resolveDirectory; fix absolute-path resolution via path.resolve for file size computation.

Written for commit fde496f. Summary will update on new commits.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports the indexing functionality from Python, introducing the CspIndex class for hybrid search, directory and Git repository indexing, and persistence helpers, along with comprehensive tests. The reviewer feedback highlights several critical improvements: optimizing file size calculation in _computeFileSizes by using statSync instead of reading entire files into memory, fixing a filtering bug in _getSelectorVector where empty filter results incorrectly disable filtering, avoiding potential process hangs in runGitClone by ignoring stdout instead of piping it without consumption, and cleaning up unused imports and helper functions.

Comment thread src/indexing/index.ts
Comment thread src/indexing/index.ts
Comment thread src/indexing/index.ts Outdated
Comment thread src/indexing/index.ts Outdated
Comment thread src/indexing/index.ts

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 6 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/indexing/index.ts Outdated
Comment thread src/indexing/index.ts
Comment thread src/indexing/index.ts Outdated
Comment thread src/indexing/index.ts Outdated
- _computeFileSizes: use statSync().size instead of readFileSync().length
  to avoid loading file content into memory (P1, gemini) and to report
  true UTF-8 byte counts instead of UTF-16 code units (P3, cubic).
- _getSelectorVector: return [] (not null) when filters are set but match
  no chunks, so search() honors the empty filter instead of falling back
  to an unfiltered search (P1, gemini/cubic).
- search(): early-return [] for zero-match selectors and topK<=0.
- resolveDirectory: preserve filesystem-root trailing separators so paths
  like 'C:\' or '/' survive normalization on Windows/POSIX (P2, cubic).
- runGitClone: set stdout to 'ignore' (not 'pipe') so the OS pipe buffer
  can't fill and deadlock git clone (P2, gemini/cubic). stdin/stderr
  unchanged.
- fromGit: tolerate rm() failures during finally so they never mask the
  inner error.
- Drop now-unused readFileSync import and readFileSyncSafe helper.

Tests:
- Add regression test for filters-with-zero-matches returning [].
- Add regression test for topK<=0 returning [].
@amondnet amondnet self-assigned this May 28, 2026

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied all 9 bot-review findings (5 from gemini-code-assist, 4 from cubic). Pushed in fde496f.

Summary:

  • _computeFileSizes: statSync().size instead of readFileSync().length — no in-memory file content, correct UTF-8 byte counts.
  • _getSelectorVector: distinguishes 'no filter' (returns null) from 'filter matches nothing' (returns []). search() early-returns [] for the empty-selector case so an unmatched filter never silently falls back to an unfiltered search. Intentional divergence from semble's Python, which has the same latent bug.
  • resolveDirectory: trailing-separator stripping now skips filesystem-root paths (/ on POSIX, C:\\ on Windows) so we don't mutate C:\\C:.
  • runGitClone: stdio: ['ignore', 'ignore', 'pipe'] — stdout no longer piped, preventing OS pipe-buffer deadlock when verbose git hooks emit output. Kept spawn + manual timeout as per the original code-review decision (NOT reverted to execFile).
  • fromGit: rm errors during finally are swallowed so they never mask the inner error.
  • Cleanup: removed unused readFileSync import and readFileSyncSafe helper.
  • Tests: added regressions for filters-with-zero-matches → [] and topK<=0 → [].

Applied 9, deferred 0.

@amondnet amondnet merged commit df14647 into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-12-cspindex branch May 28, 2026 16:08
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant