Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .please/docs/decisions/0003-rewrite-in-rust.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Adding napi-rs *now* would directly conflict with motivation #1 (single binary),

- [ADR 0001](0001-native-tree-sitter.md)'s native-vs-WASM tension dissolves — tree-sitter is a native Rust crate. ADR 0001 stays accepted for the TS lineage but no longer constrains the Rust line.
- [ADR 0002](0002-index-storage-cache-model.md)'s global `~/.csp/index/` cache model is language-agnostic and carries over unchanged.
- The existing TS test suite becomes **golden fixtures** for verifying the Rust rewrite's behavioral equivalence, then is retired with the TS code.
- The **Python upstream** (`MinishLab/semble`) is the equivalence oracle for the Rust rewrite; the existing TS test suite is reused as convenient language-neutral **golden fixtures** for already-ported modules, then retired with the TS code. *(Clarified 2026-06-19: the TS `src/` is the source of truth no longer — it is deprecated and slated for deletion. Where the Rust port has moved past the old TS stubs — real dense embeddings, wired ranking, chunk length 750 — the upstream Python is authoritative, not the TS source.)*

## Alternatives considered

Expand Down
91 changes: 51 additions & 40 deletions .please/docs/references/semble.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,14 @@
> captures the load-bearing algorithm + its constants, the Rust-specific structure/idioms, and
> where the port diverges.
>
> **Analyzed at**: upstream semble `136b6f7` (2026-06-18); Rust port at repo `2f2baa2`
> (PR #34 "Rust rewrite foundation"). **Parity oracle**: the TS `src/` test suite reused as
> golden fixtures — Rust reproduces the TS module behavior bit-for-bit, so "parity" is
> *fixture-level*, not full-runtime. The TS `src/` stays the source of truth until Rust reaches
> parity (per ADR-0003).
> **Analyzed at**: upstream semble `136b6f7` (2026-06-18); Rust port baseline `2f2baa2`
> (PR #34 "Rust rewrite foundation"), since advanced by PR #37 (ranking wired + chunk 750).
> **Source of truth**: the **Python upstream** (`MinishLab/semble`) — the Rust port targets
> behavioral equivalence with it. The TS `src/` is **deprecated** (slated for deletion, kept
> only as a historical/reference implementation) and is no longer the parity oracle; its test
> suite remains usable as language-neutral golden fixtures for already-ported modules, but where
> the Rust port has moved past the old TS stubs (real dense embeddings, wired ranking, chunk
> length 750) the upstream Python is authoritative.
> **Upstream layout**: Python `src/semble/`. **Port layout**: `crates/csp/src/` (lib) +
> `crates/csp-cli/src/` (bin).

Expand Down Expand Up @@ -51,14 +54,15 @@ They are fused with **Reciprocal Rank Fusion** and then reranked with code-speci
(dense matrix) ("{content} {stem} ▼
│ {stem} {dir[-3:]}") rerank (if CODE):
▼ → tokenize → BM25 boost_multi_chunk_files (wired)
SelectableBasicBackend index apply_query_boost (⚠ identity stub)
(cosine) rerank_top_k (⚠ saturation-only stub)
SelectableBasicBackend index apply_query_boost (wired)
(cosine) rerank_top_k (wired, path penalties + saturation)
top_k SearchResult → ~/.csp/savings.jsonl
```

⚠ = TD-002: the full ranking lives in `ranking::{boosting,penalties}` but is **not yet wired**
into `search.rs`, mirroring the TS source's current state (see §4.10/§6).
The full ranking in `ranking::{boosting,penalties}` is now **wired** into `search.rs`
(query-type boosts + path penalties + file saturation), matching the upstream
`search.search` pipeline order (see §4.10).

---

Expand All @@ -69,18 +73,18 @@ into `search.rs`, mirroring the TS source's current state (see §4.10/§6).
| `types.py` | `csp/src/types.rs` | ported | `Chunk`, `ContentType`, `CallType` enums; `ChunkDict`/`SearchResultDict` serde |
| `tokens.py` | `csp/src/tokens.rs` | ported | identifier-aware tokenizer (BM25 input) |
| `chunking/core.py` | `csp/src/chunking/core.rs` | ported (real tree-sitter) | node-merge + line-fallback boundary algorithm; `TsNode` bridge |
| `chunking/chunking.py` | `csp/src/chunking/source.rs` | ported (⚠ param drift) | `chunk_source` → `Vec<Chunk>`; char↔line conversion |
| `chunking/chunking.py` | `csp/src/chunking/source.rs` | ported | `chunk_source` → `Vec<Chunk>`; char↔line conversion (chunk length 750) |
| `index/file_walker.py` | `csp/src/indexing/file_walker.rs` | ported (`.cspignore`) | gitignore-aware recursive walk (`ignore` crate idioms) |
| `index/files.py` | `csp/src/indexing/files.rs` | ported | ext→language map, content-type sets, file status checks |
| `index/dense.py` | `csp/src/indexing/dense.rs` | ported (real + stub) | `Model` enum, `embed_chunks`, `SelectableBasicBackend` cosine |
| `index/sparse.py` | `csp/src/indexing/sparse.rs` | ported | `Bm25Index`, `enrich_for_bm25`, selector→mask |
| `index/create.py` | `csp/src/indexing/create.rs` | ported | build BM25 + dense + chunks from a path |
| `index/index.py` | `csp/src/indexing/index.rs` | ported | `CspIndex` orchestrator (from_path/from_git/search/find_related/save/load) + `load_or_build_index` |
| `cache.py` | `csp/src/indexing/cache.rs` | adapted | content-hash cache at `~/.csp/index/` (ADR-0002), 0700 perms |
| `search.py` | `csp/src/search.rs` | ported (ranking stub) | hybrid RRF + alpha blend; trait seams |
| `search.py` | `csp/src/search.rs` | ported (ranking wired) | hybrid RRF + alpha blend; trait seams |
| `ranking/weighting.py` | `csp/src/ranking/weighting.rs` | ported | adaptive alpha |
| `ranking/boosting.py` | `csp/src/ranking/boosting.rs` | ported (boost_multi wired; others unwired) | query-type detection + definition/stem/embedded boosts |
| `ranking/penalties.py` | `csp/src/ranking/penalties.rs` | ported (unwired) | path penalties + file-saturation rerank |
| `ranking/boosting.py` | `csp/src/ranking/boosting.rs` | ported (wired) | query-type detection + definition/stem/embedded boosts |
| `ranking/penalties.py` | `csp/src/ranking/penalties.rs` | ported (wired) | path penalties + file-saturation rerank |
| `stats.py` | `csp/src/stats.rs` | adapted | `~/.csp/savings.jsonl` read/write + report formatting |
| `mcp.py` | `csp/src/mcp.rs` (core) + `csp-cli/src/mcp_server.rs` (rmcp transport) | ported | MCP `search` / `find_related` tools |
| `cli.py` | `csp-cli/src/main.rs` | adapted (clap) | subcommands: search / find-related / index / savings / clear / init / mcp |
Expand Down Expand Up @@ -135,7 +139,9 @@ Same contract as semble `tokens.py`:
shape as semble. Byte offsets are converted to char offsets for multibyte safety.

**`source.rs`** (`chunk_source`):
- `DESIRED_CHUNK_LENGTH_CHARS = 1500` (⚠ upstream is now **750** — see §6).
- `DESIRED_CHUNK_LENGTH_CHARS = 750` (matches upstream `_DESIRED_CHUNK_LENGTH_CHARS`).
The value is also recorded in the index manifest (`chunk_size`) so a cache built
with a different target length is auto-invalidated (see §4.9/§4.14).
- AST chunking is gated by `is_supported_language(lang)` (metadata-only, no download); the
subsequent `chunk(...)` may still return `None` (e.g. an offline grammar fetch failure),
falling back to line chunking. Char offsets → 1-indexed line numbers; clamps end to avoid the
Expand Down Expand Up @@ -199,7 +205,7 @@ The public façade (parallels `SembleIndex`):
`~/.csp/index/<hash>` on a validated hit, else build and persist.
- Builds file→indices and language→indices maps for selectors and stats.

### 4.10 `search.rs` — hybrid retrieval & fusion (with TD-002 stub)
### 4.10 `search.rs` — hybrid retrieval & fusion

The heart of ranking. `search(query, model, semantic_index, bm25_index, chunks, top_k, options)`:
1. `resolve_alpha(query, options.alpha)`; `rerank = options.rerank.unwrap_or(true)`.
Expand All @@ -210,9 +216,10 @@ The heart of ranking. `search(query, model, semantic_index, bm25_index, chunks,
`1/(RRF_K + rank)`, `RRF_K = 60`, rank from 1.
6. Union of indices, **sorted by `start_line`** to neutralize hash-iteration nondeterminism;
`combined = α·rrf_semantic + (1-α)·rrf_bm25`.
7. If `rerank`: `boost_multi_chunk_files` (**wired**, shared impl) → `apply_query_boost_identity`
(⚠ stub) → `rerank_top_k_saturation` (⚠ stub: file-saturation decay only, path penalties
**not applied**, `penalise_paths` ignored). Else plain sort + truncate.
7. If `rerank`: `boost_multi_chunk_files` → `ranking::boosting::apply_query_boost` →
`ranking::penalties::rerank_top_k(.., penalise_paths = alpha_weight < 1.0)` — the real
ranking functions, matching the upstream `search.search` order (path penalties apply only
when BM25 contributes). Else plain sort + truncate.

**Rust idioms / structure**:
- **Trait seams** for testability: `EmbeddingModel`, `VectorBackend`, `SparseBackend`,
Expand All @@ -226,33 +233,33 @@ The heart of ranking. `search(query, model, semantic_index, bm25_index, chunks,
than panicking — matters for the long-running MCP server.
- `SearchOptions` struct (`alpha`, `selector`, `rerank`) instead of Python kwargs.

> **TD-002**: `ranking::boosting::apply_query_boost` and `ranking::penalties::rerank_top_k` are
> fully ported (with tests) but **not wired** into `search.rs` — exactly as in the TS source,
> which still uses the inline stubs (`TODO(integration)`). So search-ranking parity is
> fixture-level. `FILE_SATURATION_THRESHOLD`/`DECAY` are therefore defined **twice** (the inline
> stub in `search.rs` and the real one in `ranking/penalties.rs`).
> **TD-002 (resolved)**: `ranking::boosting::apply_query_boost` and
> `ranking::penalties::rerank_top_k` are now wired into `search.rs`, so the full ranking
> (query-type boosts + path penalties + file-saturation decay) runs in the search path.
> The duplicate inline `FILE_SATURATION_THRESHOLD`/`DECAY` stub constants in `search.rs` were
> removed; the canonical definitions live only in `ranking/penalties.rs`.

### 4.11 `ranking/weighting.rs` — adaptive alpha

`resolve_alpha(query, alpha)`: explicit wins; else `ALPHA_SYMBOL = 0.3` (BM25-leaning) for symbol
queries vs `ALPHA_NL = 0.5` for NL, decided by `is_symbol_query`.

### 4.12 `ranking/boosting.rs` — query-type detection & boosts (mostly unwired)
### 4.12 `ranking/boosting.rs` — query-type detection & boosts (wired)

Ported faithfully (`LazyLock<Regex>` for the static patterns, `RefCell<HashMap>` LRU for
`definition_pattern` cache):
- `SYMBOL_QUERY_RE` / `EMBEDDED_SYMBOL_RE` — symbol vs NL classification.
- `apply_query_boost` (unwired): symbol → `_boost_symbol_definitions` (definition regex per
- `apply_query_boost` (wired into `search.rs`): symbol → `_boost_symbol_definitions` (definition regex per
keyword set: `class def fn func struct enum trait type …` case-sensitive + SQL DDL
case-insensitive; `DEFINITION_BOOST_MULTIPLIER = 3.0`, ×1.5 on stem match); NL →
`_boost_stem_matches` (`STEM_BOOST_MULTIPLIER = 1.0`, ≥0.10 ratio, prefix-match morphology) +
`_boost_embedded_symbols` (`EMBEDDED_SYMBOL_BOOST_SCALE = 0.5`, `EMBEDDED_STEM_MIN_LEN = 4`).
- `boost_multi_chunk_files` (**wired** into search): top chunk per file boosted by
`max_score * FILE_COHERENCE_BOOST_FRAC` (=0.2) × (file score sum / max file sum).

### 4.13 `ranking/penalties.rs` — path penalties & saturation rerank (unwired)
### 4.13 `ranking/penalties.rs` — path penalties & saturation rerank (wired)

`rerank_top_k(scores, chunks, top_k, penalise_paths)` ported but unwired:
`rerank_top_k(scores, chunks, top_k, penalise_paths)` ported and wired into `search.rs`:
- Path penalties (multiplicative): test files/dirs `STRONG_PENALTY = 0.3`; compat/legacy +
examples/docs `0.3`; re-export barrels (`__init__.py`, `package-info.java`)
`MODERATE_PENALTY = 0.5`; `.d.ts` `MILD_PENALTY = 0.7`.
Expand All @@ -266,6 +273,11 @@ Ported faithfully (`LazyLock<Regex>` for the static patterns, `RefCell<HashMap>`
(NFR-003), tightening pre-existing dirs on Unix.
- `clear_index_cache` removes only the index dir — never the `~/.csp` home (which also holds
`savings.jsonl`).
- **Cache validity** (`try_reuse`): a cached index is reused only when the manifest's
`chunk_size` equals the current `DESIRED_CHUNK_LENGTH_CHARS` (a manifest predating the field
→ `None` → rebuild) **and**, for local sources, the live source-file content hash matches.
This mirrors upstream `_metadata_matches`, which gained a `chunk_size` check so the 1500→750
change auto-invalidates stale caches.
- **Divergence from upstream**: semble uses the OS cache dir (`~/Library/Caches/semble`, XDG,
`%LOCALAPPDATA%`) + `SEMBLE_CACHE_LOCATION`; csp fixes a global `~/.csp/index/` per ADR-0002.

Expand Down Expand Up @@ -315,7 +327,7 @@ Clean two-layer split:
| RRF k | `60` | `60` | `search.rs RRF_K` |
| α symbol / NL | `0.3` / `0.5` | `0.3` / `0.5` | `ranking/weighting.rs` |
| candidate over-fetch | `top_k * 5` | `top_k * 5` | `search.rs` |
| desired chunk length | **`750`** | **`1500`** ⚠ | `chunking/source.rs` |
| desired chunk length | `750` | `750` | `chunking/source.rs` |
| min chunk size | `50` | `50` | `chunking/core.rs` |
| recursion depth | `500` | `500` | `chunking/core.rs` |
| definition boost × | `3.0` | `3.0` | `ranking/boosting.rs` |
Expand All @@ -324,7 +336,7 @@ Clean two-layer split:
| stem boost × | `1.0` | `1.0` | `ranking/boosting.rs` |
| file-coherence frac | `0.2` | `0.2` | `ranking/boosting.rs` |
| strong / moderate / mild penalty | `0.3` / `0.5` / `0.7` | same | `ranking/penalties.rs` |
| file saturation threshold / decay | `1` / `0.5` | `1` / `0.5` (defined **twice**, see §4.10) | `search.rs` + `ranking/penalties.rs` |
| file saturation threshold / decay | `1` / `0.5` | `1` / `0.5` | `ranking/penalties.rs` |
| max file bytes | `1_000_000` | `1_000_000` | `index/files.py` / `indexing/create.rs` |
| default model | `minishlab/potion-code-16M` | same (real + stub) | `utils.py` / `indexing/dense.rs` |
| MCP in-mem LRU | `10` | `10` | `mcp.py` / `csp::mcp` |
Expand All @@ -350,10 +362,11 @@ Clean two-layer split:

### 6.2 Open stubs & gaps (verify before claiming runtime parity)

- **TD-002 — ranking not wired**: `search.rs` uses `apply_query_boost_identity` +
`rerank_top_k_saturation`; the real `ranking::{boosting::apply_query_boost,
penalties::rerank_top_k}` are ported but unwired (matches the TS source). Search-ranking parity
is fixture-level only. Saturation constants are duplicated as a result.
- ~~**TD-002 — ranking not wired**~~ — **closed** by [#37](https://github.com/pleaseai/code-search/pull/37).
`search.rs` now calls the real `ranking::boosting::apply_query_boost` +
`ranking::penalties::rerank_top_k` (`penalise_paths = alpha_weight < 1.0`), replacing the
identity/saturation stubs. Search-ranking is wired end-to-end (parity remains fixture-level
until the dense/BM25 backends are validated against upstream).
- ~~**Curated tree-sitter set**~~ — **closed** ([ADR-0004](../decisions/0004-rust-grammar-coverage-language-pack.md),
[#38](https://github.com/pleaseai/code-search/issues/38)). `language_for` now resolves through
`tree_sitter_language_pack` (306 grammars, full upstream parity; 264/265 `EXTENSION_TO_LANGUAGE`
Expand All @@ -363,13 +376,11 @@ Clean two-layer split:

### 6.3 Upstream drift since the review baseline (`eacbe43` → `136b6f7`)

> ⚠ **Action item**: reconcile before claiming parity.

- **Chunk length changed to 750** upstream (`chunking/chunking.py`), while the Rust port (and the
TS source it mirrors) still use **1500**. Upstream also added a `chunk_size` field to index
metadata + cache validation so the change auto-invalidates stale caches. Decide whether csp
follows 750 (and adds the metadata field to both TS and Rust) or documents 1500 as a deliberate
divergence.
- **Chunk length 750 (reconciled)** — the Rust port now uses `DESIRED_CHUNK_LENGTH_CHARS = 750`
(was 1500), matching upstream `chunking/chunking.py`. The value is recorded in the index
Comment thread
amondnet marked this conversation as resolved.
manifest as `chunk_size` and validated in `try_reuse`, so the change auto-invalidates stale
caches (mirrors upstream's added metadata field + cache check). The TS source still uses 1500,
but per the current direction Python upstream — not TS — is the source of truth.

---

Expand Down
Loading
Loading