diff --git a/.please/INDEX.md b/.please/INDEX.md index d59cbbf..653b531 100644 --- a/.please/INDEX.md +++ b/.please/INDEX.md @@ -21,7 +21,7 @@ | `docs/decisions/` | Architecture Decision Records → [Decisions Index](docs/decisions/index.md) | | `docs/investigations/` | Bug investigation reports | | `docs/research/` | Research documents | -| `docs/references/` | External reference materials (-llms.txt etc.) | +| `docs/references/` | External reference materials & upstream analyses → [References Index](docs/references/index.md) | | `docs/knowledge/` | Stable project context (product, tech-stack, guidelines, workflow) | | `templates/` | Workflow templates (plugin-provided) | | `scripts/` | Utility scripts (plugin-provided) | diff --git a/.please/docs/references/index.md b/.please/docs/references/index.md new file mode 100644 index 0000000..c1463a7 --- /dev/null +++ b/.please/docs/references/index.md @@ -0,0 +1,35 @@ +# Reference Analyses + +> Deep-dive analyses of the external libraries `@pleaseai/csp` ports from or draws on. +> Each document maps an upstream codebase module-by-module to its csp counterpart, +> records the load-bearing algorithms and parameters, and tracks where the csp port +> intentionally (or accidentally) diverges from upstream. + +These are **reference materials**, not contracts. The contract is `README.md` / `ARCHITECTURE.md`. +When upstream moves, update the relevant analysis here and reconcile any new drift against +[the sync baseline](#sync-baselines). + +## Documents + +| Library | Upstream | Role | Analysis | +|---|---|---|---| +| **semble** | [MinishLab/semble](https://github.com/MinishLab/semble) | Direct port source — analyzed against the **Rust port** (`crates/csp`, [ADR-0003](../decisions/0003-rewrite-in-rust.md)) | [semble.md](semble.md) | + + + +## Sync baselines + +Each analysis records the exact upstream commit it was written against. To check for upstream +drift, diff from that commit forward (`git log ..main` in the upstream checkout). + +| Library | Analyzed at | Notes | +|---|---|---| +| semble | upstream `136b6f7` (2026-06-18); Rust port `2f2baa2` (PR #34) | Mapped to the Rust crates; beyond prior review baseline `eacbe43`; see semble.md §Divergences | + +## How to add a new reference analysis + +1. Create `.please/docs/references/.md` following the structure of `semble.md`: + overview → pipeline → module map → per-module deep dives → key algorithms → divergences. +2. Pin the exact upstream commit analyzed (record it in the **Sync baselines** table above). +3. Map every upstream module to its csp counterpart (or mark it *not ported* / *adapted*). +4. Add a row to the **Documents** table and link the new file. diff --git a/.please/docs/references/semble.md b/.please/docs/references/semble.md new file mode 100644 index 0000000..b0aa53c --- /dev/null +++ b/.please/docs/references/semble.md @@ -0,0 +1,386 @@ +# Reference Analysis — MinishLab/semble → Rust port (`crates/csp`) + +> Module-by-module analysis of [MinishLab/semble](https://github.com/MinishLab/semble) (the +> Python original) mapped to the **Rust port** under `crates/csp` (library) and `crates/csp-cli` +> (`csp` binary), introduced by [ADR-0003](../decisions/0003-rewrite-in-rust.md). Each section +> captures the load-bearing algorithm + its constants, the Rust-specific structure/idioms, and +> where the port diverges. +> +> **Analyzed at**: upstream semble `136b6f7` (2026-06-18); Rust port at repo `2f2baa2` +> (PR #34 "Rust rewrite foundation"). **Parity oracle**: the TS `src/` test suite reused as +> golden fixtures — Rust reproduces the TS module behavior bit-for-bit, so "parity" is +> *fixture-level*, not full-runtime. The TS `src/` stays the source of truth until Rust reaches +> parity (per ADR-0003). +> **Upstream layout**: Python `src/semble/`. **Port layout**: `crates/csp/src/` (lib) + +> `crates/csp-cli/src/` (bin). + +--- + +## 1. What semble is + +A hybrid **dense + sparse** code-search engine for AI agents that runs entirely on CPU with no +API keys and no GPU. It indexes a local directory or a cloned git repo, then answers +natural-language or symbol queries in single-digit milliseconds. Two retrieval signals: + +- **Dense**: [Model2Vec](https://github.com/MinishLab/model2vec) static embeddings + (`minishlab/potion-code-16M`, 256-dim) — vocab→vector lookup + mean pooling, *not* a + transformer forward pass. CPU-fast. The Rust port wires the official + [`model2vec-rs`](https://crates.io/crates/model2vec-rs) `StaticModel`, with a deterministic + stub fallback (see §4.6). +- **Sparse**: BM25 over identifier-aware tokens (semble uses `bm25s`; Rust ports BM25 directly). + +They are fused with **Reciprocal Rank Fusion** and then reranked with code-specific priors +(definition boosts, path penalties, file-saturation decay). + +--- + +## 2. Pipeline at a glance + +``` + INDEX (once, cached) SEARCH (per query) + ┌────────────────────────────────────────────┐ ┌──────────────────────────────────────────┐ + walk_files ──► detect_language ──► chunk_source resolve_alpha(query) (0.3 symbol / 0.5 NL) + (.gitignore (ext→lang map) (tree-sitter │ + + .cspignore) AST, line- ├─► dense: Model::encode → cosine kNN + │ fallback) ├─► sparse: tokenize → BM25 get_scores + ▼ │ │ (over-fetch top_k * 5 each) + Vec ────────┬───────────────────┘ ▼ + │ RRF normalize each list (1/(k+rank), k=60) + ┌─────────────┴─────────────┐ ▼ + embed_chunks enrich_for_bm25 combined = α·rrf_dense + (1-α)·rrf_bm25 + (dense matrix) ("{content} {stem} ▼ + │ {stem} {dir[-3:]}") rerank (if CODE): + ▼ → tokenize → BM25 boost_multi_chunk_files (wired) + SelectableBasicBackend index apply_query_boost (⚠ identity stub) + (cosine) rerank_top_k (⚠ saturation-only stub) + ▼ + top_k SearchResult → ~/.csp/savings.jsonl +``` + +⚠ = TD-002: the full ranking lives in `ranking::{boosting,penalties}` but is **not yet wired** +into `search.rs`, mirroring the TS source's current state (see §4.10/§6). + +--- + +## 3. Module map (semble → Rust) + +| Upstream `src/semble/` | Rust | Status | Purpose | +|---|---|---|---| +| `types.py` | `csp/src/types.rs` | ported | `Chunk`, `ContentType`, `CallType` enums; `ChunkDict`/`SearchResultDict` serde | +| `tokens.py` | `csp/src/tokens.rs` | ported | identifier-aware tokenizer (BM25 input) | +| `chunking/core.py` | `csp/src/chunking/core.rs` | ported (real tree-sitter) | node-merge + line-fallback boundary algorithm; `TsNode` bridge | +| `chunking/chunking.py` | `csp/src/chunking/source.rs` | ported (⚠ param drift) | `chunk_source` → `Vec`; char↔line conversion | +| `index/file_walker.py` | `csp/src/indexing/file_walker.rs` | ported (`.cspignore`) | gitignore-aware recursive walk (`ignore` crate idioms) | +| `index/files.py` | `csp/src/indexing/files.rs` | ported | ext→language map, content-type sets, file status checks | +| `index/dense.py` | `csp/src/indexing/dense.rs` | ported (real + stub) | `Model` enum, `embed_chunks`, `SelectableBasicBackend` cosine | +| `index/sparse.py` | `csp/src/indexing/sparse.rs` | ported | `Bm25Index`, `enrich_for_bm25`, selector→mask | +| `index/create.py` | `csp/src/indexing/create.rs` | ported | build BM25 + dense + chunks from a path | +| `index/index.py` | `csp/src/indexing/index.rs` | ported | `CspIndex` orchestrator (from_path/from_git/search/find_related/save/load) + `load_or_build_index` | +| `cache.py` | `csp/src/indexing/cache.rs` | adapted | content-hash cache at `~/.csp/index/` (ADR-0002), 0700 perms | +| `search.py` | `csp/src/search.rs` | ported (⚠ ranking stub) | hybrid RRF + alpha blend; trait seams | +| `ranking/weighting.py` | `csp/src/ranking/weighting.rs` | ported | adaptive alpha | +| `ranking/boosting.py` | `csp/src/ranking/boosting.rs` | ported (boost_multi wired; others unwired) | query-type detection + definition/stem/embedded boosts | +| `ranking/penalties.py` | `csp/src/ranking/penalties.rs` | ported (unwired) | path penalties + file-saturation rerank | +| `stats.py` | `csp/src/stats.rs` | adapted | `~/.csp/savings.jsonl` read/write + report formatting | +| `mcp.py` | `csp/src/mcp.rs` (core) + `csp-cli/src/mcp_server.rs` (rmcp transport) | ported | MCP `search` / `find_related` tools | +| `cli.py` | `csp-cli/src/main.rs` | adapted (clap) | subcommands: search / find-related / index / savings / clear / init / mcp | +| `utils.py` | `csp/src/utils.rs` | ported | git-URL detection, `format_results` (snake_case wire), `resolve_chunk` | +| `installer/` | `csp-cli/agents/*.md` (+ `init`) | adapted | agent config templates wired via `init` | +| — | `csp/src/lib.rs` | new | crate root / public re-exports | + +--- + +## 4. Module deep-dives (algorithm + Rust idiom) + +### 4.1 `types.rs` — domain model & two serde shapes + +- `Chunk { content, file_path, start_line: u32, end_line: u32, language: Option }` + derives `PartialEq, Eq` (no `Hash` — score maps key by **index**, not by `Chunk`; see §4.10). +- `ContentType { Code, Docs, Config }` and `CallType { Search, FindRelated }` are serde enums; + `ContentType::as_str()` yields the wire value (`"code"`…). +- **Two distinct dict representations** (the single most important port-time structural choice): + - `ChunkDict` / `SearchResultDict` — **camelCase** (`filePath, startLine, endLine`; both are + `#[serde(rename_all = "camelCase")]`), used for **on-disk persistence** (`chunk_to_dict` / + `chunk_from_dict` / `search_result_to_dict`), matching the camelCase public API. + - `utils::result_to_dict` / `format_results` — **snake_case** (`file_path, start_line, + end_line`), the **CLI/MCP wire format** (built with the `json!` macro, mirroring the TS + `SearchResult.toDict`). +- `chunk_from_dict` returns `Result` (`thiserror`) instead of throwing. +- `chunk_location(chunk)` → `"{file_path}:{start}-{end}"`. + +### 4.2 `tokens.rs` — identifier-aware tokenization + +Same contract as semble `tokens.py`: +- token regex `[A-Za-z_][A-Za-z0-9_]*`; camel/Pascal splitter + `[A-Z]+(?=[A-Z][a-z])|[A-Z]?[a-z]+|[A-Z]+|[0-9]+`. +- `split_identifier`: snake → split on `_`; else camel/Pascal split. Returns `[lower]` or + `[lower, *parts]` (≥2 parts). `HandlerStack → [handlerstack, handler, stack]`. +- `tokenize` flat-maps `split_identifier` over every token match. Original lowercased compound + is preserved alongside sub-tokens (exact + partial match). + +### 4.3 `chunking/` — AST chunking with line fallback + +**`core.rs`** (boundary algorithm, byte-based; `RECURSION_DEPTH = 500`, `MIN_CHUNK_SIZE = 50`): +- The merge algorithm is generic over a node trait so tests can drive it with mock nodes; in + production a **`TsNode` bridge** adapts `tree_sitter::Node` to it. +- `language_for(language)` returns a statically-linked `tree_sitter::Language` for a **curated + grammar set**: rust, python, javascript, typescript, tsx, go, java, c, cpp, ruby, json, bash, + html, css. Unsupported languages → `None` → line fallback. ⚠ This is **narrower than upstream**, + which uses `tree_sitter_language_pack` (≈all languages); see §6. +- `_merge_node_inner` (greedy pack), `_merge_adjacent_chunks`, `chunk_lines` fallback — same + shape as semble. Byte offsets are converted to char offsets for multibyte safety. + +**`source.rs`** (`chunk_source`): +- `DESIRED_CHUNK_LENGTH_CHARS = 1500` (⚠ upstream is now **750** — see §6). +- AST chunking when `language_for(lang).is_some()`, else line fallback. Char offsets → 1-indexed + line numbers; clamps end to avoid the zero-length off-by-one. + +### 4.4 `indexing/file_walker.rs` — gitignore-aware walk + +- Default-ignored dirs mirror semble (`.git/ node_modules/ dist/ build/ .next/ …`) with + **`.csp/`** replacing semble's `.semble/`. +- Merges `.gitignore` **and `.cspignore`** per directory; skips symlinks; sorts entries for + determinism. Files are yielded when the suffix matches the content-type extension set (plus the + negation-pattern re-include rule from upstream). + +### 4.5 `indexing/files.rs` — language detection & file gating + +- `EXTENSION_TO_LANGUAGE` — `&[(&str, &str)]` table (~350 entries). `detect_language(name)` + lowercases the suffix and looks it up. Note: this recognizes far more extensions than the + curated tree-sitter set in §4.3 — recognized-but-unparsed languages still get **walked and + line-chunked**. +- Content-type partition: `DOC_LANGUAGES`, `CONFIG_LANGUAGES`, `DATA_LANGUAGES`; code = all minus + those. `get_extensions(types, extra)` inverts the map; the **`extra`** param (custom extensions) + is a small Rust-side API addition. +- File gating: `MAX_FILE_BYTES = 1_000_000` (in `create.rs`); empty/whitespace and too-new + (mtime) files are skipped (`FileStatus`). + +### 4.6 `indexing/dense.rs` — Model2Vec embeddings (real + stub) + +- `Model` is an **enum**: `Static { inner: Arc, dim }` (real `model2vec-rs` + `StaticModel::from_pretrained` + `encode`) | `Stub { dim }` (offline/test, the bit-for-bit TS + stub via `stub_embed`). `Model::encode` / `Model::dim` dispatch over the variant. +- `load_model` resolves `minishlab/potion-code-16M` (or env override) and **falls back to the + stub with a stderr warning** if loading fails; `load_model_with` is the DI seam that keeps unit + tests offline. `make_stub_model(dim)` for tests. `MODEL_CACHE` (`LazyLock>`) + memoizes loads. +- `SelectableBasicBackend` — cosine backend with an optional `selector` (subset of chunk + indices) for language/path-filtered search; `query()` returns `Result` (errors degrade to + empty results in the search path, never panic — see §4.10). +- **Status**: per the Rust track, dense + tree-sitter are **no longer stubs** (TD-001 resolved); + the stub remains only as an offline fallback. + +### 4.7 `indexing/sparse.rs` — BM25 + enrichment + +- `enrich_for_bm25(chunk)` → `"{content} {stem} {stem} {dir[-3:]}"` — stem repeated twice to + up-weight path matches; last 3 parent dir components. `Bm25Index` ports the BM25 scoring + (`get_scores(tokens, weight_mask)`). `selector_to_mask(selector, size)` → `Vec` mask. + +### 4.8 `indexing/create.rs` — index construction + +`create_index_from_path`: walk → per file: `detect_language`, size/empty gate, read text, store +path **relative to `display_root`**, `chunk_source`. Then `embed_chunks` → dense matrix; build +`Bm25Index` over `tokenize(enrich_for_bm25(chunk))`; wrap dense in `SelectableBasicBackend`. +Empty → error. + +### 4.9 `indexing/index.rs` — `CspIndex` orchestrator + +The public façade (parallels `SembleIndex`): +- `from_path`, `from_git` (git clone into a tempdir; repo-relative chunk paths), `search` + (`QueryOptions`), `find_related` (semantic kNN on the seed, same-language, excludes seed), + `save` / `load_from_disk` (persists chunks + bm25 + semantic + metadata). +- `load_or_build_index` (`LoadOrBuildOptions`) — the cache-aware entry the CLI/MCP use: load from + `~/.csp/index/` on a validated hit, else build and persist. +- Builds file→indices and language→indices maps for selectors and stats. + +### 4.10 `search.rs` — hybrid retrieval & fusion (with TD-002 stub) + +The heart of ranking. `search(query, model, semantic_index, bm25_index, chunks, top_k, options)`: +1. `resolve_alpha(query, options.alpha)`; `rerank = options.rerank.unwrap_or(true)`. +2. **Over-fetch** `candidate_count = top_k * 5` for both signals. +3. `search_semantic` — `Model::encode([query])` → backend kNN → `score = 1 - distance`. +4. `search_bm25` — `tokenize(query)` → `get_scores(mask)` → top-k via `sort_top_k`, drop ≤0. +5. **RRF** (`rrf_scores`): rank each list by raw score desc (`f64::total_cmp`, stable), + `1/(RRF_K + rank)`, `RRF_K = 60`, rank from 1. +6. Union of indices, **sorted by `start_line`** to neutralize hash-iteration nondeterminism; + `combined = α·rrf_semantic + (1-α)·rrf_bm25`. +7. If `rerank`: `boost_multi_chunk_files` (**wired**, shared impl) → `apply_query_boost_identity` + (⚠ stub) → `rerank_top_k_saturation` (⚠ stub: file-saturation decay only, path penalties + **not applied**, `penalise_paths` ignored). Else plain sort + truncate. + +**Rust idioms / structure**: +- **Trait seams** for testability: `EmbeddingModel`, `VectorBackend`, `SparseBackend`, + implemented for the concrete `dense::Model`, `SelectableBasicBackend`, `sparse::Bm25Index`. + Tests inject mocks. +- `Scores = IndexMap` (see `ranking/mod.rs`) — keyed by **chunk index** into the + canonical `&[Chunk]`, insertion-ordered. This is the Rust counterpart of the TS + `Map` (object-identity keyed) whose iteration order the upstream relies on for + tie-breaking. Rust can't hash `Chunk` cheaply, so it indexes instead. +- **Error degradation**: a backend `query` failure prints to stderr and returns empty rather + than panicking — matters for the long-running MCP server. +- `SearchOptions` struct (`alpha`, `selector`, `rerank`) instead of Python kwargs. + +> **TD-002**: `ranking::boosting::apply_query_boost` and `ranking::penalties::rerank_top_k` are +> fully ported (with tests) but **not wired** into `search.rs` — exactly as in the TS source, +> which still uses the inline stubs (`TODO(integration)`). So search-ranking parity is +> fixture-level. `FILE_SATURATION_THRESHOLD`/`DECAY` are therefore defined **twice** (the inline +> stub in `search.rs` and the real one in `ranking/penalties.rs`). + +### 4.11 `ranking/weighting.rs` — adaptive alpha + +`resolve_alpha(query, alpha)`: explicit wins; else `ALPHA_SYMBOL = 0.3` (BM25-leaning) for symbol +queries vs `ALPHA_NL = 0.5` for NL, decided by `is_symbol_query`. + +### 4.12 `ranking/boosting.rs` — query-type detection & boosts (mostly unwired) + +Ported faithfully (`LazyLock` for the static patterns, `RefCell` LRU for +`definition_pattern` cache): +- `SYMBOL_QUERY_RE` / `EMBEDDED_SYMBOL_RE` — symbol vs NL classification. +- `apply_query_boost` (unwired): symbol → `_boost_symbol_definitions` (definition regex per + keyword set: `class def fn func struct enum trait type …` case-sensitive + SQL DDL + case-insensitive; `DEFINITION_BOOST_MULTIPLIER = 3.0`, ×1.5 on stem match); NL → + `_boost_stem_matches` (`STEM_BOOST_MULTIPLIER = 1.0`, ≥0.10 ratio, prefix-match morphology) + + `_boost_embedded_symbols` (`EMBEDDED_SYMBOL_BOOST_SCALE = 0.5`, `EMBEDDED_STEM_MIN_LEN = 4`). +- `boost_multi_chunk_files` (**wired** into search): top chunk per file boosted by + `max_score * FILE_COHERENCE_BOOST_FRAC` (=0.2) × (file score sum / max file sum). + +### 4.13 `ranking/penalties.rs` — path penalties & saturation rerank (unwired) + +`rerank_top_k(scores, chunks, top_k, penalise_paths)` ported but unwired: +- Path penalties (multiplicative): test files/dirs `STRONG_PENALTY = 0.3`; compat/legacy + + examples/docs `0.3`; re-export barrels (`__init__.py`, `package-info.java`) + `MODERATE_PENALTY = 0.5`; `.d.ts` `MILD_PENALTY = 0.7`. +- File-saturation decay: beyond `FILE_SATURATION_THRESHOLD = 1` per file, ×`FILE_SATURATION_DECAY + = 0.5 ^ excess`; greedy with safe early-exit. Penalties apply only when `alpha_weight < 1.0`. + +### 4.14 `indexing/cache.rs` — index cache (ADR-0002) + +- Cache home `$HOME/.csp` (override via `CacheLocation`), index root `/index`, per-source + leaf `/index/`. `ensure_cache_dir` creates the chain with **0700** perms + (NFR-003), tightening pre-existing dirs on Unix. +- `clear_index_cache` removes only the index dir — never the `~/.csp` home (which also holds + `savings.jsonl`). +- **Divergence from upstream**: semble uses the OS cache dir (`~/Library/Caches/semble`, XDG, + `%LOCALAPPDATA%`) + `SEMBLE_CACHE_LOCATION`; csp fixes a global `~/.csp/index/` per ADR-0002. + +### 4.15 `stats.rs` — token-savings telemetry + +- Appends JSONL `{ts, call, results, snippet_chars, file_chars}` to `~/.csp/savings.jsonl` + (`now_secs`, `default_stats_file`). `format_savings_report` renders the colored ASCII report + (Total saved, efficiency bar, By Period; By Call Type gated behind `--verbose`). `clear_savings`. +- **Divergence**: fixed `~/.csp/savings.jsonl` (not the OS cache dir); no `flock` (sub-4KB + appends are atomic on POSIX); header is "Csp". + +### 4.16 MCP — `csp/src/mcp.rs` (core) + `csp-cli/src/mcp_server.rs` (rmcp transport) + +Clean two-layer split: +- **`csp::mcp`** (lib) — the unit-tested tool **core**: `search` / `find_related` handler logic, + in-process LRU `IndexCache` (`CACHE_MAX_SIZE = 10`, `Arc` so indexes are `Send` + across tasks), `_get_index` with git-transport guards. +- **`csp-cli::mcp_server`** (bin) — **rmcp 1.7** stdio wiring: `CspMcpServer` with + `#[tool_router]` + `#[tool]` async `search`/`find_related`, `#[tool_handler(router = + self.tool_router)]` (routes through the stored field; the default `Self::tool_router()` would + rebuild per call and trip clippy `dead_code`). `run_mcp(path, ref, content)` serves on a tokio + runtime. Verified on the wire (initialize / tools/list / tools/call). + +### 4.17 `csp-cli/src/main.rs` — CLI (clap) + +- `#[derive(Parser)]` with a `Command` `#[derive(Subcommand)]` enum: **search**, **find-related**, + **index** (build + persist a standalone index), **savings**, **clear** (`all|index|savings`), + **init** (write an agent file), **mcp** (run the stdio server). +- `search` / `find-related` route through `load_or_build_index` (or an explicit `--index` via + `LoadOptions`). Output is the snake_case wire JSON (`utils::format_results`). +- **Divergence from upstream**: csp keeps **`init`** (not `install`/`uninstall`), exposes the MCP + server under an explicit **`mcp`** subcommand (semble starts it from the bare binary), and adds + `index` / `clear`. + +### 4.18 `utils.rs` — helpers + +- `is_git_url` (scheme prefixes + scp-style), `resolve_chunk(chunks, file_path, line) -> + Option<&Chunk>` (interior match preferred, boundary fallback), `result_to_dict` / + `format_results` (snake_case wire dict). Model name resolution honors the env override. + +--- + +## 5. Load-bearing constants (semble vs Rust port) + +| Constant | semble | Rust | Location | +|---|---|---|---| +| RRF k | `60` | `60` | `search.rs RRF_K` | +| α symbol / NL | `0.3` / `0.5` | `0.3` / `0.5` | `ranking/weighting.rs` | +| candidate over-fetch | `top_k * 5` | `top_k * 5` | `search.rs` | +| desired chunk length | **`750`** | **`1500`** ⚠ | `chunking/source.rs` | +| min chunk size | `50` | `50` | `chunking/core.rs` | +| recursion depth | `500` | `500` | `chunking/core.rs` | +| definition boost × | `3.0` | `3.0` | `ranking/boosting.rs` | +| embedded-symbol scale | `0.5` | `0.5` | `ranking/boosting.rs` | +| embedded stem min len | `4` | `4` | `ranking/boosting.rs` | +| stem boost × | `1.0` | `1.0` | `ranking/boosting.rs` | +| file-coherence frac | `0.2` | `0.2` | `ranking/boosting.rs` | +| strong / moderate / mild penalty | `0.3` / `0.5` / `0.7` | same | `ranking/penalties.rs` | +| file saturation threshold / decay | `1` / `0.5` | `1` / `0.5` (defined **twice**, see §4.10) | `search.rs` + `ranking/penalties.rs` | +| max file bytes | `1_000_000` | `1_000_000` | `index/files.py` / `indexing/create.rs` | +| default model | `minishlab/potion-code-16M` | same (real + stub) | `utils.py` / `indexing/dense.rs` | +| MCP in-mem LRU | `10` | `10` | `mcp.py` / `csp::mcp` | +| cache dir mode | — | `0o700` | `indexing/cache.rs` | + +--- + +## 6. Divergences & drift + +### 6.1 Intentional adaptations (Rust port by design) + +1. **Score maps keyed by index** — `Scores = IndexMap` vs Python/TS `dict/Map` + keyed by the `Chunk` object. Same semantics, different key type. +2. **Trait seams** — `EmbeddingModel`/`VectorBackend`/`SparseBackend` for DI/testability. +3. **`Model` enum** (real `model2vec-rs` `Static` + offline `Stub`), graceful stub fallback. +4. **Two serde shapes** — camelCase `ChunkDict`/`SearchResultDict` (disk persistence) vs + snake_case `utils::result_to_dict`/`format_results` (CLI/MCP wire). +5. **Error handling** — `Result` + `thiserror`; backend errors degrade instead of panicking. +6. **MCP split** — testable core in `csp::mcp`, rmcp 1.7 transport in `csp-cli::mcp_server`. +7. **Storage** — fixed `~/.csp/index/` (0700) + `~/.csp/savings.jsonl` (ADR-0002), not the OS + cache dir / `SEMBLE_CACHE_LOCATION`. `.cspignore` (not `.sembleignore`). +8. **CLI** — clap; `init` (not `install`/`uninstall`); explicit `mcp` subcommand; adds `index`. + +### 6.2 Open stubs & gaps (verify before claiming runtime parity) + +- **TD-002 — ranking not wired**: `search.rs` uses `apply_query_boost_identity` + + `rerank_top_k_saturation`; the real `ranking::{boosting::apply_query_boost, + penalties::rerank_top_k}` are ported but unwired (matches the TS source). Search-ranking parity + is fixture-level only. Saturation constants are duplicated as a result. +- **Curated tree-sitter set** — only ~14 grammars are statically linked (`language_for`); upstream + uses `tree_sitter_language_pack` (≈all languages). Languages outside the curated set are + recognized by the extension map but **line-chunked**, not AST-chunked. This is a real behavioral + narrowing vs upstream. + +### 6.3 Upstream drift since the review baseline (`eacbe43` → `136b6f7`) + +> ⚠ **Action item**: reconcile before claiming parity. + +- **Chunk length changed to 750** upstream (`chunking/chunking.py`), while the Rust port (and the + TS source it mirrors) still use **1500**. Upstream also added a `chunk_size` field to index + metadata + cache validation so the change auto-invalidates stale caches. Decide whether csp + follows 750 (and adds the metadata field to both TS and Rust) or documents 1500 as a deliberate + divergence. + +--- + +## 7. How to refresh this analysis + +1. Update the upstream checkout and diff from the recorded baseline: + `git -C log 136b6f7..main --oneline`. +2. Quality gate the Rust side: `cargo fmt --all && cargo clippy --all-targets --all-features -- + -D warnings && cargo test --workspace`. +3. Re-read any changed module and update the matching §4 section + §5 constants table. +4. When a stub gets wired (TD-002) or grammars are added, move the item out of §6.2 and update + §4.10 / §4.3. Bump the baseline in `index.md` and this file's header. +5. Cross-check against the `upstream-semble-sync-baseline` and `rust-rewrite-track-status` + memories and `CLAUDE.md` (Rust rewrite section). + +--- + +*Related: [`ARCHITECTURE.md`](../../../ARCHITECTURE.md) (target architecture), +[ADR-0001](../decisions/0001-native-tree-sitter.md) (tree-sitter bindings), +[ADR-0002](../decisions/0002-index-storage-cache-model.md) (cache model), +[ADR-0003](../decisions/0003-rewrite-in-rust.md) (Rust rewrite), +[`../knowledge/tech-stack.md`](../knowledge/tech-stack.md).* diff --git a/CLAUDE.md b/CLAUDE.md index 5379c1e..295df0e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -96,5 +96,9 @@ For full file listing with workspace artifacts, use `Skill("please:project-knowl ### Decision Records - `.please/docs/decisions/` — Architecture Decision Records (ADR) + +### Reference Analyses (.please/docs/references/) +- `index.md` — index of upstream-library analyses (scales as more libraries are adopted) +- `semble.md` — module-by-module analysis of MinishLab/semble mapped to the **Rust port** (`crates/csp`), with algorithms, constants, and drift tracking