From 199cdd08d44b0a3ca8dace36eec1fc15d284f8be Mon Sep 17 00:00:00 2001 From: Minsu Lee Date: Fri, 19 Jun 2026 20:13:18 +0900 Subject: [PATCH 1/3] docs(references): add cocoindex-code and model2vec reference analyses - cocoindex-code: prior-art comparison (dense-only MCP vs csp hybrid); includes published CoIR benchmark row and a caveat footnote on methodology differences - model2vec: analysis of the model2vec / model2vec-rs dependency used by csp's dense retrieval leg (potion-code-16M); includes MTEB and CoIR benchmark figures - index.md: registered both analyses in the Documents table and Sync baselines table Key finding: model2vec's own card shows dense+BM25 hybrid (40.41 CoIR) beats dense-only (37.05), directly validating csp's hybrid design over cocoindex's dense-only approach. --- .please/docs/references/cocoindex.md | 125 ++++++++++++++++++++++++++ .please/docs/references/index.md | 4 + .please/docs/references/model2vec.md | 126 +++++++++++++++++++++++++++ 3 files changed, 255 insertions(+) create mode 100644 .please/docs/references/cocoindex.md create mode 100644 .please/docs/references/model2vec.md diff --git a/.please/docs/references/cocoindex.md b/.please/docs/references/cocoindex.md new file mode 100644 index 0000000..5ef5079 --- /dev/null +++ b/.please/docs/references/cocoindex.md @@ -0,0 +1,125 @@ +# Reference Analysis — CocoIndex Code (`cocoindex-code`) + +> Prior-art / comparison analysis of [cocoindex-io/cocoindex-code](https://github.com/cocoindex-io/cocoindex-code) +> and the underlying [CocoIndex](https://cocoindex.io) data-pipeline framework. **This is not a +> port source** — csp ports [MinishLab/semble](semble.md). CocoIndex Code is an independent, +> competing project in the *same* niche (AST-aware semantic code search exposed to coding agents +> over MCP), so it is the closest direct comparator for csp's product surface. This document maps +> their design choices against csp's and flags what is worth borrowing vs. where the two +> deliberately diverge. +> +> **Analyzed at**: web docs + GitHub README as of 2026-06-19 (no upstream commit pinned — this is a +> comparison, not a parity oracle). Sources: , +> , . + +--- + +## 1. What CocoIndex Code is + +An MCP server + CLI that gives coding agents (Claude Code, Cursor, Codex) **AST-aware semantic +code search** over a whole repo, pitched on token savings (~70%) and sub-second freshness. It is +built on the general-purpose **CocoIndex** data-indexing framework (a Rust engine with a Python +API for declarative ETL/embedding pipelines); `cocoindex-code` is the packaged, code-search-specific +application of that framework. + +Two layers, easy to conflate: + +- **CocoIndex** (the framework) — Rust core + Python DSL for incremental, lineage-tracked indexing + pipelines. Built-in `SplitRecursively` (tree-sitter chunker) and `SentenceTransformerEmbed` + functions. This is the "build your own pipeline" toolkit. +- **CocoIndex Code** (`ccc`) — the opinionated end-user product built on it: a CLI/MCP server with + an index, daemon, and agent integration baked in. + +--- + +## 2. How it compares to csp (the load-bearing differences) + +| Aspect | **CocoIndex Code** | **csp / semble** | +|---|---|---| +| Retrieval signal | **Dense-only** semantic vectors (no BM25) | **Hybrid** dense + BM25, fused via RRF (`k=60`) | +| Embeddings | **Real** models: `Snowflake/snowflake-arctic-embed-xs` local (SentenceTransformers), or 100+ cloud providers via LiteLLM | Model2Vec static embeddings (`potion-code-16M`); TS/Rust ship a deterministic **stub** until integration | +| Chunking | tree-sitter AST via `SplitRecursively` (`chunk_size=1000`, `chunk_overlap=300` in the canonical example) | tree-sitter AST, **no overlap**, target 1500 chars, `_MIN_CHUNK_SIZE=50`, line-fallback | +| Indexing model | **Incremental delta** — diff vs. prior AST, re-embed only changed chunks, "80–90% cache hit"; optional **daemon** keeps index warm | Content-hash cache at `~/.csp/index/` (ADR-0002); rebuild on change, no long-running daemon | +| Storage | **SQLite** (`cocoindex.db` + `target_sqlite.db`) under `/.cocoindex_code/` | JSON/serde index files under global `~/.csp/index/` | +| Ranking | Dense cosine kNN; no documented code-specific reranking | RRF → multi-chunk file boost → query-type boost → path penalties + file-saturation decay | +| Symbol / lexical queries | Weak by construction (dense-only — exact identifier match relies on the embedding) | **Adaptive alpha** (`0.3` symbol / `0.5` NL) + identifier-aware BM25 tokenization explicitly handle symbol lookup | +| Embedding benchmark[^bench] | `arctic-embed-xs`: 22M params, 384-dim — **50.15** MTEB *general-text* Retrieval (no public CoIR/code score) | `potion-code-16M`: 16M, 256-dim — **37.05** CoIR (code) avg NDCG@10; teacher `CodeRankEmbed` (137M) = 59.14 | +| Tech stack | Rust engine, Python wrapper (~98% Python repo); `pipx install`, binary `ccc` | TS/Bun (`@pleaseai/csp`, binary `csp`) + Rust port (`crates/csp`, ADR-0003) | +| Deployment tiers | local / shared **daemon** / VPC enterprise (branch overlays, cross-repo, SSO) | single local tool; no daemon/enterprise tier | +| License | Apache-2.0 | MIT | + +**One-line takeaway**: CocoIndex Code bets on *real embeddings + incremental delta indexing + a +daemon*; csp/semble bet on *hybrid dense+sparse + code-specific reranking on a zero-dependency CPU +stack*. The dense-only choice makes CocoIndex weaker on exact-symbol queries but lets it lean on +stronger embedding models; csp's BM25 leg + adaptive alpha is precisely the hedge against that. + +[^bench]: **Not a head-to-head — the two numbers are from different benchmarks.** `arctic-embed-xs`'s + 50.15 is **MTEB general-text Retrieval** (English prose), *not* code; arctic-embed is not a + code-trained model and has no public CoIR score, so 50.15 must **not** be read as "code-search + quality." `potion-code-16M`'s 37.05 is **CoIR** (code-specific, NDCG@10 avg over CosQA / + CodeFeedback ST/MT). The genuinely load-bearing figure for csp: the model2vec card itself + reports **potion-code-16M + BM25 hybrid = 40.41** (vs. 37.05 dense-only, **+3.36**) — the model's + own authors measure that adding sparse retrieval beats static-dense alone, which directly + validates csp/semble's hybrid + adaptive-alpha design over CocoIndex's dense-only path. + Sources: [potion-code-16M card](https://huggingface.co/minishlab/potion-code-16M), + [Snowflake-Labs/arctic-embed](https://github.com/Snowflake-Labs/arctic-embed), + [CoIR benchmark (ACL 2025)](https://github.com/coir-team/coir). + +--- + +## 3. CLI surface (`ccc`) + +For comparison with csp's `csp` subcommands (`search`, `index`, `find-related`, `mcp`, `init`, +`savings`): + +| `ccc` | Purpose | csp analog | +|---|---|---| +| `ccc init` | scaffold settings | `csp init` | +| `ccc index` | build/update index | `csp index` | +| `ccc search ` | semantic search | `csp search` | +| `ccc status` | index stats | (≈ `csp savings` / stats) | +| `ccc mcp` | MCP server, stdio | `csp mcp` | +| `ccc daemon [status\|restart\|stop]` | background index daemon | — (no equivalent) | +| `ccc reset` | delete index DBs | `csp clear index` | +| `ccc doctor` | diagnostics | — | + +- **MCP tool**: a single `search()` tool with `languages` / `paths` / `limit` / `offset` filters. + csp instead exposes **two** tools (`search`, `find_related`) — csp has no `find_related` analog + on the CocoIndex side, and CocoIndex has structured filters csp does not. +- **Install/run**: `pipx install 'cocoindex-code[full]'` (local embeddings) or + `pipx install cocoindex-code` (slim, cloud-only); binary `ccc`. csp: `bunx @pleaseai/csp`. +- **Config**: `~/.cocoindex_code/global_settings.yml` (`embedding.model`, `embedding.provider`, + `embedding.device` = cpu/cuda/mps, `min_interval_ms`, asymmetric `indexing_params`/`query_params`); + per-project `include_patterns` / `exclude_patterns`. + +--- + +## 4. Ideas worth tracking for csp + +Not endorsements — open questions surfaced by the comparison: + +1. **Incremental delta indexing + daemon.** CocoIndex's headline feature is re-embedding only + changed chunks against a warm index. csp currently caches by content hash and rebuilds; a + chunk-level delta + optional daemon is the natural next perf step if/when real embeddings land + (the stub makes re-embedding cost moot today — see [[dense-embedding-is-a-stub]]). +2. **Asymmetric query vs. index embedding params** (`indexing_params`/`query_params`). Relevant + once csp wires real Model2Vec — many code-retrieval models want a query prefix. +3. **Structured MCP filters** (`languages`, `paths`, `limit`, `offset`). csp's `search` tool could + adopt these cheaply; they map onto existing chunk metadata. +4. **Chunk overlap** (`chunk_overlap=300`). semble/csp use **no** overlap; worth measuring whether + overlap improves recall on boundary-spanning definitions, or just inflates the index. +5. **Branch overlays** (treat a PR/branch as a delta on a shared main index). Enterprise-tier idea, + but the "index once, overlay per branch" model could inform csp's global `~/.csp/index/` layout. + +Where csp should **not** follow: dropping BM25. The dense-only path is CocoIndex's biggest +weakness for symbol/identifier queries, and csp's hybrid + adaptive-alpha design is the explicit +counter-position (see [CLAUDE.md](../../../CLAUDE.md) "Conventions to preserve from semble"). + +--- + +## 5. Sources + +- CocoIndex Code product page — +- `cocoindex-io/cocoindex-code` (CLI/MCP) — +- CocoIndex framework code-index example — +- `cocoindex-io/realtime-codebase-indexing` — diff --git a/.please/docs/references/index.md b/.please/docs/references/index.md index c1463a7..c368ebb 100644 --- a/.please/docs/references/index.md +++ b/.please/docs/references/index.md @@ -14,6 +14,8 @@ When upstream moves, update the relevant analysis here and reconcile any new dri | Library | Upstream | Role | Analysis | |---|---|---|---| | **semble** | [MinishLab/semble](https://github.com/MinishLab/semble) | Direct port source — analyzed against the **Rust port** (`crates/csp`, [ADR-0003](../decisions/0003-rewrite-in-rust.md)) | [semble.md](semble.md) | +| **cocoindex-code** | [cocoindex-io/cocoindex-code](https://github.com/cocoindex-io/cocoindex-code) | Prior-art / comparator — independent AST code-search MCP in the same niche (not a port source) | [cocoindex.md](cocoindex.md) | +| **model2vec** | [MinishLab/model2vec](https://github.com/MinishLab/model2vec) + [model2vec-rs](https://github.com/MinishLab/model2vec-rs) | Direct dependency — the dense-retrieval leg (`potion-code-16M`); Rust port wires `model2vec-rs` | [model2vec.md](model2vec.md) | @@ -25,6 +27,8 @@ drift, diff from that commit forward (`git log ..main` in the upstream | Library | Analyzed at | Notes | |---|---|---| | semble | upstream `136b6f7` (2026-06-18); Rust port `2f2baa2` (PR #34) | Mapped to the Rust crates; beyond prior review baseline `eacbe43`; see semble.md §Divergences | +| cocoindex-code | web docs + GitHub README, 2026-06-19 (no commit pinned); embedding benchmarks from model HF cards | Comparison/prior-art, **not a port** — no parity oracle. Drift = re-check vs. cocoindex's docs/README; benchmark row reflects published `potion-code-16M` (CoIR) vs. `arctic-embed-xs` (MTEB) figures. See cocoindex.md §2 + `[^bench]` | +| model2vec | GitHub READMEs + HF cards, 2026-06-19; `model2vec-rs` `0.2.1` (no commit pinned) | **Direct dependency**, not a port — the dense leg. Drift = pin `model2vec-rs` crate version + `potion-code-16M` card revision when the stub is swapped for real weights. See model2vec.md §4–5 | ## How to add a new reference analysis diff --git a/.please/docs/references/model2vec.md b/.please/docs/references/model2vec.md new file mode 100644 index 0000000..0519f8c --- /dev/null +++ b/.please/docs/references/model2vec.md @@ -0,0 +1,126 @@ +# Reference Analysis — Model2Vec (`model2vec` / `model2vec-rs`) + +> Analysis of [MinishLab/model2vec](https://github.com/MinishLab/model2vec) (Python) and its Rust +> inference crate [MinishLab/model2vec-rs](https://github.com/MinishLab/model2vec-rs). **This is a +> direct dependency, not a port source** — Model2Vec *is* csp's dense-retrieval leg. semble uses the +> `minishlab/potion-code-16M` static model; the csp Rust port (`crates/csp/src/indexing/dense.rs`) +> wires the official `model2vec-rs` `StaticModel`, with a deterministic stub fallback until +> integration lands (see [[dense-embedding-is-a-stub]]). This doc captures how the embeddings are +> produced, which model csp uses, the published benchmarks, and the Rust crate's API/limits. +> +> **Analyzed at**: GitHub READMEs + HF model cards, 2026-06-19. `model2vec-rs` `0.2.1` (May 2026). +> Both projects MIT. Sources in §6. + +--- + +## 1. What Model2Vec is + +A method (and toolkit) that **distills a sentence-transformer into a static embedding model**: +each vocabulary token gets one precomputed vector; encoding a string is a **vocab→vector lookup + +mean pooling**, *not* a transformer forward pass. Result: ~50× smaller, up to ~500× faster on CPU, +with a modest quality drop. No GPU, no API key, deterministic. This is exactly why semble/csp can +do "single-digit-millisecond" search on a laptop — the dense signal is a matrix gather, not +inference. + +**Distillation pipeline** (`distill()`, ~30s on CPU, no training data needed): +1. **Vocabulary forward pass** — run the vocab through the teacher (e.g. `BAAI/bge-base-en-v1.5`) + to get one embedding per token. +2. **Token→vector table + mean pooling** — store the table; inference pools token vectors. +3. **Post-processing** — **PCA** (dim reduction), **Zipf weighting** (down-weight frequent tokens), + and **tokenlearn / POTION** (the training trick that lifted the `potion-*` generation above + plain distillation). + +--- + +## 2. The "potion" pre-trained models + +| Model | Params | Dim | Task | Notes | +|---|---|---|---|---| +| `potion-base-2M` | 1.8M | — | general | smallest | +| `potion-base-4M` | 3.7M | — | general | | +| `potion-base-8M` | 7.5M | 256 | general | crate's example default | +| `potion-base-32M` | 32.3M | 512 | general | "most performant static model" | +| `potion-retrieval-32M` | 32.3M | 512 | retrieval | retrieval-tuned `potion-base-32M` | +| `potion-multilingual-128M` | 128M | — | 101 langs | multilingual | +| **`potion-code-16M`** | **16M** | **256** | **code** | **← csp/semble's model** (vocab ≈ 62.5k) | + +**`potion-code-16M`** (the one that matters for csp) is distilled from **`CodeRankEmbed`** (137M +teacher), then **tokenlearn on CornStack pairs**, **contrastive fine-tune (MultipleNegativesRankingLoss)**, +and **post-SIF re-regularization**. 256-dim output — matching what `dense.rs` expects. + +--- + +## 3. Benchmarks (and what they mean for csp) + +**General-text MTEB** (from the model2vec results README): + +| Model | MTEB avg | Retrieval (NDCG@10) | +|---|---|---| +| `potion-base-32M` | 52.13 | 32.67 | +| `potion-base-8M` | 51.08 | — | +| `potion-retrieval-32M` | — | 35.06 | + +**Code retrieval — CoIR** (from the `potion-code-16M` HF card; this is csp's relevant number): + +| Model | Params | CoIR avg | CosQA | CodeFeedback-ST | CodeFeedback-MT | +|---|---|---|---|---|---| +| `CodeRankEmbed` (teacher) | 137M | 59.14 | 35.92 | 78.11 | 42.61 | +| **`potion-code-16M`** | 16M | **37.05** | 21.37 | 50.27 | 36.26 | +| **`potion-code-16M` + BM25 hybrid** | 16M | **40.41** | 21.63 | 64.26 | 51.23 | + +> **The load-bearing line for csp**: the model's own card reports **dense-only 37.05 → +BM25 hybrid +> 40.41 (+3.36)**. The Model2Vec authors themselves measure that pairing the static dense model with +> sparse BM25 beats dense alone — which is precisely csp/semble's hybrid + adaptive-alpha design, and +> the direct counter-argument to a dense-only engine like [cocoindex-code](cocoindex.md). Static +> embeddings trade ~22 NDCG points vs. the 137M teacher for orders-of-magnitude speed; the BM25 leg +> is how csp claws some of that back. + +--- + +## 4. `model2vec-rs` — the Rust inference crate (what csp wires) + +- **Crate**: `model2vec-rs` `0.2.1` (crates.io), 100% Rust, MIT. ~1.7× the Python throughput + (≈8000 vs 4650 samples/s). **Inference-only**. +- **API** — `StaticModel` struct: + - `from_pretrained(id_or_path, token, normalize, subfolder)` — load from HF Hub or local path. + - `from_bytes(...)` — in-memory load (WASM / embedded). + - `encode(&[String])` — default params; `encode_with_args(.., max_length, batch_size)` — custom. +- **Formats**: `safetensors` with **f32 / f16 / i8** weights. Tokenization via `onig` or + `fancy-regex`. Feature flags `local-only`, `wasm`. Ships a CLI for single/batch encode. +- **Does NOT do**: distillation, training, fine-tuning, dynamic embeddings — those stay in the + Python `model2vec`. The Rust crate is purely the lookup+pool inference path. + +```rust +let model = StaticModel::from_pretrained("minishlab/potion-code-16M", None, None, None)?; +let embeddings = model.encode(&["where do we embed chunks".to_string()]); +``` + +### How csp uses it + +`crates/csp/src/indexing/dense.rs` exposes a `Model` enum wrapping `model2vec-rs` `StaticModel` +(real path) **and** a deterministic stub (`TODO(integration)`), so the Rust port reproduces the TS +stub bit-for-bit for fixture-level parity (see [semble.md §4.6](semble.md), [[dense-embedding-is-a-stub]]). +`SelectableBasicBackend` does cosine kNN over the resulting matrix. When the stub is swapped for real +`potion-code-16M` weights, the dense leg becomes the table above; the BM25 leg and ranking are +unchanged. + +--- + +## 5. Relevance map (Model2Vec → csp) + +| Model2Vec concept | csp counterpart | +|---|---| +| `StaticModel` (Python) / `model2vec-rs::StaticModel` (Rust) | `dense.rs` `Model` enum (real + stub) | +| `potion-code-16M` (256-dim) | the embedding model semble/csp target | +| vocab→vector lookup + mean pooling | `embed_chunks` dense matrix build | +| cosine over embeddings | `SelectableBasicBackend` kNN | +| dense-only vs. +BM25 hybrid (40.41) | csp's RRF fusion of dense + BM25 (the whole point) | + +--- + +## 6. Sources + +- `MinishLab/model2vec` (method, distill/encode, potion models) — +- MTEB results table — +- `MinishLab/model2vec-rs` (Rust inference crate) — +- `minishlab/potion-code-16M` card (CoIR + hybrid numbers, training recipe) — From 11d946915b071b80bb4624f8768a72a75b3f4e3e Mon Sep 17 00:00:00 2001 From: Minsu Lee Date: Fri, 19 Jun 2026 20:16:34 +0900 Subject: [PATCH 2/3] docs(references): use backtick code format for memory refs instead of wiki-links Apply AI code review suggestions (gemini-code-assist): GitHub markdown does not render [[wiki-link]] syntax, and the target (dense-embedding-is-a-stub) is a memory note with no in-repo path. Switch the 3 occurrences to backtick code format. --- .please/docs/references/cocoindex.md | 2 +- .please/docs/references/model2vec.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.please/docs/references/cocoindex.md b/.please/docs/references/cocoindex.md index 5ef5079..fc73fb2 100644 --- a/.please/docs/references/cocoindex.md +++ b/.please/docs/references/cocoindex.md @@ -101,7 +101,7 @@ Not endorsements — open questions surfaced by the comparison: 1. **Incremental delta indexing + daemon.** CocoIndex's headline feature is re-embedding only changed chunks against a warm index. csp currently caches by content hash and rebuilds; a chunk-level delta + optional daemon is the natural next perf step if/when real embeddings land - (the stub makes re-embedding cost moot today — see [[dense-embedding-is-a-stub]]). + (the stub makes re-embedding cost moot today — see `dense-embedding-is-a-stub`). 2. **Asymmetric query vs. index embedding params** (`indexing_params`/`query_params`). Relevant once csp wires real Model2Vec — many code-retrieval models want a query prefix. 3. **Structured MCP filters** (`languages`, `paths`, `limit`, `offset`). csp's `search` tool could diff --git a/.please/docs/references/model2vec.md b/.please/docs/references/model2vec.md index 0519f8c..a13c541 100644 --- a/.please/docs/references/model2vec.md +++ b/.please/docs/references/model2vec.md @@ -5,7 +5,7 @@ > direct dependency, not a port source** — Model2Vec *is* csp's dense-retrieval leg. semble uses the > `minishlab/potion-code-16M` static model; the csp Rust port (`crates/csp/src/indexing/dense.rs`) > wires the official `model2vec-rs` `StaticModel`, with a deterministic stub fallback until -> integration lands (see [[dense-embedding-is-a-stub]]). This doc captures how the embeddings are +> integration lands (see `dense-embedding-is-a-stub`). This doc captures how the embeddings are > produced, which model csp uses, the published benchmarks, and the Rust crate's API/limits. > > **Analyzed at**: GitHub READMEs + HF model cards, 2026-06-19. `model2vec-rs` `0.2.1` (May 2026). @@ -99,7 +99,7 @@ let embeddings = model.encode(&["where do we embed chunks".to_string()]); `crates/csp/src/indexing/dense.rs` exposes a `Model` enum wrapping `model2vec-rs` `StaticModel` (real path) **and** a deterministic stub (`TODO(integration)`), so the Rust port reproduces the TS -stub bit-for-bit for fixture-level parity (see [semble.md §4.6](semble.md), [[dense-embedding-is-a-stub]]). +stub bit-for-bit for fixture-level parity (see [semble.md §4.6](semble.md), `dense-embedding-is-a-stub`). `SelectableBasicBackend` does cosine kNN over the resulting matrix. When the stub is swapped for real `potion-code-16M` weights, the dense leg becomes the table above; the BM25 leg and ranking are unchanged. From e22e340dfa433385094388f88c388d4f0cfdabc7 Mon Sep 17 00:00:00 2001 From: Minsu Lee Date: Sat, 20 Jun 2026 03:58:14 +0900 Subject: [PATCH 3/3] docs(references): clarify Rust dense path already wires model2vec-rs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Apply AI review (coderabbitai, cubic-dev-ai): the prior wording ('stub until integration lands') implied the model2vec-rs wiring was still a TODO. Verified against crates/csp/src/indexing/dense.rs — it already loads StaticModel via StaticModel::from_pretrained; the deterministic stub is only a fallback on load failure (offline/missing weights/bad path) or in tests. Reword cocoindex.md and model2vec.md to reflect this, and keep the accurate note that the TS port's dense signal is still a stub until Rust reaches parity. --- .please/docs/references/cocoindex.md | 11 ++++++----- .please/docs/references/model2vec.md | 19 +++++++++++-------- 2 files changed, 17 insertions(+), 13 deletions(-) diff --git a/.please/docs/references/cocoindex.md b/.please/docs/references/cocoindex.md index fc73fb2..a3f1bbb 100644 --- a/.please/docs/references/cocoindex.md +++ b/.please/docs/references/cocoindex.md @@ -37,7 +37,7 @@ Two layers, easy to conflate: | Aspect | **CocoIndex Code** | **csp / semble** | |---|---|---| | Retrieval signal | **Dense-only** semantic vectors (no BM25) | **Hybrid** dense + BM25, fused via RRF (`k=60`) | -| Embeddings | **Real** models: `Snowflake/snowflake-arctic-embed-xs` local (SentenceTransformers), or 100+ cloud providers via LiteLLM | Model2Vec static embeddings (`potion-code-16M`); TS/Rust ship a deterministic **stub** until integration | +| Embeddings | **Real** models: `Snowflake/snowflake-arctic-embed-xs` local (SentenceTransformers), or 100+ cloud providers via LiteLLM | Model2Vec static embeddings (`potion-code-16M`); the Rust port loads `model2vec-rs::StaticModel` with a deterministic **stub** fallback on load failure, the TS port is still a stub | | Chunking | tree-sitter AST via `SplitRecursively` (`chunk_size=1000`, `chunk_overlap=300` in the canonical example) | tree-sitter AST, **no overlap**, target 1500 chars, `_MIN_CHUNK_SIZE=50`, line-fallback | | Indexing model | **Incremental delta** — diff vs. prior AST, re-embed only changed chunks, "80–90% cache hit"; optional **daemon** keeps index warm | Content-hash cache at `~/.csp/index/` (ADR-0002); rebuild on change, no long-running daemon | | Storage | **SQLite** (`cocoindex.db` + `target_sqlite.db`) under `/.cocoindex_code/` | JSON/serde index files under global `~/.csp/index/` | @@ -100,10 +100,11 @@ Not endorsements — open questions surfaced by the comparison: 1. **Incremental delta indexing + daemon.** CocoIndex's headline feature is re-embedding only changed chunks against a warm index. csp currently caches by content hash and rebuilds; a - chunk-level delta + optional daemon is the natural next perf step if/when real embeddings land - (the stub makes re-embedding cost moot today — see `dense-embedding-is-a-stub`). -2. **Asymmetric query vs. index embedding params** (`indexing_params`/`query_params`). Relevant - once csp wires real Model2Vec — many code-retrieval models want a query prefix. + chunk-level delta + optional daemon is the natural next perf step now that the Rust port loads + real embeddings (under the stub fallback, re-embedding is cheap and the win is smaller — see + `dense-embedding-is-a-stub`). +2. **Asymmetric query vs. index embedding params** (`indexing_params`/`query_params`). Relevant to + csp's real Model2Vec path (Rust) — many code-retrieval models want a query prefix. 3. **Structured MCP filters** (`languages`, `paths`, `limit`, `offset`). csp's `search` tool could adopt these cheaply; they map onto existing chunk metadata. 4. **Chunk overlap** (`chunk_overlap=300`). semble/csp use **no** overlap; worth measuring whether diff --git a/.please/docs/references/model2vec.md b/.please/docs/references/model2vec.md index a13c541..6d8cdd2 100644 --- a/.please/docs/references/model2vec.md +++ b/.please/docs/references/model2vec.md @@ -4,8 +4,9 @@ > inference crate [MinishLab/model2vec-rs](https://github.com/MinishLab/model2vec-rs). **This is a > direct dependency, not a port source** — Model2Vec *is* csp's dense-retrieval leg. semble uses the > `minishlab/potion-code-16M` static model; the csp Rust port (`crates/csp/src/indexing/dense.rs`) -> wires the official `model2vec-rs` `StaticModel`, with a deterministic stub fallback until -> integration lands (see `dense-embedding-is-a-stub`). This doc captures how the embeddings are +> wires the official `model2vec-rs` `StaticModel`, with a deterministic stub fallback when the +> model can't be loaded (offline / missing weights / bad path); the TS port is still a stub +> (see `dense-embedding-is-a-stub`). This doc captures how the embeddings are > produced, which model csp uses, the published benchmarks, and the Rust crate's API/limits. > > **Analyzed at**: GitHub READMEs + HF model cards, 2026-06-19. `model2vec-rs` `0.2.1` (May 2026). @@ -97,12 +98,14 @@ let embeddings = model.encode(&["where do we embed chunks".to_string()]); ### How csp uses it -`crates/csp/src/indexing/dense.rs` exposes a `Model` enum wrapping `model2vec-rs` `StaticModel` -(real path) **and** a deterministic stub (`TODO(integration)`), so the Rust port reproduces the TS -stub bit-for-bit for fixture-level parity (see [semble.md §4.6](semble.md), `dense-embedding-is-a-stub`). -`SelectableBasicBackend` does cosine kNN over the resulting matrix. When the stub is swapped for real -`potion-code-16M` weights, the dense leg becomes the table above; the BM25 leg and ranking are -unchanged. +`crates/csp/src/indexing/dense.rs` exposes a `Model` enum: a **real** path +(`Static { StaticModel }`, loaded via `model2vec-rs` `StaticModel::from_pretrained`) **and** a +deterministic stub used only as a fallback when the model can't be loaded (offline / missing weights +/ bad path) or in tests. The stub reproduces the former TS stub bit-for-bit for fixture-level parity +(see [semble.md §4.6](semble.md), `dense-embedding-is-a-stub`). `SelectableBasicBackend` does cosine +kNN over the resulting matrix. With real `potion-code-16M` weights loaded the dense leg matches the +benchmark table above; the BM25 leg and ranking are unchanged either way. (The TS port has no real +dense path yet — its dense signal stays a stub until Rust reaches parity.) ---