pleaseai · amondnet · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/.please/docs/references/cocoindex.md b/.please/docs/references/cocoindex.md
@@ -0,0 +1,126 @@
+# Reference Analysis — CocoIndex Code (`cocoindex-code`)
+
+> Prior-art / comparison analysis of [cocoindex-io/cocoindex-code](https://github.com/cocoindex-io/cocoindex-code)
+> and the underlying [CocoIndex](https://cocoindex.io) data-pipeline framework. **This is not a
+> port source** — csp ports [MinishLab/semble](semble.md). CocoIndex Code is an independent,
+> competing project in the *same* niche (AST-aware semantic code search exposed to coding agents
+> over MCP), so it is the closest direct comparator for csp's product surface. This document maps
+> their design choices against csp's and flags what is worth borrowing vs. where the two
+> deliberately diverge.
+>
+> **Analyzed at**: web docs + GitHub README as of 2026-06-19 (no upstream commit pinned — this is a
+> comparison, not a parity oracle). Sources: <https://cocoindex.io/cocoindex-code/>,
+> <https://github.com/cocoindex-io/cocoindex-code>, <https://cocoindex.io/docs-v0/examples/code_index/>.
+
+---
+
+## 1. What CocoIndex Code is
+
+An MCP server + CLI that gives coding agents (Claude Code, Cursor, Codex) **AST-aware semantic
+code search** over a whole repo, pitched on token savings (~70%) and sub-second freshness. It is
+built on the general-purpose **CocoIndex** data-indexing framework (a Rust engine with a Python
+API for declarative ETL/embedding pipelines); `cocoindex-code` is the packaged, code-search-specific
+application of that framework.
+
+Two layers, easy to conflate:
+
+- **CocoIndex** (the framework) — Rust core + Python DSL for incremental, lineage-tracked indexing
+  pipelines. Built-in `SplitRecursively` (tree-sitter chunker) and `SentenceTransformerEmbed`
+  functions. This is the "build your own pipeline" toolkit.
+- **CocoIndex Code** (`ccc`) — the opinionated end-user product built on it: a CLI/MCP server with
+  an index, daemon, and agent integration baked in.
+
+---
+
+## 2. How it compares to csp (the load-bearing differences)
+
+| Aspect | **CocoIndex Code** | **csp / semble** |
+|---|---|---|
+| Retrieval signal | **Dense-only** semantic vectors (no BM25) | **Hybrid** dense + BM25, fused via RRF (`k=60`) |
+| Embeddings | **Real** models: `Snowflake/snowflake-arctic-embed-xs` local (SentenceTransformers), or 100+ cloud providers via LiteLLM | Model2Vec static embeddings (`potion-code-16M`); the Rust port loads `model2vec-rs::StaticModel` with a deterministic **stub** fallback on load failure, the TS port is still a stub |
+| Chunking | tree-sitter AST via `SplitRecursively` (`chunk_size=1000`, `chunk_overlap=300` in the canonical example) | tree-sitter AST, **no overlap**, target 1500 chars, `_MIN_CHUNK_SIZE=50`, line-fallback |
+| Indexing model | **Incremental delta** — diff vs. prior AST, re-embed only changed chunks, "80–90% cache hit"; optional **daemon** keeps index warm | Content-hash cache at `~/.csp/index/` (ADR-0002); rebuild on change, no long-running daemon |
+| Storage | **SQLite** (`cocoindex.db` + `target_sqlite.db`) under `<project>/.cocoindex_code/` | JSON/serde index files under global `~/.csp/index/` |
+| Ranking | Dense cosine kNN; no documented code-specific reranking | RRF → multi-chunk file boost → query-type boost → path penalties + file-saturation decay |
+| Symbol / lexical queries | Weak by construction (dense-only — exact identifier match relies on the embedding) | **Adaptive alpha** (`0.3` symbol / `0.5` NL) + identifier-aware BM25 tokenization explicitly handle symbol lookup |
+| Embedding benchmark[^bench] | `arctic-embed-xs`: 22M params, 384-dim — **50.15** MTEB *general-text* Retrieval (no public CoIR/code score) | `potion-code-16M`: 16M, 256-dim — **37.05** CoIR (code) avg NDCG@10; teacher `CodeRankEmbed` (137M) = 59.14 |
+| Tech stack | Rust engine, Python wrapper (~98% Python repo); `pipx install`, binary `ccc` | TS/Bun (`@pleaseai/csp`, binary `csp`) + Rust port (`crates/csp`, ADR-0003) |
+| Deployment tiers | local / shared **daemon** / VPC enterprise (branch overlays, cross-repo, SSO) | single local tool; no daemon/enterprise tier |
+| License | Apache-2.0 | MIT |
+
+**One-line takeaway**: CocoIndex Code bets on *real embeddings + incremental delta indexing + a
+daemon*; csp/semble bet on *hybrid dense+sparse + code-specific reranking on a zero-dependency CPU
+stack*. The dense-only choice makes CocoIndex weaker on exact-symbol queries but lets it lean on
+stronger embedding models; csp's BM25 leg + adaptive alpha is precisely the hedge against that.
+
+[^bench]: **Not a head-to-head — the two numbers are from different benchmarks.** `arctic-embed-xs`'s
+    50.15 is **MTEB general-text Retrieval** (English prose), *not* code; arctic-embed is not a
+    code-trained model and has no public CoIR score, so 50.15 must **not** be read as "code-search
+    quality." `potion-code-16M`'s 37.05 is **CoIR** (code-specific, NDCG@10 avg over CosQA /
+    CodeFeedback ST/MT). The genuinely load-bearing figure for csp: the model2vec card itself
+    reports **potion-code-16M + BM25 hybrid = 40.41** (vs. 37.05 dense-only, **+3.36**) — the model's
+    own authors measure that adding sparse retrieval beats static-dense alone, which directly
+    validates csp/semble's hybrid + adaptive-alpha design over CocoIndex's dense-only path.
+    Sources: [potion-code-16M card](https://huggingface.co/minishlab/potion-code-16M),
+    [Snowflake-Labs/arctic-embed](https://github.com/Snowflake-Labs/arctic-embed),
+    [CoIR benchmark (ACL 2025)](https://github.com/coir-team/coir).
+
+---
+
+## 3. CLI surface (`ccc`)
+
+For comparison with csp's `csp` subcommands (`search`, `index`, `find-related`, `mcp`, `init`,
+`savings`):
+
+| `ccc` | Purpose | csp analog |
+|---|---|---|
+| `ccc init` | scaffold settings | `csp init` |
+| `ccc index` | build/update index | `csp index` |
+| `ccc search <query>` | semantic search | `csp search` |
+| `ccc status` | index stats | (≈ `csp savings` / stats) |
+| `ccc mcp` | MCP server, stdio | `csp mcp` |
+| `ccc daemon [status\|restart\|stop]` | background index daemon | — (no equivalent) |
+| `ccc reset` | delete index DBs | `csp clear index` |
+| `ccc doctor` | diagnostics | — |
+
+- **MCP tool**: a single `search()` tool with `languages` / `paths` / `limit` / `offset` filters.
+  csp instead exposes **two** tools (`search`, `find_related`) — csp has no `find_related` analog
+  on the CocoIndex side, and CocoIndex has structured filters csp does not.
+- **Install/run**: `pipx install 'cocoindex-code[full]'` (local embeddings) or
+  `pipx install cocoindex-code` (slim, cloud-only); binary `ccc`. csp: `bunx @pleaseai/csp`.
+- **Config**: `~/.cocoindex_code/global_settings.yml` (`embedding.model`, `embedding.provider`,
+  `embedding.device` = cpu/cuda/mps, `min_interval_ms`, asymmetric `indexing_params`/`query_params`);
+  per-project `include_patterns` / `exclude_patterns`.
+
+---
+
+## 4. Ideas worth tracking for csp
+
+Not endorsements — open questions surfaced by the comparison:
+
+1. **Incremental delta indexing + daemon.** CocoIndex's headline feature is re-embedding only
+   changed chunks against a warm index. csp currently caches by content hash and rebuilds; a
+   chunk-level delta + optional daemon is the natural next perf step now that the Rust port loads
+   real embeddings (under the stub fallback, re-embedding is cheap and the win is smaller — see
+   `dense-embedding-is-a-stub`).
+2. **Asymmetric query vs. index embedding params** (`indexing_params`/`query_params`). Relevant to
+   csp's real Model2Vec path (Rust) — many code-retrieval models want a query prefix.
+3. **Structured MCP filters** (`languages`, `paths`, `limit`, `offset`). csp's `search` tool could
+   adopt these cheaply; they map onto existing chunk metadata.
+4. **Chunk overlap** (`chunk_overlap=300`). semble/csp use **no** overlap; worth measuring whether
+   overlap improves recall on boundary-spanning definitions, or just inflates the index.
+5. **Branch overlays** (treat a PR/branch as a delta on a shared main index). Enterprise-tier idea,
+   but the "index once, overlay per branch" model could inform csp's global `~/.csp/index/` layout.
+
+Where csp should **not** follow: dropping BM25. The dense-only path is CocoIndex's biggest
+weakness for symbol/identifier queries, and csp's hybrid + adaptive-alpha design is the explicit
+counter-position (see [CLAUDE.md](../../../CLAUDE.md) "Conventions to preserve from semble").
+
+---
+
+## 5. Sources
+
+- CocoIndex Code product page — <https://cocoindex.io/cocoindex-code/>
+- `cocoindex-io/cocoindex-code` (CLI/MCP) — <https://github.com/cocoindex-io/cocoindex-code>
+- CocoIndex framework code-index example — <https://cocoindex.io/docs-v0/examples/code_index/>
+- `cocoindex-io/realtime-codebase-indexing` — <https://github.com/cocoindex-io/realtime-codebase-indexing>
diff --git a/.please/docs/references/index.md b/.please/docs/references/index.md
@@ -14,6 +14,8 @@ When upstream moves, update the relevant analysis here and reconcile any new dri
 | Library | Upstream | Role | Analysis |
 |---|---|---|---|
 | **semble** | [MinishLab/semble](https://github.com/MinishLab/semble) | Direct port source — analyzed against the **Rust port** (`crates/csp`, [ADR-0003](../decisions/0003-rewrite-in-rust.md)) | [semble.md](semble.md) |
+| **cocoindex-code** | [cocoindex-io/cocoindex-code](https://github.com/cocoindex-io/cocoindex-code) | Prior-art / comparator — independent AST code-search MCP in the same niche (not a port source) | [cocoindex.md](cocoindex.md) |
+| **model2vec** | [MinishLab/model2vec](https://github.com/MinishLab/model2vec) + [model2vec-rs](https://github.com/MinishLab/model2vec-rs) | Direct dependency — the dense-retrieval leg (`potion-code-16M`); Rust port wires `model2vec-rs` | [model2vec.md](model2vec.md) |
 
 <!-- Add new reference analyses above this line as additional libraries are adopted. -->
 
@@ -25,6 +27,8 @@ drift, diff from that commit forward (`git log <baseline>..main` in the upstream
 | Library | Analyzed at | Notes |
 |---|---|---|
 | semble | upstream `136b6f7` (2026-06-18); Rust port `2f2baa2` (PR #34) | Mapped to the Rust crates; beyond prior review baseline `eacbe43`; see semble.md §Divergences |
+| cocoindex-code | web docs + GitHub README, 2026-06-19 (no commit pinned); embedding benchmarks from model HF cards | Comparison/prior-art, **not a port** — no parity oracle. Drift = re-check vs. cocoindex's docs/README; benchmark row reflects published `potion-code-16M` (CoIR) vs. `arctic-embed-xs` (MTEB) figures. See cocoindex.md §2 + `[^bench]` |
+| model2vec | GitHub READMEs + HF cards, 2026-06-19; `model2vec-rs` `0.2.1` (no commit pinned) | **Direct dependency**, not a port — the dense leg. Drift = pin `model2vec-rs` crate version + `potion-code-16M` card revision when the stub is swapped for real weights. See model2vec.md §4–5 |
 
 ## How to add a new reference analysis
 

diff --git a/.please/docs/references/model2vec.md b/.please/docs/references/model2vec.md
@@ -0,0 +1,129 @@
+# Reference Analysis — Model2Vec (`model2vec` / `model2vec-rs`)
+
+> Analysis of [MinishLab/model2vec](https://github.com/MinishLab/model2vec) (Python) and its Rust
+> inference crate [MinishLab/model2vec-rs](https://github.com/MinishLab/model2vec-rs). **This is a
+> direct dependency, not a port source** — Model2Vec *is* csp's dense-retrieval leg. semble uses the
+> `minishlab/potion-code-16M` static model; the csp Rust port (`crates/csp/src/indexing/dense.rs`)
+> wires the official `model2vec-rs` `StaticModel`, with a deterministic stub fallback when the
+> model can't be loaded (offline / missing weights / bad path); the TS port is still a stub
+> (see `dense-embedding-is-a-stub`). This doc captures how the embeddings are
+> produced, which model csp uses, the published benchmarks, and the Rust crate's API/limits.
+>
+> **Analyzed at**: GitHub READMEs + HF model cards, 2026-06-19. `model2vec-rs` `0.2.1` (May 2026).
+> Both projects MIT. Sources in §6.
+
+---
+
+## 1. What Model2Vec is
+
+A method (and toolkit) that **distills a sentence-transformer into a static embedding model**:
+each vocabulary token gets one precomputed vector; encoding a string is a **vocab→vector lookup +
+mean pooling**, *not* a transformer forward pass. Result: ~50× smaller, up to ~500× faster on CPU,
+with a modest quality drop. No GPU, no API key, deterministic. This is exactly why semble/csp can
+do "single-digit-millisecond" search on a laptop — the dense signal is a matrix gather, not
+inference.
+
+**Distillation pipeline** (`distill()`, ~30s on CPU, no training data needed):
+1. **Vocabulary forward pass** — run the vocab through the teacher (e.g. `BAAI/bge-base-en-v1.5`)
+   to get one embedding per token.
+2. **Token→vector table + mean pooling** — store the table; inference pools token vectors.
+3. **Post-processing** — **PCA** (dim reduction), **Zipf weighting** (down-weight frequent tokens),
+   and **tokenlearn / POTION** (the training trick that lifted the `potion-*` generation above
+   plain distillation).
+
+---
+
+## 2. The "potion" pre-trained models
+
+| Model | Params | Dim | Task | Notes |
+|---|---|---|---|---|
+| `potion-base-2M` | 1.8M | — | general | smallest |
+| `potion-base-4M` | 3.7M | — | general | |
+| `potion-base-8M` | 7.5M | 256 | general | crate's example default |
+| `potion-base-32M` | 32.3M | 512 | general | "most performant static model" |
+| `potion-retrieval-32M` | 32.3M | 512 | retrieval | retrieval-tuned `potion-base-32M` |
+| `potion-multilingual-128M` | 128M | — | 101 langs | multilingual |
+| **`potion-code-16M`** | **16M** | **256** | **code** | **← csp/semble's model** (vocab ≈ 62.5k) |
+
+**`potion-code-16M`** (the one that matters for csp) is distilled from **`CodeRankEmbed`** (137M
+teacher), then **tokenlearn on CornStack pairs**, **contrastive fine-tune (MultipleNegativesRankingLoss)**,
+and **post-SIF re-regularization**. 256-dim output — matching what `dense.rs` expects.
+
+---
+
+## 3. Benchmarks (and what they mean for csp)
+
+**General-text MTEB** (from the model2vec results README):
+
+| Model | MTEB avg | Retrieval (NDCG@10) |
+|---|---|---|
+| `potion-base-32M` | 52.13 | 32.67 |
+| `potion-base-8M` | 51.08 | — |
+| `potion-retrieval-32M` | — | 35.06 |
+
+**Code retrieval — CoIR** (from the `potion-code-16M` HF card; this is csp's relevant number):
+
+| Model | Params | CoIR avg | CosQA | CodeFeedback-ST | CodeFeedback-MT |
+|---|---|---|---|---|---|
+| `CodeRankEmbed` (teacher) | 137M | 59.14 | 35.92 | 78.11 | 42.61 |
+| **`potion-code-16M`** | 16M | **37.05** | 21.37 | 50.27 | 36.26 |
+| **`potion-code-16M` + BM25 hybrid** | 16M | **40.41** | 21.63 | 64.26 | 51.23 |
+
+> **The load-bearing line for csp**: the model's own card reports **dense-only 37.05 → +BM25 hybrid
+> 40.41 (+3.36)**. The Model2Vec authors themselves measure that pairing the static dense model with
+> sparse BM25 beats dense alone — which is precisely csp/semble's hybrid + adaptive-alpha design, and
+> the direct counter-argument to a dense-only engine like [cocoindex-code](cocoindex.md). Static
+> embeddings trade ~22 NDCG points vs. the 137M teacher for orders-of-magnitude speed; the BM25 leg
+> is how csp claws some of that back.
+
+---
+
+## 4. `model2vec-rs` — the Rust inference crate (what csp wires)
+
+- **Crate**: `model2vec-rs` `0.2.1` (crates.io), 100% Rust, MIT. ~1.7× the Python throughput
+  (≈8000 vs 4650 samples/s). **Inference-only**.
+- **API** — `StaticModel` struct:
+  - `from_pretrained(id_or_path, token, normalize, subfolder)` — load from HF Hub or local path.
+  - `from_bytes(...)` — in-memory load (WASM / embedded).
+  - `encode(&[String])` — default params; `encode_with_args(.., max_length, batch_size)` — custom.
+- **Formats**: `safetensors` with **f32 / f16 / i8** weights. Tokenization via `onig` or
+  `fancy-regex`. Feature flags `local-only`, `wasm`. Ships a CLI for single/batch encode.
+- **Does NOT do**: distillation, training, fine-tuning, dynamic embeddings — those stay in the
+  Python `model2vec`. The Rust crate is purely the lookup+pool inference path.
+
+```rust
+let model = StaticModel::from_pretrained("minishlab/potion-code-16M", None, None, None)?;
+let embeddings = model.encode(&["where do we embed chunks".to_string()]);
+```
+
+### How csp uses it
+
+`crates/csp/src/indexing/dense.rs` exposes a `Model` enum: a **real** path
+(`Static { StaticModel }`, loaded via `model2vec-rs` `StaticModel::from_pretrained`) **and** a
+deterministic stub used only as a fallback when the model can't be loaded (offline / missing weights
+/ bad path) or in tests. The stub reproduces the former TS stub bit-for-bit for fixture-level parity
+(see [semble.md §4.6](semble.md), `dense-embedding-is-a-stub`). `SelectableBasicBackend` does cosine
+kNN over the resulting matrix. With real `potion-code-16M` weights loaded the dense leg matches the
+benchmark table above; the BM25 leg and ranking are unchanged either way. (The TS port has no real
+dense path yet — its dense signal stays a stub until Rust reaches parity.)
+
+---
+
+## 5. Relevance map (Model2Vec → csp)
+
+| Model2Vec concept | csp counterpart |
+|---|---|
+| `StaticModel` (Python) / `model2vec-rs::StaticModel` (Rust) | `dense.rs` `Model` enum (real + stub) |
+| `potion-code-16M` (256-dim) | the embedding model semble/csp target |
+| vocab→vector lookup + mean pooling | `embed_chunks` dense matrix build |
+| cosine over embeddings | `SelectableBasicBackend` kNN |
+| dense-only vs. +BM25 hybrid (40.41) | csp's RRF fusion of dense + BM25 (the whole point) |
+
+---
+
+## 6. Sources
+
+- `MinishLab/model2vec` (method, distill/encode, potion models) — <https://github.com/MinishLab/model2vec>
+- MTEB results table — <https://github.com/MinishLab/model2vec/blob/main/results/README.md#mteb-results>
+- `MinishLab/model2vec-rs` (Rust inference crate) — <https://github.com/MinishLab/model2vec-rs>
+- `minishlab/potion-code-16M` card (CoIR + hybrid numbers, training recipe) — <https://huggingface.co/minishlab/potion-code-16M>