Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions .please/docs/references/cocoindex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Reference Analysis — CocoIndex Code (`cocoindex-code`)

> Prior-art / comparison analysis of [cocoindex-io/cocoindex-code](https://github.com/cocoindex-io/cocoindex-code)
> and the underlying [CocoIndex](https://cocoindex.io) data-pipeline framework. **This is not a
> port source** — csp ports [MinishLab/semble](semble.md). CocoIndex Code is an independent,
> competing project in the *same* niche (AST-aware semantic code search exposed to coding agents
> over MCP), so it is the closest direct comparator for csp's product surface. This document maps
> their design choices against csp's and flags what is worth borrowing vs. where the two
> deliberately diverge.
>
> **Analyzed at**: web docs + GitHub README as of 2026-06-19 (no upstream commit pinned — this is a
> comparison, not a parity oracle). Sources: <https://cocoindex.io/cocoindex-code/>,
> <https://github.com/cocoindex-io/cocoindex-code>, <https://cocoindex.io/docs-v0/examples/code_index/>.

---

## 1. What CocoIndex Code is

An MCP server + CLI that gives coding agents (Claude Code, Cursor, Codex) **AST-aware semantic
code search** over a whole repo, pitched on token savings (~70%) and sub-second freshness. It is
built on the general-purpose **CocoIndex** data-indexing framework (a Rust engine with a Python
API for declarative ETL/embedding pipelines); `cocoindex-code` is the packaged, code-search-specific
application of that framework.

Two layers, easy to conflate:

- **CocoIndex** (the framework) — Rust core + Python DSL for incremental, lineage-tracked indexing
pipelines. Built-in `SplitRecursively` (tree-sitter chunker) and `SentenceTransformerEmbed`
functions. This is the "build your own pipeline" toolkit.
- **CocoIndex Code** (`ccc`) — the opinionated end-user product built on it: a CLI/MCP server with
an index, daemon, and agent integration baked in.

---

## 2. How it compares to csp (the load-bearing differences)

| Aspect | **CocoIndex Code** | **csp / semble** |
|---|---|---|
| Retrieval signal | **Dense-only** semantic vectors (no BM25) | **Hybrid** dense + BM25, fused via RRF (`k=60`) |
| Embeddings | **Real** models: `Snowflake/snowflake-arctic-embed-xs` local (SentenceTransformers), or 100+ cloud providers via LiteLLM | Model2Vec static embeddings (`potion-code-16M`); the Rust port loads `model2vec-rs::StaticModel` with a deterministic **stub** fallback on load failure, the TS port is still a stub |
| Chunking | tree-sitter AST via `SplitRecursively` (`chunk_size=1000`, `chunk_overlap=300` in the canonical example) | tree-sitter AST, **no overlap**, target 1500 chars, `_MIN_CHUNK_SIZE=50`, line-fallback |
| Indexing model | **Incremental delta** — diff vs. prior AST, re-embed only changed chunks, "80–90% cache hit"; optional **daemon** keeps index warm | Content-hash cache at `~/.csp/index/` (ADR-0002); rebuild on change, no long-running daemon |
| Storage | **SQLite** (`cocoindex.db` + `target_sqlite.db`) under `<project>/.cocoindex_code/` | JSON/serde index files under global `~/.csp/index/` |
| Ranking | Dense cosine kNN; no documented code-specific reranking | RRF → multi-chunk file boost → query-type boost → path penalties + file-saturation decay |
| Symbol / lexical queries | Weak by construction (dense-only — exact identifier match relies on the embedding) | **Adaptive alpha** (`0.3` symbol / `0.5` NL) + identifier-aware BM25 tokenization explicitly handle symbol lookup |
| Embedding benchmark[^bench] | `arctic-embed-xs`: 22M params, 384-dim — **50.15** MTEB *general-text* Retrieval (no public CoIR/code score) | `potion-code-16M`: 16M, 256-dim — **37.05** CoIR (code) avg NDCG@10; teacher `CodeRankEmbed` (137M) = 59.14 |
| Tech stack | Rust engine, Python wrapper (~98% Python repo); `pipx install`, binary `ccc` | TS/Bun (`@pleaseai/csp`, binary `csp`) + Rust port (`crates/csp`, ADR-0003) |
| Deployment tiers | local / shared **daemon** / VPC enterprise (branch overlays, cross-repo, SSO) | single local tool; no daemon/enterprise tier |
| License | Apache-2.0 | MIT |

**One-line takeaway**: CocoIndex Code bets on *real embeddings + incremental delta indexing + a
daemon*; csp/semble bet on *hybrid dense+sparse + code-specific reranking on a zero-dependency CPU
stack*. The dense-only choice makes CocoIndex weaker on exact-symbol queries but lets it lean on
stronger embedding models; csp's BM25 leg + adaptive alpha is precisely the hedge against that.

[^bench]: **Not a head-to-head — the two numbers are from different benchmarks.** `arctic-embed-xs`'s
50.15 is **MTEB general-text Retrieval** (English prose), *not* code; arctic-embed is not a
code-trained model and has no public CoIR score, so 50.15 must **not** be read as "code-search
quality." `potion-code-16M`'s 37.05 is **CoIR** (code-specific, NDCG@10 avg over CosQA /
CodeFeedback ST/MT). The genuinely load-bearing figure for csp: the model2vec card itself
reports **potion-code-16M + BM25 hybrid = 40.41** (vs. 37.05 dense-only, **+3.36**) — the model's
own authors measure that adding sparse retrieval beats static-dense alone, which directly
validates csp/semble's hybrid + adaptive-alpha design over CocoIndex's dense-only path.
Sources: [potion-code-16M card](https://huggingface.co/minishlab/potion-code-16M),
[Snowflake-Labs/arctic-embed](https://github.com/Snowflake-Labs/arctic-embed),
[CoIR benchmark (ACL 2025)](https://github.com/coir-team/coir).

---

## 3. CLI surface (`ccc`)

For comparison with csp's `csp` subcommands (`search`, `index`, `find-related`, `mcp`, `init`,
`savings`):

| `ccc` | Purpose | csp analog |
|---|---|---|
| `ccc init` | scaffold settings | `csp init` |
| `ccc index` | build/update index | `csp index` |
| `ccc search <query>` | semantic search | `csp search` |
| `ccc status` | index stats | (≈ `csp savings` / stats) |
| `ccc mcp` | MCP server, stdio | `csp mcp` |
| `ccc daemon [status\|restart\|stop]` | background index daemon | — (no equivalent) |
| `ccc reset` | delete index DBs | `csp clear index` |
| `ccc doctor` | diagnostics | — |

- **MCP tool**: a single `search()` tool with `languages` / `paths` / `limit` / `offset` filters.
csp instead exposes **two** tools (`search`, `find_related`) — csp has no `find_related` analog
on the CocoIndex side, and CocoIndex has structured filters csp does not.
- **Install/run**: `pipx install 'cocoindex-code[full]'` (local embeddings) or
`pipx install cocoindex-code` (slim, cloud-only); binary `ccc`. csp: `bunx @pleaseai/csp`.
- **Config**: `~/.cocoindex_code/global_settings.yml` (`embedding.model`, `embedding.provider`,
`embedding.device` = cpu/cuda/mps, `min_interval_ms`, asymmetric `indexing_params`/`query_params`);
per-project `include_patterns` / `exclude_patterns`.

---

## 4. Ideas worth tracking for csp

Not endorsements — open questions surfaced by the comparison:

1. **Incremental delta indexing + daemon.** CocoIndex's headline feature is re-embedding only
changed chunks against a warm index. csp currently caches by content hash and rebuilds; a
chunk-level delta + optional daemon is the natural next perf step now that the Rust port loads
real embeddings (under the stub fallback, re-embedding is cheap and the win is smaller — see
`dense-embedding-is-a-stub`).
2. **Asymmetric query vs. index embedding params** (`indexing_params`/`query_params`). Relevant to
csp's real Model2Vec path (Rust) — many code-retrieval models want a query prefix.
3. **Structured MCP filters** (`languages`, `paths`, `limit`, `offset`). csp's `search` tool could
adopt these cheaply; they map onto existing chunk metadata.
4. **Chunk overlap** (`chunk_overlap=300`). semble/csp use **no** overlap; worth measuring whether
overlap improves recall on boundary-spanning definitions, or just inflates the index.
5. **Branch overlays** (treat a PR/branch as a delta on a shared main index). Enterprise-tier idea,
but the "index once, overlay per branch" model could inform csp's global `~/.csp/index/` layout.

Where csp should **not** follow: dropping BM25. The dense-only path is CocoIndex's biggest
weakness for symbol/identifier queries, and csp's hybrid + adaptive-alpha design is the explicit
counter-position (see [CLAUDE.md](../../../CLAUDE.md) "Conventions to preserve from semble").

---

## 5. Sources

- CocoIndex Code product page — <https://cocoindex.io/cocoindex-code/>
- `cocoindex-io/cocoindex-code` (CLI/MCP) — <https://github.com/cocoindex-io/cocoindex-code>
- CocoIndex framework code-index example — <https://cocoindex.io/docs-v0/examples/code_index/>
- `cocoindex-io/realtime-codebase-indexing` — <https://github.com/cocoindex-io/realtime-codebase-indexing>
4 changes: 4 additions & 0 deletions .please/docs/references/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ When upstream moves, update the relevant analysis here and reconcile any new dri
| Library | Upstream | Role | Analysis |
|---|---|---|---|
| **semble** | [MinishLab/semble](https://github.com/MinishLab/semble) | Direct port source — analyzed against the **Rust port** (`crates/csp`, [ADR-0003](../decisions/0003-rewrite-in-rust.md)) | [semble.md](semble.md) |
| **cocoindex-code** | [cocoindex-io/cocoindex-code](https://github.com/cocoindex-io/cocoindex-code) | Prior-art / comparator — independent AST code-search MCP in the same niche (not a port source) | [cocoindex.md](cocoindex.md) |
| **model2vec** | [MinishLab/model2vec](https://github.com/MinishLab/model2vec) + [model2vec-rs](https://github.com/MinishLab/model2vec-rs) | Direct dependency — the dense-retrieval leg (`potion-code-16M`); Rust port wires `model2vec-rs` | [model2vec.md](model2vec.md) |

<!-- Add new reference analyses above this line as additional libraries are adopted. -->

Expand All @@ -25,6 +27,8 @@ drift, diff from that commit forward (`git log <baseline>..main` in the upstream
| Library | Analyzed at | Notes |
|---|---|---|
| semble | upstream `136b6f7` (2026-06-18); Rust port `2f2baa2` (PR #34) | Mapped to the Rust crates; beyond prior review baseline `eacbe43`; see semble.md §Divergences |
| cocoindex-code | web docs + GitHub README, 2026-06-19 (no commit pinned); embedding benchmarks from model HF cards | Comparison/prior-art, **not a port** — no parity oracle. Drift = re-check vs. cocoindex's docs/README; benchmark row reflects published `potion-code-16M` (CoIR) vs. `arctic-embed-xs` (MTEB) figures. See cocoindex.md §2 + `[^bench]` |
| model2vec | GitHub READMEs + HF cards, 2026-06-19; `model2vec-rs` `0.2.1` (no commit pinned) | **Direct dependency**, not a port — the dense leg. Drift = pin `model2vec-rs` crate version + `potion-code-16M` card revision when the stub is swapped for real weights. See model2vec.md §4–5 |

## How to add a new reference analysis

Expand Down
129 changes: 129 additions & 0 deletions .please/docs/references/model2vec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Reference Analysis — Model2Vec (`model2vec` / `model2vec-rs`)

> Analysis of [MinishLab/model2vec](https://github.com/MinishLab/model2vec) (Python) and its Rust
> inference crate [MinishLab/model2vec-rs](https://github.com/MinishLab/model2vec-rs). **This is a
> direct dependency, not a port source** — Model2Vec *is* csp's dense-retrieval leg. semble uses the
> `minishlab/potion-code-16M` static model; the csp Rust port (`crates/csp/src/indexing/dense.rs`)
> wires the official `model2vec-rs` `StaticModel`, with a deterministic stub fallback when the
> model can't be loaded (offline / missing weights / bad path); the TS port is still a stub
> (see `dense-embedding-is-a-stub`). This doc captures how the embeddings are
> produced, which model csp uses, the published benchmarks, and the Rust crate's API/limits.
>
> **Analyzed at**: GitHub READMEs + HF model cards, 2026-06-19. `model2vec-rs` `0.2.1` (May 2026).
> Both projects MIT. Sources in §6.

---

## 1. What Model2Vec is

A method (and toolkit) that **distills a sentence-transformer into a static embedding model**:
each vocabulary token gets one precomputed vector; encoding a string is a **vocab→vector lookup +
mean pooling**, *not* a transformer forward pass. Result: ~50× smaller, up to ~500× faster on CPU,
with a modest quality drop. No GPU, no API key, deterministic. This is exactly why semble/csp can
do "single-digit-millisecond" search on a laptop — the dense signal is a matrix gather, not
inference.

**Distillation pipeline** (`distill()`, ~30s on CPU, no training data needed):
1. **Vocabulary forward pass** — run the vocab through the teacher (e.g. `BAAI/bge-base-en-v1.5`)
to get one embedding per token.
2. **Token→vector table + mean pooling** — store the table; inference pools token vectors.
3. **Post-processing** — **PCA** (dim reduction), **Zipf weighting** (down-weight frequent tokens),
and **tokenlearn / POTION** (the training trick that lifted the `potion-*` generation above
plain distillation).

---

## 2. The "potion" pre-trained models

| Model | Params | Dim | Task | Notes |
|---|---|---|---|---|
| `potion-base-2M` | 1.8M | — | general | smallest |
| `potion-base-4M` | 3.7M | — | general | |
| `potion-base-8M` | 7.5M | 256 | general | crate's example default |
| `potion-base-32M` | 32.3M | 512 | general | "most performant static model" |
| `potion-retrieval-32M` | 32.3M | 512 | retrieval | retrieval-tuned `potion-base-32M` |
| `potion-multilingual-128M` | 128M | — | 101 langs | multilingual |
| **`potion-code-16M`** | **16M** | **256** | **code** | **← csp/semble's model** (vocab ≈ 62.5k) |

**`potion-code-16M`** (the one that matters for csp) is distilled from **`CodeRankEmbed`** (137M
teacher), then **tokenlearn on CornStack pairs**, **contrastive fine-tune (MultipleNegativesRankingLoss)**,
and **post-SIF re-regularization**. 256-dim output — matching what `dense.rs` expects.

---

## 3. Benchmarks (and what they mean for csp)

**General-text MTEB** (from the model2vec results README):

| Model | MTEB avg | Retrieval (NDCG@10) |
|---|---|---|
| `potion-base-32M` | 52.13 | 32.67 |
| `potion-base-8M` | 51.08 | — |
| `potion-retrieval-32M` | — | 35.06 |

**Code retrieval — CoIR** (from the `potion-code-16M` HF card; this is csp's relevant number):

| Model | Params | CoIR avg | CosQA | CodeFeedback-ST | CodeFeedback-MT |
|---|---|---|---|---|---|
| `CodeRankEmbed` (teacher) | 137M | 59.14 | 35.92 | 78.11 | 42.61 |
| **`potion-code-16M`** | 16M | **37.05** | 21.37 | 50.27 | 36.26 |
| **`potion-code-16M` + BM25 hybrid** | 16M | **40.41** | 21.63 | 64.26 | 51.23 |

> **The load-bearing line for csp**: the model's own card reports **dense-only 37.05 → +BM25 hybrid
> 40.41 (+3.36)**. The Model2Vec authors themselves measure that pairing the static dense model with
> sparse BM25 beats dense alone — which is precisely csp/semble's hybrid + adaptive-alpha design, and
> the direct counter-argument to a dense-only engine like [cocoindex-code](cocoindex.md). Static
> embeddings trade ~22 NDCG points vs. the 137M teacher for orders-of-magnitude speed; the BM25 leg
> is how csp claws some of that back.

---

## 4. `model2vec-rs` — the Rust inference crate (what csp wires)

- **Crate**: `model2vec-rs` `0.2.1` (crates.io), 100% Rust, MIT. ~1.7× the Python throughput
(≈8000 vs 4650 samples/s). **Inference-only**.
- **API** — `StaticModel` struct:
- `from_pretrained(id_or_path, token, normalize, subfolder)` — load from HF Hub or local path.
- `from_bytes(...)` — in-memory load (WASM / embedded).
- `encode(&[String])` — default params; `encode_with_args(.., max_length, batch_size)` — custom.
- **Formats**: `safetensors` with **f32 / f16 / i8** weights. Tokenization via `onig` or
`fancy-regex`. Feature flags `local-only`, `wasm`. Ships a CLI for single/batch encode.
- **Does NOT do**: distillation, training, fine-tuning, dynamic embeddings — those stay in the
Python `model2vec`. The Rust crate is purely the lookup+pool inference path.

```rust
let model = StaticModel::from_pretrained("minishlab/potion-code-16M", None, None, None)?;
let embeddings = model.encode(&["where do we embed chunks".to_string()]);
```

### How csp uses it

`crates/csp/src/indexing/dense.rs` exposes a `Model` enum: a **real** path
(`Static { StaticModel }`, loaded via `model2vec-rs` `StaticModel::from_pretrained`) **and** a
deterministic stub used only as a fallback when the model can't be loaded (offline / missing weights
/ bad path) or in tests. The stub reproduces the former TS stub bit-for-bit for fixture-level parity
(see [semble.md §4.6](semble.md), `dense-embedding-is-a-stub`). `SelectableBasicBackend` does cosine
kNN over the resulting matrix. With real `potion-code-16M` weights loaded the dense leg matches the
benchmark table above; the BM25 leg and ranking are unchanged either way. (The TS port has no real
dense path yet — its dense signal stays a stub until Rust reaches parity.)

---

## 5. Relevance map (Model2Vec → csp)

| Model2Vec concept | csp counterpart |
|---|---|
| `StaticModel` (Python) / `model2vec-rs::StaticModel` (Rust) | `dense.rs` `Model` enum (real + stub) |
| `potion-code-16M` (256-dim) | the embedding model semble/csp target |
| vocab→vector lookup + mean pooling | `embed_chunks` dense matrix build |
| cosine over embeddings | `SelectableBasicBackend` kNN |
| dense-only vs. +BM25 hybrid (40.41) | csp's RRF fusion of dense + BM25 (the whole point) |

---

## 6. Sources

- `MinishLab/model2vec` (method, distill/encode, potion models) — <https://github.com/MinishLab/model2vec>
- MTEB results table — <https://github.com/MinishLab/model2vec/blob/main/results/README.md#mteb-results>
- `MinishLab/model2vec-rs` (Rust inference crate) — <https://github.com/MinishLab/model2vec-rs>
- `minishlab/potion-code-16M` card (CoIR + hybrid numbers, training recipe) — <https://huggingface.co/minishlab/potion-code-16M>
Loading