Skip to content

feat(indexing): port Model2Vec embedding + vector backend from semble#7

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-10-dense
May 28, 2026
Merged

feat(indexing): port Model2Vec embedding + vector backend from semble#7
amondnet merged 2 commits into
mainfrom
feat/unit-10-dense

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Port `src/semble/index/dense.py` to TypeScript as `src/indexing/dense.ts` — Worker 10 of the parallel csp port.

What it does

  • Loads a Model2Vec model (default `minishlab/potion-code-16M`).
  • Embeds chunks into Float32Array row vectors.
  • Wraps Vicinity's `BasicBackend` (cosine distance) with a `selector` parameter for filtering candidate indices.

Public surface

  • `DEFAULT_MODEL_NAME` — kept identical to semble for parity.
  • `loadModel(modelPath?: string): Promise<{ model, modelPath }>` — async, module-level cache, mirrors semble's `@cache` behavior.
  • `embedChunks(model, chunks): Float32Array[]` — one row per chunk, `[]` for empty input (matching semble).
  • `SelectableBasicBackend`:
    • constructor pre-normalises rows so cosine distance reduces to `1 - dot`.
    • `query(queryVectors, k, selector?)` returns `[index, distance][]` per query, sorted ascending.
    • `save(dir)` / `static load(dir)` persist vectors as a flat `vectors.bin` plus `args.json` metadata.
    • `k < 1` throws (parity).
    • `effective_k = min(k, numVectors, selector.length)` (parity).
    • Selector results map pool-relative indices back to absolute chunk indices (parity with `selector[sorted_indices]`).
  • `Chunk` is inlined locally (TODO: switch to `../types` once that module lands).

Option chosen: C (stub Model2Vec)

Per the worker instructions, network model loading is explicitly out of scope for this batch. `loadModel` returns a stub model that produces deterministic, hash-seeded unit-length vectors (FNV-1a → Mulberry32 → Box-Muller). This satisfies:

  • API contract is real.
  • Same content → same vector (tests rely on this).
  • Different content → different vectors (with overwhelming probability).
  • No network, no HuggingFace dependency, no flakiness.

Marked with `// TODO(dense): integrate real Model2Vec model loading.` Real loading (Option A via `@huggingface/transformers`, or Option B reimplementing inference) is a follow-up PR.

Tests (13 passing)

`bun test src/indexing/dense.test.ts` covers:

  • `loadModel` returns a Model with `dim > 0` and is cached per path.
  • `embedChunks([])` → `[]`.
  • `embedChunks([c1, c2])` → 2 rows of `model.dim` floats each.
  • Determinism: same content → same vector.
  • Different content → different vectors.
  • `query` throws on `k < 1`.
  • `query` returns sorted `[index, distance]` pairs with self at distance ~0.
  • `query` with selector `[1, 2]` only returns indices from that set, and effective_k respects `selector.length`.
  • Multiple query vectors → multiple result rows.
  • `effective_k` capped at vector count.
  • `save` → `load` roundtrip preserves vectors and query results.

`bun run typecheck` — clean.

Out of scope

  • Real Model2Vec inference / HuggingFace download (Option A/B).
  • Integration with sparse BM25 backend (separate unit).
  • Wiring into `CspIndex` (later unit).

Summary by cubic

Ports the Semble dense indexing unit to TypeScript with a stubbed Model2Vec and a cosine-distance backend, enabling deterministic embeddings and k-NN queries without network calls. Adds save/load, selector-based filtering, and stricter validation.

  • New Features

    • DEFAULT_MODEL_NAME: minishlab/potion-code-16M.
    • loadModel(modelPath?): async, cached per path; returns stub model with .dim and .encode.
    • embedChunks(model, chunks): Float32Array[] per chunk; empty input returns [].
    • SelectableBasicBackend:
      • Cosine distance with pre-normalized rows.
      • query(queryVectors, k, selector?): [index, distance][] sorted; k < 1 throws; effective_k = min(k, numVectors, selector.length).
      • Selector results map pool-relative indices back to absolute indices.
      • save(dir) / static load(dir): persist as vectors.bin + args.json.
  • Bug Fixes

    • Constructor throws on inconsistent vector dimensions.
    • query throws on query vector dim mismatch and selector indices out of bounds.
    • load rejects truncated vectors.bin (size mismatch).

Written for commit b817e45. Summary will update on new commits.

Port src/semble/index/dense.py to TypeScript with stub Model2Vec
inference (Option C from the porting plan).

Exports:
- DEFAULT_MODEL_NAME = 'minishlab/potion-code-16M'
- loadModel(modelPath?) — async, cached per path
- embedChunks(model, chunks) — Float32Array per chunk
- SelectableBasicBackend — cosine-distance backend with
  optional Uint32Array selector for index filtering, plus
  save/load roundtrip

The Model2Vec model loading is a stub: deterministic, hash-seeded
random vectors keep the API contract exercised by tests without
requiring HuggingFace network I/O. Real model integration is
flagged with a TODO and is explicitly out of scope per the
coordinator's e2e recipe.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports the dense vector indexing module and its unit tests to TypeScript, introducing a stub Model2Vec implementation and the SelectableBasicBackend class for in-memory vector storage, querying, and persistence. The reviewer feedback focuses on enhancing robustness by adding input validation: specifically, ensuring consistent vector dimensions in the constructor, validating that query vectors match the expected dimension, checking that selector indices are within bounds during queries, and verifying the file size of vectors.bin matches the metadata when loading from disk.

Comment thread src/indexing/dense.ts
Comment thread src/indexing/dense.ts Outdated
Comment thread src/indexing/dense.ts
Comment thread src/indexing/dense.ts

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files

Architecture diagram
sequenceDiagram
    participant Test as Test Runner
    participant DenseMod as dense.ts (Module)
    participant ModelCache as _MODEL_CACHE (Map)
    participant StubModel as Stub Model
    participant Backend as SelectableBasicBackend
    participant FS as File System

    Note over Test,FS: Embedding Flow

    Test->>DenseMod: loadModel("test/path")
    DenseMod->>ModelCache: get("test/path")
    opt Cache miss
        ModelCache-->>DenseMod: undefined
        DenseMod->>DenseMod: makeStubModel(256)
        Note over DenseMod: Creates model with stubEmbed()<br/>FNV-1a → Mulberry32 → Box-Muller
        DenseMod->>ModelCache: set("test/path", stubModel)
    end
    ModelCache-->>DenseMod: cached model
    DenseMod-->>Test: { model, modelPath }

    Test->>DenseMod: embedChunks(model, [chunk1, chunk2])
    DenseMod->>StubModel: encode(["content1", "content2"])
    StubModel->>StubModel: For each content:<br/>fnv1a() → mulberry32 → Box-Muller
    Note over StubModel: Deterministic unit-length vectors
    StubModel-->>DenseMod: Float32Array[] (rows × 256)
    DenseMod-->>Test: Float32Array[]

    Note over Test,FS: Query Flow

    Test->>Backend: new SelectableBasicBackend(vectors)
    Backend->>Backend: Defensive copy + L2 normalize each row
    Note over Backend: cosine = 1 - dot(normalized_q, normalized_v)

    Test->>Backend: query([queryVec], k=2, selector?)
    alt k < 1
        Backend-->>Test: throw Error
    else selector provided
        Note over Backend: effective_k = min(k, numVectors, selector.length)
        Backend->>Backend: Iterate selector indices only
        Backend->>Backend: Compute 1 - dot(q, v) for each candidate
        Note over Backend: Partial sort → slice top k
        Backend->>Backend: Map poolIdx → absolute chunk index
        Backend-->>Test: [[index, distance], ...]
    else no selector
        Backend->>Backend: Iterate all stored vectors
        Backend->>Backend: Compute distances, sort, slice top k
        Backend-->>Test: [[index, distance], ...]
    end

    Note over Test,FS: Persistence Flow

    Test->>Backend: save("/tmp/vectors")
    Backend->>FS: mkdir("/tmp/vectors")
    Backend->>FS: writeFile("vectors.bin")<br/>Flat Float32Array of rows × dim
    Backend->>FS: writeFile("args.json")<br/>{ rows, dim, arguments }
    FS-->>Backend: done

    Test->>Backend: SelectableBasicBackend.load("/tmp/vectors")
    Backend->>FS: readFile("args.json")
    FS-->>Backend: metadata
    Backend->>FS: readFile("vectors.bin")
    FS-->>Backend: binary buffer
    Backend->>Backend: Slice into rows × dim Float32Array[]
    Backend->>Backend: Create new SelectableBasicBackend(reconstructed)
    Backend-->>Test: loaded backend instance
Loading

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/indexing/dense.ts
Comment thread src/indexing/dense.ts

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied 4 hardening fixes from gemini-code-assist (2 also flagged by cubic — same root causes):

  • Constructor: throws Inconsistent vector dimensions when input rows differ in length. dim is now derived from vectors[0] before the map so the first row anchors the contract.
  • Query vector dim: query() throws Query vector dimension mismatch before normalization/dot, avoiding silent NaN distances.
  • Selector bounds: query() pre-validates every selector[i] < numVectors and throws Selector index out of bounds, replacing the post-hoc ! crash.
  • Persistence integrity: load() verifies vectors.bin.byteLength === meta.rows * meta.dim * 4 and throws Vector file size mismatch on corruption/truncation.

Each fix has a dedicated unit test. Total 17/17 passing (was 13), bun run typecheck clean.

No deferrals — all comments were validation hardening, none required real Model2Vec wiring (that remains a TODO for a follow-up PR per the stub-model design).

@amondnet amondnet self-assigned this May 28, 2026
@amondnet amondnet merged commit 3937a3f into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-10-dense branch May 28, 2026 16:05
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant