feat(indexing): port Model2Vec embedding + vector backend from semble#7
Conversation
Port src/semble/index/dense.py to TypeScript with stub Model2Vec inference (Option C from the porting plan). Exports: - DEFAULT_MODEL_NAME = 'minishlab/potion-code-16M' - loadModel(modelPath?) — async, cached per path - embedChunks(model, chunks) — Float32Array per chunk - SelectableBasicBackend — cosine-distance backend with optional Uint32Array selector for index filtering, plus save/load roundtrip The Model2Vec model loading is a stub: deterministic, hash-seeded random vectors keep the API contract exercised by tests without requiring HuggingFace network I/O. Real model integration is flagged with a TODO and is explicitly out of scope per the coordinator's e2e recipe.
There was a problem hiding this comment.
Code Review
This pull request ports the dense vector indexing module and its unit tests to TypeScript, introducing a stub Model2Vec implementation and the SelectableBasicBackend class for in-memory vector storage, querying, and persistence. The reviewer feedback focuses on enhancing robustness by adding input validation: specifically, ensuring consistent vector dimensions in the constructor, validating that query vectors match the expected dimension, checking that selector indices are within bounds during queries, and verifying the file size of vectors.bin matches the metadata when loading from disk.
There was a problem hiding this comment.
2 issues found across 2 files
Architecture diagram
sequenceDiagram
participant Test as Test Runner
participant DenseMod as dense.ts (Module)
participant ModelCache as _MODEL_CACHE (Map)
participant StubModel as Stub Model
participant Backend as SelectableBasicBackend
participant FS as File System
Note over Test,FS: Embedding Flow
Test->>DenseMod: loadModel("test/path")
DenseMod->>ModelCache: get("test/path")
opt Cache miss
ModelCache-->>DenseMod: undefined
DenseMod->>DenseMod: makeStubModel(256)
Note over DenseMod: Creates model with stubEmbed()<br/>FNV-1a → Mulberry32 → Box-Muller
DenseMod->>ModelCache: set("test/path", stubModel)
end
ModelCache-->>DenseMod: cached model
DenseMod-->>Test: { model, modelPath }
Test->>DenseMod: embedChunks(model, [chunk1, chunk2])
DenseMod->>StubModel: encode(["content1", "content2"])
StubModel->>StubModel: For each content:<br/>fnv1a() → mulberry32 → Box-Muller
Note over StubModel: Deterministic unit-length vectors
StubModel-->>DenseMod: Float32Array[] (rows × 256)
DenseMod-->>Test: Float32Array[]
Note over Test,FS: Query Flow
Test->>Backend: new SelectableBasicBackend(vectors)
Backend->>Backend: Defensive copy + L2 normalize each row
Note over Backend: cosine = 1 - dot(normalized_q, normalized_v)
Test->>Backend: query([queryVec], k=2, selector?)
alt k < 1
Backend-->>Test: throw Error
else selector provided
Note over Backend: effective_k = min(k, numVectors, selector.length)
Backend->>Backend: Iterate selector indices only
Backend->>Backend: Compute 1 - dot(q, v) for each candidate
Note over Backend: Partial sort → slice top k
Backend->>Backend: Map poolIdx → absolute chunk index
Backend-->>Test: [[index, distance], ...]
else no selector
Backend->>Backend: Iterate all stored vectors
Backend->>Backend: Compute distances, sort, slice top k
Backend-->>Test: [[index, distance], ...]
end
Note over Test,FS: Persistence Flow
Test->>Backend: save("/tmp/vectors")
Backend->>FS: mkdir("/tmp/vectors")
Backend->>FS: writeFile("vectors.bin")<br/>Flat Float32Array of rows × dim
Backend->>FS: writeFile("args.json")<br/>{ rows, dim, arguments }
FS-->>Backend: done
Test->>Backend: SelectableBasicBackend.load("/tmp/vectors")
Backend->>FS: readFile("args.json")
FS-->>Backend: metadata
Backend->>FS: readFile("vectors.bin")
FS-->>Backend: binary buffer
Backend->>Backend: Slice into rows × dim Float32Array[]
Backend->>Backend: Create new SelectableBasicBackend(reconstructed)
Backend-->>Test: loaded backend instance
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
amondnet
left a comment
There was a problem hiding this comment.
Applied 4 hardening fixes from gemini-code-assist (2 also flagged by cubic — same root causes):
- Constructor: throws
Inconsistent vector dimensionswhen input rows differ in length.dimis now derived fromvectors[0]before the map so the first row anchors the contract. - Query vector dim:
query()throwsQuery vector dimension mismatchbefore normalization/dot, avoiding silent NaN distances. - Selector bounds:
query()pre-validates everyselector[i] < numVectorsand throwsSelector index out of bounds, replacing the post-hoc!crash. - Persistence integrity:
load()verifiesvectors.bin.byteLength === meta.rows * meta.dim * 4and throwsVector file size mismatchon corruption/truncation.
Each fix has a dedicated unit test. Total 17/17 passing (was 13), bun run typecheck clean.
No deferrals — all comments were validation hardening, none required real Model2Vec wiring (that remains a TODO for a follow-up PR per the stub-model design).
Port `src/semble/index/dense.py` to TypeScript as `src/indexing/dense.ts` — Worker 10 of the parallel csp port.
What it does
Public surface
Option chosen: C (stub Model2Vec)
Per the worker instructions, network model loading is explicitly out of scope for this batch. `loadModel` returns a stub model that produces deterministic, hash-seeded unit-length vectors (FNV-1a → Mulberry32 → Box-Muller). This satisfies:
Marked with `// TODO(dense): integrate real Model2Vec model loading.` Real loading (Option A via `@huggingface/transformers`, or Option B reimplementing inference) is a follow-up PR.
Tests (13 passing)
`bun test src/indexing/dense.test.ts` covers:
`bun run typecheck` — clean.
Out of scope
Summary by cubic
Ports the Semble dense indexing unit to TypeScript with a stubbed Model2Vec and a cosine-distance backend, enabling deterministic embeddings and k-NN queries without network calls. Adds save/load, selector-based filtering, and stricter validation.
New Features
DEFAULT_MODEL_NAME:minishlab/potion-code-16M.loadModel(modelPath?): async, cached per path; returns stub model with.dimand.encode.embedChunks(model, chunks):Float32Array[]per chunk; empty input returns[].SelectableBasicBackend:query(queryVectors, k, selector?):[index, distance][]sorted;k < 1throws;effective_k = min(k, numVectors, selector.length).save(dir)/static load(dir): persist asvectors.bin+args.json.Bug Fixes
querythrows on query vector dim mismatch and selector indices out of bounds.loadrejects truncatedvectors.bin(size mismatch).Written for commit b817e45. Summary will update on new commits.