Skip to content

dense.ts ships a stub Model2Vec — semantic search uses fake embeddings #26

Description

@amondnet

Summary

src/indexing/dense.ts ships a stub Model2Vec implementation. loadModel / makeStubModel never download or run a real model — they return deterministic, FNV‑1a hash‑seeded random vectors (stubEmbed). As a result the dense (semantic) half of hybrid search produces meaningless results, even though the pipeline, RRF fusion, and persistence are all wired correctly.

Evidence

  • src/indexing/dense.ts:6-10NOTE: This unit ships a STUB Model2Vec implementation … TODO(dense): integrate real Model2Vec model loading.
  • src/indexing/dense.ts:75-120stubEmbed, makeStubModel, and loadModel falling back to makeStubModel(_DEFAULT_STUB_DIM).
  • @huggingface/transformers (^4.2.0) is already in dependencies but is never imported.

Impact

  • Semantic similarity is fake → search and findRelated rank almost entirely on BM25 in practice.
  • This is the single largest functional gap vs. upstream semble (which uses minishlab/potion-code-16M, 256‑dim).

Acceptance criteria

  • loadModel loads a real Model2Vec / potion-code-16M‑equivalent model (via @huggingface/transformers or equivalent) and caches it.
  • embedChunks / encode emit real embeddings of the model's native dimension.
  • Stub path retained only as an explicit test/offline fallback (e.g. behind a flag), not the default.
  • Persisted index modelId reflects the real model so loadFromDisk can reject mismatches.

Found during a stub audit. Related: ranking integration and AST chunking gaps (separate issues).

Metadata

Metadata

Assignees

No one assigned

    Labels

    p1Priority 1 - Hightype:bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions