feat(indexing): port BM25 enrichment + index from semble#9
Conversation
Ports src/semble/index/sparse.py to src/indexing/sparse.ts: - enrichForBm25(chunk): appends 'stem stem dir1 dir2 dir3' to the chunk content. Stem is repeated to up-weight path matches and only the last 3 parent directory components are kept, matching Python's Path(...).parent.parts[-3:]. - selectorToMask(selector, size): builds a Uint8Array boolean mask the same length as size with 1s at each selector index, or null when selector is null/undefined (mirrors numpy boolean-mask semantics used by bm25s.get_scores). - Bm25Index: minimal Okapi BM25 backend with build / getScores / save / load. Documents are passed pre-tokenized (caller wraps with tokenize(enrichForBm25(chunk))) and getScores returns a Float32Array in doc order, matching bm25s.BM25.get_scores. weightMask zeros out scores for masked-out documents. BM25 backend choice — Option B (inline minimal BM25) over Option A (third-party npm such as wink-bm25-text-search): keeps the unit self-contained while the dependency tree is still settling, and the required surface (build / getScores / save / load with a weight_mask) is small enough to implement and unit-test in <150 LOC. Replacing the backend later is localized to this file. Stopgap structural Chunk type is inlined until src/types.ts lands from Unit 1, matching the pattern established by Unit 3. Ref: src/semble/index/sparse.py
There was a problem hiding this comment.
Code Review
This pull request introduces a TypeScript port of a sparse BM25 index (Bm25Index) along with helper functions enrichForBm25 and selectorToMask, and a comprehensive suite of unit tests. The feedback highlights two valuable improvement opportunities: first, using path.posix.parse and normalizing backslashes in enrichForBm25 to ensure cross-platform consistency; second, optimizing the getScores method by skipping BM25 calculations for masked-out documents directly within the postings list loop rather than zeroing them out afterward.
There was a problem hiding this comment.
No issues found across 2 files
Architecture diagram
sequenceDiagram
participant App as Application Code
participant Indexer as BM25 Indexer
participant Enricher as enrichForBm25()
participant Mask as selectorToMask()
participant FS as File System
Note over App,FS: Document Indexing Flow
App->>Indexer: build(documents[])
Indexer->>Indexer: Tokenize documents
Indexer->>Indexer: Compute doc lengths, avg length
Indexer->>Indexer: Build postings list
Indexer->>Indexer: Compute document frequencies
Indexer-->>App: Return Bm25Index instance
Note over App,FS: Document Enrichment (pre-indexing)
App->>Enricher: enrichForBm25(chunk)
Enricher->>Enricher: Parse file path
Enricher->>Enricher: Extract stem (repeated twice)
Enricher->>Enricher: Extract last 3 directory parts
Enricher-->>App: Return enriched content string
Note over App,FS: Query Scoring with Optional Mask
App->>Indexer: getScores(queryTokens[], weightMask?)
Indexer->>Indexer: De-duplicate query tokens
loop For each unique term
Indexer->>Indexer: Lookup postings list
Indexer->>Indexer: Compute IDF score
loop For each matching document
Indexer->>Indexer: Compute BM25 contribution
Indexer->>Indexer: Accumulate score
end
end
alt weightMask provided
Indexer->>Indexer: Zero scores for masked-out docs
end
Indexer-->>App: Return Float32Array of scores
Note over App,FS: Mask Conversion
App->>Mask: selectorToMask(selector[], size)
alt selector is null/undefined
Mask-->>App: Return null
else
Mask->>Mask: Create Uint8Array of length size
loop For each index in selector
alt index < size
Mask->>Mask: Set mask[index] = 1
end
end
Mask-->>App: Return Uint8Array
end
Note over App,FS: Index Persistence
App->>Indexer: save(dir)
Indexer->>Indexer: Serialize state to JSON
Indexer->>FS: mkdir(dir, recursive)
Indexer->>FS: writeFile(dir/bm25.json)
FS-->>Indexer: Confirm write
Indexer-->>App: Promise<void>
App->>FS: load(dir)
FS-->>App: Read bm25.json
App->>Indexer: Deserialize state
Indexer->>Indexer: Reconstruct postings Map
Indexer-->>App: Return Bm25Index instance
- enrichForBm25: normalize backslashes and use path.posix.parse so repo-relative paths produce the same enrichment on Windows and POSIX hosts. Filter '.' segments only (no longer need '/' since splitting on a single delimiter). - Bm25Index.getScores: skip masked-out documents inside the postings iteration instead of zeroing them in a separate O(N) pass. Float32Array defaults to 0 so the result is identical. - Add a backslash-normalization test to lock in cross-platform behavior.
amondnet
left a comment
There was a problem hiding this comment.
Applied 2 of 2 gemini-code-assist suggestions in f47f867:
enrichForBm25— normalize backslashes and usepath.posix.parsefor cross-platform consistency on repo-relative paths. Added a backslash-normalization test.Bm25Index.getScores— skip masked-out docs inside the postings-list loop instead of zeroing scores in a separate O(N) pass. Result is identical (Float32Array entries default to 0); avoids extra work.
cubic-dev-ai reported no issues. Deferred: 0. All 17 sparse tests pass.
Summary
Adds
packages/layer/modules/markdown-rewrite.ts— a build-time Nuxt module that injects rewrite rules into Vercel's build-outputconfig.jsonso AI agents get raw markdown instead of the SPA shell.Ports docus upstream commits:
6fd8686b—feat(llms): redirect homepage to /llms.txt9ceafe6f—feat(llms): add docs page redirection to raw markdown for agentsSee
docs/docus-upstream-changes.mditem #9.Behaviour
preset.startsWith('vercel'), so it coversvercel,vercel-edge,vercel-static, etc.<output.publicDir>/../config.json(the Vercel build-output config), confirmllms.txtwas emitted, thenunshiftroute pairs ontoroutesso they fire before the SPA fallback.Accept: text/markdownorUser-Agent: curl/*:^/$→/llms.txt^/<locale>/?$→/llms.txt(one perruntimeConfig.public.i18n.localesentry)^<page>$→/raw<page>.md(one per/raw/...mdlink discovered inllms.txt)hasarray is AND-ed, so OR semantics between theAcceptandUser-Agentmatchers require emitting two rule entries persrc → destpair.Conventions
packages/layer/modules/{config,shadcn}.ts(defineNuxtModule+ named module).VercelBuildOutputConfig/VercelRoute/VercelHeaderHasinterface describes the parts we touch — noany.packages/layer/nuxt.config.tsimmediately after./modules/shadcn.Verification
Output (homepage rules only, because the current
nuxt-llmsconfig emits links to canonical/docs/...URLs rather than/raw/...md— the per-page rules will start being emitted as soon asllms.txtcarries raw-md links):{ "src": "^/$", "dest": "/llms.txt", "headers": { "content-type": "text/markdown; charset=utf-8" }, "has": [{ "type": "header", "key": "accept", "value": "(.*)text/markdown(.*)" }], "continue": true } { "src": "^/$", "dest": "/llms.txt", "headers": { "content-type": "text/markdown; charset=utf-8" }, "has": [{ "type": "header", "key": "user-agent", "value": "curl/.*" }], "continue": true }Default (cloudflare) build is unaffected — the module bails silently and
dist/is produced as before.Notes
llms.txtto enumerate/raw/...mdURLs. The current site config doesn't, so only the homepage routes are emitted today. This matches the upstream behaviour and avoids accidentally rewriting asset URLs. Wiringnuxt-llmsto also emit raw-md URLs is tracked separately.config.json.Follow-up to #27.
Summary by cubic
Ports a minimal, dependency-free BM25 index and helpers from Semble for sparse retrieval with path-aware enrichment. Adds optional mask and disk persistence, normalizes path separators for cross-platform consistency, and optimizes mask handling in scoring.
Bm25Indexwith build/getScores, query de-dup, and optional weight mask.save(dir)andload(dir); scores are preserved on round-trip.enrichForBm25(adds stem twice + last 3 dir parts) andselectorToMask(Uint8 mask, null-safe).Written for commit f47f867. Summary will update on new commits.