Skip to content

feat(indexing): port BM25 enrichment + index from semble#9

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-9-sparse-bm25
May 28, 2026
Merged

feat(indexing): port BM25 enrichment + index from semble#9
amondnet merged 2 commits into
mainfrom
feat/unit-9-sparse-bm25

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds packages/layer/modules/markdown-rewrite.ts — a build-time Nuxt module that injects rewrite rules into Vercel's build-output config.json so AI agents get raw markdown instead of the SPA shell.

Ports docus upstream commits:

  • 6fd8686bfeat(llms): redirect homepage to /llms.txt
  • 9ceafe6ffeat(llms): add docs page redirection to raw markdown for agents

See docs/docus-upstream-changes.md item #9.

Behaviour

  • No-op on every non-Vercel preset. The check is preset.startsWith('vercel'), so it covers vercel, vercel-edge, vercel-static, etc.
  • On Vercel: read <output.publicDir>/../config.json (the Vercel build-output config), confirm llms.txt was emitted, then unshift route pairs onto routes so they fire before the SPA fallback.
  • Rules emitted when the request carries Accept: text/markdown or User-Agent: curl/*:
    • ^/$/llms.txt
    • ^/<locale>/?$/llms.txt (one per runtimeConfig.public.i18n.locales entry)
    • ^<page>$/raw<page>.md (one per /raw/...md link discovered in llms.txt)
  • Vercel's has array is AND-ed, so OR semantics between the Accept and User-Agent matchers require emitting two rule entries per src → dest pair.
  • Locale codes are regex-escaped before being joined into the alternation, so an exotic code can't break the pattern.

Conventions

  • Module style follows the existing packages/layer/modules/{config,shadcn}.ts (defineNuxtModule + named module).
  • TypeScript only; an inline VercelBuildOutputConfig / VercelRoute / VercelHeaderHas interface describes the parts we touch — no any.
  • Module is registered in packages/layer/nuxt.config.ts immediately after ./modules/shadcn.

Verification

bun install
cd packages/layer && bun typecheck      # no errors in markdown-rewrite.ts
cd ../.. && bun lint                     # clean

# Vercel build
NITRO_PRESET=vercel bun --filter @pleaseai/docs-site build
jq '.routes[] | select(.has // empty)' apps/docs/.vercel/output/config.json

Output (homepage rules only, because the current nuxt-llms config emits links to canonical /docs/... URLs rather than /raw/...md — the per-page rules will start being emitted as soon as llms.txt carries raw-md links):

{
  "src": "^/$",
  "dest": "/llms.txt",
  "headers": { "content-type": "text/markdown; charset=utf-8" },
  "has": [{ "type": "header", "key": "accept", "value": "(.*)text/markdown(.*)" }],
  "continue": true
}
{
  "src": "^/$",
  "dest": "/llms.txt",
  "headers": { "content-type": "text/markdown; charset=utf-8" },
  "has": [{ "type": "header", "key": "user-agent", "value": "curl/.*" }],
  "continue": true
}

Default (cloudflare) build is unaffected — the module bails silently and dist/ is produced as before.

Notes

  • Per-page rules require llms.txt to enumerate /raw/...md URLs. The current site config doesn't, so only the homepage routes are emitted today. This matches the upstream behaviour and avoids accidentally rewriting asset URLs. Wiring nuxt-llms to also emit raw-md URLs is tracked separately.
  • Runtime e2e is Vercel-only (build-output rewrites apply at the edge), so verification here is limited to inspecting the generated config.json.

Follow-up to #27.


Summary by cubic

Ports a minimal, dependency-free BM25 index and helpers from Semble for sparse retrieval with path-aware enrichment. Adds optional mask and disk persistence, normalizes path separators for cross-platform consistency, and optimizes mask handling in scoring.

  • New Features
    • Bm25Index with build/getScores, query de-dup, and optional weight mask.
    • Persistence via save(dir) and load(dir); scores are preserved on round-trip.
    • Helpers: enrichForBm25 (adds stem twice + last 3 dir parts) and selectorToMask (Uint8 mask, null-safe).

Written for commit f47f867. Summary will update on new commits.

Ports src/semble/index/sparse.py to src/indexing/sparse.ts:

- enrichForBm25(chunk): appends 'stem stem dir1 dir2 dir3' to the
  chunk content. Stem is repeated to up-weight path matches and
  only the last 3 parent directory components are kept, matching
  Python's Path(...).parent.parts[-3:].
- selectorToMask(selector, size): builds a Uint8Array boolean mask
  the same length as size with 1s at each selector index, or null
  when selector is null/undefined (mirrors numpy boolean-mask
  semantics used by bm25s.get_scores).
- Bm25Index: minimal Okapi BM25 backend with build / getScores /
  save / load. Documents are passed pre-tokenized (caller wraps
  with tokenize(enrichForBm25(chunk))) and getScores returns a
  Float32Array in doc order, matching bm25s.BM25.get_scores.
  weightMask zeros out scores for masked-out documents.

BM25 backend choice — Option B (inline minimal BM25) over Option A
(third-party npm such as wink-bm25-text-search): keeps the unit
self-contained while the dependency tree is still settling, and the
required surface (build / getScores / save / load with a weight_mask)
is small enough to implement and unit-test in <150 LOC. Replacing
the backend later is localized to this file.

Stopgap structural Chunk type is inlined until src/types.ts lands
from Unit 1, matching the pattern established by Unit 3.

Ref: src/semble/index/sparse.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a TypeScript port of a sparse BM25 index (Bm25Index) along with helper functions enrichForBm25 and selectorToMask, and a comprehensive suite of unit tests. The feedback highlights two valuable improvement opportunities: first, using path.posix.parse and normalizing backslashes in enrichForBm25 to ensure cross-platform consistency; second, optimizing the getScores method by skipping BM25 calculations for masked-out documents directly within the postings list loop rather than zeroing them out afterward.

Comment thread src/indexing/sparse.ts
Comment thread src/indexing/sparse.ts

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Architecture diagram
sequenceDiagram
    participant App as Application Code
    participant Indexer as BM25 Indexer
    participant Enricher as enrichForBm25()
    participant Mask as selectorToMask()
    participant FS as File System

    Note over App,FS: Document Indexing Flow

    App->>Indexer: build(documents[])
    Indexer->>Indexer: Tokenize documents
    Indexer->>Indexer: Compute doc lengths, avg length
    Indexer->>Indexer: Build postings list
    Indexer->>Indexer: Compute document frequencies
    Indexer-->>App: Return Bm25Index instance

    Note over App,FS: Document Enrichment (pre-indexing)

    App->>Enricher: enrichForBm25(chunk)
    Enricher->>Enricher: Parse file path
    Enricher->>Enricher: Extract stem (repeated twice)
    Enricher->>Enricher: Extract last 3 directory parts
    Enricher-->>App: Return enriched content string

    Note over App,FS: Query Scoring with Optional Mask

    App->>Indexer: getScores(queryTokens[], weightMask?)
    Indexer->>Indexer: De-duplicate query tokens
    loop For each unique term
        Indexer->>Indexer: Lookup postings list
        Indexer->>Indexer: Compute IDF score
        loop For each matching document
            Indexer->>Indexer: Compute BM25 contribution
            Indexer->>Indexer: Accumulate score
        end
    end
    alt weightMask provided
        Indexer->>Indexer: Zero scores for masked-out docs
    end
    Indexer-->>App: Return Float32Array of scores

    Note over App,FS: Mask Conversion

    App->>Mask: selectorToMask(selector[], size)
    alt selector is null/undefined
        Mask-->>App: Return null
    else
        Mask->>Mask: Create Uint8Array of length size
        loop For each index in selector
            alt index < size
                Mask->>Mask: Set mask[index] = 1
            end
        end
        Mask-->>App: Return Uint8Array
    end

    Note over App,FS: Index Persistence

    App->>Indexer: save(dir)
    Indexer->>Indexer: Serialize state to JSON
    Indexer->>FS: mkdir(dir, recursive)
    Indexer->>FS: writeFile(dir/bm25.json)
    FS-->>Indexer: Confirm write
    Indexer-->>App: Promise<void>

    App->>FS: load(dir)
    FS-->>App: Read bm25.json
    App->>Indexer: Deserialize state
    Indexer->>Indexer: Reconstruct postings Map
    Indexer-->>App: Return Bm25Index instance
Loading

Re-trigger cubic

- enrichForBm25: normalize backslashes and use path.posix.parse so
  repo-relative paths produce the same enrichment on Windows and POSIX
  hosts. Filter '.' segments only (no longer need '/' since splitting
  on a single delimiter).
- Bm25Index.getScores: skip masked-out documents inside the postings
  iteration instead of zeroing them in a separate O(N) pass. Float32Array
  defaults to 0 so the result is identical.
- Add a backslash-normalization test to lock in cross-platform behavior.

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied 2 of 2 gemini-code-assist suggestions in f47f867:

  1. enrichForBm25 — normalize backslashes and use path.posix.parse for cross-platform consistency on repo-relative paths. Added a backslash-normalization test.
  2. Bm25Index.getScores — skip masked-out docs inside the postings-list loop instead of zeroing scores in a separate O(N) pass. Result is identical (Float32Array entries default to 0); avoids extra work.

cubic-dev-ai reported no issues. Deferred: 0. All 17 sparse tests pass.

@amondnet amondnet self-assigned this May 28, 2026
@amondnet amondnet merged commit d1692bd into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-9-sparse-bm25 branch May 28, 2026 16:05
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant