feat(ranking): port weighting + boosting from semble by amondnet · Pull Request #13 · pleaseai/code-search

amondnet · 2026-05-28T15:21:36Z

Port of `src/semble/ranking/weighting.py` and `src/semble/ranking/boosting.py` from MinishLab/semble.

What's in this PR

`src/ranking/weighting.ts` — `resolveAlpha(query, alpha)` with `ALPHA_SYMBOL = 0.3` (BM25-leaning) and `ALPHA_NL = 0.5` (balanced). Auto-detects symbol vs NL via `isSymbolQuery`.
`src/ranking/boosting.ts` — full ranking-boost pipeline:
- Regexes `SYMBOL_QUERY_RE` and `EMBEDDED_SYMBOL_RE` ported verbatim from semble.
- Constants (`DEFINITION_BOOST_MULTIPLIER = 3.0`, `STEM_BOOST_MULTIPLIER = 1.0`, `FILE_COHERENCE_BOOST_FRAC = 0.2`, `EMBEDDED_SYMBOL_BOOST_SCALE = 0.5`, etc.).
- Public: `isSymbolQuery`, `applyQueryBoost` (returns new map, doesn't mutate), `boostMultiChunkFiles` (in-place).
- Internals: `_extractSymbolName`, `_definitionPattern` (FIFO cache, max 256 entries), `_chunkDefinesSymbol`, `_stemMatches`, `_definitionTier`, `_scanNonCandidates`, `_boostSymbolDefinitions`, `_boostEmbeddedSymbols`, `_boostStemMatches`, `_countKeywordMatches`.

Implementation notes

Uses `Map<Chunk, number>` instead of plain objects so `Chunk` reference identity works as a key (mirrors Python `dict[Chunk, float]`).
Inline stubs for `Chunk` type and `splitIdentifier`; both marked with `TODO(integration)` to swap in imports from Unit 1 / Unit 2 once those branches land in main.
`Math.max(...iterable.values())` replaced with a reducer (`maxValue`) to avoid argument-count limits on large indexes.

Tests

40 `bun:test` cases under `src/ranking/` cover `isSymbolQuery`, `resolveAlpha`, `_stemMatches`, `_chunkDefinesSymbol` (Python/Elixir/SQL DDL, case sensitivity, in-line false-positive guards), `_countKeywordMatches` (exact + prefix), `boostMultiChunkFiles`, and `applyQueryBoost` (symbol-definition / NL / embedded-symbol / stem-match paths). All pass.

```
bun test v1.3.14
40 pass / 0 fail / 62 expect() calls
```

Out of scope

Penalties (Unit 6).
Wiring into the search pipeline.
Replacing inline Chunk type / splitIdentifier stub — depends on Units 1 & 2 merging.

Part of the parallel port effort tracking MinishLab/semble → @pleaseai/csp.

Summary by cubic

Ports ranking weighting and boosting from MinishLab/semble to TypeScript to improve ranking for symbol vs natural-language queries. Adds alpha defaults and query-aware boosts for better top hits, file coherence, and safer scoring.

New Features
- src/ranking/weighting.ts: resolveAlpha(query, alpha) with ALPHA_SYMBOL=0.3 and ALPHA_NL=0.5 (auto-detected via isSymbolQuery).
- src/ranking/boosting.ts: isSymbolQuery, applyQueryBoost (non-mutating), boostMultiChunkFiles (in-place); ports SYMBOL_QUERY_RE, EMBEDDED_SYMBOL_RE, and constants (DEFINITION_BOOST_MULTIPLIER=3.0, STEM_BOOST_MULTIPLIER=1.0, FILE_COHERENCE_BOOST_FRAC=0.2, EMBEDDED_SYMBOL_BOOST_SCALE=0.5).
- Definition detection across languages, half-strength embedded-symbol boosts, and NL file-stem keyword matching with prefix support.
- Uses Map<Chunk, number>; temporary Chunk/splitIdentifier stubs marked TODO(integration). Tests cover symbol/NL paths, file-coherence boost, and regression cases.
Bug Fixes
- Prevent NaN/Infinity in boostMultiChunkFiles by guarding maxFileSum <= 0.
- applyQueryBoost now returns a fresh Map for empty input to avoid aliasing.
- Minor perf/refactors (no behavior change): hoisted stripTrailingS, reused pathStemOriginal in pathStemLower, iterated boosted.keys() directly, and reduced allocations in _countKeywordMatches. Added regressions for these.

^{Written for commit 04c2957. Summary will update on new commits.}

Port src/semble/ranking/weighting.py and boosting.py to TypeScript: - weighting.ts: resolveAlpha(query, alpha) with ALPHA_SYMBOL=0.3, ALPHA_NL=0.5 auto-detection via isSymbolQuery. - boosting.ts: SYMBOL_QUERY_RE / EMBEDDED_SYMBOL_RE verbatim from semble; isSymbolQuery, applyQueryBoost, boostMultiChunkFiles, plus internals _extractSymbolName, _definitionPattern (FIFO cache, max 256), _chunkDefinesSymbol, _stemMatches, _definitionTier, _scanNonCandidates, _boostSymbolDefinitions, _boostEmbeddedSymbols, _boostStemMatches, _countKeywordMatches. Uses Map<Chunk, number> instead of dict[Chunk, float] to preserve reference-identity keys. Inlines minimal Chunk type and splitIdentifier stub pending Unit 1 (types) and Unit 2 (tokens); both marked with TODO(integration) comments. Avoids Math.max(...iterable) spread to dodge argument-count limits on large indexes. 40 unit tests covering isSymbolQuery, resolveAlpha, _stemMatches, _chunkDefinesSymbol, _countKeywordMatches, boostMultiChunkFiles, and applyQueryBoost (symbol / NL / embedded paths).

gemini-code-assist

Code Review

This pull request ports the ranking, boosting, and weighting logic from Python to TypeScript, introducing boosting.ts and weighting.ts along with their respective test suites. The implementation handles query-type boosts, file coherence boosts, and embedded symbol boosts. The review feedback focuses on performance optimizations and code safety, such as adding a guard to prevent potential division-by-zero errors, avoiding unnecessary array allocations by iterating directly over Map.keys(), extracting inline helper functions to the module level, and reducing duplicate logic in file-stem extraction.

cubic-dev-ai

2 issues found across 4 files

Architecture diagram

sequenceDiagram
    participant Rank as Ranking Engine
    participant Weight as weighting.ts
    participant Boost as boosting.ts
    participant ChunkStore as Chunk Store
    participant QueryParser as Query Parser

    Note over Rank,QueryParser: NEW: Ranking weighting & boosting pipeline

    Rank->>Weight: resolveAlpha(query, alpha?)
    Weight->>Boost: isSymbolQuery(query)
    Boost-->>Weight: boolean
    alt alpha not provided
        Weight->>Weight: select ALPHA_SYMBOL (0.3) or ALPHA_NL (0.5)
    else alpha provided
        Weight->>Weight: use explicit alpha
    end
    Weight-->>Rank: alpha value

    Rank->>Boost: applyQueryBoost(combinedScores, query, allChunks)
    Boost->>Boost: isSymbolQuery(query)
    alt Symbol query
        Boost->>Boost: _extractSymbolName(query)
        Boost->>Boost: _boostSymbolDefinitions()
        Boost->>Boost: _definitionPattern(symbolName)
        Boost->>Boost: LRU cache (max 256 entries)
        Boost->>Boost: _chunkDefinesSymbol(chunk, symbolName)
        Boost->>Boost: check file stem for match
        alt Definition found with stem match
            Boost->>Boost: tier = boostUnit * 1.5
        else Definition found without stem match
            Boost->>Boost: tier = boostUnit * 1.0
        end
        Boost->>Boost: _scanNonCandidates() for non-candidate chunks
        Boost->>Boost: _boostEmbeddedSymbols()
        Boost->>Boost: _stemMatches() for file stem
    else NL query
        Boost->>Boost: _boostStemMatches(query, maxScore)
        Boost->>Boost: stem = pathStemLower(filePath)
        Boost->>Boost: _stemMatches(stem, queryWord)
        Boost->>Boost: _countKeywordMatches()
        Boost->>Boost: _boostEmbeddedSymbols()
        Boost->>Boost: embedRegex = EMBEDDED_SYMBOL_RE
        Boost->>Boost: _chunkDefinesSymbol() half-strength
    end
    Boost-->>Rank: new Map<Chunk, number>

    Rank->>Boost: boostMultiChunkFiles(scores)
    Boost->>Boost: group chunks by filePath
    Boost->>Boost: calculate fileSum per file
    Boost->>Boost: identify best chunk per file
    Boost->>Boost: maxFileSum = max(fileSum values)
    Boost->>Boost: boostUnit = maxScore * FILE_COHERENCE_BOOST_FRAC
    Boost->>Boost: boost = boostUnit * fileSum / maxFileSum
    Boost->>Boost: add boost to best chunk
    Boost-->>Rank: scores modified in-place

    Note over Rank,ChunkStore: Supporting data flows

    Boost->>QueryParser: splitIdentifier(token)
    QueryParser-->>Boost: token parts (lowercase)
    Boost->>ChunkStore: access chunk.content, chunk.filePath
    ChunkStore-->>Boost: chunk metadata

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

- boostMultiChunkFiles: guard against maxFileSum <= 0 to prevent NaN/Infinity when chunk scores cancel out within every file (gemini + cubic). - applyQueryBoost: return a fresh Map for empty input so the result never aliases caller state (cubic). - _stemMatches: hoist stripTrailingS helper to module scope so it isn't re-created on every call (gemini). - pathStemLower: delegate to pathStemOriginal to remove duplicated stem extraction logic (gemini). - _boostSymbolDefinitions / _boostEmbeddedSymbols / _boostStemMatches: iterate boosted.keys() directly instead of materializing the key list via Array.from(...); only existing entries are mutated inside the loop, and new entries are added in a separate phase afterwards (gemini). - _countKeywordMatches: avoid per-iteration array allocation + destructuring when picking the shorter/longer of two strings (gemini). Adds regression tests for the empty-map aliasing and the maxFileSum-zero guard. Existing semble-parity behavior is unchanged.

amondnet

Bot review pass — applied 9 / deferred 0

Addressed all 9 line comments from gemini-code-assist (7) and cubic (2) in 04c2957. No comments deferred. No constants, regexes, or semble-parity behavior were modified.

Fixes (real correctness issues)

boostMultiChunkFiles — added if (maxFileSum <= 0) return before the boost loop so positive/negative chunk scores that cancel within every file no longer produce NaN / Infinity (gemini line 165, cubic line 164).
applyQueryBoost empty input — now returns a fresh new Map() instead of the caller's map, so the non-mutating contract is preserved even on the empty path (cubic line 125).

Cleanups (performance + readability, no behavior change)

Hoisted stripTrailingS to module scope (was being re-created on every _stemMatches call).
pathStemLower now delegates to pathStemOriginal (deduped the stem-extraction body).
Replaced Array.from(boosted.keys()) with boosted.keys() in _boostSymbolDefinitions, _boostEmbeddedSymbols, and _boostStemMatches — only existing entries are updated inside the loop; new entries (non-candidate scan) are added in a separate phase afterwards.
_countKeywordMatches — replaced the [shorter, longer] = … destructuring with two direct ternary assignments to avoid per-iteration array allocation.

Regression tests

boostMultiChunkFiles — added "no NaN/Infinity when fileSums cancel to zero".
applyQueryBoost — strengthened the empty-input test to assert the result is a fresh map and that mutating it does not affect the input.

Verified

bun test src/ranking/ → 41 pass / 0 fail.

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/ranking/boosting.ts Outdated

Comment thread src/ranking/boosting.ts

amondnet commented May 28, 2026

View reviewed changes

amondnet self-assigned this May 28, 2026

amondnet merged commit 08f5b06 into main May 28, 2026
1 check passed

amondnet deleted the feat/unit-5-ranking-boost branch May 28, 2026 16:07

This was referenced Jun 18, 2026

chore(main): release 1.0.0 #23

Closed

chore(main): release 0.1.0 #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ranking): port weighting + boosting from semble#13

feat(ranking): port weighting + boosting from semble#13
amondnet merged 2 commits into
mainfrom
feat/unit-5-ranking-boost

amondnet commented May 28, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

amondnet left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

amondnet commented May 28, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR

Implementation notes

Tests

Out of scope

Summary by cubic

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amondnet left a comment

Choose a reason for hiding this comment

Bot review pass — applied 9 / deferred 0

Fixes (real correctness issues)

Cleanups (performance + readability, no behavior change)

Regression tests

Verified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amondnet commented May 28, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading