Skip to content

feat(ranking): port weighting + boosting from semble#13

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-5-ranking-boost
May 28, 2026
Merged

feat(ranking): port weighting + boosting from semble#13
amondnet merged 2 commits into
mainfrom
feat/unit-5-ranking-boost

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Port of `src/semble/ranking/weighting.py` and `src/semble/ranking/boosting.py` from MinishLab/semble.

What's in this PR

  • `src/ranking/weighting.ts` — `resolveAlpha(query, alpha)` with `ALPHA_SYMBOL = 0.3` (BM25-leaning) and `ALPHA_NL = 0.5` (balanced). Auto-detects symbol vs NL via `isSymbolQuery`.
  • `src/ranking/boosting.ts` — full ranking-boost pipeline:
    • Regexes `SYMBOL_QUERY_RE` and `EMBEDDED_SYMBOL_RE` ported verbatim from semble.
    • Constants (`DEFINITION_BOOST_MULTIPLIER = 3.0`, `STEM_BOOST_MULTIPLIER = 1.0`, `FILE_COHERENCE_BOOST_FRAC = 0.2`, `EMBEDDED_SYMBOL_BOOST_SCALE = 0.5`, etc.).
    • Public: `isSymbolQuery`, `applyQueryBoost` (returns new map, doesn't mutate), `boostMultiChunkFiles` (in-place).
    • Internals: `_extractSymbolName`, `_definitionPattern` (FIFO cache, max 256 entries), `_chunkDefinesSymbol`, `_stemMatches`, `_definitionTier`, `_scanNonCandidates`, `_boostSymbolDefinitions`, `_boostEmbeddedSymbols`, `_boostStemMatches`, `_countKeywordMatches`.

Implementation notes

  • Uses `Map<Chunk, number>` instead of plain objects so `Chunk` reference identity works as a key (mirrors Python `dict[Chunk, float]`).
  • Inline stubs for `Chunk` type and `splitIdentifier`; both marked with `TODO(integration)` to swap in imports from Unit 1 / Unit 2 once those branches land in main.
  • `Math.max(...iterable.values())` replaced with a reducer (`maxValue`) to avoid argument-count limits on large indexes.

Tests

40 `bun:test` cases under `src/ranking/` cover `isSymbolQuery`, `resolveAlpha`, `_stemMatches`, `_chunkDefinesSymbol` (Python/Elixir/SQL DDL, case sensitivity, in-line false-positive guards), `_countKeywordMatches` (exact + prefix), `boostMultiChunkFiles`, and `applyQueryBoost` (symbol-definition / NL / embedded-symbol / stem-match paths). All pass.

```
bun test v1.3.14
40 pass / 0 fail / 62 expect() calls
```

Out of scope

  • Penalties (Unit 6).
  • Wiring into the search pipeline.
  • Replacing inline Chunk type / splitIdentifier stub — depends on Units 1 & 2 merging.

Part of the parallel port effort tracking MinishLab/semble → @pleaseai/csp.


Summary by cubic

Ports ranking weighting and boosting from MinishLab/semble to TypeScript to improve ranking for symbol vs natural-language queries. Adds alpha defaults and query-aware boosts for better top hits, file coherence, and safer scoring.

  • New Features

    • src/ranking/weighting.ts: resolveAlpha(query, alpha) with ALPHA_SYMBOL=0.3 and ALPHA_NL=0.5 (auto-detected via isSymbolQuery).
    • src/ranking/boosting.ts: isSymbolQuery, applyQueryBoost (non-mutating), boostMultiChunkFiles (in-place); ports SYMBOL_QUERY_RE, EMBEDDED_SYMBOL_RE, and constants (DEFINITION_BOOST_MULTIPLIER=3.0, STEM_BOOST_MULTIPLIER=1.0, FILE_COHERENCE_BOOST_FRAC=0.2, EMBEDDED_SYMBOL_BOOST_SCALE=0.5).
    • Definition detection across languages, half-strength embedded-symbol boosts, and NL file-stem keyword matching with prefix support.
    • Uses Map<Chunk, number>; temporary Chunk/splitIdentifier stubs marked TODO(integration). Tests cover symbol/NL paths, file-coherence boost, and regression cases.
  • Bug Fixes

    • Prevent NaN/Infinity in boostMultiChunkFiles by guarding maxFileSum <= 0.
    • applyQueryBoost now returns a fresh Map for empty input to avoid aliasing.
    • Minor perf/refactors (no behavior change): hoisted stripTrailingS, reused pathStemOriginal in pathStemLower, iterated boosted.keys() directly, and reduced allocations in _countKeywordMatches. Added regressions for these.

Written for commit 04c2957. Summary will update on new commits.

Port src/semble/ranking/weighting.py and boosting.py to TypeScript:

- weighting.ts: resolveAlpha(query, alpha) with ALPHA_SYMBOL=0.3,
  ALPHA_NL=0.5 auto-detection via isSymbolQuery.
- boosting.ts: SYMBOL_QUERY_RE / EMBEDDED_SYMBOL_RE verbatim from
  semble; isSymbolQuery, applyQueryBoost, boostMultiChunkFiles, plus
  internals _extractSymbolName, _definitionPattern (FIFO cache, max
  256), _chunkDefinesSymbol, _stemMatches, _definitionTier,
  _scanNonCandidates, _boostSymbolDefinitions,
  _boostEmbeddedSymbols, _boostStemMatches, _countKeywordMatches.

Uses Map<Chunk, number> instead of dict[Chunk, float] to preserve
reference-identity keys. Inlines minimal Chunk type and
splitIdentifier stub pending Unit 1 (types) and Unit 2 (tokens);
both marked with TODO(integration) comments.

Avoids Math.max(...iterable) spread to dodge argument-count limits
on large indexes.

40 unit tests covering isSymbolQuery, resolveAlpha, _stemMatches,
_chunkDefinesSymbol, _countKeywordMatches, boostMultiChunkFiles,
and applyQueryBoost (symbol / NL / embedded paths).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports the ranking, boosting, and weighting logic from Python to TypeScript, introducing boosting.ts and weighting.ts along with their respective test suites. The implementation handles query-type boosts, file coherence boosts, and embedded symbol boosts. The review feedback focuses on performance optimizations and code safety, such as adding a guard to prevent potential division-by-zero errors, avoiding unnecessary array allocations by iterating directly over Map.keys(), extracting inline helper functions to the module level, and reducing duplicate logic in file-stem extraction.

Comment thread src/ranking/boosting.ts
Comment thread src/ranking/boosting.ts
Comment thread src/ranking/boosting.ts Outdated
Comment thread src/ranking/boosting.ts Outdated
Comment thread src/ranking/boosting.ts Outdated
Comment thread src/ranking/boosting.ts Outdated
Comment thread src/ranking/boosting.ts Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Architecture diagram
sequenceDiagram
    participant Rank as Ranking Engine
    participant Weight as weighting.ts
    participant Boost as boosting.ts
    participant ChunkStore as Chunk Store
    participant QueryParser as Query Parser

    Note over Rank,QueryParser: NEW: Ranking weighting & boosting pipeline

    Rank->>Weight: resolveAlpha(query, alpha?)
    Weight->>Boost: isSymbolQuery(query)
    Boost-->>Weight: boolean
    alt alpha not provided
        Weight->>Weight: select ALPHA_SYMBOL (0.3) or ALPHA_NL (0.5)
    else alpha provided
        Weight->>Weight: use explicit alpha
    end
    Weight-->>Rank: alpha value

    Rank->>Boost: applyQueryBoost(combinedScores, query, allChunks)
    Boost->>Boost: isSymbolQuery(query)
    alt Symbol query
        Boost->>Boost: _extractSymbolName(query)
        Boost->>Boost: _boostSymbolDefinitions()
        Boost->>Boost: _definitionPattern(symbolName)
        Boost->>Boost: LRU cache (max 256 entries)
        Boost->>Boost: _chunkDefinesSymbol(chunk, symbolName)
        Boost->>Boost: check file stem for match
        alt Definition found with stem match
            Boost->>Boost: tier = boostUnit * 1.5
        else Definition found without stem match
            Boost->>Boost: tier = boostUnit * 1.0
        end
        Boost->>Boost: _scanNonCandidates() for non-candidate chunks
        Boost->>Boost: _boostEmbeddedSymbols()
        Boost->>Boost: _stemMatches() for file stem
    else NL query
        Boost->>Boost: _boostStemMatches(query, maxScore)
        Boost->>Boost: stem = pathStemLower(filePath)
        Boost->>Boost: _stemMatches(stem, queryWord)
        Boost->>Boost: _countKeywordMatches()
        Boost->>Boost: _boostEmbeddedSymbols()
        Boost->>Boost: embedRegex = EMBEDDED_SYMBOL_RE
        Boost->>Boost: _chunkDefinesSymbol() half-strength
    end
    Boost-->>Rank: new Map<Chunk, number>

    Rank->>Boost: boostMultiChunkFiles(scores)
    Boost->>Boost: group chunks by filePath
    Boost->>Boost: calculate fileSum per file
    Boost->>Boost: identify best chunk per file
    Boost->>Boost: maxFileSum = max(fileSum values)
    Boost->>Boost: boostUnit = maxScore * FILE_COHERENCE_BOOST_FRAC
    Boost->>Boost: boost = boostUnit * fileSum / maxFileSum
    Boost->>Boost: add boost to best chunk
    Boost-->>Rank: scores modified in-place

    Note over Rank,ChunkStore: Supporting data flows

    Boost->>QueryParser: splitIdentifier(token)
    QueryParser-->>Boost: token parts (lowercase)
    Boost->>ChunkStore: access chunk.content, chunk.filePath
    ChunkStore-->>Boost: chunk metadata
Loading

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/ranking/boosting.ts Outdated
Comment thread src/ranking/boosting.ts
- boostMultiChunkFiles: guard against maxFileSum <= 0 to prevent NaN/Infinity
  when chunk scores cancel out within every file (gemini + cubic).
- applyQueryBoost: return a fresh Map for empty input so the result never
  aliases caller state (cubic).
- _stemMatches: hoist stripTrailingS helper to module scope so it isn't
  re-created on every call (gemini).
- pathStemLower: delegate to pathStemOriginal to remove duplicated stem
  extraction logic (gemini).
- _boostSymbolDefinitions / _boostEmbeddedSymbols / _boostStemMatches:
  iterate boosted.keys() directly instead of materializing the key list
  via Array.from(...); only existing entries are mutated inside the loop,
  and new entries are added in a separate phase afterwards (gemini).
- _countKeywordMatches: avoid per-iteration array allocation + destructuring
  when picking the shorter/longer of two strings (gemini).

Adds regression tests for the empty-map aliasing and the maxFileSum-zero
guard. Existing semble-parity behavior is unchanged.

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review pass — applied 9 / deferred 0

Addressed all 9 line comments from gemini-code-assist (7) and cubic (2) in 04c2957. No comments deferred. No constants, regexes, or semble-parity behavior were modified.

Fixes (real correctness issues)

  • boostMultiChunkFiles — added if (maxFileSum <= 0) return before the boost loop so positive/negative chunk scores that cancel within every file no longer produce NaN / Infinity (gemini line 165, cubic line 164).
  • applyQueryBoost empty input — now returns a fresh new Map() instead of the caller's map, so the non-mutating contract is preserved even on the empty path (cubic line 125).

Cleanups (performance + readability, no behavior change)

  • Hoisted stripTrailingS to module scope (was being re-created on every _stemMatches call).
  • pathStemLower now delegates to pathStemOriginal (deduped the stem-extraction body).
  • Replaced Array.from(boosted.keys()) with boosted.keys() in _boostSymbolDefinitions, _boostEmbeddedSymbols, and _boostStemMatches — only existing entries are updated inside the loop; new entries (non-candidate scan) are added in a separate phase afterwards.
  • _countKeywordMatches — replaced the [shorter, longer] = … destructuring with two direct ternary assignments to avoid per-iteration array allocation.

Regression tests

  • boostMultiChunkFiles — added "no NaN/Infinity when fileSums cancel to zero".
  • applyQueryBoost — strengthened the empty-input test to assert the result is a fresh map and that mutating it does not affect the input.

Verified

bun test src/ranking/ → 41 pass / 0 fail.

@amondnet amondnet self-assigned this May 28, 2026
@amondnet amondnet merged commit 08f5b06 into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-5-ranking-boost branch May 28, 2026 16:07
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant