feat(ranking): port weighting + boosting from semble#13
Conversation
Port src/semble/ranking/weighting.py and boosting.py to TypeScript: - weighting.ts: resolveAlpha(query, alpha) with ALPHA_SYMBOL=0.3, ALPHA_NL=0.5 auto-detection via isSymbolQuery. - boosting.ts: SYMBOL_QUERY_RE / EMBEDDED_SYMBOL_RE verbatim from semble; isSymbolQuery, applyQueryBoost, boostMultiChunkFiles, plus internals _extractSymbolName, _definitionPattern (FIFO cache, max 256), _chunkDefinesSymbol, _stemMatches, _definitionTier, _scanNonCandidates, _boostSymbolDefinitions, _boostEmbeddedSymbols, _boostStemMatches, _countKeywordMatches. Uses Map<Chunk, number> instead of dict[Chunk, float] to preserve reference-identity keys. Inlines minimal Chunk type and splitIdentifier stub pending Unit 1 (types) and Unit 2 (tokens); both marked with TODO(integration) comments. Avoids Math.max(...iterable) spread to dodge argument-count limits on large indexes. 40 unit tests covering isSymbolQuery, resolveAlpha, _stemMatches, _chunkDefinesSymbol, _countKeywordMatches, boostMultiChunkFiles, and applyQueryBoost (symbol / NL / embedded paths).
There was a problem hiding this comment.
Code Review
This pull request ports the ranking, boosting, and weighting logic from Python to TypeScript, introducing boosting.ts and weighting.ts along with their respective test suites. The implementation handles query-type boosts, file coherence boosts, and embedded symbol boosts. The review feedback focuses on performance optimizations and code safety, such as adding a guard to prevent potential division-by-zero errors, avoiding unnecessary array allocations by iterating directly over Map.keys(), extracting inline helper functions to the module level, and reducing duplicate logic in file-stem extraction.
There was a problem hiding this comment.
2 issues found across 4 files
Architecture diagram
sequenceDiagram
participant Rank as Ranking Engine
participant Weight as weighting.ts
participant Boost as boosting.ts
participant ChunkStore as Chunk Store
participant QueryParser as Query Parser
Note over Rank,QueryParser: NEW: Ranking weighting & boosting pipeline
Rank->>Weight: resolveAlpha(query, alpha?)
Weight->>Boost: isSymbolQuery(query)
Boost-->>Weight: boolean
alt alpha not provided
Weight->>Weight: select ALPHA_SYMBOL (0.3) or ALPHA_NL (0.5)
else alpha provided
Weight->>Weight: use explicit alpha
end
Weight-->>Rank: alpha value
Rank->>Boost: applyQueryBoost(combinedScores, query, allChunks)
Boost->>Boost: isSymbolQuery(query)
alt Symbol query
Boost->>Boost: _extractSymbolName(query)
Boost->>Boost: _boostSymbolDefinitions()
Boost->>Boost: _definitionPattern(symbolName)
Boost->>Boost: LRU cache (max 256 entries)
Boost->>Boost: _chunkDefinesSymbol(chunk, symbolName)
Boost->>Boost: check file stem for match
alt Definition found with stem match
Boost->>Boost: tier = boostUnit * 1.5
else Definition found without stem match
Boost->>Boost: tier = boostUnit * 1.0
end
Boost->>Boost: _scanNonCandidates() for non-candidate chunks
Boost->>Boost: _boostEmbeddedSymbols()
Boost->>Boost: _stemMatches() for file stem
else NL query
Boost->>Boost: _boostStemMatches(query, maxScore)
Boost->>Boost: stem = pathStemLower(filePath)
Boost->>Boost: _stemMatches(stem, queryWord)
Boost->>Boost: _countKeywordMatches()
Boost->>Boost: _boostEmbeddedSymbols()
Boost->>Boost: embedRegex = EMBEDDED_SYMBOL_RE
Boost->>Boost: _chunkDefinesSymbol() half-strength
end
Boost-->>Rank: new Map<Chunk, number>
Rank->>Boost: boostMultiChunkFiles(scores)
Boost->>Boost: group chunks by filePath
Boost->>Boost: calculate fileSum per file
Boost->>Boost: identify best chunk per file
Boost->>Boost: maxFileSum = max(fileSum values)
Boost->>Boost: boostUnit = maxScore * FILE_COHERENCE_BOOST_FRAC
Boost->>Boost: boost = boostUnit * fileSum / maxFileSum
Boost->>Boost: add boost to best chunk
Boost-->>Rank: scores modified in-place
Note over Rank,ChunkStore: Supporting data flows
Boost->>QueryParser: splitIdentifier(token)
QueryParser-->>Boost: token parts (lowercase)
Boost->>ChunkStore: access chunk.content, chunk.filePath
ChunkStore-->>Boost: chunk metadata
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
- boostMultiChunkFiles: guard against maxFileSum <= 0 to prevent NaN/Infinity when chunk scores cancel out within every file (gemini + cubic). - applyQueryBoost: return a fresh Map for empty input so the result never aliases caller state (cubic). - _stemMatches: hoist stripTrailingS helper to module scope so it isn't re-created on every call (gemini). - pathStemLower: delegate to pathStemOriginal to remove duplicated stem extraction logic (gemini). - _boostSymbolDefinitions / _boostEmbeddedSymbols / _boostStemMatches: iterate boosted.keys() directly instead of materializing the key list via Array.from(...); only existing entries are mutated inside the loop, and new entries are added in a separate phase afterwards (gemini). - _countKeywordMatches: avoid per-iteration array allocation + destructuring when picking the shorter/longer of two strings (gemini). Adds regression tests for the empty-map aliasing and the maxFileSum-zero guard. Existing semble-parity behavior is unchanged.
amondnet
left a comment
There was a problem hiding this comment.
Bot review pass — applied 9 / deferred 0
Addressed all 9 line comments from gemini-code-assist (7) and cubic (2) in 04c2957. No comments deferred. No constants, regexes, or semble-parity behavior were modified.
Fixes (real correctness issues)
boostMultiChunkFiles— addedif (maxFileSum <= 0) returnbefore the boost loop so positive/negative chunk scores that cancel within every file no longer produce NaN / Infinity (gemini line 165, cubic line 164).applyQueryBoostempty input — now returns a freshnew Map()instead of the caller's map, so the non-mutating contract is preserved even on the empty path (cubic line 125).
Cleanups (performance + readability, no behavior change)
- Hoisted
stripTrailingSto module scope (was being re-created on every_stemMatchescall). pathStemLowernow delegates topathStemOriginal(deduped the stem-extraction body).- Replaced
Array.from(boosted.keys())withboosted.keys()in_boostSymbolDefinitions,_boostEmbeddedSymbols, and_boostStemMatches— only existing entries are updated inside the loop; new entries (non-candidate scan) are added in a separate phase afterwards. _countKeywordMatches— replaced the[shorter, longer] = …destructuring with two direct ternary assignments to avoid per-iteration array allocation.
Regression tests
boostMultiChunkFiles— added "no NaN/Infinity when fileSums cancel to zero".applyQueryBoost— strengthened the empty-input test to assert the result is a fresh map and that mutating it does not affect the input.
Verified
bun test src/ranking/ → 41 pass / 0 fail.
Port of `src/semble/ranking/weighting.py` and `src/semble/ranking/boosting.py` from MinishLab/semble.
What's in this PR
Implementation notes
Tests
40 `bun:test` cases under `src/ranking/` cover `isSymbolQuery`, `resolveAlpha`, `_stemMatches`, `_chunkDefinesSymbol` (Python/Elixir/SQL DDL, case sensitivity, in-line false-positive guards), `_countKeywordMatches` (exact + prefix), `boostMultiChunkFiles`, and `applyQueryBoost` (symbol-definition / NL / embedded-symbol / stem-match paths). All pass.
```
bun test v1.3.14
40 pass / 0 fail / 62 expect() calls
```
Out of scope
Part of the parallel port effort tracking MinishLab/semble → @pleaseai/csp.
Summary by cubic
Ports ranking weighting and boosting from
MinishLab/sembleto TypeScript to improve ranking for symbol vs natural-language queries. Adds alpha defaults and query-aware boosts for better top hits, file coherence, and safer scoring.New Features
src/ranking/weighting.ts:resolveAlpha(query, alpha)withALPHA_SYMBOL=0.3andALPHA_NL=0.5(auto-detected viaisSymbolQuery).src/ranking/boosting.ts:isSymbolQuery,applyQueryBoost(non-mutating),boostMultiChunkFiles(in-place); portsSYMBOL_QUERY_RE,EMBEDDED_SYMBOL_RE, and constants (DEFINITION_BOOST_MULTIPLIER=3.0,STEM_BOOST_MULTIPLIER=1.0,FILE_COHERENCE_BOOST_FRAC=0.2,EMBEDDED_SYMBOL_BOOST_SCALE=0.5).Map<Chunk, number>; temporaryChunk/splitIdentifierstubs markedTODO(integration). Tests cover symbol/NL paths, file-coherence boost, and regression cases.Bug Fixes
boostMultiChunkFilesby guardingmaxFileSum <= 0.applyQueryBoostnow returns a fresh Map for empty input to avoid aliasing.stripTrailingS, reusedpathStemOriginalinpathStemLower, iteratedboosted.keys()directly, and reduced allocations in_countKeywordMatches. Added regressions for these.Written for commit 04c2957. Summary will update on new commits.