feat(ranking): port path penalties + rerankTopK from semble#4
Merged
Conversation
Port src/semble/ranking/penalties.py to TypeScript: - TEST_FILE_RE: multi-language test file regex (Python, Go, Java, PHP, Ruby, JS/TS, Kotlin, Swift, C#, C/C++, Scala, Dart, Lua, helpers) - TEST_DIR_RE / COMPAT_DIR_RE / EXAMPLES_DIR_RE / TYPE_DEFS_RE - Penalty constants (STRONG/MODERATE/MILD), REEXPORT_FILENAMES set - _filePathPenalty: multiplicative combination of applicable penalties with POSIX-style basename extraction (matches Python's Path.name on POSIX — backslashes are not separators) - rerankTopK: single-sort greedy pass with file-saturation decay and safe early-exit once topK is filled Verbatim port, including the parity quirk where TEST_FILE_RE and TEST_DIR_RE are OR'd in a single penalty branch (so tests/test_foo.py gets one STRONG penalty, not two). Chunk type inlined pending Unit 1 (src/types.ts). Refs: MinishLab/semble@main src/semble/ranking/penalties.py
There was a problem hiding this comment.
Code Review
This pull request introduces path penalties and saturation decay logic for reranking search chunks, along with comprehensive unit tests. The feedback highlights a performance bottleneck in the greedy selection loop of rerankTopK that leads to O(N^2) complexity, suggesting a sorted array approach to optimize it. Additionally, it is recommended to handle non-positive topK values defensively to prevent incorrect slicing behavior.
Apply defensive guard: return [] for topK <= 0 to avoid the Array.prototype.slice(0, -1) footgun when negative topK is passed. Added a test covering 0, -1, -5. Deferred the O(N^2) greedy-loop suggestion — it changes the `selected` array semantics (always bounded to topK vs. unbounded in semble) and the early-exit threshold (min over top-K subset vs. min over all selected), which would diverge from upstream semble's rerank_topk parity.
amondnet
commented
May 28, 2026
amondnet
left a comment
Contributor
Author
There was a problem hiding this comment.
Applied 1, deferred 1.
- Applied (medium): defensive
topK <= 0guard with regression tests covering0,-1,-5(commit 8dd26b0). - Deferred (high): O(N²) greedy-loop rewrite — the suggested top-K sorted array changes two load-bearing semantics vs. semble's
rerank_topk(unboundedselectedarray; early-exit min computed over all selected, not just top-K). Diverging requires either an upstream change or parity-snapshot tests proving the rewrite is equivalent. Filed as follow-up.
This was referenced Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Worker Unit 6: ports
src/semble/ranking/penalties.py→src/ranking/penalties.ts.What's in
TEST_FILE_RE— multi-language test-file alternation (Python, Go, Java, PHP, Ruby, JS/TS, Kotlin, Swift, C#, C/C++, Scala, Dart, Lua, plus sharedtest_helpers?helpers).TEST_DIR_RE,COMPAT_DIR_RE,EXAMPLES_DIR_RE,TYPE_DEFS_RE.STRONG_PENALTY=0.3,MODERATE_PENALTY=0.5,MILD_PENALTY=0.7,FILE_SATURATION_THRESHOLD=1,FILE_SATURATION_DECAY=0.5.REEXPORT_FILENAMES:Set(['__init__.py', 'package-info.java'])._filePathPenalty(filePath): multiplicative combination over normalised path; basename check uses rawfilePathto match Python's POSIXPath.namesemantics.rerankTopK(scores, topK, { penalisePaths }): single-sort greedy pass with file-saturation decay; early break whenselected >= topK && penScore <= minSelected.Notes
Chunktype is inlined with a comment pointing to Unit 1 (src/types.ts) — will be replaced once that lands.TEST_FILE_REandTEST_DIR_REinto a single penalty branch, sotests/test_foo.pygets oneSTRONG_PENALTY(0.3), not 0.09. The tests reflect this (and the worker-prompt expectation has been double-checked against running Python).FILE_SATURATION_DECAY ** (already_selected - threshold + 1).Verification
bun test src/ranking/penalties.test.ts._filePathPenaltyoutputs cross-checked against running the original Python source for 10+ inputs (includingtests/test_foo.py,examples/foo.test.ts,src/__init__.d.ts,src\\foo.test.ts, etc.).TEST_FILE_REmatches verified identical to Python across 34 positive/negative cases.Out of scope
package.jsonis untouched.Chunktosrc/types.tswaits on Unit 1.Summary by cubic
Ported path-based penalties and
rerankTopKfrom Semble to TypeScript to improve ranking by downweighting tests/compat/examples and limiting repeated chunks from the same file. Adds parity tests and a guard for non‑positivetopK.New Features
TEST_FILE_REplus dir regexes for tests, compat/legacy, examples, and.d.ts.REEXPORT_FILENAMESfor re-export barrels._filePathPenalty(filePath)combines applicable penalties with POSIX-style basename.rerankTopK(scores, topK, { penalisePaths })applies path penalties and per-file saturation decay with a single sort and early exit.topKedge cases.Bug Fixes
[]whentopK <= 0to avoid negative slice behavior; tests added.Written for commit 8dd26b0. Summary will update on new commits.