Skip to content

feat(ranking): port path penalties + rerankTopK from semble#4

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-6-ranking-penalties
May 28, 2026
Merged

feat(ranking): port path penalties + rerankTopK from semble#4
amondnet merged 2 commits into
mainfrom
feat/unit-6-ranking-penalties

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Worker Unit 6: ports src/semble/ranking/penalties.pysrc/ranking/penalties.ts.

What's in

  • Regexes (verbatim from Python):
    • TEST_FILE_RE — multi-language test-file alternation (Python, Go, Java, PHP, Ruby, JS/TS, Kotlin, Swift, C#, C/C++, Scala, Dart, Lua, plus shared test_helpers? helpers).
    • TEST_DIR_RE, COMPAT_DIR_RE, EXAMPLES_DIR_RE, TYPE_DEFS_RE.
  • Constants: STRONG_PENALTY=0.3, MODERATE_PENALTY=0.5, MILD_PENALTY=0.7, FILE_SATURATION_THRESHOLD=1, FILE_SATURATION_DECAY=0.5.
  • REEXPORT_FILENAMES: Set(['__init__.py', 'package-info.java']).
  • _filePathPenalty(filePath): multiplicative combination over normalised path; basename check uses raw filePath to match Python's POSIX Path.name semantics.
  • rerankTopK(scores, topK, { penalisePaths }): single-sort greedy pass with file-saturation decay; early break when selected >= topK && penScore <= minSelected.

Notes

  • Chunk type is inlined with a comment pointing to Unit 1 (src/types.ts) — will be replaced once that lands.
  • The Python code OR's TEST_FILE_RE and TEST_DIR_RE into a single penalty branch, so tests/test_foo.py gets one STRONG_PENALTY (0.3), not 0.09. The tests reflect this (and the worker-prompt expectation has been double-checked against running Python).
  • Saturation decay formula matches Python: FILE_SATURATION_DECAY ** (already_selected - threshold + 1).

Verification

  • 21 unit tests pass: bun test src/ranking/penalties.test.ts.
  • _filePathPenalty outputs cross-checked against running the original Python source for 10+ inputs (including tests/test_foo.py, examples/foo.test.ts, src/__init__.d.ts, src\\foo.test.ts, etc.).
  • TEST_FILE_RE matches verified identical to Python across 34 positive/negative cases.

Out of scope

  • e2e (skipped per worker instructions).
  • package.json is untouched.
  • Wiring Chunk to src/types.ts waits on Unit 1.

Summary by cubic

Ported path-based penalties and rerankTopK from Semble to TypeScript to improve ranking by downweighting tests/compat/examples and limiting repeated chunks from the same file. Adds parity tests and a guard for non‑positive topK.

  • New Features

    • Multi-language TEST_FILE_RE plus dir regexes for tests, compat/legacy, examples, and .d.ts.
    • Penalty constants and REEXPORT_FILENAMES for re-export barrels.
    • _filePathPenalty(filePath) combines applicable penalties with POSIX-style basename.
    • rerankTopK(scores, topK, { penalisePaths }) applies path penalties and per-file saturation decay with a single sort and early exit.
    • 21 unit tests for penalties, decay, sorting, and topK edge cases.
  • Bug Fixes

    • Return [] when topK <= 0 to avoid negative slice behavior; tests added.

Written for commit 8dd26b0. Summary will update on new commits.

Port src/semble/ranking/penalties.py to TypeScript:

- TEST_FILE_RE: multi-language test file regex (Python, Go, Java, PHP,
  Ruby, JS/TS, Kotlin, Swift, C#, C/C++, Scala, Dart, Lua, helpers)
- TEST_DIR_RE / COMPAT_DIR_RE / EXAMPLES_DIR_RE / TYPE_DEFS_RE
- Penalty constants (STRONG/MODERATE/MILD), REEXPORT_FILENAMES set
- _filePathPenalty: multiplicative combination of applicable penalties
  with POSIX-style basename extraction (matches Python's Path.name on
  POSIX — backslashes are not separators)
- rerankTopK: single-sort greedy pass with file-saturation decay and
  safe early-exit once topK is filled

Verbatim port, including the parity quirk where TEST_FILE_RE and
TEST_DIR_RE are OR'd in a single penalty branch (so tests/test_foo.py
gets one STRONG penalty, not two).

Chunk type inlined pending Unit 1 (src/types.ts).

Refs: MinishLab/semble@main src/semble/ranking/penalties.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces path penalties and saturation decay logic for reranking search chunks, along with comprehensive unit tests. The feedback highlights a performance bottleneck in the greedy selection loop of rerankTopK that leads to O(N^2) complexity, suggesting a sorted array approach to optimize it. Additionally, it is recommended to handle non-positive topK values defensively to prevent incorrect slicing behavior.

Comment thread src/ranking/penalties.ts
Comment thread src/ranking/penalties.ts Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Re-trigger cubic

Apply defensive guard: return [] for topK <= 0 to avoid the
Array.prototype.slice(0, -1) footgun when negative topK is passed.
Added a test covering 0, -1, -5.

Deferred the O(N^2) greedy-loop suggestion — it changes the
`selected` array semantics (always bounded to topK vs. unbounded
in semble) and the early-exit threshold (min over top-K subset vs.
min over all selected), which would diverge from upstream semble's
rerank_topk parity.

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied 1, deferred 1.

  • Applied (medium): defensive topK <= 0 guard with regression tests covering 0, -1, -5 (commit 8dd26b0).
  • Deferred (high): O(N²) greedy-loop rewrite — the suggested top-K sorted array changes two load-bearing semantics vs. semble's rerank_topk (unbounded selected array; early-exit min computed over all selected, not just top-K). Diverging requires either an upstream change or parity-snapshot tests proving the rewrite is equivalent. Filed as follow-up.

@amondnet amondnet self-assigned this May 28, 2026
@amondnet amondnet merged commit df248d2 into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-6-ranking-penalties branch May 28, 2026 16:05
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant