feat(tokens): port identifier-aware tokenizer from semble by amondnet · Pull Request #1 · pleaseai/code-search

amondnet · 2026-05-28T15:15:44Z

Ports src/semble/tokens.py to src/tokens.ts as part of the parallel port effort (Unit 2).

What

splitIdentifier(token) — lowered original + camelCase/snake_case sub-tokens.
Returns only [lower] when fewer than two sub-tokens exist, mirroring the Python behaviour.
tokenize(text) — walks identifier matches with TOKEN_RE and expands each via splitIdentifier for BM25 indexing.

How

Regexes ported verbatim from upstream:
- TOKEN_RE = /[a-zA-Z_][a-zA-Z0-9_]*/g
- CAMEL_RE = /[A-Z]+(?=[A-Z][a-z])|[A-Z]?[a-z]+|[A-Z]+|[0-9]+/g
The g flag is required because we use String.prototype.matchAll.
The snake_case branch filters empty parts (.filter(p => p.length > 0)) to match Python's [p for p in lower.split('_') if p].

Tests

bun test src/tokens.test.ts — 13 pass. Covers the canonical semble examples plus edge cases verified against the upstream Python implementation:

Consecutive underscores (foo__bar)
Leading underscore (_foo collapses to one effective part)
Digit runs (abc123Def -> ['abc123def','abc','123','def'])
Plain space-separated words
Non-identifier digits (123) correctly dropped by TOKEN_RE

Parity for tokenize('camelCase_snake_case 123') was verified directly against Python and is ['camelcase_snake_case', 'camelcase', 'snake', 'case'] — the original task description's expected value contained extra entries that do not correspond to the semble algorithm; the test asserts the Python-verified output.

Notes

Pure functions, no e2e.
No package.json changes.

Summary by cubic

Ports the identifier-aware tokenizer from Python to TypeScript for Unit 2. Adds splitIdentifier and tokenize that expand camelCase/PascalCase/snake_case for BM25 indexing with upstream parity, plus a fast-path for pure lowercase tokens.

New Features
- splitIdentifier(token): returns the lowered original plus sub-tokens; returns only [lower] when fewer than two parts exist.
- tokenize(text): extracts identifier-like tokens via TOKEN_RE and expands each with splitIdentifier; drops non-identifier digits.

^{Written for commit 67e3607. Summary will update on new commits.}

Ports src/semble/tokens.py to src/tokens.ts: - splitIdentifier(token): lowered original + camelCase/snake_case sub-tokens. Returns only [lower] when fewer than two sub-tokens exist, matching the Python behaviour. - tokenize(text): walks identifier matches with TOKEN_RE and expands each via splitIdentifier for BM25 indexing. Regexes are ported verbatim (TOKEN_RE, CAMEL_RE) with the /g flag so matchAll works. The snake_case branch filters empty parts to match Python's '[p for p in lower.split("_") if p]'. Tests cover the canonical semble examples plus edge cases verified against the Python implementation (consecutive underscores, leading underscore, digit runs).

gemini-code-assist

Code Review

This pull request ports tokenization and identifier-splitting logic from Python to TypeScript, introducing the splitIdentifier and tokenize functions along with their corresponding unit tests. The code reviewer suggested optimizing splitIdentifier by adding a fast-path check for simple lowercase tokens to bypass regex execution, and recommended adding a defensive guard in tokenize to safely handle null or undefined inputs.

gemini-code-assist · 2026-05-28T15:17:37Z

+export function splitIdentifier(token: string): string[] {
+  const lower = token.toLowerCase()
+  let parts: string[]
+
+  if (token.includes('_')) {
+    // snake_case splitting
+    parts = lower.split('_').filter(p => p.length > 0)
+  }
+  else {
+    // camelCase / PascalCase splitting
+    parts = Array.from(token.matchAll(CAMEL_RE), ([m]) => m.toLowerCase())
+  }
+
+  if (parts.length >= 2) {
+    return [lower, ...parts]
+  }
+  return [lower]
+}


For simple lowercase tokens (which make up the vast majority of words in typical text), we can bypass the expensive matchAll regex execution and array allocation entirely. Since TOKEN_RE only matches [a-zA-Z0-9_], any token that does not contain underscores, uppercase letters, or digits consists solely of lowercase letters and cannot be split further. Adding a fast-path check at the beginning of splitIdentifier significantly improves tokenization performance.

export function splitIdentifier(token: string): string[] { const lower = token.toLowerCase() if (!token.includes('_') && !/[A-Z0-9]/.test(token)) { return [lower] } let parts: string[] if (token.includes('_')) { // snake_case splitting parts = lower.split('_').filter(p => p.length > 0) } else { // camelCase / PascalCase splitting parts = Array.from(token.matchAll(CAMEL_RE), ([m]) => m.toLowerCase()) } if (parts.length >= 2) { return [lower, ...parts] } return [lower] }

cubic-dev-ai

No issues found across 2 files

Architecture diagram

sequenceDiagram
    participant User as User Code
    participant Tokens as tokens.ts
    participant RegEx as JavaScript RegExp Engine

    Note over User,RegEx: Tokenization Flow (New Module)

    User->>Tokens: tokenize(text: string)
    Tokens->>RegEx: text.matchAll(TOKEN_RE)
    RegEx-->>Tokens: iterator of [match, ...]

    loop For each identifier match
        Tokens->>Tokens: splitIdentifier(match)
        alt Identifier contains '_'
            Tokens->>Tokens: lower.split('_').filter(p => p.length > 0)
            alt parts.length >= 2
                Tokens-->>Tokens: [lower, ...parts]
            else parts.length < 2
                Tokens-->>Tokens: [lower]
            end
        else No underscore (camelCase/PascalCase)
            Tokens->>RegEx: matchAll(CAMEL_RE) on token
            RegEx-->>Tokens: sub-token matches
            Tokens->>Tokens: Convert sub-tokens to lowercase
            alt parts.length >= 2
                Tokens-->>Tokens: [lower, ...parts]
            else parts.length < 2
                Tokens-->>Tokens: [lower]
            end
        end
        Tokens-->>Tokens: Collect expanded tokens
    end

    Tokens-->>User: Array of lowercase expanded tokens

_{Re-trigger cubic}

Add fast-path in splitIdentifier for pure-lowercase tokens to skip unnecessary regex execution and array allocation. Behavior is identical to semble (TOKEN_RE only matches [a-zA-Z0-9_], so absence of _, A-Z, 0-9 means the token cannot split further).

amondnet

Applied 1 suggestion (fast-path in splitIdentifier for pure-lowercase tokens — perf, parity preserved); deferred 1 (null guard in tokenize — see thread reply). All 13 tests still pass.

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 28, 2026

View reviewed changes

amondnet commented May 28, 2026

View reviewed changes

amondnet self-assigned this May 28, 2026

amondnet merged commit e619bc2 into main May 28, 2026
1 check passed

amondnet deleted the feat/unit-2-tokens branch May 28, 2026 16:05

This was referenced Jun 18, 2026

chore(main): release 1.0.0 #23

Closed

chore(main): release 0.1.0 #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tokens): port identifier-aware tokenizer from semble#1

feat(tokens): port identifier-aware tokenizer from semble#1
amondnet merged 2 commits into
mainfrom
feat/unit-2-tokens

amondnet commented May 28, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

amondnet left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amondnet commented May 28, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Tests

Notes

Summary by cubic

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

amondnet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amondnet commented May 28, 2026 •

edited by cubic-dev-ai Bot

Loading