Skip to content

feat(tokens): port identifier-aware tokenizer from semble#1

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-2-tokens
May 28, 2026
Merged

feat(tokens): port identifier-aware tokenizer from semble#1
amondnet merged 2 commits into
mainfrom
feat/unit-2-tokens

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Ports src/semble/tokens.py to src/tokens.ts as part of the parallel port effort (Unit 2).

What

  • splitIdentifier(token) — lowered original + camelCase/snake_case sub-tokens.
    Returns only [lower] when fewer than two sub-tokens exist, mirroring the Python behaviour.
  • tokenize(text) — walks identifier matches with TOKEN_RE and expands each via splitIdentifier for BM25 indexing.

How

  • Regexes ported verbatim from upstream:
    • TOKEN_RE = /[a-zA-Z_][a-zA-Z0-9_]*/g
    • CAMEL_RE = /[A-Z]+(?=[A-Z][a-z])|[A-Z]?[a-z]+|[A-Z]+|[0-9]+/g
  • The g flag is required because we use String.prototype.matchAll.
  • The snake_case branch filters empty parts (.filter(p => p.length > 0)) to match Python's [p for p in lower.split('_') if p].

Tests

bun test src/tokens.test.ts — 13 pass. Covers the canonical semble examples plus edge cases verified against the upstream Python implementation:

  • Consecutive underscores (foo__bar)
  • Leading underscore (_foo collapses to one effective part)
  • Digit runs (abc123Def -> ['abc123def','abc','123','def'])
  • Plain space-separated words
  • Non-identifier digits (123) correctly dropped by TOKEN_RE

Parity for tokenize('camelCase_snake_case 123') was verified directly against Python and is ['camelcase_snake_case', 'camelcase', 'snake', 'case'] — the original task description's expected value contained extra entries that do not correspond to the semble algorithm; the test asserts the Python-verified output.

Notes

  • Pure functions, no e2e.
  • No package.json changes.

Summary by cubic

Ports the identifier-aware tokenizer from Python to TypeScript for Unit 2. Adds splitIdentifier and tokenize that expand camelCase/PascalCase/snake_case for BM25 indexing with upstream parity, plus a fast-path for pure lowercase tokens.

  • New Features
    • splitIdentifier(token): returns the lowered original plus sub-tokens; returns only [lower] when fewer than two parts exist.
    • tokenize(text): extracts identifier-like tokens via TOKEN_RE and expands each with splitIdentifier; drops non-identifier digits.

Written for commit 67e3607. Summary will update on new commits.

Ports src/semble/tokens.py to src/tokens.ts:

- splitIdentifier(token): lowered original + camelCase/snake_case
  sub-tokens. Returns only [lower] when fewer than two sub-tokens
  exist, matching the Python behaviour.
- tokenize(text): walks identifier matches with TOKEN_RE and
  expands each via splitIdentifier for BM25 indexing.

Regexes are ported verbatim (TOKEN_RE, CAMEL_RE) with the /g flag
so matchAll works. The snake_case branch filters empty parts to
match Python's '[p for p in lower.split("_") if p]'.

Tests cover the canonical semble examples plus edge cases verified
against the Python implementation (consecutive underscores, leading
underscore, digit runs).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports tokenization and identifier-splitting logic from Python to TypeScript, introducing the splitIdentifier and tokenize functions along with their corresponding unit tests. The code reviewer suggested optimizing splitIdentifier by adding a fast-path check for simple lowercase tokens to bypass regex execution, and recommended adding a defensive guard in tokenize to safely handle null or undefined inputs.

Comment thread src/tokens.ts
Comment on lines +19 to +36
export function splitIdentifier(token: string): string[] {
const lower = token.toLowerCase()
let parts: string[]

if (token.includes('_')) {
// snake_case splitting
parts = lower.split('_').filter(p => p.length > 0)
}
else {
// camelCase / PascalCase splitting
parts = Array.from(token.matchAll(CAMEL_RE), ([m]) => m.toLowerCase())
}

if (parts.length >= 2) {
return [lower, ...parts]
}
return [lower]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For simple lowercase tokens (which make up the vast majority of words in typical text), we can bypass the expensive matchAll regex execution and array allocation entirely. Since TOKEN_RE only matches [a-zA-Z0-9_], any token that does not contain underscores, uppercase letters, or digits consists solely of lowercase letters and cannot be split further. Adding a fast-path check at the beginning of splitIdentifier significantly improves tokenization performance.

export function splitIdentifier(token: string): string[] {
  const lower = token.toLowerCase()
  if (!token.includes('_') && !/[A-Z0-9]/.test(token)) {
    return [lower]
  }
  let parts: string[]

  if (token.includes('_')) {
    // snake_case splitting
    parts = lower.split('_').filter(p => p.length > 0)
  }
  else {
    // camelCase / PascalCase splitting
    parts = Array.from(token.matchAll(CAMEL_RE), ([m]) => m.toLowerCase())
  }

  if (parts.length >= 2) {
    return [lower, ...parts]
  }
  return [lower]
}

Comment thread src/tokens.ts

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Architecture diagram
sequenceDiagram
    participant User as User Code
    participant Tokens as tokens.ts
    participant RegEx as JavaScript RegExp Engine

    Note over User,RegEx: Tokenization Flow (New Module)

    User->>Tokens: tokenize(text: string)
    Tokens->>RegEx: text.matchAll(TOKEN_RE)
    RegEx-->>Tokens: iterator of [match, ...]

    loop For each identifier match
        Tokens->>Tokens: splitIdentifier(match)
        alt Identifier contains '_'
            Tokens->>Tokens: lower.split('_').filter(p => p.length > 0)
            alt parts.length >= 2
                Tokens-->>Tokens: [lower, ...parts]
            else parts.length < 2
                Tokens-->>Tokens: [lower]
            end
        else No underscore (camelCase/PascalCase)
            Tokens->>RegEx: matchAll(CAMEL_RE) on token
            RegEx-->>Tokens: sub-token matches
            Tokens->>Tokens: Convert sub-tokens to lowercase
            alt parts.length >= 2
                Tokens-->>Tokens: [lower, ...parts]
            else parts.length < 2
                Tokens-->>Tokens: [lower]
            end
        end
        Tokens-->>Tokens: Collect expanded tokens
    end

    Tokens-->>User: Array of lowercase expanded tokens
Loading

Re-trigger cubic

Add fast-path in splitIdentifier for pure-lowercase tokens to skip
unnecessary regex execution and array allocation. Behavior is identical
to semble (TOKEN_RE only matches [a-zA-Z0-9_], so absence of _, A-Z, 0-9
means the token cannot split further).

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied 1 suggestion (fast-path in splitIdentifier for pure-lowercase tokens — perf, parity preserved); deferred 1 (null guard in tokenize — see thread reply). All 13 tests still pass.

@amondnet amondnet self-assigned this May 28, 2026
@amondnet amondnet merged commit e619bc2 into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-2-tokens branch May 28, 2026 16:05
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant