feat(tokens): port identifier-aware tokenizer from semble#1
Conversation
Ports src/semble/tokens.py to src/tokens.ts:
- splitIdentifier(token): lowered original + camelCase/snake_case
sub-tokens. Returns only [lower] when fewer than two sub-tokens
exist, matching the Python behaviour.
- tokenize(text): walks identifier matches with TOKEN_RE and
expands each via splitIdentifier for BM25 indexing.
Regexes are ported verbatim (TOKEN_RE, CAMEL_RE) with the /g flag
so matchAll works. The snake_case branch filters empty parts to
match Python's '[p for p in lower.split("_") if p]'.
Tests cover the canonical semble examples plus edge cases verified
against the Python implementation (consecutive underscores, leading
underscore, digit runs).
There was a problem hiding this comment.
Code Review
This pull request ports tokenization and identifier-splitting logic from Python to TypeScript, introducing the splitIdentifier and tokenize functions along with their corresponding unit tests. The code reviewer suggested optimizing splitIdentifier by adding a fast-path check for simple lowercase tokens to bypass regex execution, and recommended adding a defensive guard in tokenize to safely handle null or undefined inputs.
| export function splitIdentifier(token: string): string[] { | ||
| const lower = token.toLowerCase() | ||
| let parts: string[] | ||
|
|
||
| if (token.includes('_')) { | ||
| // snake_case splitting | ||
| parts = lower.split('_').filter(p => p.length > 0) | ||
| } | ||
| else { | ||
| // camelCase / PascalCase splitting | ||
| parts = Array.from(token.matchAll(CAMEL_RE), ([m]) => m.toLowerCase()) | ||
| } | ||
|
|
||
| if (parts.length >= 2) { | ||
| return [lower, ...parts] | ||
| } | ||
| return [lower] | ||
| } |
There was a problem hiding this comment.
For simple lowercase tokens (which make up the vast majority of words in typical text), we can bypass the expensive matchAll regex execution and array allocation entirely. Since TOKEN_RE only matches [a-zA-Z0-9_], any token that does not contain underscores, uppercase letters, or digits consists solely of lowercase letters and cannot be split further. Adding a fast-path check at the beginning of splitIdentifier significantly improves tokenization performance.
export function splitIdentifier(token: string): string[] {
const lower = token.toLowerCase()
if (!token.includes('_') && !/[A-Z0-9]/.test(token)) {
return [lower]
}
let parts: string[]
if (token.includes('_')) {
// snake_case splitting
parts = lower.split('_').filter(p => p.length > 0)
}
else {
// camelCase / PascalCase splitting
parts = Array.from(token.matchAll(CAMEL_RE), ([m]) => m.toLowerCase())
}
if (parts.length >= 2) {
return [lower, ...parts]
}
return [lower]
}There was a problem hiding this comment.
No issues found across 2 files
Architecture diagram
sequenceDiagram
participant User as User Code
participant Tokens as tokens.ts
participant RegEx as JavaScript RegExp Engine
Note over User,RegEx: Tokenization Flow (New Module)
User->>Tokens: tokenize(text: string)
Tokens->>RegEx: text.matchAll(TOKEN_RE)
RegEx-->>Tokens: iterator of [match, ...]
loop For each identifier match
Tokens->>Tokens: splitIdentifier(match)
alt Identifier contains '_'
Tokens->>Tokens: lower.split('_').filter(p => p.length > 0)
alt parts.length >= 2
Tokens-->>Tokens: [lower, ...parts]
else parts.length < 2
Tokens-->>Tokens: [lower]
end
else No underscore (camelCase/PascalCase)
Tokens->>RegEx: matchAll(CAMEL_RE) on token
RegEx-->>Tokens: sub-token matches
Tokens->>Tokens: Convert sub-tokens to lowercase
alt parts.length >= 2
Tokens-->>Tokens: [lower, ...parts]
else parts.length < 2
Tokens-->>Tokens: [lower]
end
end
Tokens-->>Tokens: Collect expanded tokens
end
Tokens-->>User: Array of lowercase expanded tokens
Add fast-path in splitIdentifier for pure-lowercase tokens to skip unnecessary regex execution and array allocation. Behavior is identical to semble (TOKEN_RE only matches [a-zA-Z0-9_], so absence of _, A-Z, 0-9 means the token cannot split further).
amondnet
left a comment
There was a problem hiding this comment.
Applied 1 suggestion (fast-path in splitIdentifier for pure-lowercase tokens — perf, parity preserved); deferred 1 (null guard in tokenize — see thread reply). All 13 tests still pass.
Ports
src/semble/tokens.pytosrc/tokens.tsas part of the parallel port effort (Unit 2).What
splitIdentifier(token)— lowered original + camelCase/snake_case sub-tokens.Returns only
[lower]when fewer than two sub-tokens exist, mirroring the Python behaviour.tokenize(text)— walks identifier matches withTOKEN_REand expands each viasplitIdentifierfor BM25 indexing.How
TOKEN_RE = /[a-zA-Z_][a-zA-Z0-9_]*/gCAMEL_RE = /[A-Z]+(?=[A-Z][a-z])|[A-Z]?[a-z]+|[A-Z]+|[0-9]+/ggflag is required because we useString.prototype.matchAll..filter(p => p.length > 0)) to match Python's[p for p in lower.split('_') if p].Tests
bun test src/tokens.test.ts— 13 pass. Covers the canonical semble examples plus edge cases verified against the upstream Python implementation:foo__bar)_foocollapses to one effective part)abc123Def->['abc123def','abc','123','def'])123) correctly dropped byTOKEN_REParity for
tokenize('camelCase_snake_case 123')was verified directly against Python and is['camelcase_snake_case', 'camelcase', 'snake', 'case']— the original task description's expected value contained extra entries that do not correspond to the semble algorithm; the test asserts the Python-verified output.Notes
package.jsonchanges.Summary by cubic
Ports the identifier-aware tokenizer from Python to TypeScript for Unit 2. Adds
splitIdentifierandtokenizethat expand camelCase/PascalCase/snake_case for BM25 indexing with upstream parity, plus a fast-path for pure lowercase tokens.splitIdentifier(token): returns the lowered original plus sub-tokens; returns only[lower]when fewer than two parts exist.tokenize(text): extracts identifier-like tokens viaTOKEN_REand expands each withsplitIdentifier; drops non-identifier digits.Written for commit 67e3607. Summary will update on new commits.