feat(indexing): port extension→language detection from semble#3
Conversation
Port src/semble/index/files.py to TypeScript at src/indexing/files.ts. Exports: - EXTENSION_TO_LANGUAGE: full 350+ entry record (no abbreviation) - DOC_LANGUAGES, CONFIG_LANGUAGES, DATA_LANGUAGES, ALL_LANGUAGES (ReadonlySet) - detectLanguage(fileName): mirrors Python's Path(name).suffix.lower() lookup (case-insensitive, dotfile-aware — '.gitignore' → undefined like Python) - getExtensions(types, extensions): unions content-type extensions with the user-provided list and returns a sorted, deduplicated array Includes 22 bun:test cases covering common languages, case-insensitivity, dotfile semantics, multi-dot filenames, content-type unions, and set non-emptiness invariants.
There was a problem hiding this comment.
Code Review
This pull request introduces file extension mapping and language detection utilities in src/indexing/files.ts, along with comprehensive tests in src/indexing/files.test.ts. Feedback highlights a cross-platform compatibility issue in detectLanguage where Windows backslash path separators (\) are not handled, potentially leading to incorrect extension parsing. It is recommended to support both path separators and add corresponding test cases to prevent regressions.
There was a problem hiding this comment.
1 issue found across 2 files
Architecture diagram
sequenceDiagram
participant Caller as Caller (e.g., indexer)
participant Detect as detectLanguage(fileName)
participant ExtMap as EXTENSION_TO_LANGUAGE
participant GetExt as getExtensions(types, extras)
participant TypeSets as CONTENT_TYPE_LANGUAGES
Note over Caller,TypeSets: Language Detection Flow (NEW)
Caller->>Detect: detectLanguage("foo.ts")
Detect->>Detect: Extract suffix (".ts"), lowercase
Detect->>ExtMap: Lookup ".ts"
ExtMap-->>Detect: "typescript"
Detect-->>Caller: "typescript"
alt Unknown extension
Caller->>Detect: detectLanguage("foo.xyz")
Detect->>Detect: Extract suffix ".xyz", lowercase
Detect->>ExtMap: Lookup ".xyz"
ExtMap-->>Detect: undefined
Detect-->>Caller: undefined
end
alt Dotfile (e.g., ".gitignore")
Caller->>Detect: detectLanguage(".gitignore")
Detect->>Detect: Extract suffix "" (empty)
Detect->>ExtMap: Lookup ""
ExtMap-->>Detect: undefined
Detect-->>Caller: undefined
end
Note over Caller,TypeSets: Extension Resolution Flow (NEW)
Caller->>GetExt: getExtensions(["code"], undefined)
GetExt->>TypeSets: Lookup "code" set
TypeSets-->>GetExt: CODE_LANGUAGES
GetExt->>GetExt: Map languages to extensions via LANGUAGE_TO_EXTENSIONS
GetExt->>GetExt: Sort and deduplicate
GetExt-->>Caller: [".ts", ".py", ".go", ...]
alt With user-provided extensions
Caller->>GetExt: getExtensions(["config"], [".custom"])
GetExt->>TypeSets: Lookup "config" set
TypeSets-->>GetExt: CONFIG_LANGUAGES
GetExt->>GetExt: Merge ".custom" into extension list
GetExt->>GetExt: Sort and deduplicate
GetExt-->>Caller: [".custom", ".toml", ".yaml", ...]
end
alt Union multiple content types
Caller->>GetExt: getExtensions(["code", "docs"], undefined)
GetExt->>TypeSets: Lookup "code" and "docs" sets
TypeSets-->>GetExt: CODE_LANGUAGES, DOC_LANGUAGES
GetExt->>GetExt: Union language sets, map to extensions
GetExt->>GetExt: Sort and deduplicate
GetExt-->>Caller: [".md", ".py", ".ts", ...]
end
Note over Caller,TypeSets: Internal Set Computation (at module init)
Note over TypeSets: CODE_LANGUAGES = ALL_LANGUAGES \ DOC \ CONFIG \ DATA
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
amondnet
left a comment
There was a problem hiding this comment.
Applied 3 (Windows path separator support in detectLanguage + matching tests), deferred 0. All 3 bot comments addressed the same root cause: detectLanguage only handled '/' and would misclassify Windows-style paths. Fix uses Math.max(lastIndexOf('/'), lastIndexOf('\')) so .gitignore behind 'dir\' resolves to undefined and 'C:\Users\me\foo.py' returns 'python', mirroring pathlib.Path on Windows. 23/23 tests pass.
Port of `src/semble/index/files.py` → `src/indexing/files.ts`. This is Unit 4 of the parallel TS port effort.
What ships
Conventions preserved
Tests
`bun test src/indexing/files.test.ts` — 22 pass, 0 fail. Cases:
Out of scope
Summary by cubic
Ports
src/semble/index/files.pyto TypeScript assrc/indexing/files.tsfor consistent extension→language detection in indexing. Mirrors Python behavior and ships with 22 unit tests.EXTENSION_TO_LANGUAGEmapping (350+ entries).DOC_LANGUAGES,CONFIG_LANGUAGES,DATA_LANGUAGES,ALL_LANGUAGESasReadonlySet<string>.detectLanguage(fileName)with case-insensitive suffix, dotfile-safe handling, and POSIX/Windows path support.getExtensions(types, extensions)returns a sorted, deduped union; includes inlineContentType = 'code' | 'docs' | 'config'.Written for commit 7d4a229. Summary will update on new commits.