Skip to content

feat(indexing): port extension→language detection from semble#3

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-4-language-detection
May 28, 2026
Merged

feat(indexing): port extension→language detection from semble#3
amondnet merged 2 commits into
mainfrom
feat/unit-4-language-detection

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Port of `src/semble/index/files.py` → `src/indexing/files.ts`. This is Unit 4 of the parallel TS port effort.

What ships

  • `EXTENSION_TO_LANGUAGE` — full 350+ entry record, no entries skipped (including the commented-out `.txt` mapping which is preserved as a TS comment).
  • `DOC_LANGUAGES`, `CONFIG_LANGUAGES`, `DATA_LANGUAGES`, `ALL_LANGUAGES` as `ReadonlySet` (mirroring Python `frozenset`).
  • `detectLanguage(fileName: string): string | undefined` — mirrors Python's `Path(name).suffix.lower()` lookup. Case-insensitive on the suffix; for dotfiles like `.gitignore` it returns `undefined` (Python's `Path('.gitignore').suffix` is `''`).
  • `getExtensions(types, extensions)` — resolves `ContentType[]` to a sorted, deduplicated extension list, optionally unioning user-provided extensions.
  • Inline `ContentType = 'code' | 'docs' | 'config'` (will be re-exported from `src/types.ts` once that lands).

Conventions preserved

  • The internal `CODE_LANGUAGES` set is computed as `ALL − DOCS − CONFIG − DATA`, matching the upstream definition. `CONTENT_TYPE_LANGUAGES` keys map to `code`/`docs`/`config` for parity with `ContentType.CODE | DOCS | CONFIG`.
  • Sort uses default lexicographic comparison to match Python's `sorted(set)`.

Tests

`bun test src/indexing/files.test.ts` — 22 pass, 0 fail. Cases:

  • Common extensions (`.ts`, `.tsx`, `.py`, `.md`).
  • Case-insensitivity (`Foo.TS` → `typescript`).
  • Multi-dot filenames (`foo.bar.ts` → `typescript`).
  • Path with directory separators (`src/indexing/files.ts`).
  • Files without an extension (`Makefile` → undefined).
  • Dotfile semantics (`.gitignore` → undefined, matching Python).
  • `getExtensions` for each content type, with and without user extensions, including union across types.
  • All exported sets are non-empty, and every value in `EXTENSION_TO_LANGUAGE` is in `ALL_LANGUAGES`.

Out of scope

  • No changes to `package.json`.
  • No e2e tests (per the unit brief).

Summary by cubic

Ports src/semble/index/files.py to TypeScript as src/indexing/files.ts for consistent extension→language detection in indexing. Mirrors Python behavior and ships with 22 unit tests.

  • New Features
    • EXTENSION_TO_LANGUAGE mapping (350+ entries).
    • DOC_LANGUAGES, CONFIG_LANGUAGES, DATA_LANGUAGES, ALL_LANGUAGES as ReadonlySet<string>.
    • detectLanguage(fileName) with case-insensitive suffix, dotfile-safe handling, and POSIX/Windows path support.
    • getExtensions(types, extensions) returns a sorted, deduped union; includes inline ContentType = 'code' | 'docs' | 'config'.

Written for commit 7d4a229. Summary will update on new commits.

Port src/semble/index/files.py to TypeScript at src/indexing/files.ts.

Exports:
- EXTENSION_TO_LANGUAGE: full 350+ entry record (no abbreviation)
- DOC_LANGUAGES, CONFIG_LANGUAGES, DATA_LANGUAGES, ALL_LANGUAGES (ReadonlySet)
- detectLanguage(fileName): mirrors Python's Path(name).suffix.lower() lookup
  (case-insensitive, dotfile-aware — '.gitignore' → undefined like Python)
- getExtensions(types, extensions): unions content-type extensions with the
  user-provided list and returns a sorted, deduplicated array

Includes 22 bun:test cases covering common languages, case-insensitivity,
dotfile semantics, multi-dot filenames, content-type unions, and set
non-emptiness invariants.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces file extension mapping and language detection utilities in src/indexing/files.ts, along with comprehensive tests in src/indexing/files.test.ts. Feedback highlights a cross-platform compatibility issue in detectLanguage where Windows backslash path separators (\) are not handled, potentially leading to incorrect extension parsing. It is recommended to support both path separators and add corresponding test cases to prevent regressions.

Comment thread src/indexing/files.ts Outdated
Comment thread src/indexing/files.test.ts

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Architecture diagram
sequenceDiagram
    participant Caller as Caller (e.g., indexer)
    participant Detect as detectLanguage(fileName)
    participant ExtMap as EXTENSION_TO_LANGUAGE
    participant GetExt as getExtensions(types, extras)
    participant TypeSets as CONTENT_TYPE_LANGUAGES

    Note over Caller,TypeSets: Language Detection Flow (NEW)

    Caller->>Detect: detectLanguage("foo.ts")
    Detect->>Detect: Extract suffix (".ts"), lowercase
    Detect->>ExtMap: Lookup ".ts"
    ExtMap-->>Detect: "typescript"
    Detect-->>Caller: "typescript"

    alt Unknown extension
        Caller->>Detect: detectLanguage("foo.xyz")
        Detect->>Detect: Extract suffix ".xyz", lowercase
        Detect->>ExtMap: Lookup ".xyz"
        ExtMap-->>Detect: undefined
        Detect-->>Caller: undefined
    end

    alt Dotfile (e.g., ".gitignore")
        Caller->>Detect: detectLanguage(".gitignore")
        Detect->>Detect: Extract suffix "" (empty)
        Detect->>ExtMap: Lookup ""
        ExtMap-->>Detect: undefined
        Detect-->>Caller: undefined
    end

    Note over Caller,TypeSets: Extension Resolution Flow (NEW)

    Caller->>GetExt: getExtensions(["code"], undefined)
    GetExt->>TypeSets: Lookup "code" set
    TypeSets-->>GetExt: CODE_LANGUAGES
    GetExt->>GetExt: Map languages to extensions via LANGUAGE_TO_EXTENSIONS
    GetExt->>GetExt: Sort and deduplicate
    GetExt-->>Caller: [".ts", ".py", ".go", ...]

    alt With user-provided extensions
        Caller->>GetExt: getExtensions(["config"], [".custom"])
        GetExt->>TypeSets: Lookup "config" set
        TypeSets-->>GetExt: CONFIG_LANGUAGES
        GetExt->>GetExt: Merge ".custom" into extension list
        GetExt->>GetExt: Sort and deduplicate
        GetExt-->>Caller: [".custom", ".toml", ".yaml", ...]
    end

    alt Union multiple content types
        Caller->>GetExt: getExtensions(["code", "docs"], undefined)
        GetExt->>TypeSets: Lookup "code" and "docs" sets
        TypeSets-->>GetExt: CODE_LANGUAGES, DOC_LANGUAGES
        GetExt->>GetExt: Union language sets, map to extensions
        GetExt->>GetExt: Sort and deduplicate
        GetExt-->>Caller: [".md", ".py", ".ts", ...]
    end

    Note over Caller,TypeSets: Internal Set Computation (at module init)
    Note over TypeSets: CODE_LANGUAGES = ALL_LANGUAGES \ DOC \ CONFIG \ DATA
Loading

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/indexing/files.ts Outdated

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied 3 (Windows path separator support in detectLanguage + matching tests), deferred 0. All 3 bot comments addressed the same root cause: detectLanguage only handled '/' and would misclassify Windows-style paths. Fix uses Math.max(lastIndexOf('/'), lastIndexOf('\')) so .gitignore behind 'dir\' resolves to undefined and 'C:\Users\me\foo.py' returns 'python', mirroring pathlib.Path on Windows. 23/23 tests pass.

@amondnet amondnet self-assigned this May 28, 2026
@amondnet amondnet merged commit 3c74752 into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-4-language-detection branch May 28, 2026 16:05
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant