Skip to content

Rust port: expand tree-sitter grammar coverage to match upstream language pack #38

Description

@amondnet

Context

PR #37 closed two upstream-parity gaps in the Rust port (crates/csp): the ranking pipeline is now wired (TD-002) and the chunk length matches upstream (750). The remaining open gap tracked in semble.md §6.2 is tree-sitter grammar coverage.

Problem

language_for in crates/csp/src/chunking/core.rs statically links only ~14 grammars:

rust, python, javascript, typescript, tsx, go, java, c, cpp, ruby, json, bash, html, css

Upstream semble uses tree_sitter_language_pack (≈all languages). Meanwhile EXTENSION_TO_LANGUAGE (crates/csp/src/indexing/files.rs, ~350 entries) recognizes far more languages than the curated grammar set.

Effect: a file in a recognized-but-uncurated language (e.g. Rust-side: kotlin, swift, php, scala, lua, …) is still walked and indexed, but falls through to line-based chunking instead of AST chunking — coarser, less semantically-aligned chunk boundaries than upstream produces. This is a real behavioral narrowing vs upstream, not just missing recognition.

Proposed work

  1. Decide the target set: full language-pack parity vs an expanded curated set (weigh binary size / build time of pulling in many tree-sitter-* crates).
  2. Add the chosen grammar crates as deps and extend the language_for match arms (+ keep is_supported_language in sync).
  3. Add chunking tests for the newly-AST-supported languages (mirror the existing core.rs grammar tests).
  4. Update semble.md §4.3 / §6.2 and the constants/coverage notes when the gap closes.

Acceptance criteria

  • Target grammar set decided and documented (with rationale on binary-size trade-off).
  • The selected languages return Some(Language) from language_for and are AST-chunked (verified by tests), no longer line-fallback.
  • Quality gate green: cargo fmt --all && cargo clippy --all-targets --all-features -- -D warnings && cargo test --workspace.
  • semble.md §6.2 item removed / updated.

References

  • .please/docs/references/semble.md §4.3 (chunking) and §6.2 (open gaps)
  • ADR-0001 — native tree-sitter
  • Source of truth: Python upstream MinishLab/semble (src/semble/chunking/)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions