Context
PR #37 closed two upstream-parity gaps in the Rust port (crates/csp): the ranking pipeline is now wired (TD-002) and the chunk length matches upstream (750). The remaining open gap tracked in semble.md §6.2 is tree-sitter grammar coverage.
Problem
language_for in crates/csp/src/chunking/core.rs statically links only ~14 grammars:
rust, python, javascript, typescript, tsx, go, java, c, cpp, ruby, json, bash, html, css
Upstream semble uses tree_sitter_language_pack (≈all languages). Meanwhile EXTENSION_TO_LANGUAGE (crates/csp/src/indexing/files.rs, ~350 entries) recognizes far more languages than the curated grammar set.
Effect: a file in a recognized-but-uncurated language (e.g. Rust-side: kotlin, swift, php, scala, lua, …) is still walked and indexed, but falls through to line-based chunking instead of AST chunking — coarser, less semantically-aligned chunk boundaries than upstream produces. This is a real behavioral narrowing vs upstream, not just missing recognition.
Proposed work
- Decide the target set: full language-pack parity vs an expanded curated set (weigh binary size / build time of pulling in many
tree-sitter-* crates).
- Add the chosen grammar crates as deps and extend the
language_for match arms (+ keep is_supported_language in sync).
- Add chunking tests for the newly-AST-supported languages (mirror the existing
core.rs grammar tests).
- Update
semble.md §4.3 / §6.2 and the constants/coverage notes when the gap closes.
Acceptance criteria
References
.please/docs/references/semble.md §4.3 (chunking) and §6.2 (open gaps)
- ADR-0001 — native tree-sitter
- Source of truth: Python upstream
MinishLab/semble (src/semble/chunking/)
Context
PR #37 closed two upstream-parity gaps in the Rust port (
crates/csp): the ranking pipeline is now wired (TD-002) and the chunk length matches upstream (750). The remaining open gap tracked insemble.md§6.2 is tree-sitter grammar coverage.Problem
language_forincrates/csp/src/chunking/core.rsstatically links only ~14 grammars:Upstream semble uses
tree_sitter_language_pack(≈all languages). MeanwhileEXTENSION_TO_LANGUAGE(crates/csp/src/indexing/files.rs, ~350 entries) recognizes far more languages than the curated grammar set.Effect: a file in a recognized-but-uncurated language (e.g. Rust-side: kotlin, swift, php, scala, lua, …) is still walked and indexed, but falls through to line-based chunking instead of AST chunking — coarser, less semantically-aligned chunk boundaries than upstream produces. This is a real behavioral narrowing vs upstream, not just missing recognition.
Proposed work
tree-sitter-*crates).language_formatch arms (+ keepis_supported_languagein sync).core.rsgrammar tests).semble.md§4.3 / §6.2 and the constants/coverage notes when the gap closes.Acceptance criteria
Some(Language)fromlanguage_forand are AST-chunked (verified by tests), no longer line-fallback.cargo fmt --all && cargo clippy --all-targets --all-features -- -D warnings && cargo test --workspace.semble.md§6.2 item removed / updated.References
.please/docs/references/semble.md§4.3 (chunking) and §6.2 (open gaps)MinishLab/semble(src/semble/chunking/)