diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml index 1f1926e..6bb87b3 100644 --- a/.github/workflows/rust.yml +++ b/.github/workflows/rust.yml @@ -19,6 +19,11 @@ on: - rust-toolchain.toml - rustfmt.toml - .github/workflows/rust.yml + # The `ignored-tests` job below is network-gated; run it manually or weekly so + # the PR gate stays offline. See ADR-0004. + workflow_dispatch: + schedule: + - cron: '0 6 * * 1' # Mondays 06:00 UTC permissions: contents: read @@ -47,3 +52,19 @@ jobs: - name: Test run: cargo test --all-features --locked --workspace + + ignored-tests: + # Network-gated: these tests download tree-sitter grammars from GitHub + # releases (see ADR-0004). Kept off the PR/push gate so it stays offline and + # flake-free; runs on manual dispatch or the weekly schedule to catch + # grammar-fetch / parser-compat regressions. + if: github.event_name == 'workflow_dispatch' || github.event_name == 'schedule' + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1 + with: + persist-credentials: false + + - name: Test (ignored / network-gated) + run: cargo test --all-features --locked --workspace -- --ignored diff --git a/.please/docs/decisions/0004-rust-grammar-coverage-language-pack.md b/.please/docs/decisions/0004-rust-grammar-coverage-language-pack.md new file mode 100644 index 0000000..62b7764 --- /dev/null +++ b/.please/docs/decisions/0004-rust-grammar-coverage-language-pack.md @@ -0,0 +1,69 @@ +# ADR 0004 — Rust grammar coverage via `tree-sitter-language-pack` (downloaded parsers) + +- **Status**: Accepted +- **Date**: 2026-06-20 +- **Deciders**: csp maintainers +- **Relates to**: [ADR 0001](0001-native-tree-sitter.md) (native tree-sitter bindings, the TS side), [ADR 0003](0003-rewrite-in-rust.md) (Rust rewrite & single-binary distribution) +- **Closes**: [#38](https://github.com/pleaseai/code-search/issues/38) — "Rust port: expand tree-sitter grammar coverage to match upstream language pack" + +## Context + +The Rust port's chunker (`crates/csp/src/chunking/core.rs`) resolved a tree-sitter grammar through `language_for(language)`, which statically linked a **curated set of ~14 grammars** (rust, python, javascript, typescript, tsx, go, java, c, cpp, ruby, json, bash, html, css) via individual `tree-sitter-*` crates. + +Upstream semble parses with the Python `tree-sitter-language-pack` (≈all languages). Meanwhile the Rust port's `EXTENSION_TO_LANGUAGE` table (`crates/csp/src/indexing/files.rs`, ~350 entries / 265 distinct language names) recognizes far more languages than the curated grammar set. The effect was a **real behavioral narrowing vs upstream**: a file in a recognized-but-uncurated language (kotlin, swift, php, scala, lua, …) was still walked and indexed, but fell through to line-based chunking instead of AST chunking — coarser, less semantically-aligned chunk boundaries than upstream produces. + +The options for closing the gap: + +1. **Expanded curated set** — hand-pick and statically link ~30–40 more `tree-sitter-*` crates. Keeps a self-contained offline binary, but only partial parity, grows build time / binary size, and requires a per-language audit (not every grammar is published to crates.io at a compatible version). +2. **`tree-sitter-language-pack` crate (dynamic loading + download)** — adopt the Rust crate published by the same project semble uses. 306 languages, `get_language(name)` maps 1:1 onto `language_for`, and 264 of csp's 265 `EXTENSION_TO_LANGUAGE` names resolve (only `wolfram` is absent). Parsers are fetched from GitHub releases on first use and cached on disk. + +## Decision + +**Adopt `tree-sitter-language-pack` = "1.9"** with its default features (`dynamic-loading` + `download`) as the canonical grammar source for the Rust chunker. Replace the 14 individual `tree-sitter-*` grammar crates with it. + +- `language_for(language)` → `tree_sitter_language_pack::get_language(language).ok()`. Downloads the parser on first use, caches it, then hits the in-process registry. Unknown language or an offline fetch failure → `None` → line fallback (the exact pre-existing degradation contract). +- `is_supported_language(language)` → `tree_sitter_language_pack::has_language(language)`. A **metadata-only** lookup (bundled manifest + aliases) that does **not** download, so `chunk_source` gates AST chunking cheaply before paying for a fetch. +- `tree-sitter` stays a direct dependency (the port drives `Parser`/`Node`/`Language` itself). The crate resolves to the same `tree-sitter 0.26.x`, so the returned `Language` is ABI-compatible. + +### Trade-off: single binary vs. runtime grammar cache + +ADR-0003 motivation #1 is single-binary distribution. This decision **narrows** that property: the `csp` binary is still a single executable that runs with no Node/Bun present, but it is **no longer fully self-contained / offline** for AST chunking — grammars are fetched from GitHub releases on first use and cached under the OS cache dir (`tree_sitter_language_pack::cache_dir()`; `dirs`-based, e.g. `~/Library/Caches/...` or `~/.cache/...`). + +We accept this because: + +- **Parity is the point of this work.** A statically-linked subset cannot reach upstream's ≈full coverage without an unbounded crate-audit treadmill; the language pack tracks 306 grammars maintained by the same upstream semble depends on. +- **Degradation is graceful, not fatal.** Offline or fetch-failed → line chunking, exactly what an unsupported language already did. No language regresses below the previous behavior; the previously-curated 14 also just download once and cache. +- **Binary size shrinks** (grammars no longer compiled in) at the cost of a one-time per-language network fetch. + +The negatives: first-use latency and a network/GitHub-availability dependency for never-before-seen languages; a writable cache dir is required for AST chunking. These are documented and considered acceptable for a developer tool. A future offline/air-gapped mode could pre-seed the cache via `tree_sitter_language_pack::download(&[...])` or `download_all()`, or pin a `download`-disabled build that links a static subset — out of scope here. + +## Consequences + +### Positive + +- Full upstream-parity AST chunking: 264/265 recognized languages now AST-chunk instead of line-falling-back. +- One dependency replaces 14; coverage tracks upstream without a per-language audit. +- Smaller binary (no compiled-in grammars). + +### Negative + +- AST chunking now requires a one-time network fetch per language and a writable cache dir; fully offline runs degrade those languages to line chunking until the cache is seeded. +- `cargo test` for real-parse tests needs network → those tests are `#[ignore]`d (run with `cargo test -- --ignored`); the default suite stays offline-green via metadata-only (`has_language`) and fallback assertions. +- One language in csp's extension table (`wolfram`) has no pack grammar and stays on line fallback. + +### Neutral + +- `chunk_source`'s gate (`is_supported_language` then `chunk`) is unchanged in shape; only the resolver backing it changed. +- An offline/static build mode remains a future option (feature-gated `download`-off build, or cache pre-seeding). + +## Alternatives considered + +- **Expanded curated static set.** Rejected: partial parity only, ongoing crate-audit burden, and larger build/binary, for a coverage ceiling still well below upstream. +- **Hybrid (static subset + optional language-pack feature).** Rejected for now: doubles the chunker's resolver paths and the test matrix for little benefit over the download model, whose offline degradation already matches the old static fallback. Kept as a future option if an air-gapped build becomes a requirement. + +## References + +- Issue [#38](https://github.com/pleaseai/code-search/issues/38). +- `tree-sitter-language-pack` — , . +- Upstream: `src/semble/chunking/core.py`, `src/semble/index/files.py` in [MinishLab/semble](https://github.com/MinishLab/semble). +- `.please/docs/references/semble.md` §4.3 (chunking) and §6.2 (open gaps). diff --git a/.please/docs/decisions/index.md b/.please/docs/decisions/index.md index 9fefb19..a6f846d 100644 --- a/.please/docs/decisions/index.md +++ b/.please/docs/decisions/index.md @@ -7,3 +7,4 @@ | [0001](0001-native-tree-sitter.md) | Use Native Tree-sitter Bindings via `@kreuzberg/tree-sitter-language-pack` | 2026-05-28 | Accepted | | [0002](0002-index-storage-cache-model.md) | Index Storage & Caching Model: Global `~/.csp/index/` Content-Hash Cache | 2026-06-18 | Accepted | | [0003](0003-rewrite-in-rust.md) | Rewrite `@pleaseai/csp` from TypeScript/Bun to Rust | 2026-06-18 | Proposed | +| [0004](0004-rust-grammar-coverage-language-pack.md) | Rust grammar coverage via `tree-sitter-language-pack` (downloaded parsers) | 2026-06-20 | Accepted | diff --git a/.please/docs/references/semble.md b/.please/docs/references/semble.md index b0aa53c..4cd66f8 100644 --- a/.please/docs/references/semble.md +++ b/.please/docs/references/semble.md @@ -123,17 +123,23 @@ Same contract as semble `tokens.py`: **`core.rs`** (boundary algorithm, byte-based; `RECURSION_DEPTH = 500`, `MIN_CHUNK_SIZE = 50`): - The merge algorithm is generic over a node trait so tests can drive it with mock nodes; in production a **`TsNode` bridge** adapts `tree_sitter::Node` to it. -- `language_for(language)` returns a statically-linked `tree_sitter::Language` for a **curated - grammar set**: rust, python, javascript, typescript, tsx, go, java, c, cpp, ruby, json, bash, - html, css. Unsupported languages → `None` → line fallback. ⚠ This is **narrower than upstream**, - which uses `tree_sitter_language_pack` (≈all languages); see §6. +- `language_for(language)` returns a `tree_sitter::Language` from + **`tree_sitter_language_pack`** (306 grammars — full upstream parity; semble uses the Python + package of the same name). Parsers download from GitHub releases on first use and cache on disk; + unknown language or an offline fetch failure → `None` → line fallback. `is_supported_language` + uses `has_language` — a **metadata-only** check (no download) — so `chunk_source` gates AST + chunking before paying for a fetch. 264/265 `EXTENSION_TO_LANGUAGE` names resolve (`wolfram` is + the sole gap). See [ADR-0004](../decisions/0004-rust-grammar-coverage-language-pack.md) for the + single-binary ↔ runtime-cache trade-off. - `_merge_node_inner` (greedy pack), `_merge_adjacent_chunks`, `chunk_lines` fallback — same shape as semble. Byte offsets are converted to char offsets for multibyte safety. **`source.rs`** (`chunk_source`): - `DESIRED_CHUNK_LENGTH_CHARS = 1500` (⚠ upstream is now **750** — see §6). -- AST chunking when `language_for(lang).is_some()`, else line fallback. Char offsets → 1-indexed - line numbers; clamps end to avoid the zero-length off-by-one. +- AST chunking is gated by `is_supported_language(lang)` (metadata-only, no download); the + subsequent `chunk(...)` may still return `None` (e.g. an offline grammar fetch failure), + falling back to line chunking. Char offsets → 1-indexed line numbers; clamps end to avoid the + zero-length off-by-one. ### 4.4 `indexing/file_walker.rs` — gitignore-aware walk @@ -146,9 +152,9 @@ Same contract as semble `tokens.py`: ### 4.5 `indexing/files.rs` — language detection & file gating - `EXTENSION_TO_LANGUAGE` — `&[(&str, &str)]` table (~350 entries). `detect_language(name)` - lowercases the suffix and looks it up. Note: this recognizes far more extensions than the - curated tree-sitter set in §4.3 — recognized-but-unparsed languages still get **walked and - line-chunked**. + lowercases the suffix and looks it up. Since §4.3 now resolves grammars through + `tree_sitter_language_pack` (306 grammars), 264/265 of these language names AST-chunk; only + `wolfram` (and any extension the pack can't fetch) falls back to line chunking. - Content-type partition: `DOC_LANGUAGES`, `CONFIG_LANGUAGES`, `DATA_LANGUAGES`; code = all minus those. `get_extensions(types, extra)` inverts the map; the **`extra`** param (custom extensions) is a small Rust-side API addition. @@ -348,10 +354,12 @@ Clean two-layer split: `rerank_top_k_saturation`; the real `ranking::{boosting::apply_query_boost, penalties::rerank_top_k}` are ported but unwired (matches the TS source). Search-ranking parity is fixture-level only. Saturation constants are duplicated as a result. -- **Curated tree-sitter set** — only ~14 grammars are statically linked (`language_for`); upstream - uses `tree_sitter_language_pack` (≈all languages). Languages outside the curated set are - recognized by the extension map but **line-chunked**, not AST-chunked. This is a real behavioral - narrowing vs upstream. +- ~~**Curated tree-sitter set**~~ — **closed** ([ADR-0004](../decisions/0004-rust-grammar-coverage-language-pack.md), + [#38](https://github.com/pleaseai/code-search/issues/38)). `language_for` now resolves through + `tree_sitter_language_pack` (306 grammars, full upstream parity; 264/265 `EXTENSION_TO_LANGUAGE` + names AST-chunk, `wolfram` excepted). Trade-off recorded in ADR-0004: parsers download on first + use and cache on disk, so AST chunking is no longer fully offline/self-contained — it degrades + gracefully to line chunking when offline, exactly as an unsupported language already did. ### 6.3 Upstream drift since the review baseline (`eacbe43` → `136b6f7`) diff --git a/Cargo.lock b/Cargo.lock index 738d49d..0af49f2 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -15,6 +15,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5a15f179cd60c4584b8a8c596927aadc462e27f2ca70c04e0071964a73ba7a75" dependencies = [ "cfg-if", + "const-random", "getrandom 0.3.4", "once_cell", "serde", @@ -155,6 +156,15 @@ dependencies = [ "generic-array", ] +[[package]] +name = "block-buffer" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d2f6c7dbe95a6ed67ad9f18e57daf93a2f034c524b99fd2b76d18fdfeb6660aa" +dependencies = [ + "hybrid-array", +] + [[package]] name = "bstr" version = "1.12.1" @@ -199,9 +209,17 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "dad887fd958be91b5098c0248def011f4523ab786cd411be668777e55063501f" dependencies = [ "find-msvc-tools", + "jobserver", + "libc", "shlex", ] +[[package]] +name = "cesu8" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6d43a04d8753f35258c91f8ec639f792891f748a1edbd759cf1dcea3382ad83c" + [[package]] name = "cfg-if" version = "1.0.4" @@ -266,6 +284,16 @@ version = "1.0.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1d07550c9036bf2ae0c684c4297d503f838287c83c53686d05370d0e139ae570" +[[package]] +name = "combine" +version = "4.6.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ba5a308b75df32fe02788e748662718f03fde005016435c444eea572398219fd" +dependencies = [ + "bytes", + "memchr", +] + [[package]] name = "compact_str" version = "0.9.1" @@ -294,6 +322,42 @@ dependencies = [ "windows-sys 0.59.0", ] +[[package]] +name = "const-oid" +version = "0.10.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a6ef517f0926dd24a1582492c791b6a4818a4d94e789a334894aa15b0d12f55c" + +[[package]] +name = "const-random" +version = "0.1.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "87e00182fe74b066627d63b85fd550ac2998d4b0bd86bfed477a0ae4c7c71359" +dependencies = [ + "const-random-macro", +] + +[[package]] +name = "const-random-macro" +version = "0.1.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f9d839f2a20b0aee515dc581a6172f2321f96cab76c1a38a4c584a194955390e" +dependencies = [ + "getrandom 0.2.17", + "once_cell", + "tiny-keccak", +] + +[[package]] +name = "core-foundation" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b2a6cd9ae233e7f62ba4e9353e81a88df7fc8a5987b8d445b4d90c879bd156f6" +dependencies = [ + "core-foundation-sys", + "libc", +] + [[package]] name = "core-foundation-sys" version = "0.8.7" @@ -309,6 +373,15 @@ dependencies = [ "libc", ] +[[package]] +name = "cpufeatures" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8b2a41393f66f16b0823bb79094d54ac5fbd34ab292ddafb9a0456ac9f87d201" +dependencies = [ + "libc", +] + [[package]] name = "crc32fast" version = "1.5.0" @@ -359,6 +432,15 @@ dependencies = [ "typenum", ] +[[package]] +name = "crypto-common" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ce6e4c961d6cd6c9a86db418387425e8bdeaf05b3c8bc1411e6dca4c252f1453" +dependencies = [ + "hybrid-array", +] + [[package]] name = "csp" version = "0.0.0" @@ -370,23 +452,11 @@ dependencies = [ "regex", "serde", "serde_json", - "sha2", + "sha2 0.10.9", "tempfile", - "thiserror", + "thiserror 2.0.18", "tree-sitter", - "tree-sitter-bash", - "tree-sitter-c", - "tree-sitter-cpp", - "tree-sitter-css", - "tree-sitter-go", - "tree-sitter-html", - "tree-sitter-java", - "tree-sitter-javascript", - "tree-sitter-json", - "tree-sitter-python", - "tree-sitter-ruby", - "tree-sitter-rust", - "tree-sitter-typescript", + "tree-sitter-language-pack", ] [[package]] @@ -519,8 +589,19 @@ version = "0.10.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292" dependencies = [ - "block-buffer", - "crypto-common", + "block-buffer 0.10.4", + "crypto-common 0.1.7", +] + +[[package]] +name = "digest" +version = "0.11.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f1dd6dbb5841937940781866fa1281a1ff7bd3bf827091440879f9994983d5c2" +dependencies = [ + "block-buffer 0.12.1", + "const-oid", + "crypto-common 0.2.2", ] [[package]] @@ -615,6 +696,27 @@ version = "2.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6" +[[package]] +name = "fd-lock" +version = "4.0.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0ce92ff622d6dadf7349484f42c93271a0d49b7cc4d466a936405bacbe10aa78" +dependencies = [ + "cfg-if", + "rustix", + "windows-sys 0.59.0", +] + +[[package]] +name = "filetime" +version = "0.2.29" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5c287a33c7f0a620c38e641e7f60827713987b3c0f26e8ddc9462cc69cf75759" +dependencies = [ + "cfg-if", + "libc", +] + [[package]] name = "find-msvc-tools" version = "0.1.9" @@ -828,8 +930,8 @@ dependencies = [ "rand", "serde", "serde_json", - "thiserror", - "ureq", + "thiserror 2.0.18", + "ureq 2.12.1", "windows-sys 0.60.2", ] @@ -843,6 +945,21 @@ dependencies = [ "itoa", ] +[[package]] +name = "httparse" +version = "1.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87" + +[[package]] +name = "hybrid-array" +version = "0.4.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9155a582abd142abc056962c29e3ce5ff2ad5469f4246b537ed42c5deba857da" +dependencies = [ + "typenum", +] + [[package]] name = "iana-time-zone" version = "0.1.65" @@ -1036,6 +1153,60 @@ version = "1.0.18" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8f42a60cbdf9a97f5d2305f08a87dc4e09308d1276d28c869c684d7777685682" +[[package]] +name = "jni" +version = "0.21.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1a87aa2bb7d2af34197c04845522473242e1aa17c12f4935d5856491a7fb8c97" +dependencies = [ + "cesu8", + "cfg-if", + "combine", + "jni-sys 0.3.1", + "log", + "thiserror 1.0.69", + "walkdir", + "windows-sys 0.45.0", +] + +[[package]] +name = "jni-sys" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "41a652e1f9b6e0275df1f15b32661cf0d4b78d4d87ddec5e0c3c20f097433258" +dependencies = [ + "jni-sys 0.4.1", +] + +[[package]] +name = "jni-sys" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c6377a88cb3910bee9b0fa88d4f42e1d2da8e79915598f65fb0c7ee14c878af2" +dependencies = [ + "jni-sys-macros", +] + +[[package]] +name = "jni-sys-macros" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "38c0b942f458fe50cdac086d2f946512305e5631e720728f2a61aabcd47a6264" +dependencies = [ + "quote", + "syn", +] + +[[package]] +name = "jobserver" +version = "0.1.34" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9afb3de4395d6b3e67a780b6de64b51c978ecf11cb9a462c66be7d4ca9039d33" +dependencies = [ + "getrandom 0.3.4", + "libc", +] + [[package]] name = "js-sys" version = "0.3.102" @@ -1053,6 +1224,16 @@ version = "0.2.186" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "68ab91017fe16c622486840e4c83c9a37afeff978bd239b5293d61ece587de66" +[[package]] +name = "libloading" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "754ca22de805bb5744484a5b151a9e1a8e837d5dc232c2d7d8c2e3492edc8b60" +dependencies = [ + "cfg-if", + "windows-link", +] + [[package]] name = "libredox" version = "0.1.17" @@ -1143,7 +1324,7 @@ dependencies = [ "serde", "serde_json", "tokenizers", - "ureq", + "ureq 2.12.1", ] [[package]] @@ -1258,6 +1439,12 @@ dependencies = [ "pkg-config", ] +[[package]] +name = "openssl-probe" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7c87def4c32ab89d880effc9e097653c8da5d6ef28e6b539d313baaacfbafcbe" + [[package]] name = "option-ext" version = "0.2.0" @@ -1422,7 +1609,7 @@ checksum = "a4e608c6638b9c18977b00b475ac1f28d14e84b27d8d42f70e0bf1e3dec127ac" dependencies = [ "getrandom 0.2.17", "libredox", - "thiserror", + "thiserror 2.0.18", ] [[package]] @@ -1504,7 +1691,7 @@ dependencies = [ "schemars", "serde", "serde_json", - "thiserror", + "thiserror 2.0.18", "tokio", "tokio-util", "tracing", @@ -1551,6 +1738,18 @@ dependencies = [ "zeroize", ] +[[package]] +name = "rustls-native-certs" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dab5152771c58876a2146916e53e35057e1a4dfa2b9df0f0305b07f611fdea4d" +dependencies = [ + "openssl-probe", + "rustls-pki-types", + "schannel", + "security-framework", +] + [[package]] name = "rustls-pki-types" version = "1.14.1" @@ -1560,6 +1759,33 @@ dependencies = [ "zeroize", ] +[[package]] +name = "rustls-platform-verifier" +version = "0.6.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1d99feebc72bae7ab76ba994bb5e121b8d83d910ca40b36e0921f53becc41784" +dependencies = [ + "core-foundation", + "core-foundation-sys", + "jni", + "log", + "once_cell", + "rustls", + "rustls-native-certs", + "rustls-platform-verifier-android", + "rustls-webpki", + "security-framework", + "security-framework-sys", + "webpki-root-certs", + "windows-sys 0.61.2", +] + +[[package]] +name = "rustls-platform-verifier-android" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f87165f0995f63a9fbeea62b64d10b4d9d8e78ec6d7d51fb2125fda7bb36788f" + [[package]] name = "rustls-webpki" version = "0.103.13" @@ -1602,6 +1828,15 @@ dependencies = [ "winapi-util", ] +[[package]] +name = "schannel" +version = "0.1.29" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91c1b7e4904c873ef0710c1f407dde2e6287de2bebc1bbbf7d430bb7cbffd939" +dependencies = [ + "windows-sys 0.61.2", +] + [[package]] name = "schemars" version = "1.2.1" @@ -1628,6 +1863,29 @@ dependencies = [ "syn", ] +[[package]] +name = "security-framework" +version = "3.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b7f4bc775c73d9a02cde8bf7b2ec4c9d12743edf609006c7facc23998404cd1d" +dependencies = [ + "bitflags", + "core-foundation", + "core-foundation-sys", + "libc", + "security-framework-sys", +] + +[[package]] +name = "security-framework-sys" +version = "2.17.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6ce2691df843ecc5d231c0b14ece2acc3efb62c0a398c7e1d875f3983ce020e3" +dependencies = [ + "core-foundation-sys", + "libc", +] + [[package]] name = "serde" version = "1.0.228" @@ -1690,8 +1948,19 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283" dependencies = [ "cfg-if", - "cpufeatures", - "digest", + "cpufeatures 0.2.17", + "digest 0.10.7", +] + +[[package]] +name = "sha2" +version = "0.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "446ba717509524cb3f22f17ecc096f10f4822d76ab5c0b9822c5f9c284e825f4" +dependencies = [ + "cfg-if", + "cpufeatures 0.3.0", + "digest 0.11.3", ] [[package]] @@ -1793,6 +2062,17 @@ dependencies = [ "syn", ] +[[package]] +name = "tar" +version = "0.4.46" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3f6221d9a6003c78398e3b239969f352578258df48c8eb051caadae0015bc840" +dependencies = [ + "filetime", + "libc", + "xattr", +] + [[package]] name = "tempfile" version = "3.27.0" @@ -1806,13 +2086,33 @@ dependencies = [ "windows-sys 0.61.2", ] +[[package]] +name = "thiserror" +version = "1.0.69" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52" +dependencies = [ + "thiserror-impl 1.0.69", +] + [[package]] name = "thiserror" version = "2.0.18" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4288b5bcbc7920c07a1149a35cf9590a2aa808e0bc1eafaade0b80947865fbc4" dependencies = [ - "thiserror-impl", + "thiserror-impl 2.0.18", +] + +[[package]] +name = "thiserror-impl" +version = "1.0.69" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4fee6c4efc90059e10f81e6d42c60a18f76588c3d74cb83a0b242a2b6c7504c1" +dependencies = [ + "proc-macro2", + "quote", + "syn", ] [[package]] @@ -1826,6 +2126,15 @@ dependencies = [ "syn", ] +[[package]] +name = "tiny-keccak" +version = "2.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2c9d3793400a45f954c52e73d068316d76b6f4e36977e3fcebb13a2721e80237" +dependencies = [ + "crunchy", +] + [[package]] name = "tinystr" version = "0.8.3" @@ -1864,7 +2173,7 @@ dependencies = [ "serde", "serde_json", "spm_precompiled", - "thiserror", + "thiserror 2.0.18", "unicode-normalization-alignments", "unicode-segmentation", "unicode_categories", @@ -1950,96 +2259,6 @@ dependencies = [ "tree-sitter-language", ] -[[package]] -name = "tree-sitter-bash" -version = "0.25.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9e5ec769279cc91b561d3df0d8a5deb26b0ad40d183127f409494d6d8fc53062" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-c" -version = "0.24.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a9b2eb57a55fed6b00812912e730b7a275cf4fe98bfd6a5d76263d4438371728" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-cpp" -version = "0.23.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "df2196ea9d47b4ab4a31b9297eaa5a5d19a0b121dceb9f118f6790ad0ab94743" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-css" -version = "0.25.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a5cbc5e18f29a2c6d6435891f42569525cf95435a3e01c2f1947abcde178686f" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-go" -version = "0.25.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c8560a4d2f835cc0d4d2c2e03cbd0dde2f6114b43bc491164238d333e28b16ea" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-html" -version = "0.23.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "261b708e5d92061ede329babaaa427b819329a9d427a1d710abb0f67bbef63ee" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-java" -version = "0.23.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0aa6cbcdc8c679b214e616fd3300da67da0e492e066df01bcf5a5921a71e90d6" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-javascript" -version = "0.25.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68204f2abc0627a90bdf06e605f5c470aa26fdcb2081ea553a04bdad756693f5" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-json" -version = "0.24.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4d727acca406c0020cffc6cf35516764f36c8e3dc4408e5ebe2cb35a947ec471" -dependencies = [ - "cc", - "tree-sitter-language", -] - [[package]] name = "tree-sitter-language" version = "0.1.7" @@ -2047,43 +2266,25 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "009994f150cc0cd50ff54917d5bc8bffe8cad10ca10d81c34da2ec421ae61782" [[package]] -name = "tree-sitter-python" -version = "0.25.0" +name = "tree-sitter-language-pack" +version = "1.9.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6bf85fd39652e740bf60f46f4cda9492c3a9ad75880575bf14960f775cb74a1c" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-ruby" -version = "0.23.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "be0484ea4ef6bb9c575b4fdabde7e31340a8d2dbc7d52b321ac83da703249f95" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-rust" -version = "0.24.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "439e577dbe07423ec2582ac62c7531120dbfccfa6e5f92406f93dd271a120e45" -dependencies = [ - "cc", - "tree-sitter-language", -] - -[[package]] -name = "tree-sitter-typescript" -version = "0.23.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6c5f76ed8d947a75cc446d5fccd8b602ebf0cde64ccf2ffa434d873d7a575eff" +checksum = "09a9d3b46347363ce7dc86d1df53ebe6af125cb9b30c24461d190aa942da6b50" dependencies = [ + "ahash", "cc", - "tree-sitter-language", + "dirs", + "fd-lock", + "libloading", + "memchr", + "serde", + "serde_json", + "sha2 0.11.0", + "tar", + "thiserror 2.0.18", + "tree-sitter", + "ureq 3.3.0", + "zstd", ] [[package]] @@ -2150,6 +2351,37 @@ dependencies = [ "webpki-roots 0.26.11", ] +[[package]] +name = "ureq" +version = "3.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dea7109cdcd5864d4eeb1b58a1648dc9bf520360d7af16ec26d0a9354bafcfc0" +dependencies = [ + "base64 0.22.1", + "flate2", + "log", + "percent-encoding", + "rustls", + "rustls-pki-types", + "rustls-platform-verifier", + "socks", + "ureq-proto", + "utf8-zero", + "webpki-roots 1.0.8", +] + +[[package]] +name = "ureq-proto" +version = "0.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e994ba84b0bd1b1b0cf92878b7ef898a5c1760108fe7b6010327e274917a808c" +dependencies = [ + "base64 0.22.1", + "http", + "httparse", + "log", +] + [[package]] name = "url" version = "2.5.8" @@ -2162,6 +2394,12 @@ dependencies = [ "serde", ] +[[package]] +name = "utf8-zero" +version = "0.8.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b8c0a043c9540bae7c578c88f91dda8bd82e59ae27c21baca69c8b191aaf5a6e" + [[package]] name = "utf8_iter" version = "1.0.4" @@ -2260,6 +2498,15 @@ dependencies = [ "wasm-bindgen", ] +[[package]] +name = "webpki-root-certs" +version = "1.0.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0d46a5a140e6f7afeccd8eae97eff335163939eac8b929834875168b29b3d267" +dependencies = [ + "rustls-pki-types", +] + [[package]] name = "webpki-roots" version = "0.26.11" @@ -2368,6 +2615,15 @@ dependencies = [ "windows-link", ] +[[package]] +name = "windows-sys" +version = "0.45.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "75283be5efb2831d37ea142365f009c02ec203cd29a3ebecbc093d52315b66d0" +dependencies = [ + "windows-targets 0.42.2", +] + [[package]] name = "windows-sys" version = "0.52.0" @@ -2404,6 +2660,21 @@ dependencies = [ "windows-link", ] +[[package]] +name = "windows-targets" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8e5180c00cd44c9b1c88adb3693291f1cd93605ded80c250a75d472756b4d071" +dependencies = [ + "windows_aarch64_gnullvm 0.42.2", + "windows_aarch64_msvc 0.42.2", + "windows_i686_gnu 0.42.2", + "windows_i686_msvc 0.42.2", + "windows_x86_64_gnu 0.42.2", + "windows_x86_64_gnullvm 0.42.2", + "windows_x86_64_msvc 0.42.2", +] + [[package]] name = "windows-targets" version = "0.52.6" @@ -2437,6 +2708,12 @@ dependencies = [ "windows_x86_64_msvc 0.53.1", ] +[[package]] +name = "windows_aarch64_gnullvm" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "597a5118570b68bc08d8d59125332c54f1ba9d9adeedeef5b99b02ba2b0698f8" + [[package]] name = "windows_aarch64_gnullvm" version = "0.52.6" @@ -2449,6 +2726,12 @@ version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a9d8416fa8b42f5c947f8482c43e7d89e73a173cead56d044f6a56104a6d1b53" +[[package]] +name = "windows_aarch64_msvc" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e08e8864a60f06ef0d0ff4ba04124db8b0fb3be5776a5cd47641e942e58c4d43" + [[package]] name = "windows_aarch64_msvc" version = "0.52.6" @@ -2461,6 +2744,12 @@ version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b9d782e804c2f632e395708e99a94275910eb9100b2114651e04744e9b125006" +[[package]] +name = "windows_i686_gnu" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c61d927d8da41da96a81f029489353e68739737d3beca43145c8afec9a31a84f" + [[package]] name = "windows_i686_gnu" version = "0.52.6" @@ -2485,6 +2774,12 @@ version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "fa7359d10048f68ab8b09fa71c3daccfb0e9b559aed648a8f95469c27057180c" +[[package]] +name = "windows_i686_msvc" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "44d840b6ec649f480a41c8d80f9c65108b92d89345dd94027bfe06ac444d1060" + [[package]] name = "windows_i686_msvc" version = "0.52.6" @@ -2497,6 +2792,12 @@ version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1e7ac75179f18232fe9c285163565a57ef8d3c89254a30685b57d83a38d326c2" +[[package]] +name = "windows_x86_64_gnu" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8de912b8b8feb55c064867cf047dda097f92d51efad5b491dfb98f6bbb70cb36" + [[package]] name = "windows_x86_64_gnu" version = "0.52.6" @@ -2509,6 +2810,12 @@ version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9c3842cdd74a865a8066ab39c8a7a473c0778a3f29370b5fd6b4b9aa7df4a499" +[[package]] +name = "windows_x86_64_gnullvm" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "26d41b46a36d453748aedef1486d5c7a85db22e56aff34643984ea85514e94a3" + [[package]] name = "windows_x86_64_gnullvm" version = "0.52.6" @@ -2521,6 +2828,12 @@ version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0ffa179e2d07eee8ad8f57493436566c7cc30ac536a3379fdf008f47f6bb7ae1" +[[package]] +name = "windows_x86_64_msvc" +version = "0.42.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9aec5da331524158c6d1a4ac0ab1541149c0b9505fde06423b02f5ef0106b9f0" + [[package]] name = "windows_x86_64_msvc" version = "0.52.6" @@ -2545,6 +2858,16 @@ version = "0.6.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1ffae5123b2d3fc086436f8834ae3ab053a283cfac8fe0a0b8eaae044768a4c4" +[[package]] +name = "xattr" +version = "1.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32e45ad4206f6d2479085147f02bc2ef834ac85886624a23575ae137c8aa8156" +dependencies = [ + "libc", + "rustix", +] + [[package]] name = "yoke" version = "0.8.3" @@ -2653,3 +2976,31 @@ name = "zmij" version = "1.0.21" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa" + +[[package]] +name = "zstd" +version = "0.13.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e91ee311a569c327171651566e07972200e76fcfe2242a4fa446149a3881c08a" +dependencies = [ + "zstd-safe", +] + +[[package]] +name = "zstd-safe" +version = "7.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f49c4d5f0abb602a93fb8736af2a4f4dd9512e36f7f570d66e65ff867ed3b9d" +dependencies = [ + "zstd-sys", +] + +[[package]] +name = "zstd-sys" +version = "2.0.16+zstd.1.5.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91e19ebc2adc8f83e43039e79776e3fda8ca919132d68a1fed6a5faca2683748" +dependencies = [ + "cc", + "pkg-config", +] diff --git a/crates/csp/Cargo.toml b/crates/csp/Cargo.toml index 93d1493..14bacfa 100644 --- a/crates/csp/Cargo.toml +++ b/crates/csp/Cargo.toml @@ -20,19 +20,11 @@ sha2 = { workspace = true } tempfile = { workspace = true } # Phase 3 — real Model2Vec dense embeddings (official MinishLab Rust port). model2vec-rs = { workspace = true } -# Phase 3 — tree-sitter AST chunking (curated grammar set; statically linked for -# the single-binary goal — unsupported languages fall back to line chunking). +# Phase 3 — tree-sitter AST chunking. Full upstream-parity grammar coverage via +# `tree-sitter-language-pack` (306 languages), replacing the earlier ~14-grammar +# curated static set (see ADR-0004). Parsers are fetched from GitHub releases on +# first use and cached on disk (`dynamic-loading` + `download`, the crate +# defaults); a language with no available grammar (or an offline fetch failure) +# degrades to line chunking exactly as before. tree-sitter = { workspace = true } -tree-sitter-rust = "0.24" -tree-sitter-python = "0.25" -tree-sitter-javascript = "0.25" -tree-sitter-typescript = "0.23" -tree-sitter-go = "0.25" -tree-sitter-java = "0.23" -tree-sitter-c = "0.24" -tree-sitter-cpp = "0.23" -tree-sitter-ruby = "0.23" -tree-sitter-json = "0.24" -tree-sitter-bash = "0.25" -tree-sitter-html = "0.23" -tree-sitter-css = "0.25" +tree-sitter-language-pack = "1.9" diff --git a/crates/csp/src/chunking/core.rs b/crates/csp/src/chunking/core.rs index 83b6576..7b46133 100644 --- a/crates/csp/src/chunking/core.rs +++ b/crates/csp/src/chunking/core.rs @@ -3,41 +3,55 @@ //! //! The merge algorithm is generic over [`AstNode`] so it can be unit-tested with //! mock nodes; in production it is driven by [`tree_sitter::Node`] via [`TsNode`]. -//! A curated set of grammars is statically linked (see [`language_for`]); a -//! language with no bundled grammar makes [`chunk`] return `None` and callers -//! fall back to [`chunk_lines`] — exactly the upstream behavior when -//! `tree_sitter_language_pack` has no parser for the language. +//! Grammars come from [`tree_sitter_language_pack`] (306 languages, full upstream +//! parity — semble uses the Python `tree_sitter_language_pack`; see ADR-0004). +//! Parsers are fetched from GitHub releases on first use and cached on disk; a +//! language with no available grammar — or an offline fetch failure — makes +//! [`language_for`] return `None`, so [`chunk`] returns `None` and callers fall +//! back to [`chunk_lines`], exactly the upstream behavior when the language pack +//! has no parser for the language. + +use std::collections::HashSet; +use std::sync::{LazyLock, Mutex}; use tree_sitter::{Language, Parser}; pub const RECURSION_DEPTH: usize = 500; pub const MIN_CHUNK_SIZE: usize = 50; +/// Languages we've already warned about failing to resolve, so a polyglot +/// offline index degrades quietly after the first notice per language. +static WARNED_LANGUAGES: LazyLock>> = + LazyLock::new(|| Mutex::new(HashSet::new())); + /// Resolve a semble language name (the values in -/// [`crate::indexing::files`]'s `EXTENSION_TO_LANGUAGE`) to a statically-linked -/// tree-sitter grammar, or `None` when no grammar is bundled for it. +/// [`crate::indexing::files`]'s `EXTENSION_TO_LANGUAGE`) to a tree-sitter grammar +/// from the language pack, or `None` when the pack has no grammar for it. /// -/// The curated set covers the common code languages; everything else falls back -/// to line chunking. Add a grammar crate + an arm here to extend coverage. +/// This calls [`tree_sitter_language_pack::get_language`], which downloads the +/// parser from GitHub releases on first use and caches it on disk; later calls +/// hit the in-process registry. A network failure (offline) or an unknown +/// language degrades to `None` → line chunking, with a one-time stderr warning +/// per language so the cause is visible without spamming a polyglot index. Use +/// [`is_supported_language`] (a metadata-only check) when you only need to know +/// whether a grammar *exists* without triggering a download. pub fn language_for(language: &str) -> Option { - let lang: Language = match language { - "rust" => tree_sitter_rust::LANGUAGE.into(), - "python" => tree_sitter_python::LANGUAGE.into(), - "javascript" => tree_sitter_javascript::LANGUAGE.into(), - "typescript" => tree_sitter_typescript::LANGUAGE_TYPESCRIPT.into(), - "tsx" => tree_sitter_typescript::LANGUAGE_TSX.into(), - "go" => tree_sitter_go::LANGUAGE.into(), - "java" => tree_sitter_java::LANGUAGE.into(), - "c" => tree_sitter_c::LANGUAGE.into(), - "cpp" => tree_sitter_cpp::LANGUAGE.into(), - "ruby" => tree_sitter_ruby::LANGUAGE.into(), - "json" => tree_sitter_json::LANGUAGE.into(), - "bash" => tree_sitter_bash::LANGUAGE.into(), - "html" => tree_sitter_html::LANGUAGE.into(), - "css" => tree_sitter_css::LANGUAGE.into(), - _ => return None, - }; - Some(lang) + match tree_sitter_language_pack::get_language(language) { + Ok(lang) => Some(lang), + Err(err) => { + // Don't swallow the error silently: warn once per language, then + // fall back to line chunking (the caller treats `None` that way). + if let Ok(mut warned) = WARNED_LANGUAGES.lock() { + if warned.insert(language.to_string()) { + eprintln!( + "csp: could not load tree-sitter grammar for '{language}': {err}. \ + Falling back to line-based chunking for this language." + ); + } + } + None + } + } } /// A half-open `[start, end)` boundary in character offsets. @@ -54,9 +68,15 @@ pub trait AstNode: Sized { fn children(&self) -> Vec; } -/// Check if the language has a bundled tree-sitter grammar. +/// Check whether the language pack knows a grammar for `language`. +/// +/// This is a metadata-only lookup ([`tree_sitter_language_pack::has_language`], +/// resolving aliases) — unlike [`language_for`] it does **not** download the +/// parser, so [`crate::chunking::source::chunk_source`] can gate AST chunking +/// cheaply before paying for a fetch. A recognized language can still fall back +/// to line chunking later if the actual download fails (e.g. offline). pub fn is_supported_language(language: &str) -> bool { - language_for(language).is_some() + tree_sitter_language_pack::has_language(language) } /// [`AstNode`] adapter over a real [`tree_sitter::Node`]. `Node` is `Copy` and @@ -320,7 +340,8 @@ mod tests { } #[test] - fn supported_languages_resolve_grammars() { + fn recognizes_expanded_language_set() { + // The original curated set stays supported... for lang in [ "rust", "python", @@ -338,13 +359,20 @@ mod tests { "css", ] { assert!(is_supported_language(lang), "{lang} should be supported"); + } + // ...plus the languages that used to fall through to line chunking and + // now resolve to a language-pack grammar (full upstream parity). This is + // a metadata-only check (`has_language`) — it does not download parsers. + for lang in [ + "kotlin", "swift", "php", "scala", "lua", "csharp", "dart", "elixir", "haskell", + "ocaml", "zig", "nix", "perl", "r", "julia", "clojure", + ] { assert!( - language_for(lang).is_some(), - "{lang} grammar should resolve" + is_supported_language(lang), + "{lang} should be supported by the language pack" ); } assert!(!is_supported_language("not-a-real-language")); - assert!(language_for("not-a-real-language").is_none()); } // --- merge_adjacent_chunks --- @@ -470,13 +498,21 @@ mod tests { #[test] fn chunk_returns_none_without_parser() { + // An unknown language is rejected by the language pack's bundled manifest + // without any network access, so `chunk` returns `None` → line fallback. assert_eq!( chunk("let x = 1\n", "__definitely_not_a_real_language__", 1500), None ); } + // The remaining `chunk(...)` tests parse real source, which makes the language + // pack download the grammar from GitHub releases on first use (then cache it). + // They are `#[ignore]`d so the default `cargo test` stays offline; run them + // with `cargo test -- --ignored` where network is available. + #[test] + #[ignore = "downloads a tree-sitter grammar from GitHub releases"] fn chunk_parses_real_rust_into_covering_boundaries() { let src = "fn a() {\n let x = 1;\n}\n\nfn b() {\n let y = 2;\n}\n"; let boundaries = chunk(src, "rust", 1500).expect("rust is supported → Some"); @@ -498,6 +534,7 @@ mod tests { } #[test] + #[ignore = "downloads a tree-sitter grammar from GitHub releases"] fn chunk_byte_to_char_handles_multibyte() { // A multibyte comment ensures byte→char conversion doesn't over-count. let src = "// café ☕ a comment\nfn z() {}\n"; @@ -507,4 +544,43 @@ mod tests { assert!(b.end <= n, "char boundary {b:?} exceeds char count {n}"); } } + + #[test] + #[ignore = "downloads tree-sitter grammars from GitHub releases"] + fn newly_supported_languages_are_ast_chunked() { + // Languages outside the old curated set now AST-chunk instead of falling + // back to line chunking. A small `desired_length` forces a split that the + // AST honors at declaration boundaries. + let cases = [ + ( + "kotlin", + "fun a() {\n val x = 1\n}\n\nfun b() {\n val y = 2\n}\n", + ), + ( + "swift", + "func a() {\n let x = 1\n}\n\nfunc b() {\n let y = 2\n}\n", + ), + ( + "php", + "= 2, + "{lang}: small desired_length should split: {split:?}" + ); + } + } }