pleaseai · amondnet · Jun 20, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml
@@ -19,6 +19,11 @@ on:
       - rust-toolchain.toml
       - rustfmt.toml
       - .github/workflows/rust.yml
+  # The `ignored-tests` job below is network-gated; run it manually or weekly so
+  # the PR gate stays offline. See ADR-0004.
+  workflow_dispatch:
+  schedule:
+    - cron: '0 6 * * 1' # Mondays 06:00 UTC
 
 permissions:
   contents: read
@@ -47,3 +52,19 @@ jobs:
 
       - name: Test
         run: cargo test --all-features --locked --workspace
+
+  ignored-tests:
+    # Network-gated: these tests download tree-sitter grammars from GitHub
+    # releases (see ADR-0004). Kept off the PR/push gate so it stays offline and
+    # flake-free; runs on manual dispatch or the weekly schedule to catch
+    # grammar-fetch / parser-compat regressions.
+    if: github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
+        with:
+          persist-credentials: false
+
+      - name: Test (ignored / network-gated)
+        run: cargo test --all-features --locked --workspace -- --ignored
diff --git a/.please/docs/decisions/0004-rust-grammar-coverage-language-pack.md b/.please/docs/decisions/0004-rust-grammar-coverage-language-pack.md
@@ -0,0 +1,69 @@
+# ADR 0004 — Rust grammar coverage via `tree-sitter-language-pack` (downloaded parsers)
+
+- **Status**: Accepted
+- **Date**: 2026-06-20
+- **Deciders**: csp maintainers
+- **Relates to**: [ADR 0001](0001-native-tree-sitter.md) (native tree-sitter bindings, the TS side), [ADR 0003](0003-rewrite-in-rust.md) (Rust rewrite & single-binary distribution)
+- **Closes**: [#38](https://github.com/pleaseai/code-search/issues/38) — "Rust port: expand tree-sitter grammar coverage to match upstream language pack"
+
+## Context
+
+The Rust port's chunker (`crates/csp/src/chunking/core.rs`) resolved a tree-sitter grammar through `language_for(language)`, which statically linked a **curated set of ~14 grammars** (rust, python, javascript, typescript, tsx, go, java, c, cpp, ruby, json, bash, html, css) via individual `tree-sitter-*` crates.
+
+Upstream semble parses with the Python `tree-sitter-language-pack` (≈all languages). Meanwhile the Rust port's `EXTENSION_TO_LANGUAGE` table (`crates/csp/src/indexing/files.rs`, ~350 entries / 265 distinct language names) recognizes far more languages than the curated grammar set. The effect was a **real behavioral narrowing vs upstream**: a file in a recognized-but-uncurated language (kotlin, swift, php, scala, lua, …) was still walked and indexed, but fell through to line-based chunking instead of AST chunking — coarser, less semantically-aligned chunk boundaries than upstream produces.
+
+The options for closing the gap:
+
+1. **Expanded curated set** — hand-pick and statically link ~30–40 more `tree-sitter-*` crates. Keeps a self-contained offline binary, but only partial parity, grows build time / binary size, and requires a per-language audit (not every grammar is published to crates.io at a compatible version).
+2. **`tree-sitter-language-pack` crate (dynamic loading + download)** — adopt the Rust crate published by the same project semble uses. 306 languages, `get_language(name)` maps 1:1 onto `language_for`, and 264 of csp's 265 `EXTENSION_TO_LANGUAGE` names resolve (only `wolfram` is absent). Parsers are fetched from GitHub releases on first use and cached on disk.
+
+## Decision
+
+**Adopt `tree-sitter-language-pack` = "1.9"** with its default features (`dynamic-loading` + `download`) as the canonical grammar source for the Rust chunker. Replace the 14 individual `tree-sitter-*` grammar crates with it.
+
+- `language_for(language)` → `tree_sitter_language_pack::get_language(language).ok()`. Downloads the parser on first use, caches it, then hits the in-process registry. Unknown language or an offline fetch failure → `None` → line fallback (the exact pre-existing degradation contract).
+- `is_supported_language(language)` → `tree_sitter_language_pack::has_language(language)`. A **metadata-only** lookup (bundled manifest + aliases) that does **not** download, so `chunk_source` gates AST chunking cheaply before paying for a fetch.
+- `tree-sitter` stays a direct dependency (the port drives `Parser`/`Node`/`Language` itself). The crate resolves to the same `tree-sitter 0.26.x`, so the returned `Language` is ABI-compatible.
+
+### Trade-off: single binary vs. runtime grammar cache
+
+ADR-0003 motivation #1 is single-binary distribution. This decision **narrows** that property: the `csp` binary is still a single executable that runs with no Node/Bun present, but it is **no longer fully self-contained / offline** for AST chunking — grammars are fetched from GitHub releases on first use and cached under the OS cache dir (`tree_sitter_language_pack::cache_dir()`; `dirs`-based, e.g. `~/Library/Caches/...` or `~/.cache/...`).
+
+We accept this because:
+
+- **Parity is the point of this work.** A statically-linked subset cannot reach upstream's ≈full coverage without an unbounded crate-audit treadmill; the language pack tracks 306 grammars maintained by the same upstream semble depends on.
+- **Degradation is graceful, not fatal.** Offline or fetch-failed → line chunking, exactly what an unsupported language already did. No language regresses below the previous behavior; the previously-curated 14 also just download once and cache.
+- **Binary size shrinks** (grammars no longer compiled in) at the cost of a one-time per-language network fetch.
+
+The negatives: first-use latency and a network/GitHub-availability dependency for never-before-seen languages; a writable cache dir is required for AST chunking. These are documented and considered acceptable for a developer tool. A future offline/air-gapped mode could pre-seed the cache via `tree_sitter_language_pack::download(&[...])` or `download_all()`, or pin a `download`-disabled build that links a static subset — out of scope here.
+
+## Consequences
+
+### Positive
+
+- Full upstream-parity AST chunking: 264/265 recognized languages now AST-chunk instead of line-falling-back.
+- One dependency replaces 14; coverage tracks upstream without a per-language audit.
+- Smaller binary (no compiled-in grammars).
+
+### Negative
+
+- AST chunking now requires a one-time network fetch per language and a writable cache dir; fully offline runs degrade those languages to line chunking until the cache is seeded.
+- `cargo test` for real-parse tests needs network → those tests are `#[ignore]`d (run with `cargo test -- --ignored`); the default suite stays offline-green via metadata-only (`has_language`) and fallback assertions.
+- One language in csp's extension table (`wolfram`) has no pack grammar and stays on line fallback.
+
+### Neutral
+
+- `chunk_source`'s gate (`is_supported_language` then `chunk`) is unchanged in shape; only the resolver backing it changed.
+- An offline/static build mode remains a future option (feature-gated `download`-off build, or cache pre-seeding).
+
+## Alternatives considered
+
+- **Expanded curated static set.** Rejected: partial parity only, ongoing crate-audit burden, and larger build/binary, for a coverage ceiling still well below upstream.
+- **Hybrid (static subset + optional language-pack feature).** Rejected for now: doubles the chunker's resolver paths and the test matrix for little benefit over the download model, whose offline degradation already matches the old static fallback. Kept as a future option if an air-gapped build becomes a requirement.
+
+## References
+
+- Issue [#38](https://github.com/pleaseai/code-search/issues/38).
+- `tree-sitter-language-pack` — <https://crates.io/crates/tree-sitter-language-pack>, <https://github.com/kreuzberg-dev/tree-sitter-language-pack>.
+- Upstream: `src/semble/chunking/core.py`, `src/semble/index/files.py` in [MinishLab/semble](https://github.com/MinishLab/semble).
+- `.please/docs/references/semble.md` §4.3 (chunking) and §6.2 (open gaps).
diff --git a/.please/docs/decisions/index.md b/.please/docs/decisions/index.md
@@ -7,3 +7,4 @@
 | [0001](0001-native-tree-sitter.md) | Use Native Tree-sitter Bindings via `@kreuzberg/tree-sitter-language-pack` | 2026-05-28 | Accepted |
 | [0002](0002-index-storage-cache-model.md) | Index Storage & Caching Model: Global `~/.csp/index/` Content-Hash Cache | 2026-06-18 | Accepted |
 | [0003](0003-rewrite-in-rust.md) | Rewrite `@pleaseai/csp` from TypeScript/Bun to Rust | 2026-06-18 | Proposed |
+| [0004](0004-rust-grammar-coverage-language-pack.md) | Rust grammar coverage via `tree-sitter-language-pack` (downloaded parsers) | 2026-06-20 | Accepted |
diff --git a/.please/docs/references/semble.md b/.please/docs/references/semble.md
@@ -123,17 +123,23 @@ Same contract as semble `tokens.py`:
 **`core.rs`** (boundary algorithm, byte-based; `RECURSION_DEPTH = 500`, `MIN_CHUNK_SIZE = 50`):
 - The merge algorithm is generic over a node trait so tests can drive it with mock nodes; in
   production a **`TsNode` bridge** adapts `tree_sitter::Node` to it.
-- `language_for(language)` returns a statically-linked `tree_sitter::Language` for a **curated
-  grammar set**: rust, python, javascript, typescript, tsx, go, java, c, cpp, ruby, json, bash,
-  html, css. Unsupported languages → `None` → line fallback. ⚠ This is **narrower than upstream**,
-  which uses `tree_sitter_language_pack` (≈all languages); see §6.
+- `language_for(language)` returns a `tree_sitter::Language` from
+  **`tree_sitter_language_pack`** (306 grammars — full upstream parity; semble uses the Python
+  package of the same name). Parsers download from GitHub releases on first use and cache on disk;
+  unknown language or an offline fetch failure → `None` → line fallback. `is_supported_language`
+  uses `has_language` — a **metadata-only** check (no download) — so `chunk_source` gates AST
+  chunking before paying for a fetch. 264/265 `EXTENSION_TO_LANGUAGE` names resolve (`wolfram` is
+  the sole gap). See [ADR-0004](../decisions/0004-rust-grammar-coverage-language-pack.md) for the
+  single-binary ↔ runtime-cache trade-off.
 - `_merge_node_inner` (greedy pack), `_merge_adjacent_chunks`, `chunk_lines` fallback — same
   shape as semble. Byte offsets are converted to char offsets for multibyte safety.
 
 **`source.rs`** (`chunk_source`):
 - `DESIRED_CHUNK_LENGTH_CHARS = 1500` (⚠ upstream is now **750** — see §6).
-- AST chunking when `language_for(lang).is_some()`, else line fallback. Char offsets → 1-indexed
-  line numbers; clamps end to avoid the zero-length off-by-one.
+- AST chunking is gated by `is_supported_language(lang)` (metadata-only, no download); the
+  subsequent `chunk(...)` may still return `None` (e.g. an offline grammar fetch failure),
+  falling back to line chunking. Char offsets → 1-indexed line numbers; clamps end to avoid the
+  zero-length off-by-one.
 
 ### 4.4 `indexing/file_walker.rs` — gitignore-aware walk
 
@@ -146,9 +152,9 @@ Same contract as semble `tokens.py`:
 ### 4.5 `indexing/files.rs` — language detection & file gating
 
 - `EXTENSION_TO_LANGUAGE` — `&[(&str, &str)]` table (~350 entries). `detect_language(name)`
-  lowercases the suffix and looks it up. Note: this recognizes far more extensions than the
-  curated tree-sitter set in §4.3 — recognized-but-unparsed languages still get **walked and
-  line-chunked**.
+  lowercases the suffix and looks it up. Since §4.3 now resolves grammars through
+  `tree_sitter_language_pack` (306 grammars), 264/265 of these language names AST-chunk; only
+  `wolfram` (and any extension the pack can't fetch) falls back to line chunking.
 - Content-type partition: `DOC_LANGUAGES`, `CONFIG_LANGUAGES`, `DATA_LANGUAGES`; code = all minus
   those. `get_extensions(types, extra)` inverts the map; the **`extra`** param (custom extensions)
   is a small Rust-side API addition.
@@ -348,10 +354,12 @@ Clean two-layer split:
   `rerank_top_k_saturation`; the real `ranking::{boosting::apply_query_boost,
   penalties::rerank_top_k}` are ported but unwired (matches the TS source). Search-ranking parity
   is fixture-level only. Saturation constants are duplicated as a result.
-- **Curated tree-sitter set** — only ~14 grammars are statically linked (`language_for`); upstream
-  uses `tree_sitter_language_pack` (≈all languages). Languages outside the curated set are
-  recognized by the extension map but **line-chunked**, not AST-chunked. This is a real behavioral
-  narrowing vs upstream.
+- ~~**Curated tree-sitter set**~~ — **closed** ([ADR-0004](../decisions/0004-rust-grammar-coverage-language-pack.md),
+  [#38](https://github.com/pleaseai/code-search/issues/38)). `language_for` now resolves through
+  `tree_sitter_language_pack` (306 grammars, full upstream parity; 264/265 `EXTENSION_TO_LANGUAGE`
+  names AST-chunk, `wolfram` excepted). Trade-off recorded in ADR-0004: parsers download on first
+  use and cache on disk, so AST chunking is no longer fully offline/self-contained — it degrades
+  gracefully to line chunking when offline, exactly as an unsupported language already did.
 
 ### 6.3 Upstream drift since the review baseline (`eacbe43` → `136b6f7`)