Skip to content

feat(csp): expand tree-sitter coverage via tree-sitter-language-pack#39

Merged
amondnet merged 2 commits into
mainfrom
amondnet/rust-port-expand-tree-sitter-grammar-coverage-to
Jun 20, 2026
Merged

feat(csp): expand tree-sitter coverage via tree-sitter-language-pack#39
amondnet merged 2 commits into
mainfrom
amondnet/rust-port-expand-tree-sitter-grammar-coverage-to

Conversation

@amondnet

@amondnet amondnet commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #38. The Rust chunker resolved grammars through a curated ~14-grammar static set (language_for), so files in recognized-but-uncurated languages (kotlin, swift, php, scala, lua, …) were walked and indexed but fell through to line chunking instead of AST chunking — a real behavioral narrowing vs upstream semble, which uses tree_sitter_language_pack (≈all languages).

This swaps the curated set for the Rust tree-sitter-language-pack crate (306 grammars, full upstream parity — the Rust sibling of the package semble itself uses). 264 of csp's 265 EXTENSION_TO_LANGUAGE names now AST-chunk; only wolfram lacks a pack grammar.

Changes

  • language_fortree_sitter_language_pack::get_language(name).ok() — downloads the parser from GitHub releases on first use, caches it on disk; unknown language or offline fetch failure → None → line fallback (the prior degradation contract, unchanged).
  • is_supported_languagehas_language(name) — a metadata-only lookup (no download), so chunk_source gates AST chunking cheaply before paying for a fetch.
  • Replaced 14 individual tree-sitter-* crates with tree-sitter-language-pack = "1.9". tree-sitter stays a direct dep and resolves to the same 0.26.x (ABI-compatible Language).
  • ADR-0004 records the single-binary ↔ runtime-grammar-cache trade-off; semble.md §4.3/§4.5/§6.2 updated.

Trade-off (decided with maintainer, ADR-0004)

The csp binary stays a single executable, but AST chunking is no longer fully offline/self-contained — grammars fetch from GitHub releases on first use and cache under the OS cache dir. Degradation is graceful (offline → line chunking, exactly what an unsupported language already did) and no language regresses below prior behavior. Binary shrinks (no compiled-in grammars).

Test plan

  • cargo fmt --all ✅ · cargo clippy --all-targets --all-features -- -D warnings ✅ · cargo test --workspace ✅ (256 + 8 passed, 4 ignored)
  • Offline-green: recognition (has_language) + fallback assertions run in the default suite.
  • Real-parse tests are #[ignore]d (need network to download a grammar). Verified locally:
    cargo test -p csp chunking -- --ignored   # 3 passed (rust, multibyte, kotlin/swift/php AST-chunk)
    

Acceptance criteria (#38)

  • Target grammar set decided & documented (ADR-0004, with binary-size/offline rationale).
  • Selected languages return Some(Language) from language_for and AST-chunk (verified by newly_supported_languages_are_ast_chunked).
  • Quality gate green.
  • semble.md §6.2 item updated/closed.

Note: independent of #37 (chunk-length 750 / ranking) — branches off main, touches disjoint lines.


Summary by cubic

Switches the Rust chunker to tree-sitter-language-pack for near-full grammar coverage; 264/265 recognized languages now AST-chunk instead of lines. Parsers download on first use and cache on disk, with one-time warnings on grammar load failures and graceful line fallback when offline.

  • New Features

    • Resolve grammars via tree_sitter_language_pack::get_language(name); is_supported_language uses has_language(name) (metadata-only, no download).
    • Coverage: 306 grammars; only wolfram in EXTENSION_TO_LANGUAGE still line-chunks.
    • Failure behavior: unknown/offline → None + one-time stderr warning per language → line fallback.
    • CI: add a network-gated ignored-tests job (manual/weekly) to run real parse tests that download grammars.
  • Dependencies

    • Replaced 14 tree-sitter-* crates with tree-sitter-language-pack = "1.9"; tree-sitter remains (0.26.x ABI-compatible).
    • Trade-off documented in ADR-0004 (single binary with runtime grammar cache).

Written for commit 60530fd. Summary will update on new commits.

Summary by CodeRabbit

  • Documentation

    • Added an ADR and updated the documentation to explain the new grammar-resolution and chunking behavior (including offline fallback).
  • Refactor

    • Expanded AST chunking coverage to hundreds of languages via a centralized language-pack source.
    • Reduced package/binary size by removing bundled grammars.
    • Grammars are downloaded on first use and cached; when unavailable/offline, chunking falls back to line-based behavior.
  • Tests

    • Adjusted CI to run ignored/network-dependent parsing tests only on manual or scheduled runs.

Resolve grammars through `tree_sitter_language_pack` (306 languages, full
upstream parity) instead of the curated ~14-grammar static set. 264 of the
265 `EXTENSION_TO_LANGUAGE` names now AST-chunk instead of line-falling-back
(only `wolfram` lacks a pack grammar).

- `language_for` → `get_language(name).ok()` (downloads on first use, caches
  on disk; unknown/offline → None → line fallback, the prior contract).
- `is_supported_language` → `has_language(name)` (metadata-only, no download)
  so `chunk_source` gates AST chunking before paying for a fetch.
- Replace 14 individual `tree-sitter-*` crates with `tree-sitter-language-pack`;
  `tree-sitter` stays (same 0.26.x, ABI-compatible Language).
- Tests: always-offline recognition + fallback assertions; real-parse tests
  (`#[ignore]`, run with `--ignored`) verify kotlin/swift/php AST-chunk.

Trade-off (ADR-0004): single binary stays, but AST chunking is no longer
fully offline — parsers fetch from GitHub releases on first use and cache on
disk, degrading gracefully to line chunking when offline.

Closes #38
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 18a4c1fa-9487-422d-913f-e40228b216e9

📥 Commits

Reviewing files that changed from the base of the PR and between 959c4d7 and 60530fd.

📒 Files selected for processing (3)
  • .github/workflows/rust.yml
  • .please/docs/references/semble.md
  • crates/csp/src/chunking/core.rs
✅ Files skipped from review due to trivial changes (1)
  • .please/docs/references/semble.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/csp/src/chunking/core.rs

📝 Walkthrough

Walkthrough

The Rust chunker (crates/csp) replaces ~14 statically linked tree-sitter-* grammar crates with a single tree-sitter-language-pack = "1.9" dependency. language_for now calls get_language with download-and-cache semantics; is_supported_language uses the metadata-only has_language check. Network-dependent tests are marked #[ignore], with a new CI job to run them on schedule. A new ADR (0004) and updates to semble.md document the decision and its trade-offs.

Changes

Grammar resolution via tree-sitter-language-pack

Layer / File(s) Summary
Language resolution rewrite and dependency swap
crates/csp/Cargo.toml, crates/csp/src/chunking/core.rs
Removes 13 per-language tree-sitter-* crates, adds tree-sitter-language-pack = "1.9". Rewrites language_for to call get_language (download + disk cache, None on failure/offline) and is_supported_language to use the metadata-only has_language check.
Test suite updates for language-pack behavior
crates/csp/src/chunking/core.rs
Expands supported-language coverage test to a much larger language set (Kotlin, Swift, PHP, Scala, Lua, etc.), adds #[ignore] to grammar-download tests, and adds a new ignored test (newly_supported_languages_are_ast_chunked) verifying newly covered languages produce non-empty AST chunk boundaries and split with smaller desired_length.
Workflow support for ignored tests
.github/workflows/rust.yml
Adds workflow_dispatch and weekly schedule triggers. Introduces an ignored-tests job that runs on manual/scheduled events and executes cargo test -- --ignored to verify network-dependent tests without blocking default push/PR runs.
ADR 0004 and reference documentation updates
.please/docs/decisions/0004-rust-grammar-coverage-language-pack.md, .please/docs/decisions/index.md, .please/docs/references/semble.md
Adds ADR 0004 recording the decision, trade-offs (offline capability for broader language coverage), consequences, and rejected alternatives. Updates the ADR index. Revises semble.md to reflect the new language_for resolution path with download + disk caching, metadata-only gating, and closes the "curated tree-sitter set" gap item.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • pleaseai/code-search#35: Introduced semble.md with explicit documentation of the prior curated grammar set and line-fallback behavior—the exact gap this PR closes.

Poem

🐇 Fourteen crates once lined my hutch,
Each grammar crate a little crutch.
Now one pack holds them all with grace,
Two-sixty-four langs find their place!
On first fetch they download and cache —
The rabbit hops at quite a pace. 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(csp): expand tree-sitter coverage via tree-sitter-language-pack' directly summarizes the main change: replacing curated grammars with tree-sitter-language-pack to expand coverage.
Linked Issues check ✅ Passed All coding-related requirements from issue #38 are met: target grammar set decided (full language-pack with rationale in ADR-0004), newly-supported languages AST-chunk with tests confirming 264/265 languages supported, quality gates pass, and documentation updated.
Out of Scope Changes check ✅ Passed All changes are directly related to expanding tree-sitter grammar coverage: dependency replacement, grammar resolution refactoring, test updates, documentation, and CI configuration for grammar download testing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch amondnet/rust-port-expand-tree-sitter-grammar-coverage-to

Comment @coderabbitai help to get the list of available commands and usage tips.

@socket-security

socket-security Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedcargo/​thiserror@​1.0.698110093100100
Addedcargo/​tree-sitter-language-pack@​1.9.183100100100100
Addedcargo/​sha2@​0.11.010010093100100

View full report

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

본 풀 리퀘스트는 기존의 정적으로 연결된 14개의 tree-sitter 문법을 tree-sitter-language-pack 크레이트로 대체하여 AST 청킹 지원을 306개 언어로 확장하고 업스트림과의 패리티를 맞춥니다. 이 변경으로 파서를 최초 사용 시 동적으로 다운로드하고 디스크에 캐싱하게 됩니다. 리뷰 피드백에서는 language_for 함수에서 tree_sitter_language_pack::get_language(language).ok()를 사용하여 에러를 무시하는 대신, 다운로드 실패 시 경고 메시지를 출력하여 디버깅과 모니터링을 용이하게 하도록 개선할 것을 제안했습니다.

Comment thread crates/csp/src/chunking/core.rs
@codacy-production

codacy-production Bot commented Jun 19, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 6 complexity · 0 duplication

Metric Results
Complexity 6
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/csp/src/chunking/core.rs (1)

485-489: Add a network-enabled CI lane for ignored parser tests.

#[ignore] keeps default runs offline, but Lines 485-489/490-526 mean the download/cache + real parse path is no longer exercised in normal CI. Consider a scheduled/manual job running cargo test --workspace -- --ignored to catch grammar-fetch and parser-compat regressions earlier.

Also applies to: 490-526

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/csp/src/chunking/core.rs` around lines 485 - 489, The tests marked
with #[ignore] for real source parsing and grammar downloading (in lines 490-526
of the chunking core tests) are not being executed in normal CI runs, which
means grammar-fetch and parser-compatibility regressions are not being caught
early. Add a new CI job (either scheduled or manually triggered) to your CI
configuration that runs `cargo test --workspace -- --ignored` to exercise these
ignored tests when network is available, ensuring the grammar caching and real
parser paths are tested as part of your CI pipeline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.please/docs/references/semble.md:
- Around line 126-133: The documentation in Section 4.3 describes two different
gating mechanisms for AST chunking: lines 126-133 state that
`is_supported_language` (a metadata-only check via `has_language`) gates
chunking before any download, but lines 137-140 still reference
`language_for(lang).is_some()` as the gate, which would trigger a download
attempt. Update the later section (around line 139) to clarify that
`is_supported_language` is the actual gate for AST chunking to maintain
consistency with the metadata-only approach described earlier and to accurately
reflect that the download attempt only happens after the metadata check passes.

---

Nitpick comments:
In `@crates/csp/src/chunking/core.rs`:
- Around line 485-489: The tests marked with #[ignore] for real source parsing
and grammar downloading (in lines 490-526 of the chunking core tests) are not
being executed in normal CI runs, which means grammar-fetch and
parser-compatibility regressions are not being caught early. Add a new CI job
(either scheduled or manually triggered) to your CI configuration that runs
`cargo test --workspace -- --ignored` to exercise these ignored tests when
network is available, ensuring the grammar caching and real parser paths are
tested as part of your CI pipeline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8fb990dd-347c-4d60-aa4e-f04751e850ad

📥 Commits

Reviewing files that changed from the base of the PR and between beae45d and 959c4d7.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • .please/docs/decisions/0004-rust-grammar-coverage-language-pack.md
  • .please/docs/decisions/index.md
  • .please/docs/references/semble.md
  • crates/csp/Cargo.toml
  • crates/csp/src/chunking/core.rs

Comment thread .please/docs/references/semble.md

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Architecture diagram
sequenceDiagram
    participant Caller as chunk_source
    participant IsSupported as is_supported_language
    participant LangFor as language_for
    participant Pack as tree-sitter-language-pack
    participant Cache as Disk Cache
    participant Net as GitHub Releases
    participant Lines as line chunking fallback

    Note over Caller,Lines: NEW: Dynamic grammar resolution via language pack

    Caller->>IsSupported: is_supported_language(language)
    Note over IsSupported: Metadata-only (no download)
    IsSupported->>Pack: has_language(name)
    Pack-->>IsSupported: bool
    alt language unknown
        IsSupported-->>Caller: false
        Caller->>Lines: fallback to line chunking
    else language known
        IsSupported-->>Caller: true
        Caller->>LangFor: language_for(language)
        LangFor->>Pack: get_language(name)
        alt parser already cached
            Pack->>Cache: read cached .so/.dylib
            Cache-->>Pack: parser bytes
            Pack-->>LangFor: Language
        else first use OR cache expired
            Pack->>Net: download parser from GitHub release
            alt download success
                Net-->>Pack: parser bytes
                Pack->>Cache: write to OS cache dir
                Cache-->>Pack: ok
                Pack-->>LangFor: Language
            else network failure
                Net-->>Pack: error
                Pack-->>LangFor: None
            end
        end
        alt Language returned
            LangFor-->>Caller: Some(Language)
            Caller->>Caller: AST parsing → chunk boundaries
        else None
            LangFor-->>Caller: None
            Caller->>Lines: fallback to line chunking (graceful degradation)
        end
    end

    Note over Caller,Lines: Tests: offline-safe checks use has_language only<br/>Real‑parse tests need network (marked #[ignore])
Loading

Re-trigger cubic

- language_for: warn once per language on grammar-resolution failure (stderr)
  instead of swallowing the error with `.ok()`, mirroring dense.rs's stub
  fallback warning. Dedup keeps a polyglot offline index from spamming.
- ci(rust): add a network-gated `ignored-tests` job (workflow_dispatch +
  weekly schedule) running `cargo test -- --ignored`, so the grammar
  download/parse path is exercised without burdening the offline PR gate.
- docs(semble.md §4.3): correct the chunk_source gate description —
  `is_supported_language` (metadata-only) gates, then `chunk` may still
  return None on an offline fetch failure.

Addresses gemini-code-assist + CodeRabbit review on #39.
@amondnet

Copy link
Copy Markdown
Contributor Author

리뷰 반영 완료 (commit 60530fd):

  • gemini [MEDIUM] — 에러 삼킴: language_for.ok()로 조용히 None을 반환하던 것을, 실패 시 언어별 1회 stderr 경고를 출력하도록 변경했습니다. dense.rs의 stub fallback 경고 스타일과 일치시켰고, 폴리글랏 오프라인 인덱싱에서 경고가 파일마다 반복되지 않도록 dedup했습니다.
  • CodeRabbit [doc] — §4.3 모순: chunk_source의 게이트 설명을 is_supported_language(메타데이터 전용) 기준으로 수정했습니다. 이후 chunk(...)가 오프라인 fetch 실패 시 None을 반환해 line fallback될 수 있음을 명시했습니다.
  • CodeRabbit [nitpick] — ignored 테스트 CI 부재: rust.yml에 네트워크 게이트 ignored-tests job을 추가했습니다 (workflow_dispatch + 주간 schedule). PR 게이트는 오프라인을 유지하면서 grammar download/parse 경로의 회귀를 주기적으로 잡습니다.

cubic은 "No issues found"였습니다. Quality gate(fmt/clippy/test) 재통과 확인했습니다.

@amondnet amondnet self-assigned this Jun 20, 2026
@amondnet amondnet merged commit 9a0cde3 into main Jun 20, 2026
9 checks passed
@amondnet amondnet deleted the amondnet/rust-port-expand-tree-sitter-grammar-coverage-to branch June 20, 2026 01:27
This was referenced Jun 20, 2026
amondnet added a commit that referenced this pull request Jun 20, 2026
Resolve conflicts in .please/docs/references/semble.md:
- chunk length: keep 750 (PR #37 reconciliation; upstream value), drop main's
  stale 1500
- AST chunking gate: adopt main's is_supported_language (metadata-only) wording,
  matching the merged #39 tree-sitter-language-pack code
- §6.2 gaps: mark TD-002 (ranking) closed by #37 and the curated tree-sitter set
  closed by #39/ADR-0004 — both gaps are resolved on the merged branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rust port: expand tree-sitter grammar coverage to match upstream language pack

1 participant