feat(csp): expand tree-sitter coverage via tree-sitter-language-pack#39
Conversation
Resolve grammars through `tree_sitter_language_pack` (306 languages, full upstream parity) instead of the curated ~14-grammar static set. 264 of the 265 `EXTENSION_TO_LANGUAGE` names now AST-chunk instead of line-falling-back (only `wolfram` lacks a pack grammar). - `language_for` → `get_language(name).ok()` (downloads on first use, caches on disk; unknown/offline → None → line fallback, the prior contract). - `is_supported_language` → `has_language(name)` (metadata-only, no download) so `chunk_source` gates AST chunking before paying for a fetch. - Replace 14 individual `tree-sitter-*` crates with `tree-sitter-language-pack`; `tree-sitter` stays (same 0.26.x, ABI-compatible Language). - Tests: always-offline recognition + fallback assertions; real-parse tests (`#[ignore]`, run with `--ignored`) verify kotlin/swift/php AST-chunk. Trade-off (ADR-0004): single binary stays, but AST chunking is no longer fully offline — parsers fetch from GitHub releases on first use and cache on disk, degrading gracefully to line chunking when offline. Closes #38
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe Rust chunker ( ChangesGrammar resolution via tree-sitter-language-pack
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
Code Review
본 풀 리퀘스트는 기존의 정적으로 연결된 14개의 tree-sitter 문법을 tree-sitter-language-pack 크레이트로 대체하여 AST 청킹 지원을 306개 언어로 확장하고 업스트림과의 패리티를 맞춥니다. 이 변경으로 파서를 최초 사용 시 동적으로 다운로드하고 디스크에 캐싱하게 됩니다. 리뷰 피드백에서는 language_for 함수에서 tree_sitter_language_pack::get_language(language).ok()를 사용하여 에러를 무시하는 대신, 다운로드 실패 시 경고 메시지를 출력하여 디버깅과 모니터링을 용이하게 하도록 개선할 것을 제안했습니다.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 6 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/csp/src/chunking/core.rs (1)
485-489: Add a network-enabled CI lane for ignored parser tests.
#[ignore]keeps default runs offline, but Lines 485-489/490-526 mean the download/cache + real parse path is no longer exercised in normal CI. Consider a scheduled/manual job runningcargo test --workspace -- --ignoredto catch grammar-fetch and parser-compat regressions earlier.Also applies to: 490-526
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/csp/src/chunking/core.rs` around lines 485 - 489, The tests marked with #[ignore] for real source parsing and grammar downloading (in lines 490-526 of the chunking core tests) are not being executed in normal CI runs, which means grammar-fetch and parser-compatibility regressions are not being caught early. Add a new CI job (either scheduled or manually triggered) to your CI configuration that runs `cargo test --workspace -- --ignored` to exercise these ignored tests when network is available, ensuring the grammar caching and real parser paths are tested as part of your CI pipeline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.please/docs/references/semble.md:
- Around line 126-133: The documentation in Section 4.3 describes two different
gating mechanisms for AST chunking: lines 126-133 state that
`is_supported_language` (a metadata-only check via `has_language`) gates
chunking before any download, but lines 137-140 still reference
`language_for(lang).is_some()` as the gate, which would trigger a download
attempt. Update the later section (around line 139) to clarify that
`is_supported_language` is the actual gate for AST chunking to maintain
consistency with the metadata-only approach described earlier and to accurately
reflect that the download attempt only happens after the metadata check passes.
---
Nitpick comments:
In `@crates/csp/src/chunking/core.rs`:
- Around line 485-489: The tests marked with #[ignore] for real source parsing
and grammar downloading (in lines 490-526 of the chunking core tests) are not
being executed in normal CI runs, which means grammar-fetch and
parser-compatibility regressions are not being caught early. Add a new CI job
(either scheduled or manually triggered) to your CI configuration that runs
`cargo test --workspace -- --ignored` to exercise these ignored tests when
network is available, ensuring the grammar caching and real parser paths are
tested as part of your CI pipeline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8fb990dd-347c-4d60-aa4e-f04751e850ad
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
.please/docs/decisions/0004-rust-grammar-coverage-language-pack.md.please/docs/decisions/index.md.please/docs/references/semble.mdcrates/csp/Cargo.tomlcrates/csp/src/chunking/core.rs
There was a problem hiding this comment.
No issues found across 6 files
Architecture diagram
sequenceDiagram
participant Caller as chunk_source
participant IsSupported as is_supported_language
participant LangFor as language_for
participant Pack as tree-sitter-language-pack
participant Cache as Disk Cache
participant Net as GitHub Releases
participant Lines as line chunking fallback
Note over Caller,Lines: NEW: Dynamic grammar resolution via language pack
Caller->>IsSupported: is_supported_language(language)
Note over IsSupported: Metadata-only (no download)
IsSupported->>Pack: has_language(name)
Pack-->>IsSupported: bool
alt language unknown
IsSupported-->>Caller: false
Caller->>Lines: fallback to line chunking
else language known
IsSupported-->>Caller: true
Caller->>LangFor: language_for(language)
LangFor->>Pack: get_language(name)
alt parser already cached
Pack->>Cache: read cached .so/.dylib
Cache-->>Pack: parser bytes
Pack-->>LangFor: Language
else first use OR cache expired
Pack->>Net: download parser from GitHub release
alt download success
Net-->>Pack: parser bytes
Pack->>Cache: write to OS cache dir
Cache-->>Pack: ok
Pack-->>LangFor: Language
else network failure
Net-->>Pack: error
Pack-->>LangFor: None
end
end
alt Language returned
LangFor-->>Caller: Some(Language)
Caller->>Caller: AST parsing → chunk boundaries
else None
LangFor-->>Caller: None
Caller->>Lines: fallback to line chunking (graceful degradation)
end
end
Note over Caller,Lines: Tests: offline-safe checks use has_language only<br/>Real‑parse tests need network (marked #[ignore])
- language_for: warn once per language on grammar-resolution failure (stderr) instead of swallowing the error with `.ok()`, mirroring dense.rs's stub fallback warning. Dedup keeps a polyglot offline index from spamming. - ci(rust): add a network-gated `ignored-tests` job (workflow_dispatch + weekly schedule) running `cargo test -- --ignored`, so the grammar download/parse path is exercised without burdening the offline PR gate. - docs(semble.md §4.3): correct the chunk_source gate description — `is_supported_language` (metadata-only) gates, then `chunk` may still return None on an offline fetch failure. Addresses gemini-code-assist + CodeRabbit review on #39.
|
리뷰 반영 완료 (commit 60530fd):
cubic은 "No issues found"였습니다. Quality gate(fmt/clippy/test) 재통과 확인했습니다. |
Resolve conflicts in .please/docs/references/semble.md: - chunk length: keep 750 (PR #37 reconciliation; upstream value), drop main's stale 1500 - AST chunking gate: adopt main's is_supported_language (metadata-only) wording, matching the merged #39 tree-sitter-language-pack code - §6.2 gaps: mark TD-002 (ranking) closed by #37 and the curated tree-sitter set closed by #39/ADR-0004 — both gaps are resolved on the merged branch
Summary
Closes #38. The Rust chunker resolved grammars through a curated ~14-grammar static set (
language_for), so files in recognized-but-uncurated languages (kotlin, swift, php, scala, lua, …) were walked and indexed but fell through to line chunking instead of AST chunking — a real behavioral narrowing vs upstream semble, which usestree_sitter_language_pack(≈all languages).This swaps the curated set for the Rust
tree-sitter-language-packcrate (306 grammars, full upstream parity — the Rust sibling of the package semble itself uses). 264 of csp's 265EXTENSION_TO_LANGUAGEnames now AST-chunk; onlywolframlacks a pack grammar.Changes
language_for→tree_sitter_language_pack::get_language(name).ok()— downloads the parser from GitHub releases on first use, caches it on disk; unknown language or offline fetch failure →None→ line fallback (the prior degradation contract, unchanged).is_supported_language→has_language(name)— a metadata-only lookup (no download), sochunk_sourcegates AST chunking cheaply before paying for a fetch.tree-sitter-*crates withtree-sitter-language-pack = "1.9".tree-sitterstays a direct dep and resolves to the same0.26.x(ABI-compatibleLanguage).semble.md§4.3/§4.5/§6.2 updated.Trade-off (decided with maintainer, ADR-0004)
The
cspbinary stays a single executable, but AST chunking is no longer fully offline/self-contained — grammars fetch from GitHub releases on first use and cache under the OS cache dir. Degradation is graceful (offline → line chunking, exactly what an unsupported language already did) and no language regresses below prior behavior. Binary shrinks (no compiled-in grammars).Test plan
cargo fmt --all✅ ·cargo clippy --all-targets --all-features -- -D warnings✅ ·cargo test --workspace✅ (256 + 8 passed, 4 ignored)has_language) + fallback assertions run in the default suite.#[ignore]d (need network to download a grammar). Verified locally:Acceptance criteria (#38)
Some(Language)fromlanguage_forand AST-chunk (verified bynewly_supported_languages_are_ast_chunked).semble.md§6.2 item updated/closed.Summary by cubic
Switches the Rust chunker to
tree-sitter-language-packfor near-full grammar coverage; 264/265 recognized languages now AST-chunk instead of lines. Parsers download on first use and cache on disk, with one-time warnings on grammar load failures and graceful line fallback when offline.New Features
tree_sitter_language_pack::get_language(name);is_supported_languageuseshas_language(name)(metadata-only, no download).wolframinEXTENSION_TO_LANGUAGEstill line-chunks.None+ one-time stderr warning per language → line fallback.ignored-testsjob (manual/weekly) to run real parse tests that download grammars.Dependencies
tree-sitter-*crates withtree-sitter-language-pack = "1.9";tree-sitterremains (0.26.x ABI-compatible).Written for commit 60530fd. Summary will update on new commits.
Summary by CodeRabbit
Documentation
Refactor
Tests