diff --git a/.please/docs/decisions/0001-native-tree-sitter.md b/.please/docs/decisions/0001-native-tree-sitter.md new file mode 100644 index 0000000..a478890 --- /dev/null +++ b/.please/docs/decisions/0001-native-tree-sitter.md @@ -0,0 +1,59 @@ +# ADR 0001 — Use Native Tree-sitter Bindings via `@kreuzberg/tree-sitter-language-pack` + +- **Status**: Accepted +- **Date**: 2026-05-28 +- **Deciders**: csp maintainers +- **Supersedes**: the "No native add-ons" guideline previously stated in `ARCHITECTURE.md` + +## Context + +`@pleaseai/csp` ports MinishLab/semble from Python to TypeScript / Bun. Semble parses source code with tree-sitter via the Python `tree-sitter-language-pack`, which exposes a few hundred pre-built grammars through native bindings. The chunker (`src/semble/chunking/core.py`) depends on this coverage — every supported language has an entry in `EXTENSION_TO_LANGUAGE`, and missing a grammar degrades silently to line-based fallback chunking. + +For the TypeScript port we considered two options: + +1. **`web-tree-sitter`** (WASM). Portable across Linux / macOS / Windows / containers, no native build step. This was the original `ARCHITECTURE.md` guidance ("No native add-ons"). +2. **`@kreuzberg/tree-sitter-language-pack`** (NAPI). Native bindings, ships pre-compiled binaries for macOS / Linux / Windows. Closer parity with the upstream Python implementation. + +The trade-off is portability (WASM wins) vs. coverage + startup cost + parity (NAPI wins). + +## Decision + +**Adopt `@kreuzberg/tree-sitter-language-pack`** as the canonical tree-sitter binding for csp. + +Rationale: + +- **Parity with semble.** Upstream uses the same multi-grammar `tree-sitter-language-pack` package. Matching the same set of grammars means the TypeScript port can adopt semble's extension → language map (`src/semble/index/files.py`) without per-language audit. +- **Coverage.** 305 languages out of the box (per the package's published description). Sourcing equivalent WASM grammars individually for each language would be a multi-week chore and add a runtime fetcher. +- **Startup cost.** Native parsers load once when the process boots; WASM grammars must be fetched/instantiated per language, which slows the first index of a polyglot repo. +- **Pre-built binaries.** `@kreuzberg/tree-sitter-language-pack` publishes prebuilds for macOS (arm64 / x64), Linux (arm64 / x64, glibc + musl) and Windows (x64). Most users get a binary download, not a node-gyp compile. + +## Consequences + +### Positive + +- Day-1 support for the same language set as semble. +- No WASM loader, no asset bundling for grammars. +- Hot-path parsing runs at native speed (a measurable factor for `csp index` on large repos). + +### Negative + +- Installs now require a supported platform. Users on exotic targets (FreeBSD, Alpine arm64 without musl prebuilds, sandboxed environments without binary loading) may fail to install. The csp README will list supported platforms explicitly. +- Bun must support loading the NAPI prebuild on the install target. Tested on Bun ≥ 1.3.10 (macOS arm64 / Linux x64). +- The package adds ~50–100 MB to `node_modules` (native binaries × language grammars). This is acceptable for a developer tool but should be documented. + +### Neutral + +- Future work may introduce an optional WASM fallback for browser / sandboxed environments. That is out of scope for this ADR — the primary distribution path remains native. +- Other native add-ons remain discouraged. Any additional NAPI / node-gyp dependency requires its own ADR. + +## Alternatives considered + +- **`web-tree-sitter` + curated grammar list.** Rejected: coverage gap vs. semble (would need ~50+ separate `tree-sitter-*-wasm` packages, none of which is a maintained drop-in) and per-language loader complexity. +- **Bun-only tree-sitter via FFI.** Rejected: ties the project to Bun-only, loses Node.js 22+ support promised in `engines`. +- **Wait and write our own chunker without tree-sitter.** Rejected: semble's chunk quality (definition-bounded, comment-attached) depends on AST awareness; a line-window chunker would regress search precision measurably. + +## References + +- Upstream: `src/semble/chunking/core.py`, `src/semble/index/files.py` in the upstream [MinishLab/semble](https://github.com/MinishLab/semble) repository. +- `@kreuzberg/tree-sitter-language-pack` — +- Previously stated guideline in `ARCHITECTURE.md` ("No native add-ons") is amended in the same commit that introduces this ADR. diff --git a/.please/docs/decisions/index.md b/.please/docs/decisions/index.md index 1db25c5..8bf7200 100644 --- a/.please/docs/decisions/index.md +++ b/.please/docs/decisions/index.md @@ -4,3 +4,4 @@ | ADR | Title | Date | Status | |-----|-------|------|--------| +| [0001](0001-native-tree-sitter.md) | Use Native Tree-sitter Bindings via `@kreuzberg/tree-sitter-language-pack` | 2026-05-28 | Accepted | diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 9ed62e2..f06d534 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -124,7 +124,7 @@ For **understanding ranking decisions** (the heart of why csp beats grep): **Algorithmic ports must read the original Python source, not memory.** Use `ask src github:MinishLab/semble@main` and read the relevant `src/semble/*.py` before writing TypeScript. When porting a non-trivial function, leave a `// Port of src/semble/::` comment so reviewers can diff against the source of truth. -**No native add-ons.** Tree-sitter must be `web-tree-sitter` (WASM), not `node-tree-sitter`. This keeps installs portable across Linux / macOS / Windows / containers without C toolchains, and works under Bun where many node-gyp packages still misbehave. +**Native tree-sitter is allowed via `@kreuzberg/tree-sitter-language-pack`.** This NAPI package ships pre-compiled binaries for macOS / Linux / Windows and gives parity with upstream semble's `tree-sitter-language-pack` Python bindings (305 languages out of the box, no WASM loader overhead). The decision is captured in [ADR 0001](.please/docs/decisions/0001-native-tree-sitter.md). Other native add-ons remain discouraged — anything beyond tree-sitter needs its own ADR justifying the loss of portability under Bun and across container images that lack a C toolchain. **Bilingual README must stay in sync.** Any public-API change (CLI flag, library symbol, MCP tool, config option, stats path) updates **both** `README.md` and `README.ko.md` in the same commit. The CLAUDE.md captures this as load-bearing. diff --git a/package.json b/package.json index 6b31303..3c24e89 100644 --- a/package.json +++ b/package.json @@ -52,6 +52,14 @@ "test": "bun test", "prepublishOnly": "bun run build" }, + "dependencies": { + "@huggingface/transformers": "^4.2.0", + "@kreuzberg/tree-sitter-language-pack": "^1.8.1", + "@modelcontextprotocol/sdk": "^1.29.0", + "chokidar": "^5.0.0", + "commander": "^14.0.3", + "ignore": "^7.0.5" + }, "devDependencies": { "@pleaseai/eslint-config": "^0.0.3", "@types/bun": "latest", diff --git a/src/agents/claude.md b/src/agents/claude.md new file mode 100644 index 0000000..238afdd --- /dev/null +++ b/src/agents/claude.md @@ -0,0 +1,56 @@ +--- +name: csp-search +description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Grep/Glob/Read for any semantic or exploratory question. +tools: Bash, Read +--- + +Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep: + +```bash +csp search "authentication flow" ./my-project +csp search "save_pretrained" ./my-project +csp search "save model to disk" ./my-project --top-k 10 +``` + +If you anticipate doing more than one search, use `csp index` to create an index. + +```bash +csp index ./my-project -o my_index +``` + +You can then reuse this index later on: + +```bash +csp search "save_pretrained" --index my_index +``` + +An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex. + +Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: + +```bash +csp search "deployment guide" ./my-project --content docs +csp search "database host port" ./my-project --content config +csp search "authentication" ./my-project --content all +``` + +Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result): + +```bash +csp find-related src/auth.ts 42 ./my-project +``` + +Like search, `find-related` also accepts an `--index` argument. + +`path` defaults to the current directory when omitted; git URLs are accepted. + +If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place. + +### Workflow + +1. Index the repo using `csp index -o cached_index`. +2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster. +3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. +4. Inspect full files only when the returned chunk does not give enough context. +5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations. +6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. diff --git a/src/agents/copilot.md b/src/agents/copilot.md new file mode 100644 index 0000000..238afdd --- /dev/null +++ b/src/agents/copilot.md @@ -0,0 +1,56 @@ +--- +name: csp-search +description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Grep/Glob/Read for any semantic or exploratory question. +tools: Bash, Read +--- + +Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep: + +```bash +csp search "authentication flow" ./my-project +csp search "save_pretrained" ./my-project +csp search "save model to disk" ./my-project --top-k 10 +``` + +If you anticipate doing more than one search, use `csp index` to create an index. + +```bash +csp index ./my-project -o my_index +``` + +You can then reuse this index later on: + +```bash +csp search "save_pretrained" --index my_index +``` + +An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex. + +Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: + +```bash +csp search "deployment guide" ./my-project --content docs +csp search "database host port" ./my-project --content config +csp search "authentication" ./my-project --content all +``` + +Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result): + +```bash +csp find-related src/auth.ts 42 ./my-project +``` + +Like search, `find-related` also accepts an `--index` argument. + +`path` defaults to the current directory when omitted; git URLs are accepted. + +If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place. + +### Workflow + +1. Index the repo using `csp index -o cached_index`. +2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster. +3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. +4. Inspect full files only when the returned chunk does not give enough context. +5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations. +6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. diff --git a/src/agents/cursor.md b/src/agents/cursor.md new file mode 100644 index 0000000..23e85d9 --- /dev/null +++ b/src/agents/cursor.md @@ -0,0 +1,55 @@ +--- +name: csp-search +description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Bash/Read for any semantic or exploratory question. +--- + +Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep: + +```bash +csp search "authentication flow" ./my-project +csp search "save_pretrained" ./my-project +csp search "save model to disk" ./my-project --top-k 10 +``` + +If you anticipate doing more than one search, use `csp index` to create an index. + +```bash +csp index ./my-project -o my_index +``` + +You can then reuse this index later on: + +```bash +csp search "save_pretrained" --index my_index +``` + +An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex. + +Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: + +```bash +csp search "deployment guide" ./my-project --content docs +csp search "database host port" ./my-project --content config +csp search "authentication" ./my-project --content all +``` + +Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result): + +```bash +csp find-related src/auth.ts 42 ./my-project +``` + +Like search, `find-related` also accepts an `--index` argument. + +`path` defaults to the current directory when omitted; git URLs are accepted. + +If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place. + +### Workflow + +1. Index the repo using `csp index -o cached_index`. +2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster. +3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. +4. Inspect full files only when the returned chunk does not give enough context. +5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations. +6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. diff --git a/src/agents/gemini.md b/src/agents/gemini.md new file mode 100644 index 0000000..9436d1a --- /dev/null +++ b/src/agents/gemini.md @@ -0,0 +1,58 @@ +--- +name: csp-search +description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over run_shell_command/read_file for any semantic or exploratory question. +tools: + - run_shell_command + - read_file +--- + +Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep: + +```bash +csp search "authentication flow" ./my-project +csp search "save_pretrained" ./my-project +csp search "save model to disk" ./my-project --top-k 10 +``` + +If you anticipate doing more than one search, use `csp index` to create an index. + +```bash +csp index ./my-project -o my_index +``` + +You can then reuse this index later on: + +```bash +csp search "save_pretrained" --index my_index +``` + +An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex. + +Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: + +```bash +csp search "deployment guide" ./my-project --content docs +csp search "database host port" ./my-project --content config +csp search "authentication" ./my-project --content all +``` + +Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result): + +```bash +csp find-related src/auth.ts 42 ./my-project +``` + +Like search, `find-related` also accepts an `--index` argument. + +`path` defaults to the current directory when omitted; git URLs are accepted. + +If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place. + +### Workflow + +1. Index the repo using `csp index -o cached_index`. +2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster. +3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. +4. Inspect full files only when the returned chunk does not give enough context. +5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations. +6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. diff --git a/src/agents/kiro.md b/src/agents/kiro.md new file mode 100644 index 0000000..01e0df1 --- /dev/null +++ b/src/agents/kiro.md @@ -0,0 +1,58 @@ +--- +name: csp-search +description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over shell/read tools for any semantic or exploratory question. +tools: + - shell + - read +--- + +Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep: + +```bash +csp search "authentication flow" ./my-project +csp search "save_pretrained" ./my-project +csp search "save model to disk" ./my-project --top-k 10 +``` + +If you anticipate doing more than one search, use `csp index` to create an index. + +```bash +csp index ./my-project -o my_index +``` + +You can then reuse this index later on: + +```bash +csp search "save_pretrained" --index my_index +``` + +An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex. + +Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: + +```bash +csp search "deployment guide" ./my-project --content docs +csp search "database host port" ./my-project --content config +csp search "authentication" ./my-project --content all +``` + +Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result): + +```bash +csp find-related src/auth.ts 42 ./my-project +``` + +Like search, `find-related` also accepts an `--index` argument. + +`path` defaults to the current directory when omitted; git URLs are accepted. + +If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place. + +### Workflow + +1. Index the repo using `csp index -o cached_index`. +2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster. +3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. +4. Inspect full files only when the returned chunk does not give enough context. +5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations. +6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. diff --git a/src/agents/opencode.md b/src/agents/opencode.md new file mode 100644 index 0000000..8a5abc0 --- /dev/null +++ b/src/agents/opencode.md @@ -0,0 +1,59 @@ +--- +name: csp-search +description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Bash/Read for any semantic or exploratory question. +mode: subagent +permission: + bash: allow + read: allow +--- + +Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep: + +```bash +csp search "authentication flow" ./my-project +csp search "save_pretrained" ./my-project +csp search "save model to disk" ./my-project --top-k 10 +``` + +If you anticipate doing more than one search, use `csp index` to create an index. + +```bash +csp index ./my-project -o my_index +``` + +You can then reuse this index later on: + +```bash +csp search "save_pretrained" --index my_index +``` + +An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex. + +Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: + +```bash +csp search "deployment guide" ./my-project --content docs +csp search "database host port" ./my-project --content config +csp search "authentication" ./my-project --content all +``` + +Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result): + +```bash +csp find-related src/auth.ts 42 ./my-project +``` + +Like search, `find-related` also accepts an `--index` argument. + +`path` defaults to the current directory when omitted; git URLs are accepted. + +If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place. + +### Workflow + +1. Index the repo using `csp index -o cached_index`. +2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster. +3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. +4. Inspect full files only when the returned chunk does not give enough context. +5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations. +6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. diff --git a/src/chunking/.gitkeep b/src/chunking/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/src/indexing/.gitkeep b/src/indexing/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/src/mcp/.gitkeep b/src/mcp/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/src/ranking/.gitkeep b/src/ranking/.gitkeep new file mode 100644 index 0000000..e69de29