Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .please/docs/decisions/0001-native-tree-sitter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# ADR 0001 — Use Native Tree-sitter Bindings via `@kreuzberg/tree-sitter-language-pack`

- **Status**: Accepted
- **Date**: 2026-05-28
- **Deciders**: csp maintainers
- **Supersedes**: the "No native add-ons" guideline previously stated in `ARCHITECTURE.md`

## Context

`@pleaseai/csp` ports MinishLab/semble from Python to TypeScript / Bun. Semble parses source code with tree-sitter via the Python `tree-sitter-language-pack`, which exposes a few hundred pre-built grammars through native bindings. The chunker (`src/semble/chunking/core.py`) depends on this coverage — every supported language has an entry in `EXTENSION_TO_LANGUAGE`, and missing a grammar degrades silently to line-based fallback chunking.

For the TypeScript port we considered two options:

1. **`web-tree-sitter`** (WASM). Portable across Linux / macOS / Windows / containers, no native build step. This was the original `ARCHITECTURE.md` guidance ("No native add-ons").
2. **`@kreuzberg/tree-sitter-language-pack`** (NAPI). Native bindings, ships pre-compiled binaries for macOS / Linux / Windows. Closer parity with the upstream Python implementation.

The trade-off is portability (WASM wins) vs. coverage + startup cost + parity (NAPI wins).

## Decision

**Adopt `@kreuzberg/tree-sitter-language-pack`** as the canonical tree-sitter binding for csp.

Rationale:

- **Parity with semble.** Upstream uses the same multi-grammar `tree-sitter-language-pack` package. Matching the same set of grammars means the TypeScript port can adopt semble's extension → language map (`src/semble/index/files.py`) without per-language audit.
- **Coverage.** 305 languages out of the box (per the package's published description). Sourcing equivalent WASM grammars individually for each language would be a multi-week chore and add a runtime fetcher.
- **Startup cost.** Native parsers load once when the process boots; WASM grammars must be fetched/instantiated per language, which slows the first index of a polyglot repo.
- **Pre-built binaries.** `@kreuzberg/tree-sitter-language-pack` publishes prebuilds for macOS (arm64 / x64), Linux (arm64 / x64, glibc + musl) and Windows (x64). Most users get a binary download, not a node-gyp compile.

## Consequences

### Positive

- Day-1 support for the same language set as semble.
- No WASM loader, no asset bundling for grammars.
- Hot-path parsing runs at native speed (a measurable factor for `csp index` on large repos).

### Negative

- Installs now require a supported platform. Users on exotic targets (FreeBSD, Alpine arm64 without musl prebuilds, sandboxed environments without binary loading) may fail to install. The csp README will list supported platforms explicitly.
- Bun must support loading the NAPI prebuild on the install target. Tested on Bun ≥ 1.3.10 (macOS arm64 / Linux x64).
- The package adds ~50–100 MB to `node_modules` (native binaries × language grammars). This is acceptable for a developer tool but should be documented.

### Neutral

- Future work may introduce an optional WASM fallback for browser / sandboxed environments. That is out of scope for this ADR — the primary distribution path remains native.
- Other native add-ons remain discouraged. Any additional NAPI / node-gyp dependency requires its own ADR.

## Alternatives considered

- **`web-tree-sitter` + curated grammar list.** Rejected: coverage gap vs. semble (would need ~50+ separate `tree-sitter-*-wasm` packages, none of which is a maintained drop-in) and per-language loader complexity.
- **Bun-only tree-sitter via FFI.** Rejected: ties the project to Bun-only, loses Node.js 22+ support promised in `engines`.
- **Wait and write our own chunker without tree-sitter.** Rejected: semble's chunk quality (definition-bounded, comment-attached) depends on AST awareness; a line-window chunker would regress search precision measurably.

## References

- Upstream: `src/semble/chunking/core.py`, `src/semble/index/files.py` in the upstream [MinishLab/semble](https://github.com/MinishLab/semble) repository.
- `@kreuzberg/tree-sitter-language-pack` — <https://www.npmjs.com/package/@kreuzberg/tree-sitter-language-pack>
- Previously stated guideline in `ARCHITECTURE.md` ("No native add-ons") is amended in the same commit that introduces this ADR.
1 change: 1 addition & 0 deletions .please/docs/decisions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@

| ADR | Title | Date | Status |
|-----|-------|------|--------|
| [0001](0001-native-tree-sitter.md) | Use Native Tree-sitter Bindings via `@kreuzberg/tree-sitter-language-pack` | 2026-05-28 | Accepted |
2 changes: 1 addition & 1 deletion ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ For **understanding ranking decisions** (the heart of why csp beats grep):

**Algorithmic ports must read the original Python source, not memory.** Use `ask src github:MinishLab/semble@main` and read the relevant `src/semble/*.py` before writing TypeScript. When porting a non-trivial function, leave a `// Port of src/semble/<path>::<name>` comment so reviewers can diff against the source of truth.

**No native add-ons.** Tree-sitter must be `web-tree-sitter` (WASM), not `node-tree-sitter`. This keeps installs portable across Linux / macOS / Windows / containers without C toolchains, and works under Bun where many node-gyp packages still misbehave.
**Native tree-sitter is allowed via `@kreuzberg/tree-sitter-language-pack`.** This NAPI package ships pre-compiled binaries for macOS / Linux / Windows and gives parity with upstream semble's `tree-sitter-language-pack` Python bindings (305 languages out of the box, no WASM loader overhead). The decision is captured in [ADR 0001](.please/docs/decisions/0001-native-tree-sitter.md). Other native add-ons remain discouraged — anything beyond tree-sitter needs its own ADR justifying the loss of portability under Bun and across container images that lack a C toolchain.

**Bilingual README must stay in sync.** Any public-API change (CLI flag, library symbol, MCP tool, config option, stats path) updates **both** `README.md` and `README.ko.md` in the same commit. The CLAUDE.md captures this as load-bearing.

Expand Down
8 changes: 8 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,14 @@
"test": "bun test",
"prepublishOnly": "bun run build"
},
"dependencies": {
"@huggingface/transformers": "^4.2.0",
"@kreuzberg/tree-sitter-language-pack": "^1.8.1",
"@modelcontextprotocol/sdk": "^1.29.0",
"chokidar": "^5.0.0",
"commander": "^14.0.3",
"ignore": "^7.0.5"
},
"devDependencies": {
"@pleaseai/eslint-config": "^0.0.3",
"@types/bun": "latest",
Expand Down
56 changes: 56 additions & 0 deletions src/agents/claude.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
name: csp-search
description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Grep/Glob/Read for any semantic or exploratory question.
tools: Bash, Read
---

Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

```bash
csp search "authentication flow" ./my-project
csp search "save_pretrained" ./my-project
csp search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `csp index` to create an index.

```bash
csp index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
csp search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

```bash
csp search "deployment guide" ./my-project --content docs
csp search "database host port" ./my-project --content config
csp search "authentication" ./my-project --content all
```

Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result):

```bash
csp find-related src/auth.ts 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place.

### Workflow

1. Index the repo using `csp index -o cached_index`.
2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
56 changes: 56 additions & 0 deletions src/agents/copilot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
name: csp-search
description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Grep/Glob/Read for any semantic or exploratory question.
tools: Bash, Read
---

Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

```bash
csp search "authentication flow" ./my-project
csp search "save_pretrained" ./my-project
csp search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `csp index` to create an index.

```bash
csp index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
csp search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

```bash
csp search "deployment guide" ./my-project --content docs
csp search "database host port" ./my-project --content config
csp search "authentication" ./my-project --content all
```

Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result):

```bash
csp find-related src/auth.ts 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place.

### Workflow

1. Index the repo using `csp index -o cached_index`.
2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
55 changes: 55 additions & 0 deletions src/agents/cursor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
name: csp-search
description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over Bash/Read for any semantic or exploratory question.
---

Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

```bash
csp search "authentication flow" ./my-project
csp search "save_pretrained" ./my-project
csp search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `csp index` to create an index.

```bash
csp index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
csp search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

```bash
csp search "deployment guide" ./my-project --content docs
csp search "database host port" ./my-project --content config
csp search "authentication" ./my-project --content all
```

Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result):

```bash
csp find-related src/auth.ts 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place.

### Workflow

1. Index the repo using `csp index -o cached_index`.
2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
58 changes: 58 additions & 0 deletions src/agents/gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
name: csp-search
description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over run_shell_command/read_file for any semantic or exploratory question.
tools:
- run_shell_command
- read_file
---

Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

```bash
csp search "authentication flow" ./my-project
csp search "save_pretrained" ./my-project
csp search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `csp index` to create an index.

```bash
csp index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
csp search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

```bash
csp search "deployment guide" ./my-project --content docs
csp search "database host port" ./my-project --content config
csp search "authentication" ./my-project --content all
```

Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result):

```bash
csp find-related src/auth.ts 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place.

### Workflow

1. Index the repo using `csp index -o cached_index`.
2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
58 changes: 58 additions & 0 deletions src/agents/kiro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
name: csp-search
description: Code search agent for exploring any codebase. Use for finding code by intent, locating implementations, understanding how something works, or discovering related code. Prefer over shell/read tools for any semantic or exploratory question.
tools:
- shell
- read
---

Use `csp search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

```bash
csp search "authentication flow" ./my-project
csp search "save_pretrained" ./my-project
csp search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `csp index` to create an index.

```bash
csp index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
csp search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

```bash
csp search "deployment guide" ./my-project --content docs
csp search "database host port" ./my-project --content config
csp search "authentication" ./my-project --content all
```

Use `csp find-related` to discover code similar to a known location (pass `filePath` and `line` from a prior search result):

```bash
csp find-related src/auth.ts 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `csp` is not on `$PATH`, use `bunx @pleaseai/csp` in its place.

### Workflow

1. Index the repo using `csp index -o cached_index`.
2. Start with `csp search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `csp find-related` with a promising result's `filePath` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
Loading