feat(deps): bootstrap external dependencies and adr#6
Conversation
Bootstrap Unit 0 of the semble → @pleaseai/csp port: add the external runtime deps that downstream units will pull from, record the ADR that amends ARCHITECTURE.md's 'No native add-ons' guideline, and copy the six agent prompts that 'csp init' will render. Dependencies (runtime): - @kreuzberg/tree-sitter-language-pack ^1.8.1 — NAPI tree-sitter parsers (305 languages) for parity with upstream semble's tree-sitter-language-pack - @modelcontextprotocol/sdk ^1.29.0 — MCP server (csp mcp tools) - ignore ^7.0.5 — .gitignore / .cspignore matching for file walker - commander ^14.0.3 — CLI arg parsing - chokidar ^5.0.0 — index file watcher for mcp.serve - @huggingface/transformers ^4.2.0 — Model2Vec-equivalent token embeddings ADR 0001 records the decision to adopt native tree-sitter bindings instead of web-tree-sitter (WASM). ARCHITECTURE.md's 'No native add-ons' paragraph is amended to point at the ADR. Agent markdown sources copied from semble's src/semble/agents/ with 'semble' → 'csp', the uvx fallback replaced by 'bunx @pleaseai/csp', and example file paths re-pointed at .ts. BM25 dep deferred: neither 'bm25' (minimal, no custom tokenizer) nor 'wink-bm25-text-search' (heavy NLP transitive deps) fit. Unit 9 will implement BM25 inline. Empty .gitkeep files seed src/chunking, src/indexing, src/mcp, src/ranking for downstream worker units.
There was a problem hiding this comment.
Code Review
This pull request introduces ADR 0001 to adopt native tree-sitter bindings via @kreuzberg/tree-sitter-language-pack, updating ARCHITECTURE.md and package.json accordingly. It also adds agent configuration files for various platforms (Claude, Copilot, Cursor, Gemini, Kiro, and Opencode). Feedback on the changes includes addressing a hardcoded absolute local path in the ADR references and correcting snake_case field names (file_path) to camelCase (filePath) in the agent documentation files to align with the project's TypeScript architecture invariants.
There was a problem hiding this comment.
No issues found across 14 files
Architecture diagram
sequenceDiagram
participant Agent as AI Agent (Claude/Copilot/etc.)
participant CLI as csp CLI (commander)
participant Indexer as Indexer (chunking/indexing)
participant TSPack as @kreuzberg/tree-sitter-language-pack
participant Embedder as @huggingface/transformers
participant MCP as MCP Server (@modelcontextprotocol/sdk)
participant FileSys as File System / chokidar
participant Ignore as .gitignore (ignore package)
Note over Agent,Ignore: NEW: Core dependency and scaffolding bootstrapped by this PR
Agent->>CLI: csp search "authentication flow" ./my-project
CLI->>CLI: Parse args with commander
CLI->>Indexer: invoke search(query, path, opts)
alt Index exists (--index flag)
Indexer->>Indexer: Load pre-built index
else No index
Indexer->>Ignore: check .gitignore rules
Ignore-->>Indexer: allowed paths
Indexer->>FileSys: traverse files (chokidar watch)
FileSys-->>Indexer: file stream
Indexer->>TSPack: parse source file (native tree-sitter)
TSPack-->>Indexer: AST (305 languages supported)
Indexer->>Indexer: chunk by definitions/comments (AST-aware)
Indexer->>Embedder: generate token embeddings
Embedder-->>Indexer: embedding vectors
Indexer->>Indexer: build/search index
end
Indexer-->>CLI: ranked chunks (top-k)
CLI-->>Agent: results (file_path, line, content)
Note over Agent,MCP: Alternative access via MCP protocol
Agent->>MCP: csp_search (MCP tool call)
MCP->>Indexer: internal search call
Indexer-->>MCP: results
MCP-->>Agent: formatted MCP response
Note over CLI,FileSys: NEW: Agent prompt files seed usage instructions
Agent->>Agent: reads agent prompt (claude.md/copilot.md/cursor.md/gemini.md/kiro.md/opencode.md)
Agent->>Agent: learns csp search/index/find-related workflow
Agent->>CLI: follows prompted commands
- ADR 0001: replace hardcoded /Users/lms/... ask cache path with link to upstream MinishLab/semble repository. - src/agents/*.md (claude, copilot, cursor, gemini, kiro, opencode): rename field reference `file_path` to `filePath` to match the camelCase invariant in ARCHITECTURE.md and the public surface defined in README (chunk.filePath).
amondnet
left a comment
There was a problem hiding this comment.
Applied 13, deferred 0.
All gemini-code-assist feedback addressed in 0635880:
- ADR 0001: replaced the hardcoded
/Users/lms/.ask/...ask cache path with a link to the upstream MinishLab/semble repository (portable for all developers). - src/agents/{claude,copilot,cursor,gemini,kiro,opencode}.md: renamed
file_pathtofilePathin both the find-related description (~line 37) and the workflow step (~line 55) so the agent prompts match the camelCase invariant from ARCHITECTURE.md and the documented public surface (chunk.filePath) in README.
cubic-dev-ai found no issues.
Summary
Adds
packages/layer/modules/markdown-rewrite.ts— a build-time Nuxt module that injects rewrite rules into Vercel's build-outputconfig.jsonso AI agents get raw markdown instead of the SPA shell.Ports docus upstream commits:
6fd8686b—feat(llms): redirect homepage to /llms.txt9ceafe6f—feat(llms): add docs page redirection to raw markdown for agentsSee
docs/docus-upstream-changes.mditem #9.Behaviour
preset.startsWith('vercel'), so it coversvercel,vercel-edge,vercel-static, etc.<output.publicDir>/../config.json(the Vercel build-output config), confirmllms.txtwas emitted, thenunshiftroute pairs ontoroutesso they fire before the SPA fallback.Accept: text/markdownorUser-Agent: curl/*:^/$→/llms.txt^/<locale>/?$→/llms.txt(one perruntimeConfig.public.i18n.localesentry)^<page>$→/raw<page>.md(one per/raw/...mdlink discovered inllms.txt)hasarray is AND-ed, so OR semantics between theAcceptandUser-Agentmatchers require emitting two rule entries persrc → destpair.Conventions
packages/layer/modules/{config,shadcn}.ts(defineNuxtModule+ named module).VercelBuildOutputConfig/VercelRoute/VercelHeaderHasinterface describes the parts we touch — noany.packages/layer/nuxt.config.tsimmediately after./modules/shadcn.Verification
Output (homepage rules only, because the current
nuxt-llmsconfig emits links to canonical/docs/...URLs rather than/raw/...md— the per-page rules will start being emitted as soon asllms.txtcarries raw-md links):{ "src": "^/$", "dest": "/llms.txt", "headers": { "content-type": "text/markdown; charset=utf-8" }, "has": [{ "type": "header", "key": "accept", "value": "(.*)text/markdown(.*)" }], "continue": true } { "src": "^/$", "dest": "/llms.txt", "headers": { "content-type": "text/markdown; charset=utf-8" }, "has": [{ "type": "header", "key": "user-agent", "value": "curl/.*" }], "continue": true }Default (cloudflare) build is unaffected — the module bails silently and
dist/is produced as before.Notes
llms.txtto enumerate/raw/...mdURLs. The current site config doesn't, so only the homepage routes are emitted today. This matches the upstream behaviour and avoids accidentally rewriting asset URLs. Wiringnuxt-llmsto also emit raw-md URLs is tracked separately.config.json.Follow-up to #27.
Summary by cubic
Bootstraps Unit 0 for
@pleaseai/csp: adds core runtime deps, adopts native tree-sitter via ADR 0001, and seeds agent prompts and directories for upcoming indexing/MCP/ranking work. Also polishes ADR and agent prompts to align with the public API.Dependencies
@kreuzberg/tree-sitter-language-packfor native tree-sitter parsers.@modelcontextprotocol/sdkfor MCP server tools.@huggingface/transformersfor token embeddings.commander,chokidar, andignorefor CLI, file watching, and ignore handling.New Features
ARCHITECTURE.mdto allow native tree-sitter.src/chunking,src/indexing,src/mcp, andsrc/rankingwith scaffolding.filePath(wasfile_path) to match the public API.Written for commit 0635880. Summary will update on new commits.