Skip to content

Latest commit

 

History

History
282 lines (209 loc) · 8.59 KB

File metadata and controls

282 lines (209 loc) · 8.59 KB

Technical Specification

Overview

jcodemunch-mcp pre-indexes repository source code using tree-sitter AST parsing, extracting a structured catalog of every symbol (function, class, method, constant, type). Each symbol stores its signature + one-line summary, with full source retrievable on demand via O(1) byte-offset seeking.

Token Savings

Scenario Raw dump codemunch Savings
Explore 500-file repo structure ~200,000 tokens ~2,000 tokens 99%
Find a specific function ~40,000 tokens ~200 tokens 99.5%
Read one function body ~40,000 tokens ~500 tokens 98.7%
Understand module API ~15,000 tokens ~800 tokens 94.7%

MCP Tools (11)

Indexing Tools

index_repo — Index a GitHub repository

{
  "url": "owner/repo",
  "use_ai_summaries": true
}

Fetches source via git/trees?recursive=1 (single API call), filters through the security pipeline, parses with tree-sitter, summarizes, and saves the index plus raw files.

index_folder — Index a local folder

{
  "path": "/path/to/project",
  "extra_ignore_patterns": ["*.generated.*"],
  "follow_symlinks": false
}

Walks the local directory with full security controls: path traversal prevention, symlink escape protection, secret detection, binary filtering, and .gitignore respect.

invalidate_cache — Delete index for a repository

{
  "repo": "owner/repo"
}

Deletes both the index JSON and raw content directory.


Discovery Tools

list_repos — List indexed repositories

No input required. Returns all indexed repositories with symbol counts, file counts, languages, and index version.

get_file_tree — Get file structure

{
  "repo": "owner/repo",
  "path_prefix": "src/"
}

Returns a nested directory tree with per-file language and symbol count annotations.

get_file_outline — Get symbols in a file

{
  "repo": "owner/repo",
  "file_path": "src/main.py"
}

Returns a hierarchical symbol tree (classes contain methods) with signatures and summaries. Source code is not included; use get_symbol for that.

get_repo_outline — High-level repository overview

{
  "repo": "owner/repo"
}

Returns directory file counts, language breakdown, and symbol kind distribution. Lighter than get_file_tree.


Retrieval Tools

get_symbol — Get full source of a symbol

{
  "repo": "owner/repo",
  "symbol_id": "src/main.py::MyClass.login#method",
  "verify": true,
  "context_lines": 3
}

Retrieves source via byte-offset seeking (O(1)). Optional verify re-hashes the source and compares it to the stored content_hash. Optional context_lines includes surrounding lines.

get_symbols — Batch retrieve multiple symbols

{
  "repo": "owner/repo",
  "symbol_ids": ["id1", "id2", "id3"]
}

Returns a list of symbols plus an error list for any IDs not found.


Search Tools

search_symbols — Search across all symbols

{
  "repo": "owner/repo",
  "query": "authenticate",
  "kind": "function",
  "language": "python",
  "file_pattern": "src/**/*.py",
  "max_results": 10
}

Weighted scoring search across name, signature, summary, keywords, and docstring. All filters are optional.

search_text — Full-text search across file contents

{
  "repo": "owner/repo",
  "query": "TODO",
  "file_pattern": "*.py",
  "max_results": 20
}

Case-insensitive substring search across indexed file contents. Returns matching lines with file, line number, and surrounding context. Use when symbol search misses (string literals, comments, config values).


Data Models

Symbol

@dataclass
class Symbol:
    id: str                  # "{file_path}::{qualified_name}#{kind}"
    file: str                # Relative file path
    name: str                # Symbol name
    qualified_name: str      # Dot-separated with parent context
    kind: str                # function | class | method | constant | type
    language: str            # python | javascript | typescript | go | rust | java | php
    signature: str           # Full signature line(s)
    content_hash: str = ""   # SHA-256 of source bytes (drift detection)
    docstring: str = ""
    summary: str = ""
    decorators: list[str]    # Decorators/attributes
    keywords: list[str]      # Search keywords
    parent: str | None       # Parent symbol ID (methods → class)
    line: int = 0            # Start line (1-indexed)
    end_line: int = 0        # End line (1-indexed)
    byte_offset: int = 0     # Start byte in raw file
    byte_length: int = 0     # Byte length of source

CodeIndex

@dataclass
class CodeIndex:
    repo: str                        # "owner/repo"
    owner: str
    name: str
    indexed_at: str                  # ISO timestamp
    index_version: int               # Schema version (current: 2)
    source_files: list[str]
    languages: dict[str, int]        # language → file count
    symbols: list[dict]              # Serialized symbols (no source)
    file_hashes: dict[str, str]      # file_path → SHA-256 (for incremental)
    git_head: str                    # HEAD commit hash (for git repos, empty if unavailable)

File Discovery

GitHub Repositories

Single API call: GET /repos/{owner}/{repo}/git/trees/HEAD?recursive=1

Local Folders

Recursive directory walk with the full security pipeline.

Filtering Pipeline (Both Paths)

  1. Extension filter — must be in LANGUAGE_EXTENSIONS (.py, .js, .jsx, .ts, .tsx, .go, .rs, .java, .php)
  2. Skip patternsnode_modules/, vendor/, .git/, build/, dist/, lock files, minified files, etc.
  3. .gitignore — respected via the pathspec library
  4. Secret detection.env, *.pem, *.key, *.p12, credentials files excluded
  5. Binary detection — extension-based + null-byte content sniffing
  6. Size limit — 500 KB per file (configurable)
  7. File count limit — 500 files max, prioritized: src/lib/pkg/cmd/internal/ → remainder

Response Envelope

All tools return a _meta object with timing, context, and token savings:

{
  "_meta": {
    "timing_ms": 42,
    "repo": "owner/repo",
    "symbol_count": 387,
    "truncated": false,
    "content_verified": true,
    "tokens_saved": 2450,
    "total_tokens_saved": 184320
  }
}
  • tokens_saved: Tokens saved by this specific call (raw file bytes vs response bytes, divided by 4)
  • total_tokens_saved: Cumulative tokens saved across all tool calls, persisted to ~/.code-index/_savings.json

Present on: get_file_outline, get_symbol, get_symbols, get_repo_outline, search_symbols.


Error Handling

All errors return:

{
  "error": "Human-readable message",
  "_meta": { "timing_ms": 1 }
}
Scenario Behavior
Repository not found (GitHub 404) Error with message
Rate limited (GitHub 403) Error with reset time; suggest setting GITHUB_TOKEN
File fetch fails File skipped; indexing continues
Parse fails (single file) File skipped; indexing continues
No source files found Error message returned
Symbol ID not found Error in response
Repository not indexed Error suggesting indexing first
AI summarization fails Falls back to docstring or signature
Index version mismatch Old index ignored; full reindex required

Environment Variables

Variable Purpose Required
GITHUB_TOKEN GitHub API authentication (higher limits, private repos) No
ANTHROPIC_API_KEY AI summarization via Claude Haiku (takes priority if both keys set) No
GOOGLE_API_KEY AI summarization via Gemini Flash (used if ANTHROPIC_API_KEY not set) No
CODE_INDEX_PATH Custom storage path (default: ~/.code-index/) No