A general-purpose static analysis system for multi-language codebases.
Extract. Unify. Analyze.
Arboretum is a static analysis system that extracts semantic graphs from source code during compilation. It unifies shared definitions across translation units and enriches the graph with derived program properties including:
- Ownership structure
- Aliasing relationships
- Lifetime analysis
- Control flow
- Control dependence
The extracted data forms a code property graph that can be queried via PostgreSQL for downstream analysis tools, security analyzers, documentation generators, refactoring tools, and AI/ML pipelines.
Arboretum is designed as a standalone infrastructure component with no dependency on any downstream consumer. Translation tools, security analyzers, documentation generators, refactoring tools, and any other system that needs deep semantic understanding of source code can consume its output via a language-agnostic query interface.
The C++ to Rust translation tool is one such consumer, but Arboretum is built to serve any consumer that needs comprehensive code understanding.
| Feature | Description |
|---|---|
| Multi-language Support | C/C++ currently, with Rust, Python, JavaScript, Go, Java coming soon |
| Semantic Extraction | Full AST extraction with LLVM IR at multiple optimization stages |
| Cross-TU Unification | Identical definitions across translation units are unified into single entities |
| Program Analysis | Def-use chains, alias analysis, control dependence, dominator trees, lifetime analysis |
| Package Integration | Track package versions, symbol versions, and dependencies |
| Distro-scale | Designed to analyze entire Linux distributions |
| PostgreSQL Backend | Query results via SQL with recursive CTEs for fixpoint analysis |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Source Code β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Compiler Plugin β
β - Clang (C/C++) β
β - Rustc (Rust - future) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PostgreSQL Store β
β - Normalized Graph (cpg_node, cpg_edge) β
β - Language-Specific Tables β
β - Build Artifacts β
β - Analysis Results β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Interface β
β - SQL with PostgreSQL β
β - pgvector extension support β
β - Recursive CTEs for analysis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Install dependencies
sudo apt install cmake postgresql postgresql-contrib
curl --proto '=https' --tlsonly -sSf https://sh.rustup.rs | sh# Build LLVM (first time only - ~15-20 minutes)
make llvm-project/build/llvm-stamp
# Build the project
make arboretum# Compile with Arboretum plugin
clang++ -fplugin=./build/libarboretum.so \
-std=c++20 \
your_code.cpp
# Query the results
psql -d arboretum -c "SELECT * FROM cpg_node LIMIT 10;"
psql -d arboretum -c "SELECT * FROM FunctionDecl WHERE name = 'main';"| Document | Description |
|---|---|
| AGENTS.md | Project overview for AI agents and developers |
| ROADMAP.md | Detailed milestones and release timeline |
| TASKS.md | Prioritized task board for contributors |
| PROJECT_OVERVIEW.md | Executive summary and architecture highlights |
| PROJECT_STRUCTURE.md | Directory layout and data model |
| DOCS_INDEX.md | Documentation navigation guide |
| Component | Description |
|---|---|
| reificator | Clang plugin for schema generation and AST extraction |
| reify-cpp | C++ AST visitor library |
| reify-rs | Rust AST reification with PostgreSQL I/O |
| arboretum-ffi | FFI bindings for C++ β Rust communication |
| arboretum-plugin | Clang plugin integration |
- C/C++ extraction with LLVM IR
- Cross-TU unification
- All V1 analyses (def-use, alias, CDG, etc.)
- Docker distribution (RHEL, Debian, Ubuntu)
- Rust extraction
- Additional language support (Python, JS, Go, Java)
- Package registry
- Distro build system integration
- Global catalog for shared analysis
- Enterprise features
- Advanced analyses
- AI/ML integrations
Arboretum enables:
| Use Case | Example |
|---|---|
| Security Analysis | Identify vulnerabilities across dependencies |
| Code Translation | Refactor C++ to Rust with semantic awareness |
| Documentation | Generate accurate documentation from semantic graph |
| Refactoring | Safe, semantic-aware code transformations |
| AI/ML Training | Provide structured code data for model training |
| Legacy Migration | Understand dependencies before modernization |
We welcome contributions! Here's how to get involved:
- Review TASKS.md for available tasks
- Review ROADMAP.md for project direction
- Read AGENTS.md for project architecture
- Comment on a task to claim it
- Submit a PR
This project is licensed under the MIT License. See LICENSE for details.
Arboretum is designed to serve the broader software engineering community. We're grateful to the LLVM, Rust, and PostgreSQL communities whose tools make this project possible.
This is a research and development project. While we strive for quality, the API and schema may change as we iterate toward V1.
