Updated — March 9, 2026
The Arboretum is a general-purpose static analysis system for multi-language codebases. It extracts a semantic graph from source code during compilation, unifies shared definitions across translation units, and enriches the graph with derived program properties including ownership structure, aliasing, lifetimes, control flow, and control dependence.
The engine is designed as a standalone, reusable infrastructure component with no dependency on any downstream consumer. Translation tools, security analyzers, documentation generators, refactoring tools, and any other system that needs deep semantic understanding of source code can consume its output via a language-agnostic query API. The C++ to Rust translation tool is one such consumer.
The engine launches with C++ support and Rust as the second language. The schema and analysis plugin architecture are designed from the outset to support arbitrary additional languages without schema migration.
The engine is fully open source. A hosted global catalog provides shared storage and deduplication of analysis results across users and codebases, with a freemium model covering open source packages.
Current Status (March 2026):
- C++ Clang plugin extraction to Postgres is in progress
- Rust extraction planned for MIR/HIR/LLVM IR
- Build system integration deferred - using wrapper scripts
- All analyses use SQL CTEs (no Souffle for V1)
V1 Scope:
- Single package build at scale
- C/C++ extraction with LLVM IR at PreOpt/PostInline/PostOpt stages
- Cross-TU unification via structural hash + verification
- All analyses: def-use chains, alias analysis, CDG, dominator tree, lifetime extents, call graph, ownership inference
- First consumers: AI/ML data pipelines that sample from the dataset
Success Criteria:
- Extract and analyze the most commonly installed applications, tools, and libraries together
- Output available for downstream consumption
- Data computed rapidly enough for AI model feature engineering
- Consumer agnostic. The engine computes properties of programs. What consumers do with those properties is not the engine's concern. No translation-specific concepts appear in the engine's data model or APIs.
- Language agnostic core. The normalized graph schema and analysis plugin interface are language-independent. Language-specific extractors and analyses sit alongside the agnostic core without polluting it.
- Language-specific depth. Language-specific tables coexist with the normalized core, enabling analyses that require language-specific attributes without forcing all consumers through a lowest-common-denominator representation.
- Append-only. TU subgraphs are immutable once written. Recompilation produces a new subgraph rather than mutating an existing one. This simplifies consistency, enables versioning, and supports the content-addressed catalog.
- Build-system agnostic. Extraction happens inside the compiler via plugins injected through wrapper scripts. Any build system that invokes a supported compiler contributes to the graph automatically.
- Distro-scale. The engine is designed to analyze entire Linux distributions. Package and library metadata are first-class concepts, not afterthoughts.
- Fixpoint analysis via SQL. Program analyses are expressed as Postgres recursive CTEs, providing Datalog fixpoint semantics without an additional runtime dependency.
| Component | Status | Notes |
|---|---|---|
| Clang Plugin | In Progress | RecursiveASTVisitor + LLVM IR extraction writing directly to Postgres |
| Rustc Plugin | Planned | HIR + MIR + LLVM IR extraction for Rust TUs |
| Compiler/Linker Wrappers | Planned | Rust binaries injecting plugin flags and capturing link artifacts |
| Graph Store | In Progress | Postgres: normalized cpg_node/cpg_edge + language-specific tables |
| Build Artifact Layer | Planned | BuildArtifact, symbol_provider, runtime_resolution from linker wrapper + postlink binary |
| Graph Unification | Planned | Two-phase: structural hash via recursive CTE, then candidate verification |
| Program Analysis | Planned | SQL CTE-based analyses (no Souffle for V1) |
| Package Registry | Planned | Package, library, and distro metadata layer |
| Build Integration | Planned | RHEL/Fedora (primary), Debian/Ubuntu (secondary) via compiler wrappers |
| Global Catalog | Deferred | PostgreSQL for now; parquet for offline storage later |
| Query API | Planned | Using PostgreSQL directly - versioned by upstream |
Every node in the graph has a compound ID: a 32-bit subgraph_id and a 32-bit object_id. Subgraph IDs are assigned by a Postgres auto-increment sequence — no hashing, no collision risk. The content hash of the TU is stored separately for deduplication and catalog lookup but is not used as identity. Subgraph IDs may be recycled after a subgraph is deleted.
subgraph
subgraph_id : uint32 -- identity, auto-incremented
content_hash : bytes -- SHA-256 of TU source; for dedup and catalog
file_path : text
language : text -- 'cpp', 'rust', 'python', etc.
build_id : uuid
created_at : timestamp
kind : Enum[TU, Unified]
The core graph schema is language-agnostic. All language extractors write to these tables, enabling cross-language analysis and a uniform query interface for consumers. The normalized tables are the substrate for language-agnostic analyses.
cpg_node
node_id : uint32
subgraph_id : uint32
node_kind : text -- 'function', 'variable', 'call', etc.
language : text -- denormalized for query convenience
visit_order : uint32 -- canonical child ordering within subgraph
source_file : text
source_line : int
source_col : int
name : text -- nullable; common attribute
type_str : text -- nullable; surface type as text
visibility : text -- nullable; public/private/etc.
is_definition: bool
attrs : jsonb -- language-specific overflow attributes
cpg_edge
subgraph_id : uint32
parent_id : uint32
child_id : uint32
edge_kind : Enum[AST, CFG, DefUse, Call, ControlDep]
visit_order : uint32
-- Note: Edges may be stored as JSON arrays in node attrs for performance
-- This is optimized per-access-pattern rather than a global decision
Language-specific tables coexist with the normalized schema, linked by node_id. They carry the full richness of the language's AST without forcing it through the normalized representation. Language-specific analyses consume these tables directly.
-- C++ specific (prefixed cpp_)
cpp_function_decl(subgraph_id, node_id, mangled_name, return_type,
is_virtual, is_const, is_template, visit_order, ...)
cpp_var_decl (subgraph_id, node_id, type, storage_class, ...)
cpp_call_expr (subgraph_id, node_id, ...)
-- ... one table per Clang node type
-- Rust specific (prefixed rust_)
rust_fn_def (subgraph_id, node_id, mangled_name, return_type,
is_unsafe, is_async, lifetimes, ...)
rust_let_stmt (subgraph_id, node_id, pattern, type, is_mut, ...)
-- ... one table per MIR/HIR node type
-- Language-specific edge tables
cpp_ast_edge (parent_subgraph_id, parent_id, child_subgraph_id, child_id, visit_order)
cpp_cfg_edge (src_subgraph_id, src_id, dst_subgraph_id, dst_id, edge_kind)
rust_mir_edge (src_subgraph_id, src_id, dst_subgraph_id, dst_id, edge_kind)
The Clang plugin captures LLVM IR at multiple optimization stages alongside the AST. This provides a post-optimization view of what the compiler actually generates, making optimizer-exploited undefined behavior visible that is invisible at the AST level. Each stage is stored as a separate set of IR nodes linked to the AST via source location.
ir_function (subgraph_id, node_id, llvm_name, opt_stage, ...)
ir_block (subgraph_id, node_id, function_id, opt_stage, ...)
ir_instruction (subgraph_id, node_id, block_id, opcode, opt_stage, ...)
-- opt_stage: Enum[PreOpt, PostInline, PostOpt]
ir_ast_mapping
subgraph_id : uint32
ir_node_id : uint32
ast_node_id : uint32 -- nullable; unmapped IR has no source loc
source_line : int
source_col : int
opt_stage : text
Unmapped IR nodes — those without a source location — are typically compiler-generated: implicit constructors, inlined functions that lost debug info, optimizer-introduced temporaries. At distro scale the pattern of what gets unmapped is informative signal about which code is most compiler-transformed.
The build artifact layer captures what the linker produces and how symbols resolve at runtime. This is essential for cross-package analysis and for correctly modeling dynamic linking behavior, where the implementation of a symbol depends on the runtime environment rather than being determinable from source alone.
build_artifact
artifact_id : uint32
kind : Enum[Executable, SharedLib, StaticLib, Object]
path : text
package_id : uint32
build_id : uuid
linker : text -- e.g. 'lld-17.0.6'
linker_flags : text
capture_status : Enum[Complete, LinkFailed(reason),
PluginFailed(reason), PostlinkFailed(reason)]
artifact_tu
artifact_id : uint32
subgraph_id : uint32
symbol_provider
artifact_id : uint32
mangled_name : text
symbol_version : text -- GNU symbol versioning e.g. GLIBC_2.17
subgraph_id : uint32
node_id : uint32
binding : Enum[Strong, Weak, Global, Local]
runtime_resolution
consumer_id : uint32 -- artifact referencing the symbol
mangled_name : text
symbol_version : text
provider_id : uint32 -- artifact providing it at runtime
resolution_kind : Enum[Static, Dynamic, Weak, VersionedDynamic]
Package and library metadata provide organizational context above the subgraph level, enabling distro-scale queries.
package
package_id : uint32
name : text
version : text
distro : text
source_url : text
content_hash : bytes
library
library_id : uint32
package_id : uint32
name : text
kind : Enum[Static, Shared, HeaderOnly]
tu_library
subgraph_id : uint32
library_id : uint32
package_dependency
dependent_id : uint32
dependency_id : uint32
version_constraint : text
Unified entities receive new subgraph_ids from the same sequence, distinguished by kind = Unified. A provenance table records the mapping back to source TU entities.
unification_map
unified_subgraph_id : uint32
unified_object_id : uint32
source_subgraph_id : uint32
source_object_id : uint32
Each supported language has a dedicated extractor that writes to both the normalized cpg_node/cpg_edge tables and the language-specific tables. All extractors share the same subgraph ID scheme and Postgres connection model.
The C++ extractor is a Clang plugin using a RecursiveASTVisitor. It runs inside the compiler process for each TU, emitting AST nodes in visit_order alongside their language-specific cpp_ table entries. It also triggers LLVM IR generation at three optimization stages (PreOpt, PostInline, PostOpt) and captures IR nodes and their source location mappings.
For each TU the plugin: assigns a new subgraph_id; walks the AST emitting cpg_node rows and cpp_* rows with a monotonically increasing visit_order; emits AST and CFG edges to both cpg_edge and cpp_ast_edge/cpp_cfg_edge; runs the LLVM pass pipeline at each stage capturing IR; writes all rows in a single transaction; commits.
If the build fails or the plugin crashes, the transaction rolls back and no partial state is written. On the next build the compiler reprocesses the TU and the plugin retries cleanly. Incremental extraction is provided by the build system — only recompiled TUs produce new subgraphs.
The Rust extractor runs as a rustc compiler plugin, operating on the MIR (Mid-level Intermediate Representation), which is Rust's primary analysis-friendly IR. MIR is already lowered from the source HIR and has explicit ownership and borrow information encoded, making it a richer extraction target than the raw source AST for ownership-related analyses.
The Rust extractor writes to cpg_node with node_kind values drawn from a Rust-specific vocabulary and to rust_* language-specific tables. Because Rust's ownership and lifetime information is explicit in MIR, the analysis engine can extract it directly rather than inferring it, which produces higher-confidence results than the equivalent inference from C++.
Extractors are injected via compiler/linker wrapper scripts placed in the global compiler installation. No per-project or per-build-system changes are required:
- CMake —
CMAKE_C_COMPILER/CMAKE_CXX_COMPILERpointing to wrapper - Bazel — custom toolchain with wrapped compiler
- Make / Ninja —
CC/CXXenvironment variables - Meson — native file compiler override
- RPM (RHEL/Fedora) —
%{__cc}/%{__cxx}macros inredhat-rpm-config; single injection point covers all packages - DEB (Debian/Ubuntu) —
CC/CXXviadpkg-buildpackage; standard packages covered
Each extractor invocation is a separate compiler process with its own Postgres connection and transaction. There is no row-level contention between TUs since each writes to its own subgraph_id partition. Parallel builds up to -j 100 are handled by Postgres natively. PgBouncer can be added transparently if larger parallelism is needed.
Source-level analysis alone cannot determine how symbols resolve at runtime. The h/cc split means declarations and definitions have different AST shapes and must be linked by symbol identity rather than structural equality. Dynamic linking means the correct implementation of a symbol depends on the runtime environment — LD_PRELOAD, RPATH/RUNPATH, symbol versioning — not just the source graph. The linker wrapper captures this information.
Three Rust binaries are injected alongside the compiler extractor:
our-cc-wrapper (Rust binary)
-- inject extractor plugin flags
-- record build_id
-- exec real compiler
our-cxx-wrapper (Rust binary)
-- same as cc-wrapper for C++
our-ld-wrapper (Rust binary)
-- force LLD via -fuse-ld=lld
-- inject --Map=<tmpfile> to capture linker map
-- exec real linker
-- on success: exec cpp-analysis-postlink binary
-- on failure: write LinkFailed(reason) to build_artifact
Build ID Linking: Each wrapper binary uses a unique build_id key to link together all related TU extraction runs. This ensures atomicity - the entire build comes at once or it doesn't. Each individual TU extraction runs in a separate transaction to allow rollback on failure.
LLD is the standard linker for this toolchain. It is the default for LLVM/Clang, significantly faster than BFD at distro scale, and has a stable, well-documented map file format. The linker wrapper forces LLD via -fuse-ld=lld for all link steps.
Packages that fail due to LLD incompatibility have their build_artifact record written with capture_status = LinkFailed and a descriptive reason. They are not silently skipped — the failure is visible and queryable. At distro scale the set of failing packages is itself informative about the scope of LLD incompatibility.
The linker column on build_artifact is the extensibility point for future linker support. When BFD or gold support is added, the schema already has a place for it and artifacts can be queried by which linker produced them.
The postlink analysis is implemented as a standalone Rust binary (cpp-analysis-postlink) invoked by the linker wrapper after a successful link. Shell is insufficient for this step due to the complexity of ELF parsing and GNU symbol versioning. The binary uses the goblin library for ELF analysis and writes directly to Postgres under the same build_id as the compilation steps.
The postlink binary performs:
- Parse LLD map file →
symbol_providerentries (which TU provides each symbol) - Parse ELF via goblin → RPATH, RUNPATH, symbol versioning, binding (strong/weak/global/local)
- Run
lddagainst the artifact in the build environment →runtime_resolutionentries - Write
build_artifactrecord withcapture_status = Complete
The h/cc split produces declarations (FunctionDecl with no body) and definitions (FunctionDecl with CompoundStmt) as structurally different AST nodes. They cannot be unified by structural hash. Instead they are linked by mangled name after unification, using the symbol_provider table as the authority on which TU's definition is canonical for each symbol. This linking step runs as part of the analysis enrichment pass after unification.
The graph store is append-only. TU subgraphs are never mutated. When a file is recompiled a new subgraph is written with a new subgraph_id. The old subgraph remains until explicitly garbage collected. Unification identifies shared structure between old and new versions; once a new subgraph is confirmed equivalent the old one can be pruned.
This design provides implicit versioning — the full history of every TU is preserved — and aligns with the content-addressed catalog, where every unique content hash is a permanent entry.
Garbage collection policy is deferred. Options include eager pruning of superseded subgraphs on confirmation of a new version, retention of a configurable number of historical versions, and indefinite retention for catalog entries. The append-only design means GC can be added later without changing the core schema.
Header definitions appear redundantly across TUs. Unification identifies structurally identical definitions and merges them into a single unified entity, enabling interprocedural analysis across the full codebase without N² comparisons.
Unification proceeds in two phases:
- Phase 1 — Structural hashing: compute a bottom-up hash for every AST node across all subgraphs. Build a global table of
hash -> [(subgraph_id, object_id)]. Hash buckets with a single entry are trivially unique. Only buckets with multiple entries proceed to phase 2. - Phase 2 — Candidate verification: for each multi-entry hash bucket, verify structural equality by direct AST comparison. Confirmed-equal nodes are merged into the unified graph. Hash collisions that differ structurally remain separate.
The hash is computed bottom-up via a Postgres recursive CTE. Leaf nodes are hashed from their node type and scalar fields. Interior nodes are hashed from their node type, fields, and an ordered array of child hashes. Child ordering uses visit_order, which is stable and canonical across TUs because it reflects the RecursiveASTVisitor's source-text traversal order.
The HAVING clause ensures a node is hashed only once all of its children have been hashed, providing Datalog fixpoint semantics:
WITH RECURSIVE structural_hash AS (
-- base case: leaf nodes (no children)
SELECT subgraph_id, object_id,
hash_leaf(node_type, fields) AS hash
FROM ast_nodes
WHERE object_id NOT IN (SELECT parent_id FROM ast_edges)
UNION ALL
-- inductive: interior nodes whose children are all hashed
SELECT n.subgraph_id, n.object_id,
hash_interior(n.node_type, n.fields,
array_agg(h.hash ORDER BY e.visit_order))
FROM ast_nodes n
JOIN ast_edges e ON e.parent_id = n.object_id
AND e.parent_subgraph_id = n.subgraph_id
JOIN structural_hash h ON h.object_id = e.child_id
AND h.subgraph_id = e.child_subgraph_id
GROUP BY n.subgraph_id, n.object_id, n.node_type, n.fields
HAVING COUNT(*) = (
SELECT COUNT(*) FROM ast_edges e2
WHERE e2.parent_id = n.object_id
AND e2.parent_subgraph_id = n.subgraph_id)
)
SELECT * FROM structural_hash;Definitions that differ across TUs despite appearing in the same header are kept separate and flagged as divergent. Divergence indicates a potential ODR violation, platform-specific conditional compilation, or macro-dependent behavior. Divergent definitions are surfaced to consumers as a first-class signal rather than silently picking one version.
The hash computation and unification algorithm traverse AST edges only. CFG edges are excluded because the CFG may contain back edges (loop back-edges) which would cause the bottom-up hash computation to fail to terminate. The CFG is a derived view of the AST; if two AST subtrees are identical their CFGs are implicitly identical.
The unified graph is enriched with derived analysis results stored as additional annotation and edge tables. Analyses are implemented as plugins with a declared language scope and dependency ordering. Language-agnostic analyses run first; language-specific analyses consume their results.
All analyses are expressed as Postgres recursive CTEs providing Datalog fixpoint semantics. If analyses grow complex enough to warrant it, Souffle can be used as a Datalog-to-SQL compiler, emitting Postgres-compatible SQL while allowing analyses to be authored in a more readable Datalog dialect.
AnalysisPlugin
language : Option[text] -- None = language agnostic
produces : List[text] -- analysis_kinds this plugin produces
consumes : List[text] -- analysis_kinds this plugin requires
run(subgraph_id) -> results
-- Results land in normalized or language-specific tables
cpg_analysis_result
node_id : uint32
subgraph_id : uint32
analysis_kind : text
result : jsonb
-- Language-specific result tables e.g.
cpp_ownership_result (node_id, subgraph_id, ownership, confidence)
rust_lifetime_result (node_id, subgraph_id, lifetime)
The engine schedules plugins topologically based on their consumes/produces declarations. Language-agnostic plugins run first and produce results available to all language-specific plugins.
- Def-use chains — for each definition, the set of uses reachable along the CFG. Foundation for ownership inference and dead code detection.
- Dominance tree — dominator tree of the CFG, used for lifetime extent computation, loop detection, and ownership heuristics.
- Control Dependence Graph (CDG) — captures which conditions control whether a statement executes. Dual of data dependence. Essential for tracking how attacker-controlled data influences control flow and for understanding conditional ownership patterns.
- Call graph — interprocedural edges built after unification using unified symbol identities. Calls to library functions resolve correctly regardless of which TU's copy of the header was used.
- Lifetime extents — for each pointer or reference, the minimal scope in which it must remain valid, derived from def-use chains and the dominator tree.
Alias analysis has two tiers with different precision/cost tradeoffs:
- Flow-sensitive intraprocedural (V1) — computes points-to sets within a function, tracking how pointer values flow through assignments. Pointer escape across function boundaries is flagged for consumer attention. Fast and sufficient for most ownership inference cases.
- Context-sensitive interprocedural (V2) — distinguishes different call sites to the same function, enabling precise alias information across function boundaries. Significantly more expensive but necessary for correct ownership inference in complex aliasing patterns. Context sensitivity is the key precision gap identified relative to state-of-the-art tools such as MATE.
- C++ ownership inference — reconstructs ownership structure from def-use, alias, and lifetime results. Produces
cpp_ownership_resultannotations consumed by the translation tool. - C++ UB detection — identifies patterns of undefined behavior using both AST-level analysis and the LLVM IR layer where optimizer-exploited UB is visible.
- Rust ownership extraction — extracts explicit ownership and lifetime information directly from MIR. Higher confidence than C++ inference since the information is stated rather than inferred.
The engine is designed to analyze entire Linux distributions, treating the package ecosystem as the unit of analysis rather than individual projects. This enables cross-package analysis: tracing a vulnerability through all packages that depend on an affected library, understanding the full semantic impact of an API change, and providing distro-wide memory safety assessments.
The primary targets are RHEL/Fedora (primary) and Debian/Ubuntu (secondary). RHEL was selected because the integration is easier via %{__cc} / %{__cxx} macros in redhat-rpm-config. Debian/Ubuntu is also supported via CC / CXX environment variables.
RPM build infrastructure is well standardized. The injection point is redhat-rpm-config, which provides compiler flag macros used by virtually all packages. Setting %{__cc} / %{__cxx} to the wrapper scripts covers the entire distribution build from a single change. The Koji build system provides the orchestration layer; each package build in Koji automatically contributes to the graph.
CC and CXX environment variables propagate through the standard dpkg-buildpackage infrastructure. Well-behaved packages are covered automatically. Packages with non-standard build systems or hardcoded compiler invocations require individual attention. Coverage will be high but not complete from a single injection point.
Symbol references that cross library boundaries are first-class edges in the unified graph. A call from one package into a shared library is an interprocedural edge that crosses a package boundary, and can be analyzed for ownership and aliasing just as intrapackage edges are. This is the core capability that makes distro-scale analysis qualitatively different from per-project analysis.
V1 uses Docker images as the primary distribution mechanism:
- RHEL Dockerfile: With analysis tooling pre-configured
- Debian Dockerfile: With analysis tooling pre-configured
- Ubuntu Dockerfile: With analysis tooling pre-configured
Users invoke the analysis tooling on packages specified within the Docker image.
Status: Deferred to V2+
The global catalog is a shared service that stores subgraphs and analysis results contributed by all users. It enables deduplication across codebases: if two users have the same version of a library, their subgraphs for that library produce the same content hash and share a single catalog entry.
The catalog is keyed by content hash. Before uploading a subgraph, the local service queries the catalog with the content hash. If the subgraph is already present, the local service receives the catalog's canonical subgraph_id and skips the upload. Cached analysis results travel with the subgraph — a user pulling a well-known library gets the analysis for free.
LocalService
graph_store : Postgres -- user's own codebase
catalog_client -- push/pull by content hash
GlobalCatalog
subgraph_store -- content-addressed by content_hash
analysis_store -- cached analysis results per subgraph
The catalog has a freemium structure. The free tier covers all open source packages — analyzing the open source ecosystem is a strategic investment that makes the catalog valuable to paying users. The paid tier covers private codebases, higher retention of historical versions, SLAs on analysis freshness, and API access at scale for integration into CI pipelines and security tooling.
Enterprise partnerships with Red Hat and Canonical are a natural commercial extension: licensing the catalog infrastructure for a hosted instance scoped to their package ecosystem, with open source results flowing back to the public catalog.
For V1, there is no publicly available global catalog. Some users may fund setup of a private catalog instance. Long-term, the plan is to use PostgreSQL as the backend, potentially with parquet exports for offline storage.
Sharing policy is file-level, specified by the user as glob patterns in a .gitignore-style configuration. The first matching rule wins; the default is all-private. Content hashes already present in the catalog are implicitly public regardless of local policy — the content is already known, and the user is only revealing that they use it. Full sharing policy design is deferred; the local service operates fully offline until the user opts in.
Status: Using PostgreSQL directly for V1
The engine exposes a query API that downstream consumers use to retrieve subgraphs, analysis results, and derived properties. The API is consumer-agnostic — it provides access to the enriched graph without any knowledge of what consumers will do with it.
The exact API design is deferred until the graph store and analysis layers are stable. The interface should support: subgraph retrieval by content hash or subgraph_id; symbol lookup by name and package; reachability queries over the call graph; analysis result retrieval for specific nodes; and streaming updates when new analysis results are available for a subgraph.
For V1, consumers use PostgreSQL directly as the query interface. This provides:
- Direct SQL access to all data
- Full Postgres features (CTEs, indexes, extensions like pgvector)
- No abstraction layer overhead
- Versioning tied to upstream (llvm/clang) schema versions
Later, an abstraction layer may be added for convenience, but V1 uses PostgreSQL directly.
| Topic | Notes |
|---|---|
| Context-sensitive alias analysis | Precision vs. cost tradeoff. V1 is flow-sensitive intraprocedural. V2 escalation criteria TBD. |
| Souffle threshold | At what analysis complexity to switch from hand-written CTEs to Souffle-generated SQL. |
| Garbage collection policy | Eager vs. versioned vs. indefinite retention. Local and catalog may differ. |
| Template instantiation merging | Structural hash handles identical instantiations; parametrically related ones need further design. |
| Catalog sharing policy | Full glob-pattern design deferred. Default is all-private. |
| Query API design | Deferred until graph store and analysis are stable. |
| Debian coverage gaps | Non-standard build systems require individual handling; scope TBD. |
| Rust extractor IR layer | Rustc MIR is already post-lowering; whether to also capture LLVM IR for Rust TBD. |
| Additional language priority | After C++ and Rust, which language is highest value? Python, Java, Go are candidates. |
| LLD incompatibility scope | What fraction of distro packages fail with LLD? Measure before deciding whether to add BFD support. |
V1 targets a working extraction and unification pipeline for a single package build, with full program analysis capabilities. The goal is a system that produces a correct unified graph with ownership and alias annotations, queryable by downstream consumers.
- C++ Clang plugin extraction to Postgres with LLVM IR at all three stages (PreOpt, PostInline, PostOpt)
- Normalized
cpg_node/cpg_edgetables populated alongsidecpp_*tables - AST, CFG, and IR edge extraction with
visit_order - Linker wrapper with LLD (Rust binaries) and postlink binary for build artifact capture
- Cross-TU unification via structural hash and candidate verification
- Declaration/definition linking via mangled name and
symbol_provider - Forward declarations unified with definitions for clients
- Def-use chain analysis
- Flow-sensitive intraprocedural alias analysis
- Control Dependence Graph (CDG)
- Lifetime extent computation
- Dominator tree
- Call graph construction
- C++ ownership inference (language-specific plugin)
- Package and symbol version tracking (separate tracking for package versions and GNU symbol versions)
- All analyses expressed as SQL CTEs (no Souffle for V1)
The engine tracks multiple version dimensions:
- Package versions: Standard package versioning (e.g., libc 2.31-0ubuntu9.2)
- Symbol versions: GNU symbol versioning (e.g., GLIBC_2.17)
- Schema versions: Versioned by upstream (llvm/clang); new version = new schema in separate tables
These are tracked separately because they are semantically different. Different versions will not unify - they are truly different.
- Context-sensitive interprocedural alias analysis
- Template instantiation merging beyond identical copies
- Additional language extractors (Python, JS, Go, Java)
- Rust extractor (requires rustc plugin development)
- Package registry and distro metadata
- Distro build system integration (RPM, DEB)
- Global catalog infrastructure (PostgreSQL + parquet for offline)
- Souffle-based analysis authoring (if CTEs prove insufficient)
- Query API abstraction layer (using PostgreSQL directly for now)
V1 is considered successful when the engine can:
- Extract: Process a single package build at scale (multiple TUs)
- Unify: Create a unified graph across translation units
- Analyze: Run all V1 analyses (def-use, alias, CDG, dominator tree, lifetime, call graph, ownership)
- Output: Produce queryable results via PostgreSQL
- Enable: First consumers (AI/ML data pipelines) can sample from the dataset
Target applications, tools, and libraries for V1 validation:
| Priority | Package | Reason |
|---|---|---|
| 1 | OpenSSL | Widely used, complex, good test for ownership/alias analysis |
| 2 | SQLite | Simple but complete, good for validation |
| 3 | BusyBox | Embedded systems, good for distro-scale validation |
| 4 | GCC | Self-hosting, tests the extraction toolchain |
| 5 | Node.js | Complex, multi-TU, good for unification testing |
- Build time: < X minutes for a typical package (TBD)
- Storage: < X GB per package build (TBD)
- Query latency: < X ms for common queries (TBD)
- Unification coverage: > X% of definitions unified across TUs (TBD)
The first consumer will be an AI/ML data pipeline that:
- Pulls data from the database
- Samples features for model training
- Iterates rapidly on feature engineering
Success is measured by the pipeline's ability to:
- Extract features in minutes rather than days/weeks
- Iterate on new feature ideas quickly
- Access comprehensive semantic data
| Month | Milestone |
|---|---|
| April | Complete C++ extraction with LLVM IR |
| May | Basic unification working |
| June | All analyses working (def-use, alias, CDG, etc.) |
| July | Docker distribution ready (RHEL, Debian, Ubuntu) |
| August | Hero use case validation (OpenSSL, SQLite, etc.) |
| September | Documentation and first external试用 |
| Month | Milestone |
|---|---|
| October | Rust extractor (HIR + MIR + LLVM IR) |
| November | Package registry and distro metadata |
| December | Build system integration (RHEL/Fedora, Debian/Ubuntu) |
| January | Context-sensitive alias analysis (V2) |
| February | Template instantiation merging |
| March | Additional language extractors (Python, JS, Go, Java) |
| Release | Timeline | Features |
|---|---|---|
| V3 | Q2-Q3 2027 | Global catalog infrastructure, query API abstraction |
| V4 | Q4 2027+ | Advanced analyses, AI/ML integrations, enterprise features |
The roadmap is designed to:
- V1: Prove the core value (data availability for downstream consumption)
- V2: Expand language coverage (Rust + additional languages)
- V3: Build infrastructure (global catalog, enterprise features)
- V4+: Enable advanced use cases (AI/ML, enterprise tooling)
Each release builds on the previous, with V1 being the critical proving ground.