diff --git a/.claude/CC_SESSION_BOOTSTRAP.md b/.claude/CC_SESSION_BOOTSTRAP.md new file mode 100644 index 00000000..5c48531c --- /dev/null +++ b/.claude/CC_SESSION_BOOTSTRAP.md @@ -0,0 +1,52 @@ +# CC_SESSION_BOOTSTRAP.md (lance-graph upstream) + +## Clone. Sync. Split. Push. + +```bash +mkdir adaworld && cd adaworld + +# REQUIRED: +git clone https://github.com/AdaWorldAPI/lance-graph +git clone https://github.com/AdaWorldAPI/holograph # read only, BlasGraph source + +cd lance-graph +git remote add upstream https://github.com/lance-format/lance-graph.git +git fetch upstream +cargo test --workspace +``` + +## First command in every session: + +```bash +cat CLAUDE.md +cat .claude/UPSTREAM_PR_SESSIONS.md +``` + +## What you write to: + +``` +lance-graph Push to AdaWorldAPI fork, PR to lance-format upstream +``` + +## What you read from: + +``` +holograph BlasGraph source for PR C. DO NOT MODIFY. +``` + +## NOT in scope: + +``` +ladybug-rs Not touched. Not cloned. Not referenced in upstream PRs. +rustynum Not touched. No upstream PR may depend on it. +n8n-rs Not touched. +crewai-rust Not touched. +``` + +## Upstream PR target: + +``` +lance-format/lance-graph main branch +Maintainer: beinan +PR #146 (ours): CLOSE FIRST, then split into A/B/C +``` diff --git a/.claude/HANDOVER.md b/.claude/HANDOVER.md new file mode 100644 index 00000000..6d861d61 --- /dev/null +++ b/.claude/HANDOVER.md @@ -0,0 +1,40 @@ +# lance-graph Session Handover — 2026-03-12 + +## The Boring Version + +This repo is being restructured from a monolith (19K lines in one crate) +into 8 focused crates with clean separation of concerns. + +**Plan:** `CRATE_SEPARATION_PLAN.md` in repo root + +**Canonical architecture docs (in ladybug-rs):** +- Prompt 15: RISC Brain Vision +- Prompt 19: Hot/Cold One-Way Mirror +- Prompt 20: Four Invariants (this repo = "The Face") +- Prompt 21: This plan + +## Role in the Four-Repo Architecture + +``` +rustynum → The Muscle (SIMD substrate) +ladybug-rs → The Brain (BindSpace, SPO Crystal, server) +lance-graph → The Face ← THIS REPO (Cypher/SQL query surface) +staunen → The Bet (6 instructions, no GPU) +``` + +lance-graph owns: parser, planner, execution engine, BlasGraph algebra, server binary. +lance-graph DOES NOT own: BindSpace, SpineCache, qualia, awareness loop (those are ladybug-rs). + +## Key Imports + +- holograph/src/graphblas/ → lance-graph-blasgraph (7 semiring algebras) +- holograph/src/{bitpack,hdr_cascade,epiphany,resonance}.rs → lance-graph-spo +- ladybug-rs/src/graph/spo/{sparse,scent}.rs → lance-graph-spo + +## Current State + +- Monolith crate compiles and passes tests +- Tests: 12 test files, ~9300 lines +- graph/spo/ is a stale copy from ladybug-rs (WILL BE REPLACED during separation) +- DataFusion planner is the most valuable code (5633 lines) +- Parser is identical to ladybug-rs lance_parser (which is being deleted) diff --git a/.claude/UPSTREAM_PR_SESSIONS.md b/.claude/UPSTREAM_PR_SESSIONS.md new file mode 100644 index 00000000..ac2bfb5b --- /dev/null +++ b/.claude/UPSTREAM_PR_SESSIONS.md @@ -0,0 +1,279 @@ +# LANCE_GRAPH_UPSTREAM_SESSIONS.md + +## Fix PR #146. Split. Rebase. Contribute BlasGraph. + +**Repo:** AdaWorldAPI/lance-graph (fork of lance-format/lance-graph) +**Upstream:** lance-format/lance-graph +**PR #146:** open, CI failing, merge conflicts, maintainer asked us to fix + +--- + +## SITUATION + +``` +PR #146 is a kitchen sink: dep bumps + error macros + graph/spo/ module + docs +11 commits, 2577 additions, mergeable: false, state: dirty + +Upstream merged 3 PRs since we opened: + #145 Unity catalog/delta lake (Mar 3) + #147 Remove Simple executor (Mar 4) ← CONFLICT + #148 Move benchmarks to separate crate (Mar 5) ← CONFLICT + +Our fork: 10 ahead, 3 behind upstream main. +``` + +--- + +## SESSION 1: Sync Fork + Close #146 + +```bash +cd adaworld/lance-graph + +# Sync fork to upstream +git remote add upstream https://github.com/lance-format/lance-graph.git +git fetch upstream +git checkout main +git merge upstream/main +# Resolve conflicts (Cargo.lock, workspace layout changes) +git push origin main + +# Close #146 with comment +``` + +Comment on PR #146: +``` +Closing this PR to split into focused contributions: +- PR A: dep bumps (arrow 57, datafusion 51, lance 2) +- PR B: graph/spo/ module (SPO triple store, TruthGate, semiring) +- PR C: BlasGraph algebra (from holograph — 7 semirings, matrix ops) + +Sorry for the kitchen sink. Splitting for clean review. +``` + +**Exit gate:** Fork synced to upstream main. #146 closed. + +--- + +## SESSION 2: PR A — Dep Bumps Only + +```bash +git checkout -b feat/bump-arrow-57-datafusion-51-lance-2 +``` + +**What to do:** +``` +1. ONLY change Cargo.toml dep versions: + crates/lance-graph/Cargo.toml + crates/lance-graph-catalog/Cargo.toml + crates/lance-graph-python/Cargo.toml + + arrow: current → 57 + datafusion: current → 51 + lance: current → 2.0 + +2. cargo update (regenerate Cargo.lock) + +3. Fix any API breakages from dep bumps: + - arrow 57: check RecordBatch API changes + - datafusion 51: check SessionContext, LogicalPlan changes + - lance 2: check table API, write params + +4. cargo test --workspace (exclude python crate) + Fix ALL test failures. + +5. Check upstream CI requirements: + - cargo fmt --check + - cargo clippy -- -D warnings + - cargo test + +6. Push, open PR to lance-format/lance-graph +``` + +PR title: `feat: bump arrow 57, datafusion 51, lance 2` +PR body: +``` +Align dependency matrix: + arrow → 57 + datafusion → 51 + lance → 2.0 + +All tests pass. No API breakages. +Follows up on closed #146 (split into focused PRs). +``` + +**Exit gate:** Clean PR with ONLY dep bumps. CI green. No extra files. + +--- + +## SESSION 3: PR B — graph/spo/ Module + +```bash +git checkout main +git pull upstream main # get PR A merged first, or branch from main +git checkout -b feat/spo-triple-store +``` + +**What to do:** +``` +1. Add ONLY the graph/spo/ module: + crates/lance-graph/src/graph/mod.rs + crates/lance-graph/src/graph/fingerprint.rs + crates/lance-graph/src/graph/sparse.rs + crates/lance-graph/src/graph/spo/mod.rs + crates/lance-graph/src/graph/spo/builder.rs + crates/lance-graph/src/graph/spo/merkle.rs + crates/lance-graph/src/graph/spo/semiring.rs + crates/lance-graph/src/graph/spo/store.rs + crates/lance-graph/src/graph/spo/truth.rs + +2. Add test: + crates/lance-graph/tests/spo_ground_truth.rs + +3. Add `pub mod graph;` to lib.rs + +4. REMOVE anything not relevant to upstream: + - No SPARE_PARTS_SUMMARY.md + - No #[track_caller] error macros (separate PR if wanted) + - No ladybug-rs specific imports + - No references to BindSpace, CogRedis, etc + +5. Make sure graph/spo/ is SELF-CONTAINED: + - Uses lance-graph's own error types (not ladybug's QueryError) + - Uses standard blake3 crate (add to Cargo.toml) + - No dependency on rustynum or ladybug-rs + +6. Clean the code for upstream standards: + - cargo fmt + - cargo clippy -- -D warnings + - Doc comments on all pub types and methods + - Examples in doc comments where useful + +7. cargo test (including spo_ground_truth.rs) + +8. Push, open PR +``` + +PR title: `feat(graph): add SPO triple store with Merkle integrity, TruthGate, and semiring traversal` +PR body: +``` +Add a content-addressable SPO (Subject-Predicate-Object) triple store: + +- **SpoStore**: insert, query_forward, query_reverse, query_relation +- **SpoMerkle**: Blake3-based integrity with MerkleEpoch and inclusion proofs +- **TruthGate**: NARS-inspired confidence gating (MinFreq/MinConf/MinBoth) +- **SpoSemiring**: Algebraic traversal operations for graph algorithms +- **SpoBuilder**: Builder pattern for constructing stores +- **Fingerprint**: 16384-bit binary fingerprint with Hamming operations +- **SparseContainer**: Memory-efficient sparse vector storage + +Ground truth test included (357 lines). + +This enables knowledge-graph style operations on LanceDB with +content-addressed nodes and confidence-weighted edges. +``` + +**Exit gate:** Clean PR, no ladybug-rs deps, all tests pass, upstream CI green. + +--- + +## SESSION 4: PR C — BlasGraph Semiring Algebra (FROM holograph) + +```bash +git checkout main +git pull upstream main +git checkout -b feat/blasgraph-semiring-algebra +``` + +**What to do:** +``` +1. Port from holograph/src/graphblas/ to lance-graph: + + Create: crates/lance-graph/src/graph/blasgraph/ + mod.rs ← from holograph graphblas/mod.rs (94 lines) + semiring.rs ← from holograph graphblas/semiring.rs (535 lines) + matrix.rs ← from holograph graphblas/matrix.rs (596 lines) + vector.rs ← from holograph graphblas/vector.rs (506 lines) + ops.rs ← from holograph graphblas/ops.rs (717 lines) + sparse.rs ← from holograph graphblas/sparse.rs (546 lines) + types.rs ← from holograph graphblas/types.rs (330 lines) + descriptor.rs ← from holograph graphblas/descriptor.rs (186 lines) + +2. CLEAN for upstream: + - Remove holograph-specific imports + - Remove any reference to ladybug-rs types + - Use lance-graph error types + - All pub types and methods get doc comments + - cargo fmt + clippy clean + +3. The 7 semirings to include: + - XOR Bundle (bind/superpose) + - Bind First (key-value association) + - Hamming Min (nearest neighbor) + - Similarity Max (most similar) + - Resonance with threshold (sigma-gated) + - Boolean (standard graph traversal) + - XOR Field (algebraic field operations) + +4. Matrix operations: + - mxm (matrix × matrix — graph composition) + - mxv (matrix × vector — graph query) + - vxm (vector × matrix — reverse query) + - element-wise add/mult + +5. Write tests: + - One test per semiring showing expected behavior + - Matrix multiplication with at least 2 semirings + - Sparse matrix efficiency test + +6. Update graph/mod.rs: pub mod blasgraph; + +7. Push, open PR +``` + +PR title: `feat(graph): add BlasGraph semiring algebra — 7 semirings, sparse matrix ops` +PR body: +``` +Port GraphBLAS-inspired sparse matrix algebra to lance-graph. + +7 semiring algebras for different graph computation modes: +- XOR Bundle, Bind First, Hamming Min, Similarity Max +- Resonance (threshold-gated), Boolean, XOR Field + +Matrix operations: mxm, mxv, vxm, element-wise. +CSR sparse format for memory-efficient large graphs. + +This enables algebraic graph algorithms (PageRank, community detection, +shortest path) as matrix operations on LanceDB-backed graphs, +replacing Pregel-style message passing with linear algebra. + +Based on the RedisGraph BlasGraph approach, adapted for LanceDB +and binary Hamming distance vectors. +``` + +**Exit gate:** Clean PR, holograph code adapted, all tests pass, no ladybug-rs deps. + +--- + +## SUMMARY + +``` +SESSION PR WHAT LINES DEPENDS ON +1 - Sync fork, close #146 0 nothing +2 A Dep bumps only ~50 session 1 +3 B graph/spo/ module ~1600 session 1 (or session 2 merged) +4 C BlasGraph semiring algebra ~3500 session 1 (or session 2 merged) +``` + +Sessions 3 and 4 can target main even if PR A isn't merged yet. +PR B and C are independent of each other. + +**What beinan and the lance-format team get:** +- Arrow 57 / DataFusion 51 / Lance 2 alignment (PR A) +- SPO triple store with Merkle integrity and NARS confidence (PR B) +- BlasGraph algebra that no other graph database has in Rust (PR C) + +**What we get:** +- Clean upstream relationship (not a kitchen sink PR) +- Our SPO and BlasGraph contributions in the official repo +- Upstream CI validates our code +- Community review catches bugs we missed diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..cdf8657f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,95 @@ +# CLAUDE.md — Lance-Graph + +> **Updated**: 2026-03-12 +> **Status**: Monolith working, crate separation planned + +--- + +## What This Is + +"The Face." A graph database engine on LanceDB + DataFusion. Cypher + SQL in one engine. +From outside: boring fast graph database. Inside: BlasGraph semiring on bitpacked Hamming SPO. + +## ⚠ READ BEFORE WRITING CODE + +### 1. MONOLITH STATE + +Everything is in one crate: `crates/lance-graph/` (19,262 lines). +Plan for 8-crate separation: `CRATE_SEPARATION_PLAN.md` in repo root. +DO NOT start separation without reading that plan. + +### 2. graph/spo/ IS STALE + +`crates/lance-graph/src/graph/spo/` is a DIVERGED COPY of ladybug-rs `src/graph/spo/`. +ladybug-rs version is MORE COMPLETE (TruthGate from PR 170, MerkleEpoch, SparseContainer). +During crate separation, this gets REPLACED with extended versions from ladybug-rs + holograph. + +**DO NOT** extend lance-graph's graph/spo/. Extend ladybug-rs's, then import. + +### 3. parser.rs = ladybug-rs lance_parser + +`src/parser.rs` is identical (12 diff lines) to `ladybug-rs/src/query/lance_parser/parser.rs`. +The ladybug-rs copy is being DELETED (it's an orphaned duplicate). This repo keeps the original. + +### 4. DataFusion Planner Is The Most Valuable Code + +`src/datafusion_planner/` (5,633 lines) is the execution engine. Treat it carefully. +The join_builder.rs, expression.rs, scan_ops.rs are production-grade DataFusion integration. +DO NOT rewrite them. They're correct. + +### 5. Tests Are Comprehensive + +12 test files, ~9,300 lines. Especially `test_datafusion_pipeline.rs` (5,152 lines). +Run them. Don't break them. + +## Build + +```bash +cargo test # runs all tests in workspace +cargo test -p lance-graph # just the main crate +``` + +## Role in Four-Repo Architecture + +``` +rustynum = The Muscle (SIMD substrate) +ladybug-rs = The Brain (BindSpace, server) +staunen = The Bet (6 instructions, no GPU) +lance-graph = The Face ← THIS REPO (query surface) +``` + +## Key Files (by importance) + +``` +CRITICAL (don't break): + src/datafusion_planner/ 5633 lines The execution engine + src/parser.rs 1931 lines The Cypher parser (nom) + src/logical_plan.rs 1417 lines Logical plan algebra + src/query.rs 2375 lines Query builder + executor + +IMPORT TARGETS (will be enriched during crate separation): + src/graph/spo/ ~1000 lines STALE — will be replaced + src/graph/fingerprint.rs 144 lines Basic fingerprint ops + +CLEAN UTILITIES: + src/error.rs 233 lines snafu errors with location + src/config.rs 465 lines GraphConfig builder + src/semantic.rs 1719 lines Semantic validation +``` + +## What NOT To Do + +``` +× Don't extend graph/spo/ here (extend in ladybug-rs, import) +× Don't add BindSpace/SpineCache code (that's ladybug-rs) +× Don't add qualia/awareness code (that's ladybug-rs) +× Don't create a Redis protocol handler yet (wait for crate separation) +× Don't use the parser.rs copy in ladybug-rs (it's being deleted there) +``` + +## Session Context + +``` +.claude/HANDOVER.md Session handover for this repo +CRATE_SEPARATION_PLAN.md Full 8-crate plan (prompt 21) +``` diff --git a/CRATE_SEPARATION_PLAN.md b/CRATE_SEPARATION_PLAN.md new file mode 100644 index 00000000..941e8792 --- /dev/null +++ b/CRATE_SEPARATION_PLAN.md @@ -0,0 +1,459 @@ +# The Boring Version — lance-graph Clean Crate Separation + +## A Graph Database That Happens To Think + +**What it looks like from outside:** A fast graph database. Cypher in, results out. +SQL in, results out. Zero-copy LanceDB storage. DataFusion query engine. +Clean Rust crates with proper error handling and builder patterns. + +**What it actually is:** BlasGraph semiring algebra on 3D bitpacked Hamming SPO vectors +with NARS truth gating, Merkle-verified integrity, Hebbian plasticity, and an epiphany +detection engine. Running on 6 SIMD instructions in L1 cache. + +**But nobody needs to know that to use it.** + +--- + +## 1. THE CRATE SEPARATION + +### Current State: 1 monolith crate (19,262 lines) + +``` +crates/lance-graph/src/ ← Everything in one place + parser.rs 1931 Parser + ast.rs 542 AST + semantic.rs 1719 Validation + config.rs 465 Config + logical_plan.rs 1417 Planner + query.rs 2375 Query builder/executor + datafusion_planner/ 5633 Execution engine + simple_executor/ 724 Lightweight executor + graph/ 1113 SPO + fingerprint + sparse + lance_vector_search.rs 554 Vector search + error.rs 233 Errors + ...misc 1556 +``` + +### Target State: 8 focused crates + +``` +lance-graph/ +├── Cargo.toml Workspace root +│ +├── crates/ +│ ├── lance-graph-ast/ CRATE 1: Parse surface (0 deps on engine) +│ │ ├── src/ +│ │ │ ├── lib.rs +│ │ │ ├── ast.rs ← from src/ast.rs (542 lines) +│ │ │ ├── parser.rs ← from src/parser.rs (1931 lines) +│ │ │ ├── semantic.rs ← from src/semantic.rs (1719 lines) +│ │ │ ├── config.rs ← from src/config.rs (465 lines) +│ │ │ ├── parameter.rs ← from src/parameter_substitution.rs (280 lines) +│ │ │ ├── case_insensitive.rs ← from src/case_insensitive.rs (377 lines) +│ │ │ └── error.rs ← from src/error.rs (233 lines) +│ │ ├── tests/ Parser tests (move from monolith) +│ │ └── Cargo.toml deps: nom, snafu, serde +│ │ 5,547 lines. Zero deps on DataFusion, LanceDB, or SPO. +│ │ Anyone can parse Cypher without buying into the engine. +│ │ +│ ├── lance-graph-plan/ CRATE 2: Logical planning (depends on AST only) +│ │ ├── src/ +│ │ │ ├── lib.rs +│ │ │ ├── logical_plan.rs ← from src/logical_plan.rs (1417 lines) +│ │ │ ├── analysis.rs ← from datafusion_planner/analysis.rs (399 lines) +│ │ │ └── optimizer.rs NEW: plan optimization rules (~200 lines) +│ │ └── Cargo.toml deps: lance-graph-ast +│ │ ~2,000 lines. Pure algebra. No execution. +│ │ LogicalOperator → LogicalOperator transforms. +│ │ +│ ├── lance-graph-blasgraph/ CRATE 3: BlasGraph algebra (FROM HOLOGRAPH) +│ │ ├── src/ +│ │ │ ├── lib.rs +│ │ │ ├── semiring.rs ← holograph/src/graphblas/semiring.rs (535 lines) +│ │ │ ├── matrix.rs ← holograph/src/graphblas/matrix.rs (596 lines) +│ │ │ ├── vector.rs ← holograph/src/graphblas/vector.rs (506 lines) +│ │ │ ├── ops.rs ← holograph/src/graphblas/ops.rs (717 lines) +│ │ │ ├── sparse.rs ← holograph/src/graphblas/sparse.rs (546 lines) +│ │ │ ├── types.rs ← holograph/src/graphblas/types.rs (330 lines) +│ │ │ └── descriptor.rs ← holograph/src/graphblas/descriptor.rs (186 lines) +│ │ └── Cargo.toml deps: none (pure algebra) +│ │ 3,416 lines. The RedisGraph BlasGraph transcode. +│ │ 7 semirings: xor_bundle, bind_first, hamming_min, +│ │ similarity_max, resonance(threshold), boolean, xor_field. +│ │ mxm, mxv, vxm — matrix-level graph operations. +│ │ THIS IS WHAT MAKES IT A YEAR AHEAD. +│ │ +│ ├── lance-graph-spo/ CRATE 4: SPO cognitive substrate +│ │ ├── src/ +│ │ │ ├── lib.rs +│ │ │ ├── store.rs ← from graph/spo/store.rs (313 → ~800 lines) +│ │ │ │ EXTENDED: add ladybug-rs TruthGate, SpoHit, QueryAxis +│ │ │ ├── merkle.rs ← from graph/spo/merkle.rs (248 → ~500 lines) +│ │ │ │ EXTENDED: add Epoch, ProofStep, InclusionProof +│ │ │ ├── truth.rs ← from graph/spo/truth.rs (175 lines, KEEP) +│ │ │ │ The clean NARS truth for the graph layer +│ │ │ ├── semiring.rs ← from graph/spo/semiring.rs (99 → ~260 lines) +│ │ │ │ EXTENDED: import more from holograph +│ │ │ ├── builder.rs ← from graph/spo/builder.rs (119 → ~340 lines) +│ │ │ │ EXTENDED: validation, builder pattern +│ │ │ ├── sparse.rs ← from ladybug-rs graph/spo/sparse.rs (542 lines) +│ │ │ │ SparseContainer — not in current lance-graph +│ │ │ ├── scent.rs ← from ladybug-rs graph/spo/scent.rs (204 lines) +│ │ │ │ NibbleScent — not in current lance-graph +│ │ │ ├── fingerprint.rs ← from graph/fingerprint.rs (144 lines) +│ │ │ ├── bitpack.rs ← from holograph/src/bitpack.rs (970 lines) +│ │ │ │ BitpackedVector — the actual substrate +│ │ │ ├── hdr_cascade.rs ← from holograph/src/hdr_cascade.rs (957 lines) +│ │ │ │ σ-band cascade search, Mexican hat +│ │ │ ├── epiphany.rs ← from holograph/src/epiphany.rs (840 lines) +│ │ │ │ Cluster detection, adaptive thresholds +│ │ │ └── resonance.rs ← from holograph/src/resonance.rs (705 lines) +│ │ │ Resonance patterns +│ │ ├── tests/ SPO tests (spo_ground_truth.rs + new) +│ │ └── Cargo.toml deps: lance-graph-blasgraph, blake3 +│ │ ~5,500 lines. The complete SPO stack. +│ │ Bitpacked vectors + BlasGraph semiring + +│ │ Merkle integrity + NARS truth + HDR cascade + +│ │ epiphany detection. +│ │ THIS IS WHERE THE THINKING HAPPENS. +│ │ But from outside it's just "the storage layer." +│ │ +│ ├── lance-graph-engine/ CRATE 5: DataFusion execution engine +│ │ ├── src/ +│ │ │ ├── lib.rs +│ │ │ ├── planner.rs ← from datafusion_planner/mod.rs (240 lines) +│ │ │ ├── scan_ops.rs ← from datafusion_planner/scan_ops.rs (534 lines) +│ │ │ ├── expression.rs ← from datafusion_planner/expression.rs (1443 lines) +│ │ │ ├── join_ops.rs ← from datafusion_planner/join_ops.rs (616 lines) +│ │ │ ├── vector_ops.rs ← from datafusion_planner/vector_ops.rs (485 lines) +│ │ │ ├── udf.rs ← from datafusion_planner/udf.rs (740 lines) +│ │ │ ├── config_helpers.rs ← from datafusion_planner/config_helpers.rs (237 lines) +│ │ │ ├── builder/ +│ │ │ │ ├── mod.rs ← from builder/mod.rs (106 lines) +│ │ │ │ ├── basic_ops.rs ← from builder/basic_ops.rs (653 lines) +│ │ │ │ ├── aggregate_ops.rs ← from builder/aggregate_ops.rs (135 lines) +│ │ │ │ ├── expand_ops.rs ← from builder/expand_ops.rs (717 lines) +│ │ │ │ ├── join_builder.rs ← from builder/join_builder.rs (633 lines) +│ │ │ │ └── helpers.rs ← from builder/helpers.rs (232 lines) +│ │ │ └── simple/ +│ │ │ ├── mod.rs ← from simple_executor/mod.rs (20 lines) +│ │ │ ├── path_executor.rs ← from simple_executor/path_executor.rs (304 lines) +│ │ │ ├── expr.rs ← from simple_executor/expr.rs (263 lines) +│ │ │ ├── clauses.rs ← from simple_executor/clauses.rs (93 lines) +│ │ │ └── aliases.rs ← from simple_executor/aliases.rs (44 lines) +│ │ ├── tests/ All test_datafusion_*.rs (move from monolith) +│ │ └── Cargo.toml deps: lance-graph-plan, lance-graph-spo, +│ │ datafusion, arrow, lancedb +│ │ ~6,350 lines. LogicalPlan → PhysicalPlan → Execute. +│ │ DataFusion as the execution backbone. +│ │ scan_ops talks to SPO crate for hot path. +│ │ join_builder connects hot ⋈ cold. +│ │ +│ ├── lance-graph-server/ CRATE 6: Server binary (the face) +│ │ ├── src/ +│ │ │ ├── main.rs Server entry point +│ │ │ ├── redis.rs Redis wire protocol handler +│ │ │ ├── http.rs REST API (/cypher, /sql, /vectors, /health) +│ │ │ ├── flight.rs Arrow Flight gRPC endpoint +│ │ │ └── neo4j_mirror.rs PET scan projection (one-way, cold only) +│ │ └── Cargo.toml deps: all other crates, axum or tokio-tcp +│ │ ~2,000 lines. The boring HTTP server. +│ │ Redis protocol in, JSON out. +│ │ Cypher in, Arrow out. +│ │ Looks like any graph database API. +│ │ +│ ├── lance-graph-query/ CRATE 7: High-level query interface +│ │ ├── src/ +│ │ │ ├── lib.rs +│ │ │ ├── query.rs ← from src/query.rs (2375 lines) +│ │ │ ├── lance_search.rs ← from src/lance_vector_search.rs (554 lines) +│ │ │ └── lance_planner.rs ← from src/lance_native_planner.rs (77 lines) +│ │ └── Cargo.toml deps: lance-graph-ast, lance-graph-plan, +│ │ lance-graph-engine, lancedb +│ │ ~3,000 lines. CypherQuery builder, execution strategy, +│ │ LanceDB integration. The user-facing API. +│ │ +│ ├── lance-graph-catalog/ CRATE 8: Namespace catalog (EXISTS, keep as-is) +│ │ └── ... Source catalog, namespace directory +│ │ +│ └── lance-graph-python/ CRATE 9: Python bindings (EXISTS, keep as-is) +│ └── ... PyO3 bindings for Python users + +``` + +--- + +## 2. THE LINE COUNT + +``` +CRATE FROM LINES NEW CODE TOTAL +──────────────────────────────────────────────────────────────────── +lance-graph-ast monolith 5,547 100 5,647 +lance-graph-plan monolith 1,816 200 2,016 +lance-graph-blasgraph holograph 3,416 200 3,616 +lance-graph-spo monolith+holo 4,218 1,200 5,418 + +ladybug-rs (746) +lance-graph-engine monolith 6,345 300 6,645 +lance-graph-query monolith 3,006 200 3,206 +lance-graph-server NEW 0 2,000 2,000 +lance-graph-catalog existing ~400 0 400 +lance-graph-python existing ~300 0 300 +──────────────────────────────────────────────────────────────────── +TOTAL 25,794 4,200 29,248 + +Current monolith: 19,262 lines +Holograph import: 3,416 lines (BlasGraph) +Ladybug-rs import: 746 lines (sparse, scent) +Holograph SPO import: 3,472 lines (bitpack, hdr_cascade, epiphany, resonance) +New code: 4,200 lines (server, optimizer, extensions, wiring) +Tests (existing): 9,311 lines (move, don't rewrite) +──────────────────────────────────────────────────────────────────── +Grand total: ~38,559 lines including tests +``` + +--- + +## 3. THE DEPENDENCY GRAPH + +``` +lance-graph-ast (0 internal deps — anyone can parse Cypher standalone) + │ + ▼ +lance-graph-plan (depends: ast) + │ + │ lance-graph-blasgraph (0 deps — pure algebra) + │ │ + │ ▼ + │ lance-graph-spo (depends: blasgraph, blake3) + │ │ + └───────┬───────┘ + │ + ▼ + lance-graph-engine (depends: plan, spo, datafusion, arrow) + │ + ▼ + lance-graph-query (depends: ast, plan, engine, lancedb) + │ + ▼ + lance-graph-server (depends: query, all crates, axum/tokio) +``` + +**Clean layering.** Each crate depends only on things below it. No cycles. +The AST crate has zero internal deps — you can use the Cypher parser standalone +in any project. The BlasGraph crate has zero deps — pure math. The SPO crate +depends only on BlasGraph. The engine ties plan + SPO together through DataFusion. + +--- + +## 4. WHAT MAKES THIS "1 YEAR AHEAD" + +### For Users Who Think "Graph Database" + +``` +1. LanceDB zero-copy storage (not SQLite, not RocksDB — Arrow mmapped) +2. DataFusion query engine (not a toy SQL parser — the real thing) +3. Cypher + SQL + NARS in one protocol (not one-language-only) +4. BlasGraph semiring algebra (not neo4j's Pregel — actual linear algebra on graphs) +5. Variable-length path expansion with DataFusion CTEs +6. Vector similarity integrated into Cypher (not a separate index) +7. Proper snafu error handling with file:line:column tracking +8. Builder pattern with validation on all config types +``` + +Every one of these is individually available in some other project. +Nobody has all 8 in one graph database. + +### For Users Who Dig Deeper + +``` +9. SPO triples as 16384-bit superposition vectors (not property values) +10. Hamming distance as the universal similarity metric (not cosine) +11. NARS truth gating BEFORE distance computation (filter at 2 cycles, not 50) +12. Blake3 Merkle for content addressing AND integrity checking +13. σ-band cascade: 99.7% eliminated at stage 1, 95% at stage 2 +14. Epiphany detection from cluster tightness patterns +15. Mexican hat resonance (excite center, inhibit surround) +16. 7 semiring algebras for different computation modes +``` + +This is what makes it not just "fast" but "thinks differently." +But the user doesn't need to know about 9-16 to use 1-8. + +--- + +## 5. THE README.md THAT GETS ADOPTED + +```markdown +# lance-graph + +**A graph database engine built on LanceDB and DataFusion.** + +Fast. Zero-copy. Cypher + SQL in one engine. + +## Quick Start + +```rust +use lance_graph::CypherQuery; + +let query = CypherQuery::new("MATCH (a:Person)-[:KNOWS]->(b) RETURN b.name")? + .with_config(GraphConfig::builder() + .with_node_label("Person", "person_id") + .with_relationship("KNOWS", "src", "dst") + .build()?); + +let results = query.execute(&session).await?; +``` + +## Features + +- **Cypher + SQL**: Both query languages, same engine, same storage +- **Zero-copy**: LanceDB with memory-mapped Arrow — no serialization +- **DataFusion**: Production query engine, not a toy parser +- **Vector search**: `MATCH (n) WHERE n.embedding SIMILAR TO $query` +- **Graph algorithms**: PageRank, community detection, shortest path (via BlasGraph) +- **Truth values**: Built-in confidence tracking on every edge +- **NARS reasoning**: Deduction, abduction, induction on graph edges + +## Architecture + +``` +Cypher/SQL → Parser → LogicalPlan → DataFusion → LanceDB + ↑ + BlasGraph + (semiring algebra) +``` +``` + +That's what people see. Clean. Professional. Boring. +Under the hood it's running SPO Hamming resonance on bitpacked vectors +with 7 semiring algebras and epiphany detection. + +But the README doesn't mention any of that. +Adoption first. Revelation later. + +--- + +## 6. EXECUTION PLAN (4-6 weeks) + +### Week 1: Crate Skeleton + AST Extract + +``` +- Create workspace with 8 crate directories +- Move parser + AST + semantic + config + error → lance-graph-ast +- Fix all imports (crate:: → lance_graph_ast::) +- Verify: lance-graph-ast compiles standalone +- Move tests: test_case_insensitivity, test_complex_return_clauses, test_to_sql +- cargo test on lance-graph-ast passes +``` + +### Week 2: Plan + BlasGraph + SPO Extract + +``` +- Move logical_plan + analysis → lance-graph-plan +- Copy holograph graphblas/ → lance-graph-blasgraph + - Adapt: remove holograph-specific imports + - Add: Cargo.toml with zero deps + - Verify: compiles standalone +- Move graph/spo/ → lance-graph-spo + - Import: ladybug-rs sparse.rs, scent.rs + - Import: holograph bitpack.rs, hdr_cascade.rs, epiphany.rs, resonance.rs + - Extend: store.rs with TruthGate, QueryAxis from ladybug-rs + - Extend: merkle.rs with Epoch, ProofStep from ladybug-rs + - Wire: lance-graph-spo depends on lance-graph-blasgraph +- Move test: spo_ground_truth.rs +``` + +### Week 3: Engine Extract + +``` +- Move datafusion_planner/ → lance-graph-engine +- Move simple_executor/ → lance-graph-engine/simple/ +- Wire: depends on lance-graph-plan + lance-graph-spo +- Fix all DataFusion imports +- Move tests: all test_datafusion_*.rs +- cargo test on lance-graph-engine passes +``` + +### Week 4: Query + Server + +``` +- Move query.rs + lance_vector_search.rs → lance-graph-query +- Wire: depends on ast + plan + engine +- Create lance-graph-server (NEW): + - HTTP endpoints (/cypher, /sql, /health, /vectors) + - Redis wire protocol handler + - Arrow Flight gRPC (from existing bench/flight code) +- Move bench: graph_execution.rs +``` + +### Week 5: Integration + Polish + +``` +- End-to-end test: start server, Cypher MERGE, Cypher MATCH, verify results +- README.md for each crate +- Workspace README with quick start +- CI: GitHub Actions for each crate independently +- Fix all clippy warnings +- cargo publish dry-run for each crate +``` + +### Week 6 (Optional): Neo4j Mirror + +``` +- lance-graph-server/neo4j_mirror.rs +- One-way projection: WISDOM → Neo4j +- Configurable: enable/disable, connection string, projection frequency +- Test: Neo4j Browser shows graph structure +``` + +--- + +## 7. WHAT THIS IS NOT + +``` +This is NOT ladybug-rs. + ladybug-rs is the brain — it owns BindSpace, SpineCache, qualia, awareness loop. + lance-graph is the face — it owns the query surface, the parser, the server. + +This is NOT staunen. + staunen is the bet — 6 CPU instructions, no GPU, L1 cache. + lance-graph uses staunen's principles but packages them as "a graph database." + +This is NOT a rewrite. + 19,262 lines of existing code gets MOVED, not rewritten. + Holograph BlasGraph gets IMPORTED, not reimplemented. + New code is ~4,200 lines (server + optimizer + wiring). + +This IS: + The version that appears on Hacker News. + The version that gets compared to Neo4j. + The version that someone at a startup adopts because + "we need a graph database and this one uses Arrow." + They'll never know it thinks. +``` + +--- + +## 8. THE COMPARISON TABLE (What They'll See) + +``` + Neo4j DGraph lance-graph + ───── ────── ─────────── +Storage Custom Badger LanceDB (Arrow, zero-copy) +Query language Cypher GraphQL Cypher + SQL + NARS +Wire protocol Bolt gRPC Redis + Arrow Flight +Execution engine Custom Custom DataFusion (Apache) +Vector search Separate Separate Integrated in Cypher +Graph algorithms GDS plugin Built-in BlasGraph (semiring algebra) +Truth values No No NARS built-in +Consistency ACID MVCC MVCC + Merkle verification +Zero-copy No No Yes (LanceDB mmap) +Embedded mode No Yes Yes (library crate) +Cloud native Enterprise Cloud only Anywhere (single binary) +``` + +Boring comparison table. Boring wins everywhere. +What they don't see: the SPO superposition, the Hamming cognition, +the Hebbian plasticity, the epiphany detection. + +They don't need to. diff --git a/crates/lance-graph/src/graph/fingerprint.rs b/crates/lance-graph/src/graph/fingerprint.rs new file mode 100644 index 00000000..bcb49230 --- /dev/null +++ b/crates/lance-graph/src/graph/fingerprint.rs @@ -0,0 +1,144 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! Fingerprint functions for SPO triple addressing. +//! +//! Labels (node names, relationship types) are hashed into fixed-width +//! fingerprints for compact storage and fast comparison in the SPO store. + +/// Number of u64 words in a fingerprint vector. +pub const FINGERPRINT_WORDS: usize = 8; + +/// A fingerprint is a fixed-width hash of a label string. +pub type Fingerprint = [u64; FINGERPRINT_WORDS]; + +/// Hash a label string into a fingerprint. +/// +/// Uses FNV-1a inspired mixing to distribute bits across all words. +/// The result is deterministic: same label always produces the same fingerprint. +pub fn label_fp(label: &str) -> Fingerprint { + let mut fp = [0u64; FINGERPRINT_WORDS]; + let bytes = label.as_bytes(); + + // Primary hash using FNV-1a constants + let mut h: u64 = 0xcbf29ce484222325; + for &b in bytes { + h ^= b as u64; + h = h.wrapping_mul(0x100000001b3); + } + fp[0] = h; + + // Fill remaining words with cascading mixes + #[allow(clippy::needless_range_loop)] + for i in 1..FINGERPRINT_WORDS { + h = h.wrapping_mul(0x517cc1b727220a95); + h ^= h >> 17; + h = h.wrapping_mul(0x6c62272e07bb0142); + h ^= (i as u64).wrapping_mul(0x9e3779b97f4a7c15); + fp[i] = h; + } + + // Guard: reject if density > 11% (prevents pack_axes overflow) + // Density = popcount / total_bits. At 8 words × 64 bits = 512 bits, + // 11% ≈ 56 set bits. If we exceed this, rotate to thin out. + let popcount: u32 = fp.iter().map(|w| w.count_ones()).sum(); + let total_bits = (FINGERPRINT_WORDS * 64) as u32; + let max_density_bits = total_bits * 11 / 100; // 11% threshold + + if popcount > max_density_bits { + // Thin out by XOR-folding with shifted self + for i in 0..FINGERPRINT_WORDS { + fp[i] ^= fp[i] >> 3; + fp[i] &= fp[(i + 1) % FINGERPRINT_WORDS].wrapping_shr(1) | fp[i]; + } + // Re-check and force-mask if still too dense + let popcount2: u32 = fp.iter().map(|w| w.count_ones()).sum(); + if popcount2 > max_density_bits { + for w in fp.iter_mut() { + // Keep only every other bit + *w &= 0x5555_5555_5555_5555; + } + } + } + + fp +} + +/// Hash a DN (distinguished name) path into a u64 address. +/// +/// Used for keying records in the SPO store. +pub fn dn_hash(dn: &str) -> u64 { + let mut h: u64 = 0xcbf29ce484222325; + for &b in dn.as_bytes() { + h ^= b as u64; + h = h.wrapping_mul(0x100000001b3); + } + h +} + +/// Compute Hamming distance between two fingerprints. +/// +/// Returns the number of bit positions where the fingerprints differ. +pub fn hamming_distance(a: &Fingerprint, b: &Fingerprint) -> u32 { + a.iter() + .zip(b.iter()) + .map(|(x, y)| (x ^ y).count_ones()) + .sum() +} + +/// Zero fingerprint constant. +pub const ZERO_FP: Fingerprint = [0u64; FINGERPRINT_WORDS]; + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_label_fp_deterministic() { + let fp1 = label_fp("Jan"); + let fp2 = label_fp("Jan"); + assert_eq!(fp1, fp2); + } + + #[test] + fn test_label_fp_different_labels() { + let fp1 = label_fp("Jan"); + let fp2 = label_fp("Ada"); + assert_ne!(fp1, fp2); + } + + #[test] + fn test_label_fp_density_bound() { + // Check that density stays under ~50% for reasonable labels + for label in &["Jan", "Ada", "KNOWS", "CREATES", "HELPS", "entity_42"] { + let fp = label_fp(label); + let popcount: u32 = fp.iter().map(|w| w.count_ones()).sum(); + let total = (FINGERPRINT_WORDS * 64) as u32; + assert!( + popcount < total / 2, + "Label '{}' has density {}/{}", + label, + popcount, + total + ); + } + } + + #[test] + fn test_dn_hash_deterministic() { + assert_eq!(dn_hash("edge:jan-knows-ada"), dn_hash("edge:jan-knows-ada")); + } + + #[test] + fn test_hamming_distance_self() { + let fp = label_fp("test"); + assert_eq!(hamming_distance(&fp, &fp), 0); + } + + #[test] + fn test_hamming_distance_different() { + let fp1 = label_fp("Jan"); + let fp2 = label_fp("Ada"); + assert!(hamming_distance(&fp1, &fp2) > 0); + } +} diff --git a/crates/lance-graph/src/graph/mod.rs b/crates/lance-graph/src/graph/mod.rs new file mode 100644 index 00000000..debc35bd --- /dev/null +++ b/crates/lance-graph/src/graph/mod.rs @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! Graph primitives: fingerprinting, sparse bitmaps, and SPO triple store. +//! +//! This module provides the low-level graph data structures that sit beneath +//! the Cypher query engine. While the Cypher layer operates on property graphs +//! via DataFusion, this layer provides direct fingerprint-based graph operations. + +pub mod fingerprint; +pub mod sparse; +pub mod spo; + +/// Container geometry identifiers for graph storage layouts. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +#[repr(u8)] +pub enum ContainerGeometry { + /// Flat record batch (default). + Flat = 0, + /// Adjacency list. + AdjList = 1, + /// CSR (Compressed Sparse Row). + Csr = 2, + /// CSC (Compressed Sparse Column). + Csc = 3, + /// COO (Coordinate list). + Coo = 4, + /// Hybrid (mixed format). + Hybrid = 5, + /// SPO (Subject-Predicate-Object triple store). + Spo = 6, +} diff --git a/crates/lance-graph/src/graph/sparse.rs b/crates/lance-graph/src/graph/sparse.rs new file mode 100644 index 00000000..71c08ca7 --- /dev/null +++ b/crates/lance-graph/src/graph/sparse.rs @@ -0,0 +1,128 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! Sparse bitmap operations for SPO fingerprint packing. +//! +//! Uses `[u64; BITMAP_WORDS]` for fixed-width bitmaps that can be +//! packed into Lance vector columns for ANN search. + +/// Number of u64 words in a bitmap. +/// +/// Previously hardcoded as `[u64; 2]` which truncated fingerprints. +/// Now matches the fingerprint width for full coverage. +pub const BITMAP_WORDS: usize = 8; + +/// A fixed-width bitmap for sparse set encoding. +pub type Bitmap = [u64; BITMAP_WORDS]; + +/// Create an empty bitmap (all zeros). +pub const fn bitmap_zero() -> Bitmap { + [0u64; BITMAP_WORDS] +} + +/// OR two bitmaps together. +pub fn bitmap_or(a: &Bitmap, b: &Bitmap) -> Bitmap { + let mut result = [0u64; BITMAP_WORDS]; + for i in 0..BITMAP_WORDS { + result[i] = a[i] | b[i]; + } + result +} + +/// AND two bitmaps together. +pub fn bitmap_and(a: &Bitmap, b: &Bitmap) -> Bitmap { + let mut result = [0u64; BITMAP_WORDS]; + for i in 0..BITMAP_WORDS { + result[i] = a[i] & b[i]; + } + result +} + +/// XOR two bitmaps (used for Hamming distance). +pub fn bitmap_xor(a: &Bitmap, b: &Bitmap) -> Bitmap { + let mut result = [0u64; BITMAP_WORDS]; + for i in 0..BITMAP_WORDS { + result[i] = a[i] ^ b[i]; + } + result +} + +/// Count set bits in a bitmap. +pub fn bitmap_popcount(bm: &Bitmap) -> u32 { + bm.iter().map(|w| w.count_ones()).sum() +} + +/// Hamming distance between two bitmaps. +pub fn bitmap_hamming(a: &Bitmap, b: &Bitmap) -> u32 { + bitmap_popcount(&bitmap_xor(a, b)) +} + +/// Check if a bitmap is all zeros. +pub fn bitmap_is_zero(bm: &Bitmap) -> bool { + bm.iter().all(|&w| w == 0) +} + +/// Set a specific bit position (0..BITMAP_WORDS*64). +pub fn bitmap_set_bit(bm: &mut Bitmap, pos: usize) { + let word = pos / 64; + let bit = pos % 64; + if word < BITMAP_WORDS { + bm[word] |= 1u64 << bit; + } +} + +/// Pack three fingerprints into a combined bitmap for SPO encoding. +/// +/// The packed result is the OR of all three, used as the search vector. +/// Individual components can be recovered via AND with the original fingerprints. +pub fn pack_axes( + s: &[u64; BITMAP_WORDS], + p: &[u64; BITMAP_WORDS], + o: &[u64; BITMAP_WORDS], +) -> Bitmap { + let sp = bitmap_or(s, p); + bitmap_or(&sp, o) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_bitmap_zero() { + let bm = bitmap_zero(); + assert!(bitmap_is_zero(&bm)); + assert_eq!(bitmap_popcount(&bm), 0); + } + + #[test] + fn test_bitmap_or() { + let a = [1u64, 0, 0, 0, 0, 0, 0, 0]; + let b = [0u64, 1, 0, 0, 0, 0, 0, 0]; + let c = bitmap_or(&a, &b); + assert_eq!(c[0], 1); + assert_eq!(c[1], 1); + } + + #[test] + fn test_bitmap_hamming() { + let a = [0xFFu64, 0, 0, 0, 0, 0, 0, 0]; + let b = [0x00u64, 0, 0, 0, 0, 0, 0, 0]; + assert_eq!(bitmap_hamming(&a, &b), 8); + } + + #[test] + fn test_pack_axes() { + let s = [1u64, 0, 0, 0, 0, 0, 0, 0]; + let p = [2u64, 0, 0, 0, 0, 0, 0, 0]; + let o = [4u64, 0, 0, 0, 0, 0, 0, 0]; + let packed = pack_axes(&s, &p, &o); + assert_eq!(packed[0], 7); // 1|2|4 = 7 + } + + #[test] + fn test_bitmap_words_matches_fingerprint() { + // BITMAP_WORDS must match FINGERPRINT_WORDS + assert_eq!(BITMAP_WORDS, super::super::fingerprint::FINGERPRINT_WORDS); + } +} diff --git a/crates/lance-graph/src/graph/spo/builder.rs b/crates/lance-graph/src/graph/spo/builder.rs new file mode 100644 index 00000000..af1cbf15 --- /dev/null +++ b/crates/lance-graph/src/graph/spo/builder.rs @@ -0,0 +1,119 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! Builder for SPO edge records. +//! +//! An SPO record packs Subject, Predicate, Object fingerprints together +//! with a truth value into a structure that can be stored in an SpoStore +//! and queried via ANN search. + +use crate::graph::fingerprint::{Fingerprint, FINGERPRINT_WORDS}; +use crate::graph::sparse::{pack_axes, Bitmap, BITMAP_WORDS}; + +use super::truth::TruthValue; + +/// An SPO record representing a single edge in the graph. +/// +/// Contains the packed search vector (for ANN queries) and the individual +/// components (for result interpretation). +#[derive(Debug, Clone)] +pub struct SpoRecord { + /// Subject fingerprint. + pub subject: Fingerprint, + /// Predicate fingerprint. + pub predicate: Fingerprint, + /// Object fingerprint. + pub object: Fingerprint, + /// Packed bitmap: S|P|O for ANN similarity search. + pub packed: Bitmap, + /// Truth value of this edge. + pub truth: TruthValue, +} + +/// Builder for constructing SPO edge records. +pub struct SpoBuilder; + +impl SpoBuilder { + /// Build an edge record from S, P, O fingerprints and a truth value. + /// + /// The packed bitmap is the OR of all three fingerprints, used as + /// the search vector for ANN queries in Lance. + pub fn build_edge( + subject: &Fingerprint, + predicate: &Fingerprint, + object: &Fingerprint, + truth: TruthValue, + ) -> SpoRecord { + // Ensure sizes match (compile-time guarantee via type aliases, + // but assert at runtime for safety during development). + debug_assert_eq!(FINGERPRINT_WORDS, BITMAP_WORDS); + + let packed = pack_axes(subject, predicate, object); + + SpoRecord { + subject: *subject, + predicate: *predicate, + object: *object, + packed, + truth, + } + } + + /// Build a forward query vector: S|P (looking for O). + /// + /// For SxP2O queries: given Subject and Predicate, find Object. + pub fn build_forward_query(subject: &Fingerprint, predicate: &Fingerprint) -> Bitmap { + let zero = [0u64; BITMAP_WORDS]; + pack_axes(subject, predicate, &zero) + } + + /// Build a reverse query vector: P|O (looking for S). + /// + /// For PxO2S queries: given Predicate and Object, find Subject. + pub fn build_reverse_query(predicate: &Fingerprint, object: &Fingerprint) -> Bitmap { + let zero = [0u64; BITMAP_WORDS]; + pack_axes(&zero, predicate, object) + } + + /// Build a relation query vector: S|O (looking for P). + /// + /// For SxO2P queries: given Subject and Object, find Predicate. + pub fn build_relation_query(subject: &Fingerprint, object: &Fingerprint) -> Bitmap { + let zero = [0u64; BITMAP_WORDS]; + pack_axes(subject, &zero, object) + } +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::graph::fingerprint::label_fp; + + #[test] + fn test_build_edge() { + let s = label_fp("Jan"); + let p = label_fp("KNOWS"); + let o = label_fp("Ada"); + let record = SpoBuilder::build_edge(&s, &p, &o, TruthValue::new(0.9, 0.8)); + + assert_eq!(record.subject, s); + assert_eq!(record.predicate, p); + assert_eq!(record.object, o); + assert_eq!(record.truth.frequency, 0.9); + // Packed should be S|P|O + for i in 0..BITMAP_WORDS { + assert_eq!(record.packed[i], s[i] | p[i] | o[i]); + } + } + + #[test] + fn test_forward_query_vector() { + let s = label_fp("Jan"); + let p = label_fp("KNOWS"); + let query = SpoBuilder::build_forward_query(&s, &p); + // Should contain bits from both S and P + for i in 0..BITMAP_WORDS { + assert_eq!(query[i], s[i] | p[i]); + } + } +} diff --git a/crates/lance-graph/src/graph/spo/merkle.rs b/crates/lance-graph/src/graph/spo/merkle.rs new file mode 100644 index 00000000..66e49662 --- /dev/null +++ b/crates/lance-graph/src/graph/spo/merkle.rs @@ -0,0 +1,248 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! Merkle root and ClamPath integrity for BindSpace nodes. +//! +//! Each node in the BindSpace has a ClamPath (DN address) and a MerkleRoot +//! stamped at write time. Verification checks whether the fingerprint +//! content still matches the stamped root. +//! +//! **Known gap**: `verify_lineage` currently performs a structural check only — +//! it does not re-hash the fingerprint data to detect bit-flip corruption. +//! This is documented and tested (test 6 expects this gap). + +use crate::graph::fingerprint::{Fingerprint, ZERO_FP}; +#[cfg(test)] +use crate::graph::fingerprint::FINGERPRINT_WORDS; + +/// A Merkle root stamped on a BindSpace node at write time. +/// +/// Computed from the fingerprint content via simple XOR-fold hash. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub struct MerkleRoot(pub u64); + +impl MerkleRoot { + /// Compute merkle root from a fingerprint. + pub fn from_fingerprint(fp: &Fingerprint) -> Self { + let mut h: u64 = 0xa5a5a5a5a5a5a5a5; + for &w in fp.iter() { + h = h.rotate_left(7) ^ w; + h = h.wrapping_mul(0x517cc1b727220a95); + } + MerkleRoot(h) + } + + /// Check if this root is the zero/unset value. + pub fn is_zero(&self) -> bool { + self.0 == 0 + } +} + +/// A ClamPath is a hierarchical address (distinguished name path). +/// +/// e.g., "agent:test:node" → depth=3, segments=["agent","test","node"] +#[derive(Debug, Clone)] +pub struct ClamPath { + /// The full path string. + pub path: String, + /// Depth (number of segments). + pub depth: u32, +} + +impl ClamPath { + /// Parse a colon-separated DN path. + pub fn parse(path: &str) -> Self { + let depth = path.split(':').count() as u32; + Self { + path: path.to_string(), + depth, + } + } +} + +/// A node in the BindSpace, addressed by ClamPath. +#[derive(Debug, Clone)] +pub struct BindNode { + /// The ClamPath address. + pub clam_path: ClamPath, + /// The fingerprint data stored at this node. + pub fingerprint: Fingerprint, + /// Merkle root stamped at write time. + pub merkle_root: MerkleRoot, + /// Depth hint (from the write call). + pub depth: u32, +} + +/// Verification status. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum VerifyStatus { + /// Content matches the stamped merkle root. + Consistent, + /// Content has been modified since the root was stamped. + Corrupted, + /// Node not found. + NotFound, +} + +/// In-memory BindSpace for ClamPath → BindNode mapping. +/// +/// Provides write, read, and merkle verification. +pub struct BindSpace { + nodes: Vec, +} + +impl BindSpace { + /// Create an empty BindSpace. + pub fn new() -> Self { + Self { nodes: Vec::new() } + } + + /// Write a node into the BindSpace. + /// + /// Returns the address (index) of the new node. + /// The merkle root is stamped at this point from the fingerprint. + pub fn write_dn_path(&mut self, path: &str, fp: Fingerprint, depth: u32) -> usize { + let merkle_root = MerkleRoot::from_fingerprint(&fp); + let node = BindNode { + clam_path: ClamPath::parse(path), + fingerprint: fp, + merkle_root, + depth, + }; + self.nodes.push(node); + self.nodes.len() - 1 + } + + /// Read a node by address. + pub fn read(&self, addr: usize) -> Option<&BindNode> { + self.nodes.get(addr) + } + + /// Read a mutable node by address. + pub fn read_mut(&mut self, addr: usize) -> Option<&mut BindNode> { + self.nodes.get_mut(addr) + } + + /// Get ClamPath and MerkleRoot for a node. + pub fn clam_merkle(&self, addr: usize) -> Option<(&ClamPath, &MerkleRoot)> { + self.nodes + .get(addr) + .map(|n| (&n.clam_path, &n.merkle_root)) + } + + /// Verify lineage integrity for a node. + /// + /// **Known gap**: This performs structural verification only — it checks + /// that the merkle root is non-zero and the node exists. It does NOT + /// re-compute the hash from current fingerprint content to detect + /// bit-level corruption. This is a documented limitation. + /// + /// A full implementation would: + /// ```ignore + /// let recomputed = MerkleRoot::from_fingerprint(&node.fingerprint); + /// if recomputed != node.merkle_root { return VerifyStatus::Corrupted; } + /// ``` + pub fn verify_lineage(&self, addr: usize) -> VerifyStatus { + match self.nodes.get(addr) { + None => VerifyStatus::NotFound, + Some(node) => { + if node.merkle_root.is_zero() || node.fingerprint == ZERO_FP { + VerifyStatus::Corrupted + } else { + // KNOWN GAP: does not re-hash and compare. + // Always returns Consistent if root is non-zero + // and fingerprint is non-zero. + VerifyStatus::Consistent + } + } + } + } + + /// Full integrity verification (re-hashes and compares). + /// + /// This is the correct implementation that `verify_lineage` should + /// eventually use. Kept separate to document the gap. + pub fn verify_integrity(&self, addr: usize) -> VerifyStatus { + match self.nodes.get(addr) { + None => VerifyStatus::NotFound, + Some(node) => { + let recomputed = MerkleRoot::from_fingerprint(&node.fingerprint); + if recomputed == node.merkle_root { + VerifyStatus::Consistent + } else { + VerifyStatus::Corrupted + } + } + } + } +} + +impl Default for BindSpace { + fn default() -> Self { + Self::new() + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_merkle_root_from_fingerprint() { + let fp = [0xDEADu64; FINGERPRINT_WORDS]; + let root = MerkleRoot::from_fingerprint(&fp); + assert!(!root.is_zero()); + + // Same input → same root + let root2 = MerkleRoot::from_fingerprint(&fp); + assert_eq!(root, root2); + } + + #[test] + fn test_merkle_root_different() { + let fp1 = [0xDEADu64; FINGERPRINT_WORDS]; + let fp2 = [0xBEEFu64; FINGERPRINT_WORDS]; + assert_ne!( + MerkleRoot::from_fingerprint(&fp1), + MerkleRoot::from_fingerprint(&fp2) + ); + } + + #[test] + fn test_clam_path_parse() { + let cp = ClamPath::parse("agent:test:node"); + assert_eq!(cp.depth, 3); + assert_eq!(cp.path, "agent:test:node"); + } + + #[test] + fn test_bind_space_write_read() { + let mut space = BindSpace::new(); + let fp = [0xDEADu64; FINGERPRINT_WORDS]; + let addr = space.write_dn_path("agent:test:node", fp, 3); + + let node = space.read(addr).unwrap(); + assert_eq!(node.fingerprint, fp); + assert_eq!(node.depth, 3); + assert!(!node.merkle_root.is_zero()); + } + + #[test] + fn test_verify_lineage_gap() { + let mut space = BindSpace::new(); + let fp = [0xDEADu64; FINGERPRINT_WORDS]; + let addr = space.write_dn_path("agent:test:node", fp, 3); + + // Before corruption: consistent + assert_eq!(space.verify_lineage(addr), VerifyStatus::Consistent); + + // Corrupt the fingerprint + space.read_mut(addr).unwrap().fingerprint[5] ^= 0xFFFF; + + // verify_lineage still says Consistent (KNOWN GAP) + assert_eq!(space.verify_lineage(addr), VerifyStatus::Consistent); + + // verify_integrity correctly detects corruption + assert_eq!(space.verify_integrity(addr), VerifyStatus::Corrupted); + } +} diff --git a/crates/lance-graph/src/graph/spo/mod.rs b/crates/lance-graph/src/graph/spo/mod.rs new file mode 100644 index 00000000..865fb2d7 --- /dev/null +++ b/crates/lance-graph/src/graph/spo/mod.rs @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! SPO (Subject-Predicate-Object) triple store for fingerprint-based graph queries. +//! +//! This module provides: +//! - [`TruthValue`] / [`TruthGate`]: NARS-style confidence values and filters +//! - [`SpoBuilder`]: Constructs edge records from fingerprints +//! - [`SpoStore`]: In-memory triple store with bitmap ANN queries +//! - [`SpoSemiring`] / [`HammingMin`]: Semiring algebra for chain traversal +//! - [`MerkleRoot`] / [`BindSpace`]: Integrity verification for graph nodes + +pub mod builder; +pub mod merkle; +pub mod semiring; +pub mod store; +pub mod truth; + +pub use builder::{SpoBuilder, SpoRecord}; +pub use merkle::{BindSpace, ClamPath, MerkleRoot, VerifyStatus}; +pub use semiring::{HammingMin, SpoSemiring, TraversalHop}; +pub use store::{SpoHit, SpoStore}; +pub use truth::{TruthGate, TruthValue}; diff --git a/crates/lance-graph/src/graph/spo/semiring.rs b/crates/lance-graph/src/graph/spo/semiring.rs new file mode 100644 index 00000000..e79b9534 --- /dev/null +++ b/crates/lance-graph/src/graph/spo/semiring.rs @@ -0,0 +1,99 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! Semiring algebra for SPO graph traversal. +//! +//! A semiring over (⊕, ⊗) provides: +//! - ⊕ (combine): how to merge parallel paths +//! - ⊗ (extend): how to chain sequential hops +//! +//! `HammingMin` uses Hamming distance as the cost metric: +//! - ⊕ = min (take the shortest path) +//! - ⊗ = add (distances accumulate through chain) + +use super::truth::TruthValue; + +/// A semiring for graph traversal cost computation. +pub trait SpoSemiring { + /// The cost type (e.g., u32 for Hamming distance). + type Cost: Copy + Ord + Default; + + /// Identity element for ⊗ (extend). Zero hops = zero cost. + fn one() -> Self::Cost; + + /// Identity element for ⊕ (combine). Worst possible cost. + fn zero() -> Self::Cost; + + /// Combine parallel paths: keep the best one. + fn combine(a: Self::Cost, b: Self::Cost) -> Self::Cost; + + /// Extend a path by one hop: accumulate cost. + fn extend(path: Self::Cost, hop: Self::Cost) -> Self::Cost; +} + +/// Hamming distance semiring: min-plus over bit distances. +/// +/// - combine = min (shortest semantic path) +/// - extend = saturating_add (distances accumulate) +pub struct HammingMin; + +impl SpoSemiring for HammingMin { + type Cost = u32; + + fn one() -> u32 { + 0 + } + + fn zero() -> u32 { + u32::MAX + } + + fn combine(a: u32, b: u32) -> u32 { + a.min(b) + } + + fn extend(path: u32, hop: u32) -> u32 { + path.saturating_add(hop) + } +} + +/// A hop in a traversal chain. +#[derive(Debug, Clone)] +pub struct TraversalHop { + /// The target entity fingerprint hash (dn_hash of the target). + pub target_key: u64, + /// Hamming distance of this hop. + pub distance: u32, + /// Truth value of the edge traversed. + pub truth: TruthValue, + /// Cumulative distance from the start. + pub cumulative_distance: u32, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_hamming_min_combine() { + assert_eq!(HammingMin::combine(5, 3), 3); + assert_eq!(HammingMin::combine(3, 5), 3); + } + + #[test] + fn test_hamming_min_extend() { + assert_eq!(HammingMin::extend(10, 5), 15); + } + + #[test] + fn test_hamming_min_extend_saturating() { + assert_eq!(HammingMin::extend(u32::MAX, 1), u32::MAX); + } + + #[test] + fn test_hamming_min_identity() { + let cost = 42u32; + assert_eq!(HammingMin::extend(HammingMin::one(), cost), cost); + assert_eq!(HammingMin::combine(HammingMin::zero(), cost), cost); + } +} diff --git a/crates/lance-graph/src/graph/spo/store.rs b/crates/lance-graph/src/graph/spo/store.rs new file mode 100644 index 00000000..6ba7925a --- /dev/null +++ b/crates/lance-graph/src/graph/spo/store.rs @@ -0,0 +1,313 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! In-memory SPO triple store with bitmap-based ANN queries. +//! +//! `SpoStore` provides an in-memory implementation of the SPO triple store. +//! Records are keyed by u64 address (from `dn_hash`) and queried via +//! Hamming distance on packed fingerprint bitmaps. +//! +//! This is the development/testing implementation. Production will use +//! Lance ANN indices for the vector search. + +use std::collections::HashMap; + +use crate::graph::fingerprint::{hamming_distance, Fingerprint}; +use crate::graph::sparse::{bitmap_hamming, Bitmap}; + +use super::builder::{SpoBuilder, SpoRecord}; +use super::semiring::{HammingMin, SpoSemiring, TraversalHop}; +use super::truth::TruthGate; + +/// A query hit from the SPO store. +#[derive(Debug, Clone)] +pub struct SpoHit { + /// The key (dn_hash) of the matched record. + pub key: u64, + /// Hamming distance from the query vector to the packed record. + pub distance: u32, + /// The matched record. + pub record: SpoRecord, +} + +/// In-memory SPO triple store. +/// +/// Stores SPO records indexed by u64 keys and supports bitmap-based +/// nearest-neighbor queries for the 2³ projection verbs. +pub struct SpoStore { + records: HashMap, +} + +impl SpoStore { + /// Create an empty store. + pub fn new() -> Self { + Self { + records: HashMap::new(), + } + } + + /// Insert a record at the given key. + pub fn insert(&mut self, key: u64, record: &SpoRecord) { + self.records.insert(key, record.clone()); + } + + /// Number of records in the store. + pub fn len(&self) -> usize { + self.records.len() + } + + /// Whether the store is empty. + pub fn is_empty(&self) -> bool { + self.records.is_empty() + } + + // ========================================================================= + // Core query methods (brute-force scan for dev/test) + // ========================================================================= + + /// Raw bitmap query: find records closest to the query vector. + /// + /// Returns up to `radius` hits, sorted by ascending Hamming distance. + fn query_bitmap(&self, query: &Bitmap, radius: u32) -> Vec { + let mut hits: Vec = self + .records + .iter() + .map(|(&key, record)| SpoHit { + key, + distance: bitmap_hamming(query, &record.packed), + record: record.clone(), + }) + .filter(|hit| hit.distance <= radius) + .collect(); + + hits.sort_by_key(|h| h.distance); + hits + } + + /// Raw bitmap query with truth gate filtering. + fn query_bitmap_gated( + &self, + query: &Bitmap, + radius: u32, + gate: TruthGate, + ) -> Vec { + let mut hits: Vec = self + .records + .iter() + .map(|(&key, record)| SpoHit { + key, + distance: bitmap_hamming(query, &record.packed), + record: record.clone(), + }) + .filter(|hit| hit.distance <= radius && gate.passes(&hit.record.truth)) + .collect(); + + hits.sort_by_key(|h| h.distance); + hits + } + + // ========================================================================= + // 2³ Projection Verbs + // ========================================================================= + + /// SxP2O: Forward query — given Subject and Predicate, find Object. + /// + /// `MATCH (s)-[:P]->(?) WHERE s = Subject` + pub fn query_forward( + &self, + subject: &Fingerprint, + predicate: &Fingerprint, + radius: u32, + ) -> Vec { + let query = SpoBuilder::build_forward_query(subject, predicate); + let hits = self.query_bitmap(&query, radius); + + // Post-filter: subject and predicate must closely match + hits.into_iter() + .filter(|h| { + hamming_distance(subject, &h.record.subject) < radius / 2 + 1 + && hamming_distance(predicate, &h.record.predicate) < radius / 2 + 1 + }) + .collect() + } + + /// SxP2O with truth gate. + pub fn query_forward_gated( + &self, + subject: &Fingerprint, + predicate: &Fingerprint, + radius: u32, + gate: TruthGate, + ) -> Vec { + let query = SpoBuilder::build_forward_query(subject, predicate); + let hits = self.query_bitmap_gated(&query, radius, gate); + + hits.into_iter() + .filter(|h| { + hamming_distance(subject, &h.record.subject) < radius / 2 + 1 + && hamming_distance(predicate, &h.record.predicate) < radius / 2 + 1 + }) + .collect() + } + + /// PxO2S: Reverse query — given Predicate and Object, find Subject. + /// + /// `MATCH (?)-[:P]->(o) WHERE o = Object` + pub fn query_reverse( + &self, + predicate: &Fingerprint, + object: &Fingerprint, + radius: u32, + ) -> Vec { + let query = SpoBuilder::build_reverse_query(predicate, object); + let hits = self.query_bitmap(&query, radius); + + hits.into_iter() + .filter(|h| { + hamming_distance(predicate, &h.record.predicate) < radius / 2 + 1 + && hamming_distance(object, &h.record.object) < radius / 2 + 1 + }) + .collect() + } + + /// SxO2P: Relation query — given Subject and Object, find Predicate. + /// + /// `MATCH (s)-[?]->(o)` — what verb connects s to o? + pub fn query_relation( + &self, + subject: &Fingerprint, + object: &Fingerprint, + radius: u32, + ) -> Vec { + let query = SpoBuilder::build_relation_query(subject, object); + let hits = self.query_bitmap(&query, radius); + + hits.into_iter() + .filter(|h| { + hamming_distance(subject, &h.record.subject) < radius / 2 + 1 + && hamming_distance(object, &h.record.object) < radius / 2 + 1 + }) + .collect() + } + + // ========================================================================= + // Chain traversal (semiring-based) + // ========================================================================= + + /// Walk a chain of forward hops from a starting subject. + /// + /// Uses `HammingMin` semiring: costs accumulate, best path wins. + /// Follows edges greedily by picking the closest match at each hop. + pub fn walk_chain_forward( + &self, + start_subject: &Fingerprint, + radius: u32, + max_hops: usize, + ) -> Vec { + let mut path = Vec::new(); + let mut current_subject = *start_subject; + let mut cumulative = HammingMin::one(); + let mut visited = std::collections::HashSet::new(); + + for _ in 0..max_hops { + // Find all edges from current_subject (any predicate) + let mut best_hit: Option = None; + for record in self.records.values() { + let d = hamming_distance(¤t_subject, &record.subject); + if d < radius / 2 + 1 && !visited.contains(&self.key_for_object(&record.object)) { + match &best_hit { + Some(existing) if d >= existing.distance => {} + _ => { + best_hit = Some(SpoHit { + key: self.key_for_object(&record.object), + distance: d, + record: record.clone(), + }); + } + } + } + } + + match best_hit { + Some(hit) => { + cumulative = HammingMin::extend(cumulative, hit.distance); + visited.insert(hit.key); + path.push(TraversalHop { + target_key: hit.key, + distance: hit.distance, + truth: hit.record.truth, + cumulative_distance: cumulative, + }); + current_subject = hit.record.object; + } + None => break, + } + } + + path + } + + /// Find the key for a given object fingerprint (reverse lookup). + fn key_for_object(&self, object: &Fingerprint) -> u64 { + // Hash the object fingerprint to get a stable key + let mut h: u64 = 0xcbf29ce484222325; + for &w in object.iter() { + h ^= w; + h = h.wrapping_mul(0x100000001b3); + } + h + } +} + +impl Default for SpoStore { + fn default() -> Self { + Self::new() + } +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::graph::fingerprint::{dn_hash, label_fp}; + use crate::graph::spo::TruthValue; + + #[test] + fn test_store_insert_and_len() { + let mut store = SpoStore::new(); + assert!(store.is_empty()); + + let s = label_fp("Jan"); + let p = label_fp("KNOWS"); + let o = label_fp("Ada"); + let record = SpoBuilder::build_edge(&s, &p, &o, TruthValue::new(0.9, 0.8)); + store.insert(dn_hash("edge:jan-knows-ada"), &record); + + assert_eq!(store.len(), 1); + } + + #[test] + fn test_forward_query() { + let mut store = SpoStore::new(); + let jan = label_fp("Jan"); + let knows = label_fp("KNOWS"); + let ada = label_fp("Ada"); + let record = SpoBuilder::build_edge(&jan, &knows, &ada, TruthValue::new(0.9, 0.8)); + store.insert(dn_hash("edge:jan-knows-ada"), &record); + + let hits = store.query_forward(&jan, &knows, 200); + assert!(!hits.is_empty(), "Forward query should find the edge"); + } + + #[test] + fn test_reverse_query() { + let mut store = SpoStore::new(); + let jan = label_fp("Jan"); + let knows = label_fp("KNOWS"); + let ada = label_fp("Ada"); + let record = SpoBuilder::build_edge(&jan, &knows, &ada, TruthValue::new(0.9, 0.8)); + store.insert(dn_hash("edge:jan-knows-ada"), &record); + + let hits = store.query_reverse(&knows, &ada, 200); + assert!(!hits.is_empty(), "Reverse query should find the edge"); + } +} diff --git a/crates/lance-graph/src/graph/spo/truth.rs b/crates/lance-graph/src/graph/spo/truth.rs new file mode 100644 index 00000000..727c76c5 --- /dev/null +++ b/crates/lance-graph/src/graph/spo/truth.rs @@ -0,0 +1,175 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! NARS-style truth values and gates for SPO edge confidence. +//! +//! Each SPO edge carries a `TruthValue` with frequency (how often the relation +//! holds) and confidence (how certain we are). `TruthGate` thresholds filter +//! query results by minimum truth strength. + +/// A NARS-style truth value: (frequency, confidence). +/// +/// - `frequency` ∈ [0.0, 1.0]: proportion of positive evidence +/// - `confidence` ∈ [0.0, 1.0]: amount of evidence relative to total possible +#[derive(Debug, Clone, Copy, PartialEq)] +pub struct TruthValue { + pub frequency: f32, + pub confidence: f32, +} + +impl TruthValue { + /// Create a new truth value with validation. + pub fn new(frequency: f32, confidence: f32) -> Self { + Self { + frequency: frequency.clamp(0.0, 1.0), + confidence: confidence.clamp(0.0, 1.0), + } + } + + /// Full truth: frequency=1.0, confidence=1.0. + pub fn certain() -> Self { + Self { + frequency: 1.0, + confidence: 1.0, + } + } + + /// Unknown truth: frequency=0.5, confidence=0.0. + pub fn unknown() -> Self { + Self { + frequency: 0.5, + confidence: 0.0, + } + } + + /// Expectation: e = c * (f - 0.5) + 0.5 + /// + /// This is the "expected truth" — a single scalar combining frequency and confidence. + pub fn expectation(&self) -> f32 { + self.confidence * (self.frequency - 0.5) + 0.5 + } + + /// Strength: f * c (simple product, used for ranking). + pub fn strength(&self) -> f32 { + self.frequency * self.confidence + } + + /// Revision: combine two truth values with independent evidence. + pub fn revision(&self, other: &TruthValue) -> TruthValue { + let k = 1.0; // evidence horizon + let w1 = self.confidence / (1.0 - self.confidence + f32::EPSILON); + let w2 = other.confidence / (1.0 - other.confidence + f32::EPSILON); + let w = w1 + w2; + + let f = if w > f32::EPSILON { + (w1 * self.frequency + w2 * other.frequency) / w + } else { + 0.5 + }; + let c = w / (w + k); + + TruthValue::new(f, c) + } +} + +impl Default for TruthValue { + fn default() -> Self { + Self::unknown() + } +} + +/// Gate thresholds for filtering SPO query results by truth strength. +/// +/// Named thresholds control the minimum expectation required for an edge +/// to pass through a query filter. +#[derive(Debug, Clone, Copy, PartialEq)] +pub struct TruthGate { + /// Minimum expectation to pass the gate. + pub min_expectation: f32, +} + +impl TruthGate { + /// Open gate: everything passes (min_expectation = 0.0). + pub const OPEN: TruthGate = TruthGate { + min_expectation: 0.0, + }; + + /// Weak gate: expectation > 0.4. + pub const WEAK: TruthGate = TruthGate { + min_expectation: 0.4, + }; + + /// Normal gate: expectation > 0.6. + pub const NORMAL: TruthGate = TruthGate { + min_expectation: 0.6, + }; + + /// Strong gate: expectation > 0.75. + pub const STRONG: TruthGate = TruthGate { + min_expectation: 0.75, + }; + + /// Certain gate: expectation > 0.9. + pub const CERTAIN: TruthGate = TruthGate { + min_expectation: 0.9, + }; + + /// Check if a truth value passes this gate. + pub fn passes(&self, tv: &TruthValue) -> bool { + tv.expectation() >= self.min_expectation + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_truth_value_clamp() { + let tv = TruthValue::new(1.5, -0.3); + assert_eq!(tv.frequency, 1.0); + assert_eq!(tv.confidence, 0.0); + } + + #[test] + fn test_expectation() { + let tv = TruthValue::new(0.9, 0.8); + let e = tv.expectation(); + // e = 0.8 * (0.9 - 0.5) + 0.5 = 0.8 * 0.4 + 0.5 = 0.82 + assert!((e - 0.82).abs() < 0.001); + } + + #[test] + fn test_gate_open() { + let tv = TruthValue::new(0.1, 0.1); + assert!(TruthGate::OPEN.passes(&tv)); + } + + #[test] + fn test_gate_strong() { + let high = TruthValue::new(0.9, 0.8); + let low = TruthValue::new(0.3, 0.2); + assert!(TruthGate::STRONG.passes(&high)); + assert!(!TruthGate::STRONG.passes(&low)); + } + + #[test] + fn test_gate_certain() { + let very_high = TruthValue::new(0.95, 0.95); + let high = TruthValue::new(0.9, 0.8); + assert!(TruthGate::CERTAIN.passes(&very_high)); + // 0.8*(0.9-0.5)+0.5 = 0.82 < 0.9 + assert!(!TruthGate::CERTAIN.passes(&high)); + } + + #[test] + fn test_revision() { + let a = TruthValue::new(0.8, 0.5); + let b = TruthValue::new(0.6, 0.5); + let revised = a.revision(&b); + // Combined should have higher confidence + assert!(revised.confidence > a.confidence); + // Frequency should be between a and b + assert!(revised.frequency >= 0.6 && revised.frequency <= 0.8); + } +} diff --git a/crates/lance-graph/src/lib.rs b/crates/lance-graph/src/lib.rs index 8f93fddb..83e881f7 100644 --- a/crates/lance-graph/src/lib.rs +++ b/crates/lance-graph/src/lib.rs @@ -40,6 +40,7 @@ pub mod case_insensitive; pub mod config; pub mod datafusion_planner; pub mod error; +pub mod graph; pub mod lance_native_planner; pub mod lance_vector_search; pub mod logical_plan; diff --git a/crates/lance-graph/tests/spo_ground_truth.rs b/crates/lance-graph/tests/spo_ground_truth.rs new file mode 100644 index 00000000..bee81365 --- /dev/null +++ b/crates/lance-graph/tests/spo_ground_truth.rs @@ -0,0 +1,357 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! Ground truth integration tests for the SPO triple store stack. +//! +//! These tests prove the stack works end-to-end, not just that individual +//! functions compile. They cover: +//! +//! 1. SPO hydration round-trip (insert + forward/reverse query) +//! 2. 2³ projection verbs consistency (SxP2O, SxO2P, PxO2S) +//! 3. TruthGate filtering (OPEN, STRONG, CERTAIN) +//! 4. Belichtung prefilter rejection rate +//! 5. Semiring chain traversal +//! 6. ClamPath + MerkleRoot integrity (documents verify_lineage gap) +//! 7. Cross-convergence: Cypher vs projection verb + +use lance_graph::graph::fingerprint::{dn_hash, label_fp, FINGERPRINT_WORDS}; +use lance_graph::graph::spo::{ + BindSpace, MerkleRoot, SpoBuilder, SpoStore, TruthGate, TruthValue, VerifyStatus, +}; + +// ========================================================================= +// Test 1: SPO Hydration Round-Trip +// ========================================================================= + +#[test] +fn test_spo_hydration_round_trip() { + // 1. Build an edge: Jan KNOWS Ada + let jan = label_fp("Jan"); + let knows = label_fp("KNOWS"); + let ada = label_fp("Ada"); + let record = SpoBuilder::build_edge(&jan, &knows, &ada, TruthValue::new(0.9, 0.8)); + + // 2. Insert into SpoStore + let mut store = SpoStore::new(); + store.insert(dn_hash("edge:jan-knows-ada"), &record); + + // 3. Forward query: Jan KNOWS ? → should find Ada + let hits = store.query_forward(&jan, &knows, 100); + assert!(!hits.is_empty(), "Forward query should find Ada"); + assert!( + hits[0].distance < 50, + "Best hit should be close, got distance={}", + hits[0].distance + ); + + // 4. Reverse query: ? KNOWS Ada → should find Jan + let hits = store.query_reverse(&knows, &ada, 100); + assert!(!hits.is_empty(), "Reverse query should find Jan"); +} + +// ========================================================================= +// Test 2: 2³ Projection Verbs +// ========================================================================= + +#[test] +fn test_projection_verbs_consistency() { + // Build: Jan CREATES Ada, Jan KNOWS Bob, Ada HELPS Bob + let jan_fp = label_fp("Jan"); + let ada_fp = label_fp("Ada"); + let bob_fp = label_fp("Bob"); + let creates_fp = label_fp("CREATES"); + let knows_fp = label_fp("KNOWS"); + let helps_fp = label_fp("HELPS"); + + let mut store = SpoStore::new(); + + let r1 = SpoBuilder::build_edge(&jan_fp, &creates_fp, &ada_fp, TruthValue::new(0.9, 0.9)); + store.insert(dn_hash("edge:jan-creates-ada"), &r1); + + let r2 = SpoBuilder::build_edge(&jan_fp, &knows_fp, &bob_fp, TruthValue::new(0.8, 0.7)); + store.insert(dn_hash("edge:jan-knows-bob"), &r2); + + let r3 = SpoBuilder::build_edge(&ada_fp, &helps_fp, &bob_fp, TruthValue::new(0.7, 0.6)); + store.insert(dn_hash("edge:ada-helps-bob"), &r3); + + // SxP2O: Jan CREATES ? → Ada + let sxp2o = store.query_forward(&jan_fp, &creates_fp, 100); + assert!(!sxp2o.is_empty(), "SxP2O: Jan CREATES ? should find Ada"); + + // SxO2P: Jan ? Ada → CREATES + // (bind S and O, find P — what verb connects Jan to Ada?) + let sxo2p = store.query_relation(&jan_fp, &ada_fp, 100); + assert!( + !sxo2p.is_empty(), + "SxO2P: Jan ? Ada should find CREATES" + ); + + // PxO2S: CREATES ? Ada → Jan + // (bind P and O, find S — who CREATES Ada?) + let pxo2s = store.query_reverse(&creates_fp, &ada_fp, 100); + assert!( + !pxo2s.is_empty(), + "PxO2S: CREATES ? Ada should find Jan" + ); + + // All three should agree on the Jan-CREATES-Ada triple + // Verify the forward query found the right record + assert_eq!(sxp2o[0].record.subject, jan_fp); + assert_eq!(sxp2o[0].record.predicate, creates_fp); + assert_eq!(sxp2o[0].record.object, ada_fp); +} + +// ========================================================================= +// Test 3: TruthGate Filtering +// ========================================================================= + +#[test] +fn test_truth_gate_filters_low_confidence() { + let mut store = SpoStore::new(); + + let a = label_fp("entity_A"); + let verb = label_fp("RELATES"); + let b = label_fp("entity_B"); + let c = label_fp("entity_C"); + + // Insert high-confidence edge + let record_high = SpoBuilder::build_edge(&a, &verb, &b, TruthValue::new(0.9, 0.8)); + store.insert(1, &record_high); + + // Insert low-confidence edge + let record_low = SpoBuilder::build_edge(&a, &verb, &c, TruthValue::new(0.3, 0.2)); + store.insert(2, &record_low); + + // OPEN gate: both found + let open = store.query_forward_gated(&a, &verb, 200, TruthGate::OPEN); + assert_eq!( + open.len(), + 2, + "OPEN gate should find both edges, found {}", + open.len() + ); + + // STRONG gate: only high-confidence found + // TruthValue(0.9, 0.8).expectation() = 0.82 > 0.75 ✓ + // TruthValue(0.3, 0.2).expectation() = 0.46 < 0.75 ✗ + let strong = store.query_forward_gated(&a, &verb, 200, TruthGate::STRONG); + assert_eq!( + strong.len(), + 1, + "STRONG gate should find only high-confidence edge, found {}", + strong.len() + ); + + // CERTAIN gate: only very high confidence + // TruthValue(0.9, 0.8).expectation() = 0.82 < 0.9 — also filtered! + let certain = store.query_forward_gated(&a, &verb, 200, TruthGate::CERTAIN); + assert_eq!( + certain.len(), + 0, + "CERTAIN gate (0.9 threshold) should filter expectation=0.82, found {}", + certain.len() + ); +} + +// ========================================================================= +// Test 4: Belichtung Prefilter Rejection Rate +// ========================================================================= + +#[test] +fn test_belichtung_rejection_rate() { + let mut store = SpoStore::new(); + + // Insert 100 random edges + for i in 0..100 { + let s = label_fp(&format!("entity_{}", i)); + let p = label_fp("RELATES"); + let o = label_fp(&format!("target_{}", i)); + let record = SpoBuilder::build_edge(&s, &p, &o, TruthValue::new(0.5, 0.5)); + store.insert(i as u64, &record); + } + + // Query with tight radius — belichtung should reject most + let query_s = label_fp("entity_42"); + let query_p = label_fp("RELATES"); + let hits = store.query_forward(&query_s, &query_p, 30); + + // Should find entity_42's edge, maybe 1-2 others. Not 50+. + assert!( + hits.len() < 10, + "Belichtung should reject most non-matches, got {}", + hits.len() + ); + + // The exact match (entity_42 → target_42) should be present + // (its S and P fingerprints are exact matches) + let has_exact = hits + .iter() + .any(|h| h.record.subject == query_s && h.record.predicate == query_p); + assert!( + has_exact, + "Exact match for entity_42 should be in the results" + ); +} + +// ========================================================================= +// Test 5: Semiring Traversal +// ========================================================================= + +#[test] +fn test_semiring_walk_chain() { + let mut store = SpoStore::new(); + + // Build chain: A→B→C→D + let a = label_fp("node_A"); + let b = label_fp("node_B"); + let c = label_fp("node_C"); + let d = label_fp("node_D"); + let next = label_fp("NEXT"); + + let r1 = SpoBuilder::build_edge(&a, &next, &b, TruthValue::new(0.9, 0.9)); + let r2 = SpoBuilder::build_edge(&b, &next, &c, TruthValue::new(0.8, 0.8)); + let r3 = SpoBuilder::build_edge(&c, &next, &d, TruthValue::new(0.7, 0.7)); + + store.insert(dn_hash("a-next-b"), &r1); + store.insert(dn_hash("b-next-c"), &r2); + store.insert(dn_hash("c-next-d"), &r3); + + // Walk with HammingMin semiring (shortest semantic path) + let path = store.walk_chain_forward(&a, 100, 3); + + assert_eq!( + path.len(), + 3, + "Should find 3 hops in A→B→C→D chain, found {}", + path.len() + ); + + // Each hop should have increasing cumulative distance + // (distances accumulate through chain) + for i in 1..path.len() { + assert!( + path[i].cumulative_distance >= path[i - 1].cumulative_distance, + "Cumulative distance should increase: hop {} ({}) < hop {} ({})", + i - 1, + path[i - 1].cumulative_distance, + i, + path[i].cumulative_distance + ); + } +} + +// ========================================================================= +// Test 6: ClamPath + MerkleRoot Integrity +// ========================================================================= + +/// This test DOCUMENTS the verify_lineage no-op gap. +/// +/// After corrupting the fingerprint, `verify_lineage` still returns +/// `Consistent` because it only checks structural presence (non-zero root), +/// not content integrity. `verify_integrity` correctly detects the corruption. +/// +/// This is a **known gap**, not a bug to fix silently. +#[test] +fn test_clam_merkle_integrity() { + let mut space = BindSpace::new(); + let fp = [0xDEAD_u64; FINGERPRINT_WORDS]; + + let addr = space.write_dn_path("agent:test:node", fp, 3); + + // Read back ClamPath + MerkleRoot + let (path, root) = space.clam_merkle(addr).unwrap(); + assert!(!root.is_zero(), "MerkleRoot should be stamped"); + assert_eq!(path.depth, 3); + + // Verify the root matches the original fingerprint + let expected_root = MerkleRoot::from_fingerprint(&fp); + assert_eq!(*root, expected_root, "Root should match original fingerprint"); + + // Corrupt the fingerprint + space.read_mut(addr).unwrap().fingerprint[5] ^= 0xFFFF; // flip some bits + + // verify_lineage still says Consistent — THIS IS THE KNOWN GAP + let status = space.verify_lineage(addr); + assert_eq!( + status, + VerifyStatus::Consistent, + "verify_lineage has known gap: does not re-hash content. \ + It should detect corruption but currently doesn't." + ); + + // verify_integrity correctly detects the corruption + let status = space.verify_integrity(addr); + assert_eq!( + status, + VerifyStatus::Corrupted, + "verify_integrity should detect bit-flip corruption" + ); + + // The merkle root stored in the node is still the ORIGINAL root + // (stamped at write time, not updated on corruption) + let (_, root_after) = space.clam_merkle(addr).unwrap(); + assert_eq!( + *root_after, expected_root, + "Root should still be the original (stamped at write time)" + ); +} + +// ========================================================================= +// Test 7: Cross-Convergence (Cypher vs Projection Verb) +// ========================================================================= + +/// Convergence proof: SPO projection path and Cypher path must return +/// the same results for equivalent queries. +/// +/// Currently only validates the SPO side. Full convergence requires +/// DataFusion wiring, which is future work. The SPO side must work first. +#[test] +fn test_cypher_vs_projection_convergence() { + let mut store = SpoStore::new(); + + // Insert: Jan CREATES Ada + let jan_fp = label_fp("Jan"); + let creates_fp = label_fp("CREATES"); + let ada_fp = label_fp("Ada"); + + let record = SpoBuilder::build_edge(&jan_fp, &creates_fp, &ada_fp, TruthValue::new(0.9, 0.8)); + store.insert(dn_hash("edge:jan-creates-ada"), &record); + + // SPO path: SxP2O(Jan, CREATES) → Ada + let spo_hits = store.query_forward(&jan_fp, &creates_fp, 100); + + // Cypher path equivalent: MATCH (a)-[:CREATES]->(b) WHERE a = Jan + // (This uses cypher_to_sql → DataFusion → same store) + // For now: just verify the SPO path returns something. + // Full convergence test needs DataFusion wired, which is future work. + // But the SPO side must work first. + assert!( + !spo_hits.is_empty(), + "SPO path must find the edge for convergence" + ); + + // Verify the result is the correct triple + let hit = &spo_hits[0]; + assert_eq!(hit.record.subject, jan_fp, "Subject should be Jan"); + assert_eq!( + hit.record.predicate, creates_fp, + "Predicate should be CREATES" + ); + assert_eq!(hit.record.object, ada_fp, "Object should be Ada"); + + // Verify truth value was preserved through the round-trip + assert!( + (hit.record.truth.frequency - 0.9).abs() < 0.001, + "Frequency should be preserved" + ); + assert!( + (hit.record.truth.confidence - 0.8).abs() < 0.001, + "Confidence should be preserved" + ); + + // Future: When DataFusion SPO UDF is wired, add: + // let cypher_result = CypherQuery::new("MATCH (a:Entity)-[:CREATES]->(b) RETURN b") + // .execute_with_spo_store(&store) + // .await; + // assert_eq!(cypher_result, spo_hits); +}