Improve roundtrip fidelity and optimize Git import process#42
Merged
Conversation
For shared views with no parent (the common case after git import), all changes are visible. Skip the expensive collect_view_change_ids scan (which loads every change file from disk to expand dependencies) in both status() and get_file_content(). Also skip content comparison in status for shared root views when there's no mtime cache entry — files are clean by construction after import. This helps but doesn't fully solve the large-repo status problem. With 35K+ changes and 40K files, status still hangs — likely due to two full TREE table scans plus filesystem walk. Needs a fundamentally different approach (index-based status like git).
After git import, every file's mtime+size is stored in the pristine database. This lets 'atomic status' compare file metadata (stat) instead of reconstructing graph content for every file. Before: status on a 1017-commit repo hung (O(files × graph_traversal)) After: status completes in 0.1s (O(files × stat)) Also adds Repository::update_file_mtimes() public API for batch mtime cache population, and fast-paths get_file_content() and status() to skip O(N) change filter scan for shared root views.
Both materialize() and materialize_prefix() now stat all written files and store their mtime+size in the pristine database. This makes 'atomic status' after a server-side push + materialize instantaneous (stat comparison instead of graph traversal). Same pattern as the git import mtime population — best-effort, errors silently ignored.
…ommits Commits that only delete files can produce RecordedFiles with hunks that globalization strips (e.g., when find_content_vertices returns empty for files whose graph content was already removed by prior commits in the same batch). Instead of failing with 'All recorded files are empty (no hunks)', fall back to an empty change and still run the explicit repo.remove() cleanup for deleted file TREE entries. The commit metadata (message, author, timestamp) is preserved in the change. This eliminates the 'Failed to write' warnings during large git imports (e.g., terraform with pure-deletion merge commits).
…raph traversal in status Schema change: FILE_MTIMES (20 bytes: mtime+size) → FILE_INDEX (52 bytes: mtime+size+blake3_hash). Status no longer calls get_file_content (which traverses the graph to reconstruct file content). Instead: 1. If mtime+size match index → clean (O(1) per file) 2. If mtime+size differ → hash disk file, compare with stored hash (O(file_size) per file, no graph traversal) This makes status O(files × stat) instead of O(files × graph). Before: status on 35K-commit terraform repo = 27 seconds After: status on 1017-commit hyperfine repo = 111ms Also updated all callers: - git import computes blake3 hash during mtime population - materialize computes blake3 hash after writing files - record stores content hash after recording - status uses content_hash for comparison instead of get_file_content
After repo.remove() cleans up the TREE entry for deleted files, also remove the FILE_INDEX entry so status doesn't show them as stale deleted entries. Note: the deletion is still not recorded as a GraphOp::FileDel in the change — it's only a TREE+INDEX cleanup. Proper graph-level deletion recording is a follow-up.
Merge commits can implicitly delete files by not including them from a second parent. Our per-commit diff only detects explicit deletions (FileOperation::Deleted), so files dropped during merge resolution leave orphaned TREE entries that show up as 'deleted' in status. Fix: after all import batches complete, compare every tracked path against the working copy. Files in TREE that don't exist on disk are removed from both TREE and FILE_INDEX. tokio: 3 orphaned entries cleaned up in <0.1s Result: status shows 'nothing to record, working tree clean'
Merge commits can also implicitly ADD files from a second parent without an explicit FileOperation::Added in the first-parent diff. These files exist on disk but have no TREE entry, showing as 'untracked' or 'new file' in status. After removing orphaned TREE entries (direction 1), also run repo.status() to find untracked files and add them to tracking with FILE_INDEX entries. This completes the bidirectional reconciliation: - Direction 1: TREE entries without disk files → removed - Direction 2: Disk files without TREE entries → added
Merge commits can also implicitly ADD files from a second parent without an explicit FileOperation::Added in the first-parent diff. These files exist on disk but have no TREE entry, showing as 'untracked' or 'new file' in status. After removing orphaned TREE entries (direction 1), also run repo.status() to find untracked files and add them to tracking with FILE_INDEX entries. This completes the bidirectional reconciliation: - Direction 1: TREE entries without disk files → removed - Direction 2: Disk files without TREE entries → added
leefaus
added a commit
that referenced
this pull request
May 5, 2026
* Improve roundtrip fidelity and optimize Git import process (#40) * feat: port roundtrip fidelity fixes and git import improvements Port all 7 commits from chore/roundtirp-fidelity-test onto dev's restructured codebase: - Fix hunk consolidation: collapse any combination of Replace, Delete, or middle-Insert hunks into a single whole-file Replace to prevent N× content duplication in the graph - Remove NeedsReplace fallback from globalize_hunk; middle insertions are now caught by upstream consolidation - Add GitDiffLine type and build_crdt_ops_from_git_diff for exact git diff parity in CRDT ops - Rewrite git import write_commit to use in-memory working copies (Memory) instead of git checkout_tree, eliminating filesystem I/O - Add FileMove GraphOp support in record, globalize, and insert pipelines for proper rename tracking - Fix binary file modifications (create Replace hunk when binary content differs) - Fix deleted file cleanup (explicit repo.remove after insert) - Add double-deletion guard for FileMove in insert_change - Use Phase 1 old_content instead of repo.get_file_content to avoid O(N) graph scans per commit during import - Add content roundtrip fidelity tests - Add harness suites 11-13 (diff parity, full repo parity, import fidelity) - Extend harness suite 10 with status parity tests Known issue: assemble_and_hash calls retrieve_graph per file per commit, which is O(graph_size). For large imports (1000+ commits), this causes progressive slowdown. Tracked for separate optimization (in-memory content vertex cache). * Add InodeGraphOps trait implementation to ViewGraph wrapper The ViewGraph wrapper now delegates InodeGraphOps methods to the underlying transaction, enabling O(m) file-local graph traversal within view-filtered contexts. Update globalize module to accept InodeGraphOps and use the fast INODE_GRAPH path for content vertex discovery. * perf: use INODE_GRAPH for O(m) file-scoped traversal in globalize + batch imports Two performance optimizations for git import: 1. Wire find_content_vertices to INODE_GRAPH secondary index - find_content_vertices now uses the file-scoped INODE_GRAPH B-tree instead of retrieve_graph (global GRAPH DFS) - O(m) where m = edges for this file, not O(n) where n = all edges in the repository - Adds InodeGraphOps trait bound through globalize pipeline - Implements InodeGraphOps delegation for ViewGraph wrapper - Falls back to global GRAPH scan if INODE_GRAPH not populated Before: assemble time grew linearly (40ms → 550ms by commit 300) After: assemble time constant (3-14ms regardless of position) hyperfine (1017 commits): Phase 2 dropped from O(N²) minutes to 21.6 seconds flat 2. Batch import pipeline for memory-bounded processing - Commits processed in parse→write batches instead of parse-all→write-all - Batch sizes tiered by repo size: <5K: 250, 5-10K: 500, 10-20K: 1000, 20K+: 2500 - Keeps memory bounded for large repos (35K+ commits) - Shows incremental progress during import * perf: use eventual durability for redb write transactions Skip fsync on every transaction commit. Data is still written to the database file but the OS may buffer it. This reduces per-commit overhead for all write operations. Safe for most use cases — a crash could lose the last few transactions but the repository can be rebuilt from change files. * Bump version to 0.5.3 * Bump version to 0.5.3 * fix: use global retrieve_graph in find_content_vertices The INODE_GRAPH fast path bypasses ViewGraph's change filter, returning vertices from ALL changes regardless of view visibility. This caused content duplication in the record() path which uses ViewGraph wrapping. Always use retrieve_graph (which respects whatever GraphTxnT impl is provided) for content vertex discovery. The INODE_GRAPH optimization remains available for callers that use a bare &txn (like assemble_and_hash for git import). Fixes 5 of 6 content roundtrip fidelity test failures. Remaining: test_three_sequential_modifications (content truncation from a separate issue in dev's record pipeline). * fix: harness error handling and merge commit test - Fix error handling section: capture exit code with || status=$? instead of letting set -e kill the harness on expected failures - Fix merge commit test: detect default branch name instead of hardcoding master/main * fix: restore del_edge_with_reverse in apply pipeline Dev's refactoring removed the old-edge deletion from write_new_edge, leaving both the original alive edge AND the new DELETED edge in the B-tree multimap. This caused is_vertex_alive to find the stale alive parent edge and incorrectly report dead vertices as alive. Result: the second Replace on a file couldn't find the correct content vertex, breaking sequential modifications (text and binary). Restore del_edge_with_reverse before add_edge_with_reverse so the superseded edge is removed. This matches the old branch's behavior where apply_new_edge always deleted the previous edge first. Fixes: all 9 content roundtrip fidelity tests now pass. Fixes: suite 13 'Multiple Binary Modifications' (46/46 pass). Closes #41. * fix: remove Durability::Eventual — causes data visibility issues on Windows Redb's Eventual durability skips fsync, which on Windows can cause a read transaction opened immediately after a write commit to not see the written data. This broke 6 of 9 content fidelity tests on Windows while passing on macOS/Linux. Revert to default durability (Immediate) for correctness across all platforms. * fix: use platform-aware paths in fidelity tests for Windows compatibility Add test_path() helper that flattens subdirectory paths on Windows (src/main.rs → src_main.rs) while keeping them as-is on Unix. Windows has a known issue with subdirectory path tracking — this workaround lets the fidelity tests exercise the record pipeline without hitting the path normalization bug. Git show commands still use the real repo path (src/main.rs) for fetching content from the cloned hyperfine repo. * fix: normalize path separators for Windows compatibility On Windows, walkdir returns paths with backslash separators (src\main.rs) but the TREE table stores forward slashes (src/main.rs). Status comparison between tracked and disk paths failed because PathBuf uses byte-level comparison — src/main.rs != src\main.rs. Fix: normalize all paths to forward slashes in both the working copy scan (collect_working_copy_files_with_rules) and the tracked path sets (status). This ensures consistent comparison across platforms. Also revert the test_path() workaround in fidelity tests — tests now use real subdirectory paths (src/main.rs) since the underlying bug is fixed. * fix: normalize path separators for Windows compatibility On Windows, walkdir returns paths with backslash separators (src\main.rs) but the TREE table stores forward slashes (src/main.rs). Status comparison between tracked and disk paths failed because PathBuf uses byte-level comparison — src/main.rs != src\main.rs. Fix: normalize all paths to forward slashes in both the working copy scan (collect_working_copy_files_with_rules) and the tracked path sets (status). This ensures consistent comparison across platforms. Also revert the test_path() workaround in fidelity tests — tests now use real subdirectory paths (src/main.rs) since the underlying bug is fixed. * Improve roundtrip fidelity and optimize Git import process (#42) * perf: skip O(N) change filter scan for shared root views For shared views with no parent (the common case after git import), all changes are visible. Skip the expensive collect_view_change_ids scan (which loads every change file from disk to expand dependencies) in both status() and get_file_content(). Also skip content comparison in status for shared root views when there's no mtime cache entry — files are clean by construction after import. This helps but doesn't fully solve the large-repo status problem. With 35K+ changes and 40K files, status still hangs — likely due to two full TREE table scans plus filesystem walk. Needs a fundamentally different approach (index-based status like git). * perf: populate mtime cache during git import for instant status After git import, every file's mtime+size is stored in the pristine database. This lets 'atomic status' compare file metadata (stat) instead of reconstructing graph content for every file. Before: status on a 1017-commit repo hung (O(files × graph_traversal)) After: status completes in 0.1s (O(files × stat)) Also adds Repository::update_file_mtimes() public API for batch mtime cache population, and fast-paths get_file_content() and status() to skip O(N) change filter scan for shared root views. * perf: populate mtime cache after materialize Both materialize() and materialize_prefix() now stat all written files and store their mtime+size in the pristine database. This makes 'atomic status' after a server-side push + materialize instantaneous (stat comparison instead of graph traversal). Same pattern as the git import mtime population — best-effort, errors silently ignored. * fix: handle AllEmpty gracefully during git import for pure-deletion commits Commits that only delete files can produce RecordedFiles with hunks that globalization strips (e.g., when find_content_vertices returns empty for files whose graph content was already removed by prior commits in the same batch). Instead of failing with 'All recorded files are empty (no hunks)', fall back to an empty change and still run the explicit repo.remove() cleanup for deleted file TREE entries. The commit metadata (message, author, timestamp) is preserved in the change. This eliminates the 'Failed to write' warnings during large git imports (e.g., terraform with pure-deletion merge commits). * perf: replace FILE_MTIMES with FILE_INDEX — content hash eliminates graph traversal in status Schema change: FILE_MTIMES (20 bytes: mtime+size) → FILE_INDEX (52 bytes: mtime+size+blake3_hash). Status no longer calls get_file_content (which traverses the graph to reconstruct file content). Instead: 1. If mtime+size match index → clean (O(1) per file) 2. If mtime+size differ → hash disk file, compare with stored hash (O(file_size) per file, no graph traversal) This makes status O(files × stat) instead of O(files × graph). Before: status on 35K-commit terraform repo = 27 seconds After: status on 1017-commit hyperfine repo = 111ms Also updated all callers: - git import computes blake3 hash during mtime population - materialize computes blake3 hash after writing files - record stores content hash after recording - status uses content_hash for comparison instead of get_file_content * fix: remove deleted files from FILE_INDEX during git import After repo.remove() cleans up the TREE entry for deleted files, also remove the FILE_INDEX entry so status doesn't show them as stale deleted entries. Note: the deletion is still not recorded as a GraphOp::FileDel in the change — it's only a TREE+INDEX cleanup. Proper graph-level deletion recording is a follow-up. * fix: reconcile orphaned TREE entries after git import Merge commits can implicitly delete files by not including them from a second parent. Our per-commit diff only detects explicit deletions (FileOperation::Deleted), so files dropped during merge resolution leave orphaned TREE entries that show up as 'deleted' in status. Fix: after all import batches complete, compare every tracked path against the working copy. Files in TREE that don't exist on disk are removed from both TREE and FILE_INDEX. tokio: 3 orphaned entries cleaned up in <0.1s Result: status shows 'nothing to record, working tree clean' * fix: reconcile untracked files after git import (reverse direction) Merge commits can also implicitly ADD files from a second parent without an explicit FileOperation::Added in the first-parent diff. These files exist on disk but have no TREE entry, showing as 'untracked' or 'new file' in status. After removing orphaned TREE entries (direction 1), also run repo.status() to find untracked files and add them to tracking with FILE_INDEX entries. This completes the bidirectional reconciliation: - Direction 1: TREE entries without disk files → removed - Direction 2: Disk files without TREE entries → added * fix: reconcile untracked files after git import (reverse direction) Merge commits can also implicitly ADD files from a second parent without an explicit FileOperation::Added in the first-parent diff. These files exist on disk but have no TREE entry, showing as 'untracked' or 'new file' in status. After removing orphaned TREE entries (direction 1), also run repo.status() to find untracked files and add them to tracking with FILE_INDEX entries. This completes the bidirectional reconciliation: - Direction 1: TREE entries without disk files → removed - Direction 2: Disk files without TREE entries → added * vault creation and testing * Make vault initialization the default and add `--no-vault` flag * Add C/C++ semantic parser and improve AI system prompt * Refactor code to follow Rust idioms and conventions - Use `sort_by_key` with `Reverse` instead of manual comparison - Replace `match` with `.is_some()` for simple existence checks - Implement `FromStr` trait instead of custom `from_str` method - Use `std::slice::from_ref` to avoid allocating single-element vec - Dereference hash instead of cloning - Simplify heading detection logic with `find` instead of `skip_while` - Reorder variable initialization for clarity * docs: add Vault and Knowledge Graph sections to README - Add 'Vault — Shared Project Brain' section with usage examples - Add 'Knowledge Graph Queries with LLM Assist' section with RAG pipeline - Add Vault Commands table to CLI Reference - Add experiment/** branch pattern to CI push trigger - Fix 10 clippy -D warnings across atomic-core, atomic-repository, atomic-cli * docs: add Vault and Knowledge Graph sections to README - Add 'Vault — Shared Project Brain' section with usage examples - Add 'Knowledge Graph Queries with LLM Assist' section with RAG pipeline - Add Vault Commands table to CLI Reference - Add experiment/** branch pattern to CI push trigger - Fix 10 clippy -D warnings across atomic-core, atomic-repository, atomic-cli * docs: fix rustdoc warnings (-D warnings) - Escape square brackets in KG table doc comments (broken intra-doc links) - Wrap generic types in backticks in query.rs doc comment (invalid HTML tags) - Remove broken intra-doc link to vault_store_kg * test: fix FTS tokenize tests to match >= 3 char minimum Tests asserted 2-char tokens (rs, in, am) were included, but tokenize_for_fts filters at >= 3 chars. Also fixed assertion for 'the' which is a stop word. * test: fix test failures across workspace - Fix 3 vault deflation tests: add trailing newlines to default vault source files (MEMORY.md, system_prompt.md, code-intelligence.md) so content hash roundtrips correctly through materialize/scan cycle - Fix test_options_new_returns_defaults: assert record_empty_files is true (matching the Default impl) - Fix test_status_clean_after_record: status() is an exception-reporter that omits clean files; assert absence from non-clean lists instead - Fix test_diff_end_to_end_multiple_files: same status().clean() pattern - Fix atomic-cli compile errors: add missing Init and Promote match arms in vault/mod.rs and view/mod.rs test helpers Pre-existing failures not addressed: - test_three_sequential_modifications (graph overlay content retrieval) - test_log_run_json_empty (test isolation / working directory issue) * test: fix remaining test failures for CI - Fix test_empty_file: record_empty_files defaults to true, so empty file recording succeeds; test both true and false behaviors - Fix test_log_run_json_empty: add missing #[serial] to CWD-mutating tests in init.rs and status.rs that raced with serial log tests * clean provenance and attestation recording * fix opencode * cleanup agent configurations for sub-projects * Add team collaboration features for Atomic VCS Introduces `atomic-teams` crate with organization and team management, plus workspace/project storage APIs. New CLI commands for managing orgs (create, list, show, update, delete, upgrade, switch) and teams (list, create, show, update, delete), with member and permission grant management. Feature-gated behind the `teams` flag in the CLI. Key additions: - `atomic-teams`: Domain logic for org/team CRUD, member management, grants, domain aliases - `atomic-remote`: `StorageClient` HTTP API for workspace/project operations - CLI commands: `atomic org`, `atomic team`, `atomic workspace`, `atomic project` - Knowledge graph expansion: `PART_OF` and `INCLUDES` edge kinds - Infrastructure: Content search index powered by syntext, agentic LLM tool-use system * Add integration tests for atomic-storage management API Tests the full CLI ↔ atomic-storage round-trip for organizations, workspaces, projects, teams, members, and identity resolution. Includes server availability check, multi-user workflows, resource CRUD operations, and cleanup. Also hardens helpers.sh against unbound variable errors on bash <4.4 when arrays are empty under `set -u`. * Add change dependency index and fast status Backfill and use a new CHANGE_DEPS index to expand view dependency closures without loading .change files. Add MutTxn::put_change_deps, pristine table support, reader/writer implementations and tests. Wire repository to populate the index on insert and expose repo.repair_change_dependency_index with a new `doctor repair-dependency-index` CLI command. Also add a fast mtime-only status check in the agent to skip unnecessary records. * Use owned strings in CLI tables Add FileIndexEntry/FileIndexMetadata aliases and export them from pristine traits; update implementations and callers to use the new types. Change the AI repo tool to return a structured file outline for large files instead of erroring when no line range is provided. Misc minor cleanups and formatting tweaks (hidden-dir check, small refactors) * Pin roaring to 0.11.3 and fix sorts and doc links * Update CI workflows and Rust toolchain action Remove push branch triggers from ci.yml so CI runs on PRs to dev. Replace actions-rs/toolchain@v1 with dtolnay/rust-toolchain@stable and adjust inputs (use 'targets' for matrix builds) in release.yml. * Bump MSRV to Rust 1.90 Update CI toolchain usages, refresh Cargo.lock to pick roaring 0.11.4, and remove the atomic-repository pin on roaring 0.11.3 so syntext's dependency can upgrade. * update for PR to release * Update bug report URL in CLI errors Point CLI internal-error/help messages to the atomicdotdev GitHub org. Remove the 'test' job from .github/workflows/release.yml and make the 'build' job depend only on 'version' so the release workflow skips the matrix test run
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several significant improvements and fixes to the core graph and file recording logic, with a focus on correct edge management, improved support for file moves/renames, and better CRDT operation generation. The most important changes are grouped below:
Graph Edge Management & Correctness
write_new_edgenow removes the old edge before adding the new one, fixing an issue where stale alive edges could cause incorrect vertex liveness and break subsequent operations. This is implemented via the newdel_edge_with_reversehelper. [1] [2] [3]del_edge_with_reversefunction to reliably remove both forward and reverse edges from the graph and inode graph, ensuring graph integrity.File Move/Rename Support
FileMove) by removing the old FOLDER edge and adding a new name vertex for the moved file, updating dependencies and content as needed.CRDT Operation Generation
BranchOp::Modifyfor equal line counts (enabling word-level diffs), and emitsDelete/Insertfor unequal counts, matching git’s unified diff format. [1] [2]API and Trait Updates
InodeGraphOpsthrough more layers, allowing view graphs and workflow assembly/globalization to access inode graph operations. [1] [2] [3] [4] [5]Other Improvements
Pristinenow uses eventual durability for improved performance.0.5.3inCargo.toml.These changes collectively improve the correctness, flexibility, and performance of the core graph and file recording subsystems.