Enhance Git import performance and roundtrip fidelity by leefaus · Pull Request #44 · atomicdotdev/atomic

leefaus · 2026-05-05T00:34:34Z

This pull request makes significant improvements to the documentation and configuration for the Atomic project, focusing on clarifying core concepts, updating the CI/CD setup, and adding support for new workspace members and features. The most important changes include major documentation enhancements to explain new features like workspace shelving, the vault, and the knowledge graph; updates to the GitHub Actions workflows for Rust toolchain management; and the addition of new crates and dependencies.

Documentation improvements:

Expanded and clarified the README.md to introduce and explain new concepts such as workspace shelving, the vault (project knowledge store), and the knowledge graph, including detailed usage examples and command references. Updated terminology from "stack" to "view" throughout, and added tables comparing features and commands. [1] [2] [3] [4] [5]
Updated AGENTS.md to document workspace shelving, including references to relevant code and a summary of how ignored files and exposed paths are handled per view.

CI/CD workflow updates:

Replaced the deprecated actions-rs/toolchain with dtolnay/rust-toolchain in all GitHub Actions workflows (ci.yml, release.yml), and updated toolchain versions to use Rust 1.90 where appropriate. Simplified the test step to run tests for all workspace members at once. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Workspace and dependency additions:

Added new workspace members (atomic-semantic, atomic-teams) and updated the workspace Rust version to 1.90 in Cargo.toml. [1] [2]
Added new dependencies for IDs (uuid) and content search indexing (syntext). [1] [2]

Minor codebase improvements:

Updated the agent event hook system to recognize additional task lifecycle verbs for Copilot and other agents in atomic-agent/src/event.rs.

These changes collectively improve developer experience, documentation clarity, and CI reliability, while laying the groundwork for new features related to workspace management and code intelligence.

* feat: port roundtrip fidelity fixes and git import improvements Port all 7 commits from chore/roundtirp-fidelity-test onto dev's restructured codebase: - Fix hunk consolidation: collapse any combination of Replace, Delete, or middle-Insert hunks into a single whole-file Replace to prevent N× content duplication in the graph - Remove NeedsReplace fallback from globalize_hunk; middle insertions are now caught by upstream consolidation - Add GitDiffLine type and build_crdt_ops_from_git_diff for exact git diff parity in CRDT ops - Rewrite git import write_commit to use in-memory working copies (Memory) instead of git checkout_tree, eliminating filesystem I/O - Add FileMove GraphOp support in record, globalize, and insert pipelines for proper rename tracking - Fix binary file modifications (create Replace hunk when binary content differs) - Fix deleted file cleanup (explicit repo.remove after insert) - Add double-deletion guard for FileMove in insert_change - Use Phase 1 old_content instead of repo.get_file_content to avoid O(N) graph scans per commit during import - Add content roundtrip fidelity tests - Add harness suites 11-13 (diff parity, full repo parity, import fidelity) - Extend harness suite 10 with status parity tests Known issue: assemble_and_hash calls retrieve_graph per file per commit, which is O(graph_size). For large imports (1000+ commits), this causes progressive slowdown. Tracked for separate optimization (in-memory content vertex cache). * Add InodeGraphOps trait implementation to ViewGraph wrapper The ViewGraph wrapper now delegates InodeGraphOps methods to the underlying transaction, enabling O(m) file-local graph traversal within view-filtered contexts. Update globalize module to accept InodeGraphOps and use the fast INODE_GRAPH path for content vertex discovery. * perf: use INODE_GRAPH for O(m) file-scoped traversal in globalize + batch imports Two performance optimizations for git import: 1. Wire find_content_vertices to INODE_GRAPH secondary index - find_content_vertices now uses the file-scoped INODE_GRAPH B-tree instead of retrieve_graph (global GRAPH DFS) - O(m) where m = edges for this file, not O(n) where n = all edges in the repository - Adds InodeGraphOps trait bound through globalize pipeline - Implements InodeGraphOps delegation for ViewGraph wrapper - Falls back to global GRAPH scan if INODE_GRAPH not populated Before: assemble time grew linearly (40ms → 550ms by commit 300) After: assemble time constant (3-14ms regardless of position) hyperfine (1017 commits): Phase 2 dropped from O(N²) minutes to 21.6 seconds flat 2. Batch import pipeline for memory-bounded processing - Commits processed in parse→write batches instead of parse-all→write-all - Batch sizes tiered by repo size: <5K: 250, 5-10K: 500, 10-20K: 1000, 20K+: 2500 - Keeps memory bounded for large repos (35K+ commits) - Shows incremental progress during import * perf: use eventual durability for redb write transactions Skip fsync on every transaction commit. Data is still written to the database file but the OS may buffer it. This reduces per-commit overhead for all write operations. Safe for most use cases — a crash could lose the last few transactions but the repository can be rebuilt from change files. * Bump version to 0.5.3 * Bump version to 0.5.3 * fix: use global retrieve_graph in find_content_vertices The INODE_GRAPH fast path bypasses ViewGraph's change filter, returning vertices from ALL changes regardless of view visibility. This caused content duplication in the record() path which uses ViewGraph wrapping. Always use retrieve_graph (which respects whatever GraphTxnT impl is provided) for content vertex discovery. The INODE_GRAPH optimization remains available for callers that use a bare &txn (like assemble_and_hash for git import). Fixes 5 of 6 content roundtrip fidelity test failures. Remaining: test_three_sequential_modifications (content truncation from a separate issue in dev's record pipeline). * fix: harness error handling and merge commit test - Fix error handling section: capture exit code with || status=$? instead of letting set -e kill the harness on expected failures - Fix merge commit test: detect default branch name instead of hardcoding master/main * fix: restore del_edge_with_reverse in apply pipeline Dev's refactoring removed the old-edge deletion from write_new_edge, leaving both the original alive edge AND the new DELETED edge in the B-tree multimap. This caused is_vertex_alive to find the stale alive parent edge and incorrectly report dead vertices as alive. Result: the second Replace on a file couldn't find the correct content vertex, breaking sequential modifications (text and binary). Restore del_edge_with_reverse before add_edge_with_reverse so the superseded edge is removed. This matches the old branch's behavior where apply_new_edge always deleted the previous edge first. Fixes: all 9 content roundtrip fidelity tests now pass. Fixes: suite 13 'Multiple Binary Modifications' (46/46 pass). Closes #41. * fix: remove Durability::Eventual — causes data visibility issues on Windows Redb's Eventual durability skips fsync, which on Windows can cause a read transaction opened immediately after a write commit to not see the written data. This broke 6 of 9 content fidelity tests on Windows while passing on macOS/Linux. Revert to default durability (Immediate) for correctness across all platforms. * fix: use platform-aware paths in fidelity tests for Windows compatibility Add test_path() helper that flattens subdirectory paths on Windows (src/main.rs → src_main.rs) while keeping them as-is on Unix. Windows has a known issue with subdirectory path tracking — this workaround lets the fidelity tests exercise the record pipeline without hitting the path normalization bug. Git show commands still use the real repo path (src/main.rs) for fetching content from the cloned hyperfine repo. * fix: normalize path separators for Windows compatibility On Windows, walkdir returns paths with backslash separators (src\main.rs) but the TREE table stores forward slashes (src/main.rs). Status comparison between tracked and disk paths failed because PathBuf uses byte-level comparison — src/main.rs != src\main.rs. Fix: normalize all paths to forward slashes in both the working copy scan (collect_working_copy_files_with_rules) and the tracked path sets (status). This ensures consistent comparison across platforms. Also revert the test_path() workaround in fidelity tests — tests now use real subdirectory paths (src/main.rs) since the underlying bug is fixed. * fix: normalize path separators for Windows compatibility On Windows, walkdir returns paths with backslash separators (src\main.rs) but the TREE table stores forward slashes (src/main.rs). Status comparison between tracked and disk paths failed because PathBuf uses byte-level comparison — src/main.rs != src\main.rs. Fix: normalize all paths to forward slashes in both the working copy scan (collect_working_copy_files_with_rules) and the tracked path sets (status). This ensures consistent comparison across platforms. Also revert the test_path() workaround in fidelity tests — tests now use real subdirectory paths (src/main.rs) since the underlying bug is fixed.

* perf: skip O(N) change filter scan for shared root views For shared views with no parent (the common case after git import), all changes are visible. Skip the expensive collect_view_change_ids scan (which loads every change file from disk to expand dependencies) in both status() and get_file_content(). Also skip content comparison in status for shared root views when there's no mtime cache entry — files are clean by construction after import. This helps but doesn't fully solve the large-repo status problem. With 35K+ changes and 40K files, status still hangs — likely due to two full TREE table scans plus filesystem walk. Needs a fundamentally different approach (index-based status like git). * perf: populate mtime cache during git import for instant status After git import, every file's mtime+size is stored in the pristine database. This lets 'atomic status' compare file metadata (stat) instead of reconstructing graph content for every file. Before: status on a 1017-commit repo hung (O(files × graph_traversal)) After: status completes in 0.1s (O(files × stat)) Also adds Repository::update_file_mtimes() public API for batch mtime cache population, and fast-paths get_file_content() and status() to skip O(N) change filter scan for shared root views. * perf: populate mtime cache after materialize Both materialize() and materialize_prefix() now stat all written files and store their mtime+size in the pristine database. This makes 'atomic status' after a server-side push + materialize instantaneous (stat comparison instead of graph traversal). Same pattern as the git import mtime population — best-effort, errors silently ignored. * fix: handle AllEmpty gracefully during git import for pure-deletion commits Commits that only delete files can produce RecordedFiles with hunks that globalization strips (e.g., when find_content_vertices returns empty for files whose graph content was already removed by prior commits in the same batch). Instead of failing with 'All recorded files are empty (no hunks)', fall back to an empty change and still run the explicit repo.remove() cleanup for deleted file TREE entries. The commit metadata (message, author, timestamp) is preserved in the change. This eliminates the 'Failed to write' warnings during large git imports (e.g., terraform with pure-deletion merge commits). * perf: replace FILE_MTIMES with FILE_INDEX — content hash eliminates graph traversal in status Schema change: FILE_MTIMES (20 bytes: mtime+size) → FILE_INDEX (52 bytes: mtime+size+blake3_hash). Status no longer calls get_file_content (which traverses the graph to reconstruct file content). Instead: 1. If mtime+size match index → clean (O(1) per file) 2. If mtime+size differ → hash disk file, compare with stored hash (O(file_size) per file, no graph traversal) This makes status O(files × stat) instead of O(files × graph). Before: status on 35K-commit terraform repo = 27 seconds After: status on 1017-commit hyperfine repo = 111ms Also updated all callers: - git import computes blake3 hash during mtime population - materialize computes blake3 hash after writing files - record stores content hash after recording - status uses content_hash for comparison instead of get_file_content * fix: remove deleted files from FILE_INDEX during git import After repo.remove() cleans up the TREE entry for deleted files, also remove the FILE_INDEX entry so status doesn't show them as stale deleted entries. Note: the deletion is still not recorded as a GraphOp::FileDel in the change — it's only a TREE+INDEX cleanup. Proper graph-level deletion recording is a follow-up. * fix: reconcile orphaned TREE entries after git import Merge commits can implicitly delete files by not including them from a second parent. Our per-commit diff only detects explicit deletions (FileOperation::Deleted), so files dropped during merge resolution leave orphaned TREE entries that show up as 'deleted' in status. Fix: after all import batches complete, compare every tracked path against the working copy. Files in TREE that don't exist on disk are removed from both TREE and FILE_INDEX. tokio: 3 orphaned entries cleaned up in <0.1s Result: status shows 'nothing to record, working tree clean' * fix: reconcile untracked files after git import (reverse direction) Merge commits can also implicitly ADD files from a second parent without an explicit FileOperation::Added in the first-parent diff. These files exist on disk but have no TREE entry, showing as 'untracked' or 'new file' in status. After removing orphaned TREE entries (direction 1), also run repo.status() to find untracked files and add them to tracking with FILE_INDEX entries. This completes the bidirectional reconciliation: - Direction 1: TREE entries without disk files → removed - Direction 2: Disk files without TREE entries → added * fix: reconcile untracked files after git import (reverse direction) Merge commits can also implicitly ADD files from a second parent without an explicit FileOperation::Added in the first-parent diff. These files exist on disk but have no TREE entry, showing as 'untracked' or 'new file' in status. After removing orphaned TREE entries (direction 1), also run repo.status() to find untracked files and add them to tracking with FILE_INDEX entries. This completes the bidirectional reconciliation: - Direction 1: TREE entries without disk files → removed - Direction 2: Disk files without TREE entries → added

- Use `sort_by_key` with `Reverse` instead of manual comparison - Replace `match` with `.is_some()` for simple existence checks - Implement `FromStr` trait instead of custom `from_str` method - Use `std::slice::from_ref` to avoid allocating single-element vec - Dereference hash instead of cloning - Simplify heading detection logic with `find` instead of `skip_while` - Reorder variable initialization for clarity

- Add 'Vault — Shared Project Brain' section with usage examples - Add 'Knowledge Graph Queries with LLM Assist' section with RAG pipeline - Add Vault Commands table to CLI Reference - Add experiment/** branch pattern to CI push trigger - Fix 10 clippy -D warnings across atomic-core, atomic-repository, atomic-cli

- Escape square brackets in KG table doc comments (broken intra-doc links) - Wrap generic types in backticks in query.rs doc comment (invalid HTML tags) - Remove broken intra-doc link to vault_store_kg

Tests asserted 2-char tokens (rs, in, am) were included, but tokenize_for_fts filters at >= 3 chars. Also fixed assertion for 'the' which is a stop word.

- Fix 3 vault deflation tests: add trailing newlines to default vault source files (MEMORY.md, system_prompt.md, code-intelligence.md) so content hash roundtrips correctly through materialize/scan cycle - Fix test_options_new_returns_defaults: assert record_empty_files is true (matching the Default impl) - Fix test_status_clean_after_record: status() is an exception-reporter that omits clean files; assert absence from non-clean lists instead - Fix test_diff_end_to_end_multiple_files: same status().clean() pattern - Fix atomic-cli compile errors: add missing Init and Promote match arms in vault/mod.rs and view/mod.rs test helpers Pre-existing failures not addressed: - test_three_sequential_modifications (graph overlay content retrieval) - test_log_run_json_empty (test isolation / working directory issue)

- Fix test_empty_file: record_empty_files defaults to true, so empty file recording succeeds; test both true and false behaviors - Fix test_log_run_json_empty: add missing #[serial] to CWD-mutating tests in init.rs and status.rs that raced with serial log tests

Introduces `atomic-teams` crate with organization and team management, plus workspace/project storage APIs. New CLI commands for managing orgs (create, list, show, update, delete, upgrade, switch) and teams (list, create, show, update, delete), with member and permission grant management. Feature-gated behind the `teams` flag in the CLI. Key additions: - `atomic-teams`: Domain logic for org/team CRUD, member management, grants, domain aliases - `atomic-remote`: `StorageClient` HTTP API for workspace/project operations - CLI commands: `atomic org`, `atomic team`, `atomic workspace`, `atomic project` - Knowledge graph expansion: `PART_OF` and `INCLUDES` edge kinds - Infrastructure: Content search index powered by syntext, agentic LLM tool-use system

Tests the full CLI ↔ atomic-storage round-trip for organizations, workspaces, projects, teams, members, and identity resolution. Includes server availability check, multi-user workflows, resource CRUD operations, and cleanup. Also hardens helpers.sh against unbound variable errors on bash <4.4 when arrays are empty under `set -u`.

Backfill and use a new CHANGE_DEPS index to expand view dependency closures without loading .change files. Add MutTxn::put_change_deps, pristine table support, reader/writer implementations and tests. Wire repository to populate the index on insert and expose repo.repair_change_dependency_index with a new `doctor repair-dependency-index` CLI command. Also add a fast mtime-only status check in the agent to skip unnecessary records.

Add FileIndexEntry/FileIndexMetadata aliases and export them from pristine traits; update implementations and callers to use the new types. Change the AI repo tool to return a structured file outline for large files instead of erroring when no line range is provided. Misc minor cleanups and formatting tweaks (hidden-dir check, small refactors)

Remove push branch triggers from ci.yml so CI runs on PRs to dev. Replace actions-rs/toolchain@v1 with dtolnay/rust-toolchain@stable and adjust inputs (use 'targets' for matrix builds) in release.yml.

Update CI toolchain usages, refresh Cargo.lock to pick roaring 0.11.4, and remove the atomic-repository pin on roaring 0.11.3 so syntext's dependency can upgrade.

Point CLI internal-error/help messages to the atomicdotdev GitHub org. Remove the 'test' job from .github/workflows/release.yml and make the 'build' job depend only on 'version' so the release workflow skips the matrix test run

leefaus added 23 commits April 9, 2026 12:16

vault creation and testing

54292b7

Make vault initialization the default and add --no-vault flag

dca1775

Add C/C++ semantic parser and improve AI system prompt

f1d66ef

docs: fix rustdoc warnings (-D warnings)

0bdaebe

- Escape square brackets in KG table doc comments (broken intra-doc links) - Wrap generic types in backticks in query.rs doc comment (invalid HTML tags) - Remove broken intra-doc link to vault_store_kg

test: fix FTS tokenize tests to match >= 3 char minimum

ba3af2d

Tests asserted 2-char tokens (rs, in, am) were included, but tokenize_for_fts filters at >= 3 chars. Also fixed assertion for 'the' which is a stop word.

clean provenance and attestation recording

d703df0

fix opencode

d2b4843

cleanup agent configurations for sub-projects

580ec60

Pin roaring to 0.11.3 and fix sorts and doc links

b0722bf

Update CI workflows and Rust toolchain action

38b03b0

Remove push branch triggers from ci.yml so CI runs on PRs to dev. Replace actions-rs/toolchain@v1 with dtolnay/rust-toolchain@stable and adjust inputs (use 'targets' for matrix builds) in release.yml.

Bump MSRV to Rust 1.90

e4820c8

Update CI toolchain usages, refresh Cargo.lock to pick roaring 0.11.4, and remove the atomic-repository pin on roaring 0.11.3 so syntext's dependency can upgrade.

update for PR to release

24f8ae4

leefaus force-pushed the dev branch from a9363db to 24f8ae4 Compare May 5, 2026 00:38

Update bug report URL in CLI errors

00ad4e7

Point CLI internal-error/help messages to the atomicdotdev GitHub org. Remove the 'test' job from .github/workflows/release.yml and make the 'build' job depend only on 'version' so the release workflow skips the matrix test run

leefaus merged commit 961df86 into release May 5, 2026
7 checks passed

leefaus deleted the dev branch May 5, 2026 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Git import performance and roundtrip fidelity#44

Enhance Git import performance and roundtrip fidelity#44
leefaus merged 24 commits into
releasefrom
dev

leefaus commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leefaus commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant