Skip to content

Improve roundtrip fidelity and optimize Git import process#42

Merged
leefaus merged 9 commits into
devfrom
chore/performance-checks
Apr 10, 2026
Merged

Improve roundtrip fidelity and optimize Git import process#42
leefaus merged 9 commits into
devfrom
chore/performance-checks

Conversation

@leefaus
Copy link
Copy Markdown
Contributor

@leefaus leefaus commented Apr 9, 2026

This pull request introduces several significant improvements and fixes to the core graph and file recording logic, with a focus on correct edge management, improved support for file moves/renames, and better CRDT operation generation. The most important changes are grouped below:

Graph Edge Management & Correctness

  • The edge application logic in write_new_edge now removes the old edge before adding the new one, fixing an issue where stale alive edges could cause incorrect vertex liveness and break subsequent operations. This is implemented via the new del_edge_with_reverse helper. [1] [2] [3]
  • Added the del_edge_with_reverse function to reliably remove both forward and reverse edges from the graph and inode graph, ensuring graph integrity.

File Move/Rename Support

  • The file globalization pipeline now handles file moves/renames (FileMove) by removing the old FOLDER edge and adding a new name vertex for the moved file, updating dependencies and content as needed.

CRDT Operation Generation

  • Improved logic for generating CRDT operations on modified files: now emits BranchOp::Modify for equal line counts (enabling word-level diffs), and emits Delete/Insert for unequal counts, matching git’s unified diff format. [1] [2]

API and Trait Updates

  • Extended trait bounds and implementations to propagate InodeGraphOps through more layers, allowing view graphs and workflow assembly/globalization to access inode graph operations. [1] [2] [3] [4] [5]

Other Improvements

  • The write transaction in Pristine now uses eventual durability for improved performance.
  • Minor: Bumped workspace version to 0.5.3 in Cargo.toml.
  • Minor: Exported additional CRDT-related functions and types for workflow recording.

These changes collectively improve the correctness, flexibility, and performance of the core graph and file recording subsystems.

leefaus added 2 commits April 9, 2026 12:39
For shared views with no parent (the common case after git import),
all changes are visible. Skip the expensive collect_view_change_ids
scan (which loads every change file from disk to expand dependencies)
in both status() and get_file_content().

Also skip content comparison in status for shared root views when
there's no mtime cache entry — files are clean by construction after
import.

This helps but doesn't fully solve the large-repo status problem.
With 35K+ changes and 40K files, status still hangs — likely due
to two full TREE table scans plus filesystem walk. Needs a
fundamentally different approach (index-based status like git).
After git import, every file's mtime+size is stored in the pristine
database. This lets 'atomic status' compare file metadata (stat)
instead of reconstructing graph content for every file.

Before: status on a 1017-commit repo hung (O(files × graph_traversal))
After: status completes in 0.1s (O(files × stat))

Also adds Repository::update_file_mtimes() public API for batch
mtime cache population, and fast-paths get_file_content() and
status() to skip O(N) change filter scan for shared root views.
@leefaus leefaus changed the base branch from release to dev April 9, 2026 17:04
leefaus added 7 commits April 9, 2026 13:07
Both materialize() and materialize_prefix() now stat all written files
and store their mtime+size in the pristine database. This makes
'atomic status' after a server-side push + materialize instantaneous
(stat comparison instead of graph traversal).

Same pattern as the git import mtime population — best-effort,
errors silently ignored.
…ommits

Commits that only delete files can produce RecordedFiles with hunks
that globalization strips (e.g., when find_content_vertices returns
empty for files whose graph content was already removed by prior
commits in the same batch).

Instead of failing with 'All recorded files are empty (no hunks)',
fall back to an empty change and still run the explicit repo.remove()
cleanup for deleted file TREE entries. The commit metadata (message,
author, timestamp) is preserved in the change.

This eliminates the 'Failed to write' warnings during large git
imports (e.g., terraform with pure-deletion merge commits).
…raph traversal in status

Schema change: FILE_MTIMES (20 bytes: mtime+size) → FILE_INDEX
(52 bytes: mtime+size+blake3_hash).

Status no longer calls get_file_content (which traverses the graph
to reconstruct file content). Instead:
1. If mtime+size match index → clean (O(1) per file)
2. If mtime+size differ → hash disk file, compare with stored hash
   (O(file_size) per file, no graph traversal)

This makes status O(files × stat) instead of O(files × graph).

Before: status on 35K-commit terraform repo = 27 seconds
After: status on 1017-commit hyperfine repo = 111ms

Also updated all callers:
- git import computes blake3 hash during mtime population
- materialize computes blake3 hash after writing files
- record stores content hash after recording
- status uses content_hash for comparison instead of get_file_content
After repo.remove() cleans up the TREE entry for deleted files,
also remove the FILE_INDEX entry so status doesn't show them as
stale deleted entries.

Note: the deletion is still not recorded as a GraphOp::FileDel
in the change — it's only a TREE+INDEX cleanup. Proper graph-level
deletion recording is a follow-up.
Merge commits can implicitly delete files by not including them from
a second parent. Our per-commit diff only detects explicit deletions
(FileOperation::Deleted), so files dropped during merge resolution
leave orphaned TREE entries that show up as 'deleted' in status.

Fix: after all import batches complete, compare every tracked path
against the working copy. Files in TREE that don't exist on disk are
removed from both TREE and FILE_INDEX.

tokio: 3 orphaned entries cleaned up in <0.1s
Result: status shows 'nothing to record, working tree clean'
Merge commits can also implicitly ADD files from a second parent
without an explicit FileOperation::Added in the first-parent diff.
These files exist on disk but have no TREE entry, showing as
'untracked' or 'new file' in status.

After removing orphaned TREE entries (direction 1), also run
repo.status() to find untracked files and add them to tracking
with FILE_INDEX entries.

This completes the bidirectional reconciliation:
- Direction 1: TREE entries without disk files → removed
- Direction 2: Disk files without TREE entries → added
Merge commits can also implicitly ADD files from a second parent
without an explicit FileOperation::Added in the first-parent diff.
These files exist on disk but have no TREE entry, showing as
'untracked' or 'new file' in status.

After removing orphaned TREE entries (direction 1), also run
repo.status() to find untracked files and add them to tracking
with FILE_INDEX entries.

This completes the bidirectional reconciliation:
- Direction 1: TREE entries without disk files → removed
- Direction 2: Disk files without TREE entries → added
@leefaus leefaus merged commit dc0f694 into dev Apr 10, 2026
8 checks passed
@graywolf336 graywolf336 deleted the chore/performance-checks branch April 10, 2026 21:50
leefaus added a commit that referenced this pull request May 5, 2026
* Improve roundtrip fidelity and optimize Git import process (#40)

* feat: port roundtrip fidelity fixes and git import improvements

Port all 7 commits from chore/roundtirp-fidelity-test onto dev's
restructured codebase:

- Fix hunk consolidation: collapse any combination of Replace, Delete,
  or middle-Insert hunks into a single whole-file Replace to prevent
  N× content duplication in the graph
- Remove NeedsReplace fallback from globalize_hunk; middle insertions
  are now caught by upstream consolidation
- Add GitDiffLine type and build_crdt_ops_from_git_diff for exact
  git diff parity in CRDT ops
- Rewrite git import write_commit to use in-memory working copies
  (Memory) instead of git checkout_tree, eliminating filesystem I/O
- Add FileMove GraphOp support in record, globalize, and insert
  pipelines for proper rename tracking
- Fix binary file modifications (create Replace hunk when binary
  content differs)
- Fix deleted file cleanup (explicit repo.remove after insert)
- Add double-deletion guard for FileMove in insert_change
- Use Phase 1 old_content instead of repo.get_file_content to avoid
  O(N) graph scans per commit during import
- Add content roundtrip fidelity tests
- Add harness suites 11-13 (diff parity, full repo parity, import
  fidelity)
- Extend harness suite 10 with status parity tests

Known issue: assemble_and_hash calls retrieve_graph per file per
commit, which is O(graph_size). For large imports (1000+ commits),
this causes progressive slowdown. Tracked for separate optimization
(in-memory content vertex cache).

* Add InodeGraphOps trait implementation to ViewGraph wrapper

The ViewGraph wrapper now delegates InodeGraphOps methods to the
underlying transaction, enabling O(m) file-local graph traversal within
view-filtered contexts. Update globalize module to accept InodeGraphOps
and use the fast INODE_GRAPH path for content vertex discovery.

* perf: use INODE_GRAPH for O(m) file-scoped traversal in globalize + batch imports

Two performance optimizations for git import:

1. Wire find_content_vertices to INODE_GRAPH secondary index
   - find_content_vertices now uses the file-scoped INODE_GRAPH
     B-tree instead of retrieve_graph (global GRAPH DFS)
   - O(m) where m = edges for this file, not O(n) where n = all
     edges in the repository
   - Adds InodeGraphOps trait bound through globalize pipeline
   - Implements InodeGraphOps delegation for ViewGraph wrapper
   - Falls back to global GRAPH scan if INODE_GRAPH not populated

   Before: assemble time grew linearly (40ms → 550ms by commit 300)
   After:  assemble time constant (3-14ms regardless of position)

   hyperfine (1017 commits): Phase 2 dropped from O(N²) minutes
   to 21.6 seconds flat

2. Batch import pipeline for memory-bounded processing
   - Commits processed in parse→write batches instead of
     parse-all→write-all
   - Batch sizes tiered by repo size:
     <5K: 250, 5-10K: 500, 10-20K: 1000, 20K+: 2500
   - Keeps memory bounded for large repos (35K+ commits)
   - Shows incremental progress during import

* perf: use eventual durability for redb write transactions

Skip fsync on every transaction commit. Data is still written to the
database file but the OS may buffer it. This reduces per-commit
overhead for all write operations.

Safe for most use cases — a crash could lose the last few transactions
but the repository can be rebuilt from change files.

* Bump version to 0.5.3

* Bump version to 0.5.3

* fix: use global retrieve_graph in find_content_vertices

The INODE_GRAPH fast path bypasses ViewGraph's change filter,
returning vertices from ALL changes regardless of view visibility.
This caused content duplication in the record() path which uses
ViewGraph wrapping.

Always use retrieve_graph (which respects whatever GraphTxnT impl
is provided) for content vertex discovery. The INODE_GRAPH
optimization remains available for callers that use a bare &txn
(like assemble_and_hash for git import).

Fixes 5 of 6 content roundtrip fidelity test failures.
Remaining: test_three_sequential_modifications (content truncation
from a separate issue in dev's record pipeline).

* fix: harness error handling and merge commit test

- Fix error handling section: capture exit code with || status=$?
  instead of letting set -e kill the harness on expected failures
- Fix merge commit test: detect default branch name instead of
  hardcoding master/main

* fix: restore del_edge_with_reverse in apply pipeline

Dev's refactoring removed the old-edge deletion from write_new_edge,
leaving both the original alive edge AND the new DELETED edge in the
B-tree multimap. This caused is_vertex_alive to find the stale alive
parent edge and incorrectly report dead vertices as alive.

Result: the second Replace on a file couldn't find the correct content
vertex, breaking sequential modifications (text and binary).

Restore del_edge_with_reverse before add_edge_with_reverse so the
superseded edge is removed. This matches the old branch's behavior
where apply_new_edge always deleted the previous edge first.

Fixes: all 9 content roundtrip fidelity tests now pass.
Fixes: suite 13 'Multiple Binary Modifications' (46/46 pass).
Closes #41.

* fix: remove Durability::Eventual — causes data visibility issues on Windows

Redb's Eventual durability skips fsync, which on Windows can cause
a read transaction opened immediately after a write commit to not
see the written data. This broke 6 of 9 content fidelity tests on
Windows while passing on macOS/Linux.

Revert to default durability (Immediate) for correctness across
all platforms.

* fix: use platform-aware paths in fidelity tests for Windows compatibility

Add test_path() helper that flattens subdirectory paths on Windows
(src/main.rs → src_main.rs) while keeping them as-is on Unix.
Windows has a known issue with subdirectory path tracking — this
workaround lets the fidelity tests exercise the record pipeline
without hitting the path normalization bug.

Git show commands still use the real repo path (src/main.rs) for
fetching content from the cloned hyperfine repo.

* fix: normalize path separators for Windows compatibility

On Windows, walkdir returns paths with backslash separators (src\main.rs)
but the TREE table stores forward slashes (src/main.rs). Status comparison
between tracked and disk paths failed because PathBuf uses byte-level
comparison — src/main.rs != src\main.rs.

Fix: normalize all paths to forward slashes in both the working copy scan
(collect_working_copy_files_with_rules) and the tracked path sets
(status). This ensures consistent comparison across platforms.

Also revert the test_path() workaround in fidelity tests — tests now use
real subdirectory paths (src/main.rs) since the underlying bug is fixed.

* fix: normalize path separators for Windows compatibility

On Windows, walkdir returns paths with backslash separators (src\main.rs)
but the TREE table stores forward slashes (src/main.rs). Status comparison
between tracked and disk paths failed because PathBuf uses byte-level
comparison — src/main.rs != src\main.rs.

Fix: normalize all paths to forward slashes in both the working copy scan
(collect_working_copy_files_with_rules) and the tracked path sets
(status). This ensures consistent comparison across platforms.

Also revert the test_path() workaround in fidelity tests — tests now use
real subdirectory paths (src/main.rs) since the underlying bug is fixed.

* Improve roundtrip fidelity and optimize Git import process (#42)

* perf: skip O(N) change filter scan for shared root views

For shared views with no parent (the common case after git import),
all changes are visible. Skip the expensive collect_view_change_ids
scan (which loads every change file from disk to expand dependencies)
in both status() and get_file_content().

Also skip content comparison in status for shared root views when
there's no mtime cache entry — files are clean by construction after
import.

This helps but doesn't fully solve the large-repo status problem.
With 35K+ changes and 40K files, status still hangs — likely due
to two full TREE table scans plus filesystem walk. Needs a
fundamentally different approach (index-based status like git).

* perf: populate mtime cache during git import for instant status

After git import, every file's mtime+size is stored in the pristine
database. This lets 'atomic status' compare file metadata (stat)
instead of reconstructing graph content for every file.

Before: status on a 1017-commit repo hung (O(files × graph_traversal))
After: status completes in 0.1s (O(files × stat))

Also adds Repository::update_file_mtimes() public API for batch
mtime cache population, and fast-paths get_file_content() and
status() to skip O(N) change filter scan for shared root views.

* perf: populate mtime cache after materialize

Both materialize() and materialize_prefix() now stat all written files
and store their mtime+size in the pristine database. This makes
'atomic status' after a server-side push + materialize instantaneous
(stat comparison instead of graph traversal).

Same pattern as the git import mtime population — best-effort,
errors silently ignored.

* fix: handle AllEmpty gracefully during git import for pure-deletion commits

Commits that only delete files can produce RecordedFiles with hunks
that globalization strips (e.g., when find_content_vertices returns
empty for files whose graph content was already removed by prior
commits in the same batch).

Instead of failing with 'All recorded files are empty (no hunks)',
fall back to an empty change and still run the explicit repo.remove()
cleanup for deleted file TREE entries. The commit metadata (message,
author, timestamp) is preserved in the change.

This eliminates the 'Failed to write' warnings during large git
imports (e.g., terraform with pure-deletion merge commits).

* perf: replace FILE_MTIMES with FILE_INDEX — content hash eliminates graph traversal in status

Schema change: FILE_MTIMES (20 bytes: mtime+size) → FILE_INDEX
(52 bytes: mtime+size+blake3_hash).

Status no longer calls get_file_content (which traverses the graph
to reconstruct file content). Instead:
1. If mtime+size match index → clean (O(1) per file)
2. If mtime+size differ → hash disk file, compare with stored hash
   (O(file_size) per file, no graph traversal)

This makes status O(files × stat) instead of O(files × graph).

Before: status on 35K-commit terraform repo = 27 seconds
After: status on 1017-commit hyperfine repo = 111ms

Also updated all callers:
- git import computes blake3 hash during mtime population
- materialize computes blake3 hash after writing files
- record stores content hash after recording
- status uses content_hash for comparison instead of get_file_content

* fix: remove deleted files from FILE_INDEX during git import

After repo.remove() cleans up the TREE entry for deleted files,
also remove the FILE_INDEX entry so status doesn't show them as
stale deleted entries.

Note: the deletion is still not recorded as a GraphOp::FileDel
in the change — it's only a TREE+INDEX cleanup. Proper graph-level
deletion recording is a follow-up.

* fix: reconcile orphaned TREE entries after git import

Merge commits can implicitly delete files by not including them from
a second parent. Our per-commit diff only detects explicit deletions
(FileOperation::Deleted), so files dropped during merge resolution
leave orphaned TREE entries that show up as 'deleted' in status.

Fix: after all import batches complete, compare every tracked path
against the working copy. Files in TREE that don't exist on disk are
removed from both TREE and FILE_INDEX.

tokio: 3 orphaned entries cleaned up in <0.1s
Result: status shows 'nothing to record, working tree clean'

* fix: reconcile untracked files after git import (reverse direction)

Merge commits can also implicitly ADD files from a second parent
without an explicit FileOperation::Added in the first-parent diff.
These files exist on disk but have no TREE entry, showing as
'untracked' or 'new file' in status.

After removing orphaned TREE entries (direction 1), also run
repo.status() to find untracked files and add them to tracking
with FILE_INDEX entries.

This completes the bidirectional reconciliation:
- Direction 1: TREE entries without disk files → removed
- Direction 2: Disk files without TREE entries → added

* fix: reconcile untracked files after git import (reverse direction)

Merge commits can also implicitly ADD files from a second parent
without an explicit FileOperation::Added in the first-parent diff.
These files exist on disk but have no TREE entry, showing as
'untracked' or 'new file' in status.

After removing orphaned TREE entries (direction 1), also run
repo.status() to find untracked files and add them to tracking
with FILE_INDEX entries.

This completes the bidirectional reconciliation:
- Direction 1: TREE entries without disk files → removed
- Direction 2: Disk files without TREE entries → added

* vault creation and testing

* Make vault initialization the default and add `--no-vault` flag

* Add C/C++ semantic parser and improve AI system prompt

* Refactor code to follow Rust idioms and conventions

- Use `sort_by_key` with `Reverse` instead of manual comparison
- Replace `match` with `.is_some()` for simple existence checks
- Implement `FromStr` trait instead of custom `from_str` method
- Use `std::slice::from_ref` to avoid allocating single-element vec
- Dereference hash instead of cloning
- Simplify heading detection logic with `find` instead of `skip_while`
- Reorder variable initialization for clarity

* docs: add Vault and Knowledge Graph sections to README

- Add 'Vault — Shared Project Brain' section with usage examples
- Add 'Knowledge Graph Queries with LLM Assist' section with RAG pipeline
- Add Vault Commands table to CLI Reference
- Add experiment/** branch pattern to CI push trigger
- Fix 10 clippy -D warnings across atomic-core, atomic-repository, atomic-cli

* docs: add Vault and Knowledge Graph sections to README

- Add 'Vault — Shared Project Brain' section with usage examples
- Add 'Knowledge Graph Queries with LLM Assist' section with RAG pipeline
- Add Vault Commands table to CLI Reference
- Add experiment/** branch pattern to CI push trigger
- Fix 10 clippy -D warnings across atomic-core, atomic-repository, atomic-cli

* docs: fix rustdoc warnings (-D warnings)

- Escape square brackets in KG table doc comments (broken intra-doc links)
- Wrap generic types in backticks in query.rs doc comment (invalid HTML tags)
- Remove broken intra-doc link to vault_store_kg

* test: fix FTS tokenize tests to match >= 3 char minimum

Tests asserted 2-char tokens (rs, in, am) were included, but
tokenize_for_fts filters at >= 3 chars. Also fixed assertion for
'the' which is a stop word.

* test: fix test failures across workspace

- Fix 3 vault deflation tests: add trailing newlines to default vault
  source files (MEMORY.md, system_prompt.md, code-intelligence.md) so
  content hash roundtrips correctly through materialize/scan cycle
- Fix test_options_new_returns_defaults: assert record_empty_files is
  true (matching the Default impl)
- Fix test_status_clean_after_record: status() is an exception-reporter
  that omits clean files; assert absence from non-clean lists instead
- Fix test_diff_end_to_end_multiple_files: same status().clean() pattern
- Fix atomic-cli compile errors: add missing Init and Promote match arms
  in vault/mod.rs and view/mod.rs test helpers

Pre-existing failures not addressed:
- test_three_sequential_modifications (graph overlay content retrieval)
- test_log_run_json_empty (test isolation / working directory issue)

* test: fix remaining test failures for CI

- Fix test_empty_file: record_empty_files defaults to true, so empty
  file recording succeeds; test both true and false behaviors
- Fix test_log_run_json_empty: add missing #[serial] to CWD-mutating
  tests in init.rs and status.rs that raced with serial log tests

* clean provenance and attestation recording

* fix opencode

* cleanup agent configurations for sub-projects

* Add team collaboration features for Atomic VCS

Introduces `atomic-teams` crate with organization and team management,
plus workspace/project storage APIs. New CLI commands for managing orgs
(create, list, show, update, delete, upgrade, switch) and teams (list,
create, show, update, delete), with member and permission grant
management.
Feature-gated behind the `teams` flag in the CLI.

Key additions:
- `atomic-teams`: Domain logic for org/team CRUD, member management,
  grants, domain aliases
- `atomic-remote`: `StorageClient` HTTP API for workspace/project
  operations
- CLI commands: `atomic org`, `atomic team`, `atomic workspace`, `atomic
  project`
- Knowledge graph expansion: `PART_OF` and `INCLUDES` edge kinds
- Infrastructure: Content search index powered by syntext, agentic LLM
  tool-use system

* Add integration tests for atomic-storage management API

Tests the full CLI ↔ atomic-storage round-trip for organizations,
workspaces, projects, teams, members, and identity resolution. Includes
server availability check, multi-user workflows, resource CRUD
operations,
and cleanup. Also hardens helpers.sh against unbound variable errors on
bash <4.4 when arrays are empty under `set -u`.

* Add change dependency index and fast status

Backfill and use a new CHANGE_DEPS index to expand view dependency
closures without loading .change files. Add MutTxn::put_change_deps,
pristine table support, reader/writer implementations and tests.
Wire repository to populate the index on insert and expose
repo.repair_change_dependency_index with a new
`doctor repair-dependency-index` CLI command. Also add a fast
mtime-only status check in the agent to skip unnecessary records.

* Use owned strings in CLI tables

Add FileIndexEntry/FileIndexMetadata aliases and export them from
pristine traits; update implementations and callers to use the new
types.
Change the AI repo tool to return a structured file outline for large
files
instead of erroring when no line range is provided. Misc minor cleanups
and formatting tweaks (hidden-dir check, small refactors)

* Pin roaring to 0.11.3 and fix sorts and doc links

* Update CI workflows and Rust toolchain action

Remove push branch triggers from ci.yml so CI runs on PRs to dev.
Replace actions-rs/toolchain@v1 with dtolnay/rust-toolchain@stable
and adjust inputs (use 'targets' for matrix builds) in release.yml.

* Bump MSRV to Rust 1.90

Update CI toolchain usages, refresh Cargo.lock to pick roaring 0.11.4,
and remove the atomic-repository pin on roaring 0.11.3 so syntext's
dependency can upgrade.

* update for PR to release

* Update bug report URL in CLI errors

Point CLI internal-error/help messages to the atomicdotdev GitHub org.

Remove the 'test' job from .github/workflows/release.yml and make the
'build' job depend only on 'version' so the release workflow skips the
matrix test run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant