Skip to content

Enhance Git import performance and roundtrip fidelity#44

Merged
leefaus merged 24 commits into
releasefrom
dev
May 5, 2026
Merged

Enhance Git import performance and roundtrip fidelity#44
leefaus merged 24 commits into
releasefrom
dev

Conversation

@leefaus
Copy link
Copy Markdown
Contributor

@leefaus leefaus commented May 5, 2026

This pull request makes significant improvements to the documentation and configuration for the Atomic project, focusing on clarifying core concepts, updating the CI/CD setup, and adding support for new workspace members and features. The most important changes include major documentation enhancements to explain new features like workspace shelving, the vault, and the knowledge graph; updates to the GitHub Actions workflows for Rust toolchain management; and the addition of new crates and dependencies.

Documentation improvements:

  • Expanded and clarified the README.md to introduce and explain new concepts such as workspace shelving, the vault (project knowledge store), and the knowledge graph, including detailed usage examples and command references. Updated terminology from "stack" to "view" throughout, and added tables comparing features and commands. [1] [2] [3] [4] [5]
  • Updated AGENTS.md to document workspace shelving, including references to relevant code and a summary of how ignored files and exposed paths are handled per view.

CI/CD workflow updates:

  • Replaced the deprecated actions-rs/toolchain with dtolnay/rust-toolchain in all GitHub Actions workflows (ci.yml, release.yml), and updated toolchain versions to use Rust 1.90 where appropriate. Simplified the test step to run tests for all workspace members at once. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Workspace and dependency additions:

  • Added new workspace members (atomic-semantic, atomic-teams) and updated the workspace Rust version to 1.90 in Cargo.toml. [1] [2]
  • Added new dependencies for IDs (uuid) and content search indexing (syntext). [1] [2]

Minor codebase improvements:

  • Updated the agent event hook system to recognize additional task lifecycle verbs for Copilot and other agents in atomic-agent/src/event.rs.

These changes collectively improve developer experience, documentation clarity, and CI reliability, while laying the groundwork for new features related to workspace management and code intelligence.

leefaus added 23 commits April 9, 2026 12:16
* feat: port roundtrip fidelity fixes and git import improvements

Port all 7 commits from chore/roundtirp-fidelity-test onto dev's
restructured codebase:

- Fix hunk consolidation: collapse any combination of Replace, Delete,
  or middle-Insert hunks into a single whole-file Replace to prevent
  N× content duplication in the graph
- Remove NeedsReplace fallback from globalize_hunk; middle insertions
  are now caught by upstream consolidation
- Add GitDiffLine type and build_crdt_ops_from_git_diff for exact
  git diff parity in CRDT ops
- Rewrite git import write_commit to use in-memory working copies
  (Memory) instead of git checkout_tree, eliminating filesystem I/O
- Add FileMove GraphOp support in record, globalize, and insert
  pipelines for proper rename tracking
- Fix binary file modifications (create Replace hunk when binary
  content differs)
- Fix deleted file cleanup (explicit repo.remove after insert)
- Add double-deletion guard for FileMove in insert_change
- Use Phase 1 old_content instead of repo.get_file_content to avoid
  O(N) graph scans per commit during import
- Add content roundtrip fidelity tests
- Add harness suites 11-13 (diff parity, full repo parity, import
  fidelity)
- Extend harness suite 10 with status parity tests

Known issue: assemble_and_hash calls retrieve_graph per file per
commit, which is O(graph_size). For large imports (1000+ commits),
this causes progressive slowdown. Tracked for separate optimization
(in-memory content vertex cache).

* Add InodeGraphOps trait implementation to ViewGraph wrapper

The ViewGraph wrapper now delegates InodeGraphOps methods to the
underlying transaction, enabling O(m) file-local graph traversal within
view-filtered contexts. Update globalize module to accept InodeGraphOps
and use the fast INODE_GRAPH path for content vertex discovery.

* perf: use INODE_GRAPH for O(m) file-scoped traversal in globalize + batch imports

Two performance optimizations for git import:

1. Wire find_content_vertices to INODE_GRAPH secondary index
   - find_content_vertices now uses the file-scoped INODE_GRAPH
     B-tree instead of retrieve_graph (global GRAPH DFS)
   - O(m) where m = edges for this file, not O(n) where n = all
     edges in the repository
   - Adds InodeGraphOps trait bound through globalize pipeline
   - Implements InodeGraphOps delegation for ViewGraph wrapper
   - Falls back to global GRAPH scan if INODE_GRAPH not populated

   Before: assemble time grew linearly (40ms → 550ms by commit 300)
   After:  assemble time constant (3-14ms regardless of position)

   hyperfine (1017 commits): Phase 2 dropped from O(N²) minutes
   to 21.6 seconds flat

2. Batch import pipeline for memory-bounded processing
   - Commits processed in parse→write batches instead of
     parse-all→write-all
   - Batch sizes tiered by repo size:
     <5K: 250, 5-10K: 500, 10-20K: 1000, 20K+: 2500
   - Keeps memory bounded for large repos (35K+ commits)
   - Shows incremental progress during import

* perf: use eventual durability for redb write transactions

Skip fsync on every transaction commit. Data is still written to the
database file but the OS may buffer it. This reduces per-commit
overhead for all write operations.

Safe for most use cases — a crash could lose the last few transactions
but the repository can be rebuilt from change files.

* Bump version to 0.5.3

* Bump version to 0.5.3

* fix: use global retrieve_graph in find_content_vertices

The INODE_GRAPH fast path bypasses ViewGraph's change filter,
returning vertices from ALL changes regardless of view visibility.
This caused content duplication in the record() path which uses
ViewGraph wrapping.

Always use retrieve_graph (which respects whatever GraphTxnT impl
is provided) for content vertex discovery. The INODE_GRAPH
optimization remains available for callers that use a bare &txn
(like assemble_and_hash for git import).

Fixes 5 of 6 content roundtrip fidelity test failures.
Remaining: test_three_sequential_modifications (content truncation
from a separate issue in dev's record pipeline).

* fix: harness error handling and merge commit test

- Fix error handling section: capture exit code with || status=$?
  instead of letting set -e kill the harness on expected failures
- Fix merge commit test: detect default branch name instead of
  hardcoding master/main

* fix: restore del_edge_with_reverse in apply pipeline

Dev's refactoring removed the old-edge deletion from write_new_edge,
leaving both the original alive edge AND the new DELETED edge in the
B-tree multimap. This caused is_vertex_alive to find the stale alive
parent edge and incorrectly report dead vertices as alive.

Result: the second Replace on a file couldn't find the correct content
vertex, breaking sequential modifications (text and binary).

Restore del_edge_with_reverse before add_edge_with_reverse so the
superseded edge is removed. This matches the old branch's behavior
where apply_new_edge always deleted the previous edge first.

Fixes: all 9 content roundtrip fidelity tests now pass.
Fixes: suite 13 'Multiple Binary Modifications' (46/46 pass).
Closes #41.

* fix: remove Durability::Eventual — causes data visibility issues on Windows

Redb's Eventual durability skips fsync, which on Windows can cause
a read transaction opened immediately after a write commit to not
see the written data. This broke 6 of 9 content fidelity tests on
Windows while passing on macOS/Linux.

Revert to default durability (Immediate) for correctness across
all platforms.

* fix: use platform-aware paths in fidelity tests for Windows compatibility

Add test_path() helper that flattens subdirectory paths on Windows
(src/main.rs → src_main.rs) while keeping them as-is on Unix.
Windows has a known issue with subdirectory path tracking — this
workaround lets the fidelity tests exercise the record pipeline
without hitting the path normalization bug.

Git show commands still use the real repo path (src/main.rs) for
fetching content from the cloned hyperfine repo.

* fix: normalize path separators for Windows compatibility

On Windows, walkdir returns paths with backslash separators (src\main.rs)
but the TREE table stores forward slashes (src/main.rs). Status comparison
between tracked and disk paths failed because PathBuf uses byte-level
comparison — src/main.rs != src\main.rs.

Fix: normalize all paths to forward slashes in both the working copy scan
(collect_working_copy_files_with_rules) and the tracked path sets
(status). This ensures consistent comparison across platforms.

Also revert the test_path() workaround in fidelity tests — tests now use
real subdirectory paths (src/main.rs) since the underlying bug is fixed.

* fix: normalize path separators for Windows compatibility

On Windows, walkdir returns paths with backslash separators (src\main.rs)
but the TREE table stores forward slashes (src/main.rs). Status comparison
between tracked and disk paths failed because PathBuf uses byte-level
comparison — src/main.rs != src\main.rs.

Fix: normalize all paths to forward slashes in both the working copy scan
(collect_working_copy_files_with_rules) and the tracked path sets
(status). This ensures consistent comparison across platforms.

Also revert the test_path() workaround in fidelity tests — tests now use
real subdirectory paths (src/main.rs) since the underlying bug is fixed.
* perf: skip O(N) change filter scan for shared root views

For shared views with no parent (the common case after git import),
all changes are visible. Skip the expensive collect_view_change_ids
scan (which loads every change file from disk to expand dependencies)
in both status() and get_file_content().

Also skip content comparison in status for shared root views when
there's no mtime cache entry — files are clean by construction after
import.

This helps but doesn't fully solve the large-repo status problem.
With 35K+ changes and 40K files, status still hangs — likely due
to two full TREE table scans plus filesystem walk. Needs a
fundamentally different approach (index-based status like git).

* perf: populate mtime cache during git import for instant status

After git import, every file's mtime+size is stored in the pristine
database. This lets 'atomic status' compare file metadata (stat)
instead of reconstructing graph content for every file.

Before: status on a 1017-commit repo hung (O(files × graph_traversal))
After: status completes in 0.1s (O(files × stat))

Also adds Repository::update_file_mtimes() public API for batch
mtime cache population, and fast-paths get_file_content() and
status() to skip O(N) change filter scan for shared root views.

* perf: populate mtime cache after materialize

Both materialize() and materialize_prefix() now stat all written files
and store their mtime+size in the pristine database. This makes
'atomic status' after a server-side push + materialize instantaneous
(stat comparison instead of graph traversal).

Same pattern as the git import mtime population — best-effort,
errors silently ignored.

* fix: handle AllEmpty gracefully during git import for pure-deletion commits

Commits that only delete files can produce RecordedFiles with hunks
that globalization strips (e.g., when find_content_vertices returns
empty for files whose graph content was already removed by prior
commits in the same batch).

Instead of failing with 'All recorded files are empty (no hunks)',
fall back to an empty change and still run the explicit repo.remove()
cleanup for deleted file TREE entries. The commit metadata (message,
author, timestamp) is preserved in the change.

This eliminates the 'Failed to write' warnings during large git
imports (e.g., terraform with pure-deletion merge commits).

* perf: replace FILE_MTIMES with FILE_INDEX — content hash eliminates graph traversal in status

Schema change: FILE_MTIMES (20 bytes: mtime+size) → FILE_INDEX
(52 bytes: mtime+size+blake3_hash).

Status no longer calls get_file_content (which traverses the graph
to reconstruct file content). Instead:
1. If mtime+size match index → clean (O(1) per file)
2. If mtime+size differ → hash disk file, compare with stored hash
   (O(file_size) per file, no graph traversal)

This makes status O(files × stat) instead of O(files × graph).

Before: status on 35K-commit terraform repo = 27 seconds
After: status on 1017-commit hyperfine repo = 111ms

Also updated all callers:
- git import computes blake3 hash during mtime population
- materialize computes blake3 hash after writing files
- record stores content hash after recording
- status uses content_hash for comparison instead of get_file_content

* fix: remove deleted files from FILE_INDEX during git import

After repo.remove() cleans up the TREE entry for deleted files,
also remove the FILE_INDEX entry so status doesn't show them as
stale deleted entries.

Note: the deletion is still not recorded as a GraphOp::FileDel
in the change — it's only a TREE+INDEX cleanup. Proper graph-level
deletion recording is a follow-up.

* fix: reconcile orphaned TREE entries after git import

Merge commits can implicitly delete files by not including them from
a second parent. Our per-commit diff only detects explicit deletions
(FileOperation::Deleted), so files dropped during merge resolution
leave orphaned TREE entries that show up as 'deleted' in status.

Fix: after all import batches complete, compare every tracked path
against the working copy. Files in TREE that don't exist on disk are
removed from both TREE and FILE_INDEX.

tokio: 3 orphaned entries cleaned up in <0.1s
Result: status shows 'nothing to record, working tree clean'

* fix: reconcile untracked files after git import (reverse direction)

Merge commits can also implicitly ADD files from a second parent
without an explicit FileOperation::Added in the first-parent diff.
These files exist on disk but have no TREE entry, showing as
'untracked' or 'new file' in status.

After removing orphaned TREE entries (direction 1), also run
repo.status() to find untracked files and add them to tracking
with FILE_INDEX entries.

This completes the bidirectional reconciliation:
- Direction 1: TREE entries without disk files → removed
- Direction 2: Disk files without TREE entries → added

* fix: reconcile untracked files after git import (reverse direction)

Merge commits can also implicitly ADD files from a second parent
without an explicit FileOperation::Added in the first-parent diff.
These files exist on disk but have no TREE entry, showing as
'untracked' or 'new file' in status.

After removing orphaned TREE entries (direction 1), also run
repo.status() to find untracked files and add them to tracking
with FILE_INDEX entries.

This completes the bidirectional reconciliation:
- Direction 1: TREE entries without disk files → removed
- Direction 2: Disk files without TREE entries → added
- Use `sort_by_key` with `Reverse` instead of manual comparison
- Replace `match` with `.is_some()` for simple existence checks
- Implement `FromStr` trait instead of custom `from_str` method
- Use `std::slice::from_ref` to avoid allocating single-element vec
- Dereference hash instead of cloning
- Simplify heading detection logic with `find` instead of `skip_while`
- Reorder variable initialization for clarity
- Add 'Vault — Shared Project Brain' section with usage examples
- Add 'Knowledge Graph Queries with LLM Assist' section with RAG pipeline
- Add Vault Commands table to CLI Reference
- Add experiment/** branch pattern to CI push trigger
- Fix 10 clippy -D warnings across atomic-core, atomic-repository, atomic-cli
- Add 'Vault — Shared Project Brain' section with usage examples
- Add 'Knowledge Graph Queries with LLM Assist' section with RAG pipeline
- Add Vault Commands table to CLI Reference
- Add experiment/** branch pattern to CI push trigger
- Fix 10 clippy -D warnings across atomic-core, atomic-repository, atomic-cli
- Escape square brackets in KG table doc comments (broken intra-doc links)
- Wrap generic types in backticks in query.rs doc comment (invalid HTML tags)
- Remove broken intra-doc link to vault_store_kg
Tests asserted 2-char tokens (rs, in, am) were included, but
tokenize_for_fts filters at >= 3 chars. Also fixed assertion for
'the' which is a stop word.
- Fix 3 vault deflation tests: add trailing newlines to default vault
  source files (MEMORY.md, system_prompt.md, code-intelligence.md) so
  content hash roundtrips correctly through materialize/scan cycle
- Fix test_options_new_returns_defaults: assert record_empty_files is
  true (matching the Default impl)
- Fix test_status_clean_after_record: status() is an exception-reporter
  that omits clean files; assert absence from non-clean lists instead
- Fix test_diff_end_to_end_multiple_files: same status().clean() pattern
- Fix atomic-cli compile errors: add missing Init and Promote match arms
  in vault/mod.rs and view/mod.rs test helpers

Pre-existing failures not addressed:
- test_three_sequential_modifications (graph overlay content retrieval)
- test_log_run_json_empty (test isolation / working directory issue)
- Fix test_empty_file: record_empty_files defaults to true, so empty
  file recording succeeds; test both true and false behaviors
- Fix test_log_run_json_empty: add missing #[serial] to CWD-mutating
  tests in init.rs and status.rs that raced with serial log tests
Introduces `atomic-teams` crate with organization and team management,
plus workspace/project storage APIs. New CLI commands for managing orgs
(create, list, show, update, delete, upgrade, switch) and teams (list,
create, show, update, delete), with member and permission grant
management.
Feature-gated behind the `teams` flag in the CLI.

Key additions:
- `atomic-teams`: Domain logic for org/team CRUD, member management,
  grants, domain aliases
- `atomic-remote`: `StorageClient` HTTP API for workspace/project
  operations
- CLI commands: `atomic org`, `atomic team`, `atomic workspace`, `atomic
  project`
- Knowledge graph expansion: `PART_OF` and `INCLUDES` edge kinds
- Infrastructure: Content search index powered by syntext, agentic LLM
  tool-use system
Tests the full CLI ↔ atomic-storage round-trip for organizations,
workspaces, projects, teams, members, and identity resolution. Includes
server availability check, multi-user workflows, resource CRUD
operations,
and cleanup. Also hardens helpers.sh against unbound variable errors on
bash <4.4 when arrays are empty under `set -u`.
Backfill and use a new CHANGE_DEPS index to expand view dependency
closures without loading .change files. Add MutTxn::put_change_deps,
pristine table support, reader/writer implementations and tests.
Wire repository to populate the index on insert and expose
repo.repair_change_dependency_index with a new
`doctor repair-dependency-index` CLI command. Also add a fast
mtime-only status check in the agent to skip unnecessary records.
Add FileIndexEntry/FileIndexMetadata aliases and export them from
pristine traits; update implementations and callers to use the new
types.
Change the AI repo tool to return a structured file outline for large
files
instead of erroring when no line range is provided. Misc minor cleanups
and formatting tweaks (hidden-dir check, small refactors)
Remove push branch triggers from ci.yml so CI runs on PRs to dev.
Replace actions-rs/toolchain@v1 with dtolnay/rust-toolchain@stable
and adjust inputs (use 'targets' for matrix builds) in release.yml.
Update CI toolchain usages, refresh Cargo.lock to pick roaring 0.11.4,
and remove the atomic-repository pin on roaring 0.11.3 so syntext's
dependency can upgrade.
Point CLI internal-error/help messages to the atomicdotdev GitHub org.

Remove the 'test' job from .github/workflows/release.yml and make the
'build' job depend only on 'version' so the release workflow skips the
matrix test run
@leefaus leefaus merged commit 961df86 into release May 5, 2026
7 checks passed
@leefaus leefaus deleted the dev branch May 5, 2026 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant