perf(ingest): no-op fast path gated on a source fingerprint (#468)#471
Conversation
`ingest()` cost ~0.7s on every call even when nothing upstream changed — it always loaded the cursor blob, cloned it, walked every session file, and deserialized a cursor per file. Profiling a 94k-turn ledger showed the no-op cost is dominated by the cursor load/clone + per-file deserialize plus the source walk, none of which produce any new turns. Add a stat-only source fingerprint (`count:totalBytes:mtimeSum` over the Claude/Codex/OpenCode session files, folding in OpenCode's per-session message-dir mtime). `ingest_all` computes it up front and, when it matches the value the last sweep recorded in the new `archive_state.source_fingerprint` column, returns an empty report without loading cursors or parsing anything. The fingerprint is stored after every full sweep; it is blanked by `state reset` and `rebuild_derivable` so a wiped/rebuilt ledger always re-walks. Storing the fingerprint captured at the *start* of the sweep is the safe choice: a file that changes mid-sweep yields a different fingerprint next call, forcing a (cheap) re-scan that per-file cursors then skip. Schema bumped to v6 via in-place `ALTER TABLE ADD COLUMN` (auto-migrated). On a live ledger a no-op `burn ingest` dropped from ~0.74s to ~0.19s (bounded by the unavoidable stat walk over the source trees); on Claude/Codex-dominant ledgers without a large OpenCode tree it returns in tens of ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
CodeAnt AI is reviewing your PR. |
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR adds a stat-only source fingerprint used by ingest_all to detect unchanged upstream session stores and short-circuit ingest. Schema bumped to v6; ledger read/write API persists the fingerprint; ingest computes and compares the fingerprint, skips scanning when unchanged (unless forced), and avoids persisting fingerprints if files were skipped. Watch/CLI and tests updated accordingly. ChangesSource fingerprint fast-path for ingest
Sequence Diagram(s)sequenceDiagram
participant ingest_all
participant source_fingerprint
participant Ledger
participant ScanOrchestration
ingest_all->>source_fingerprint: compute current_fp (stat-only)
ingest_all->>Ledger: read_source_fingerprint()
alt stored == current_fp && !force_scan
Ledger->>ingest_all: return match
ingest_all->>ingest_all: return empty IngestReport
else
ingest_all->>ScanOrchestration: perform full scan/parse
ScanOrchestration->>ingest_all: parsed turns + had_skips flag
alt !had_skips
ingest_all->>Ledger: write_source_fingerprint(current_fp)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec53de3473
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ledger | ||
| .write_source_fingerprint(¤t_fp) | ||
| .map_err(|e| anyhow::anyhow!(e))?; |
There was a problem hiding this comment.
Avoid caching fingerprints after skipped source files
When any harness sees a source file but cannot read or parse it, the per-file loop logs the error and continues, yet this still persists the fingerprint that already counted that file. If the failure is transient without changing size/mtime (for example a PermissionDenied read fixed with chmod, which changes ctime but not the fingerprint inputs), the next ingest_all matches this stored fingerprint and returns before retrying, leaving that session un-ingested indefinitely. Only record the fingerprint when every counted source was successfully processed, or include failed files in a way that forces a retry.
Useful? React with 👍 / 👎.
| let sid_b = "cccccccc-cccc-cccc-cccc-cccccccccccc"; | ||
| let file_b = project_dir.join(format!("{sid_b}.jsonl")); | ||
| fs::write(&file_b, claude_distinct_session(sid_b)).unwrap(); |
There was a problem hiding this comment.
Suggestion: This test claims to verify that appending to an existing session file reopens the fast-path gate, but it actually creates a brand-new file. That only validates add-file detection (count change), so a regression where in-place appends are missed could still pass. Update the test to append new JSONL lines to the existing file and assert those new turns are ingested. [incomplete implementation]
Severity Level: Major ⚠️
- ⚠️ Append-only session updates may regress without test failure.
- ⚠️ Fast-path source fingerprint gate untested for in-place appends.
- ⚠️ Future refactors risk missing append detection semantics.Steps of Reproduction ✅
1. Open `crates/relayburn-sdk/src/ingest/orchestration_tests.rs` and locate the test
`ingest_all_rescans_after_source_file_appended` starting around line 529, with its doc
comment at lines 527–528 describing "appending to an existing session file" changing size
and mtime.
2. Within that test, observe the initial setup at lines 522–524 where `sid_a` and `file_a`
are created and `fs::write(&file_a, claude_minimal_session(sid_a)).unwrap();` writes the
first session file, followed by `ingest_all(&mut ledger, &opts).unwrap();` at line 551
that ingests it and arms the fast-path gate.
3. After confirming the fast path is armed via `assert_eq!(ingest_all(&mut ledger,
&opts).unwrap().scanned_sessions, 0);` at line 555, examine the subsequent code at lines
561–563: it defines `sid_b`, computes `file_b`, and calls `fs::write(&file_b,
claude_distinct_session(sid_b)).unwrap();`, which creates an entirely new JSONL file
rather than appending to the existing `file_a`.
4. Run `cargo test ingest_all_rescans_after_source_file_appended` in the
`crates/relayburn-sdk` crate; the test will pass if adding a new file correctly reopens
the fast-path gate (checked via `r3.scanned_sessions >= 1` and `r3.appended_turns >= 1` at
lines 47–53) even if a regression were introduced where in-place appends to an existing
session file do not update the fingerprint, because no code path in this test ever appends
to `file_a` as implied by the doc comment.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** crates/relayburn-sdk/src/ingest/orchestration_tests.rs
**Line:** 561:563
**Comment:**
*Incomplete Implementation: This test claims to verify that appending to an existing session file reopens the fast-path gate, but it actually creates a brand-new file. That only validates add-file detection (count change), so a regression where in-place appends are missed could still pass. Update the test to append new JSONL lines to the existing file and assert those new turns are ingested.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
|
✅ pr-reviewer applied fixes — committed and pushed Reviewed and fixed the PR locally. Changes made:
Validation:
|
|
Reviewed the PR diff and traced the ingest/schema/status impact. Fixed:
Verification:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
CHANGELOG.md (1)
9-9: ⚡ Quick winSimplify to impact-first wording per coding guidelines.
The entry includes implementation backstory ("stat-only source fingerprint is compared...", "skips the cursor load and per-file deserialize") and a PR reference that the guidelines ask you to drop. Focus on the practical effect:
ingest()is now much faster on unchanged sources.As per coding guidelines, changelog entries should be concise and impact-first, dropping implementation details unless they clearly explain the shipped impact.
Suggested condensed wording
-- `ingest()` short-circuits when nothing upstream changed: a stat-only source fingerprint is compared against the last sweep's value, so a no-op ingest skips the cursor load and per-file deserialize and returns `{ ingested: 0 }` in roughly the cost of walking the source dirs instead of ~0.7s. Adds `archive_state.source_fingerprint` (schema v6, auto-migrated). (`#468`) +- `ingest()` skips re-scanning when upstream sessions haven't changed, reducing no-op calls from ~0.7s to ~0.2s. Adds `archive_state.source_fingerprint` (schema v6, auto-migrated).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CHANGELOG.md` at line 9, Rewrite the changelog line to be impact-first and concise: state that ingest() is now much faster when sources are unchanged and omit the implementation backstory and PR reference; keep a brief note that this adds archive_state.source_fingerprint (schema v6, auto-migrated) and that unchanged runs return { ingested: 0 } to reflect the observable behavior; target the existing text mentioning ingest() and archive_state.source_fingerprint for replacement.packages/sdk-node/CHANGELOG.md (1)
7-7: ⚡ Quick winSimplify to impact-first wording per coding guidelines.
The entry includes a PR reference and implementation details ("Gated on a stat-only source fingerprint...", "cursor-load + per-file scan") that the guidelines ask you to drop. For the Node SDK facade changelog, focus on the observable API behavior.
As per coding guidelines, changelog entries should be concise and impact-first, naming the API touched and the practical effect.
Suggested condensed wording
-- `ingest()` returns in roughly source-walk time (`{ ingested: 0 }`) when nothing upstream has changed, instead of paying the full ~0.7s cursor-load + per-file scan every call. Gated on a stat-only source fingerprint persisted in the ledger. (`#468`) +- `ingest()` returns immediately with `{ ingested: 0 }` when upstream sessions haven't changed, avoiding ~0.7s cursor load.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/sdk-node/CHANGELOG.md` at line 7, Changelog entry for the Node SDK should be rewritten to an impact-first, user-facing sentence: replace the current line that mentions implementation details and PR number with a concise statement naming the API (ingest()) and the observable effect (returns immediately with no work when nothing upstream changed, e.g., returns { ingested: 0 } faster). Remove phrases like "Gated on a stat-only source fingerprint", "cursor-load + per-file scan", and the PR number; keep only the API name and the practical behavior change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/relayburn-sdk/src/ledger/schema.rs`:
- Around line 60-66: The doc comment for archive_state.source_fingerprint
incorrectly states the stored shape as `count:totalBytes:hash`; update the
documentation to match the implementation by describing it as a stat-only
aggregate over mtimes (for example `count:totalBytes:mtimes` or
`count:totalBytes:mtime-aggregate`) and mention that it is derived from file
mtimes rather than a per-file content hash; modify the comment block that
references `archive_state.source_fingerprint` (also the similar text later in
the file) to use this corrected wording.
---
Nitpick comments:
In `@CHANGELOG.md`:
- Line 9: Rewrite the changelog line to be impact-first and concise: state that
ingest() is now much faster when sources are unchanged and omit the
implementation backstory and PR reference; keep a brief note that this adds
archive_state.source_fingerprint (schema v6, auto-migrated) and that unchanged
runs return { ingested: 0 } to reflect the observable behavior; target the
existing text mentioning ingest() and archive_state.source_fingerprint for
replacement.
In `@packages/sdk-node/CHANGELOG.md`:
- Line 7: Changelog entry for the Node SDK should be rewritten to an
impact-first, user-facing sentence: replace the current line that mentions
implementation details and PR number with a concise statement naming the API
(ingest()) and the observable effect (returns immediately with no work when
nothing upstream changed, e.g., returns { ingested: 0 } faster). Remove phrases
like "Gated on a stat-only source fingerprint", "cursor-load + per-file scan",
and the PR number; keep only the API name and the practical behavior change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: edd99d3d-e11b-42a4-8015-46ce058762c7
📒 Files selected for processing (9)
CHANGELOG.mdcrates/relayburn-sdk/src/ingest/ingest.rscrates/relayburn-sdk/src/ingest/orchestration_tests.rscrates/relayburn-sdk/src/ledger.rscrates/relayburn-sdk/src/ledger/db.rscrates/relayburn-sdk/src/ledger/schema.rscrates/relayburn-sdk/src/ledger/tests.rscrates/relayburn-sdk/src/query_verbs.rspackages/sdk-node/CHANGELOG.md
| /// - `6`: adds `archive_state.source_fingerprint TEXT` — a cheap aggregate | ||
| /// (`count:totalBytes:hash`) over the source session files ingest scans. | ||
| /// `ingest_all` records it after each sweep and short-circuits to an | ||
| /// empty report when the live source fingerprint is unchanged, so a | ||
| /// no-op ingest costs a stat-only walk instead of a full cursor load + | ||
| /// per-file deserialize. Blanked by `state reset` so the next ingest | ||
| /// re-walks from scratch. (#468) |
There was a problem hiding this comment.
Align the documented fingerprint shape with the implementation.
These comments say the stored value is count:totalBytes:hash, but this PR’s fingerprint contract is a stat-only aggregate over mtimes. Leaving hash here will misdescribe the on-disk value in archive_state.source_fingerprint.
Suggested wording
-/// (`count:totalBytes:hash`) over the source session files ingest scans.
+/// (`count:totalBytes:mtimeSum`) over the source session files ingest scans.
...
- -- Cheap source-side change gate: `count:totalBytes:hash` over the
+ -- Cheap source-side change gate: `count:totalBytes:mtimeSum` over theAlso applies to: 232-233
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/relayburn-sdk/src/ledger/schema.rs` around lines 60 - 66, The doc
comment for archive_state.source_fingerprint incorrectly states the stored shape
as `count:totalBytes:hash`; update the documentation to match the implementation
by describing it as a stat-only aggregate over mtimes (for example
`count:totalBytes:mtimes` or `count:totalBytes:mtime-aggregate`) and mention
that it is derived from file mtimes rather than a per-file content hash; modify
the comment block that references `archive_state.source_fingerprint` (also the
similar text later in the file) to use this corrected wording.
There was a problem hiding this comment.
2 issues found across 6 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/sdk-node/CHANGELOG.md">
<violation number="1" location="packages/sdk-node/CHANGELOG.md:91">
P3: Inconsistent indentation on continuation line inside bullet point: the surrounding continuation lines use 2-space indent but this line starts at column 0.</violation>
</file>
<file name="CHANGELOG.md">
<violation number="1" location="CHANGELOG.md:262">
P3: Changelog text accidentally turned `+ free-function form` into a nested `-` list item, which breaks sentence rendering and changes the meaning of the release note.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| - Umbrella facade now coerces napi-rs `BigInt` return values to `Number` | ||
| for safe-range integers (`[Number.MIN_SAFE_INTEGER, | ||
| Number.MAX_SAFE_INTEGER]`), matching the TS 1.x runtime shape; values | ||
| Number.MAX_SAFE_INTEGER]`), matching the TS 1.x runtime shape; values |
There was a problem hiding this comment.
P3: Inconsistent indentation on continuation line inside bullet point: the surrounding continuation lines use 2-space indent but this line starts at column 0.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/sdk-node/CHANGELOG.md, line 91:
<comment>Inconsistent indentation on continuation line inside bullet point: the surrounding continuation lines use 2-space indent but this line starts at column 0.</comment>
<file context>
@@ -88,7 +88,7 @@
- Umbrella facade now coerces napi-rs `BigInt` return values to `Number`
for safe-range integers (`[Number.MIN_SAFE_INTEGER,
- Number.MAX_SAFE_INTEGER]`), matching the TS 1.x runtime shape; values
+Number.MAX_SAFE_INTEGER]`), matching the TS 1.x runtime shape; values
outside that range stay `BigInt` to avoid silent precision loss.
- Conformance suite is now wired into CI: `napi build` writes its outputs
</file context>
| Number.MAX_SAFE_INTEGER]`), matching the TS 1.x runtime shape; values | |
| Number.MAX_SAFE_INTEGER]`), matching the TS 1.x runtime shape; values |
| - free-function form) returning `SessionsListResult { sessions, limit, | ||
| truncated }`. Derived from the `turns` table so older ledgers with an | ||
| empty `sessions` table still enumerate correctly. |
There was a problem hiding this comment.
P3: Changelog text accidentally turned + free-function form into a nested - list item, which breaks sentence rendering and changes the meaning of the release note.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At CHANGELOG.md, line 262:
<comment>Changelog text accidentally turned `+ free-function form` into a nested `-` list item, which breaks sentence rendering and changes the meaning of the release note.</comment>
<file context>
@@ -259,9 +259,9 @@ Cross-package release notes for relayburn. Package changelogs contain package-le
- + free-function form) returning `SessionsListResult { sessions, limit,
- truncated }`. Derived from the `turns` table so older ledgers with an
- empty `sessions` table still enumerate correctly.
+ - free-function form) returning `SessionsListResult { sessions, limit,
+truncated }`. Derived from the `turns` table so older ledgers with an
+ empty `sessions` table still enumerate correctly.
</file context>
| - free-function form) returning `SessionsListResult { sessions, limit, | |
| truncated }`. Derived from the `turns` table so older ledgers with an | |
| empty `sessions` table still enumerate correctly. | |
| + free-function form) returning `SessionsListResult { sessions, limit, | |
| truncated }`. Derived from the `turns` table so older ledgers with an | |
| empty `sessions` table still enumerate correctly. |
There was a problem hiding this comment.
1 issue found across 9 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/sdk-node/CHANGELOG.md">
<violation number="1" location="packages/sdk-node/CHANGELOG.md:91">
P3: Inconsistent indentation on continuation line inside bullet point: the surrounding continuation lines use 2-space indent but this line starts at column 0.</violation>
</file>
<file name="CHANGELOG.md">
<violation number="1" location="CHANGELOG.md:262">
P3: Changelog text accidentally turned `+ free-function form` into a nested `-` list item, which breaks sentence rendering and changes the meaning of the release note.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Resolve reviewer findings on #471 and reconcile the path-aware fingerprint with the watch-loop force-scan + OpenCode child-count work: - Don't cache the source fingerprint when a statted source file fails to read/parse (Codex P2). A transient failure (e.g. PermissionDenied later fixed with chmod, moving ctime but not size/mtime) would otherwise match the stored fingerprint and short-circuit before retrying, stranding the session. Threaded a `had_skips` flag through the per-harness loops and skip the fingerprint write when set. Regression test added. - Rewrite `ingest_all_rescans_after_source_file_appended` to append JSONL in place instead of creating a new file (CodeAnt + Cubic), so the test actually covers append-triggered fast-path invalidation. - Keep the path-aware `count:totalBytes:hash` fingerprint AND fold in the OpenCode message-dir child count so a new message file re-opens the gate even when the dir mtime granularity is too coarse to move. - Changelog entries trimmed to impact-first wording per CLAUDE.md. cargo build --workspace, full relayburn-sdk + relayburn-cli suites, and clippy --workspace all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review |
|
CodeAnt AI Incremental review completed. |
User description
Fixes #468.
Problem
ingest()cost ~0.7s on every call even when nothing upstream had changed. Profiling a 94k-turn / ~125 MB ledger showed a no-op sweep spends its time on work that produces zero new turns:load_cursors(parse the cursor blob)clone+save-compare of the cursor mapget_typed)fs::metadataper filefingerprint()already detects "nothing changed" in ~47 ms, but it reflects the post-ingestturnstable, so it can't gate ingest itself.Fix
Add a stat-only source fingerprint —
count:totalBytes:mtimeSumover the Claude/Codex/OpenCode session files (folding in OpenCode's per-session message-dir mtime, the signal that adapter watches for new messages).ingest_allcomputes it up front and, when it matches the value the previous sweep recorded, returns an empty report without loading cursors, cloning, deserializing per-file cursors, or parsing.archive_state.source_fingerprintcolumn (schema v6, in-placeALTER TABLE ADD COLUMN, auto-migrated; legacy''never matches so the first post-upgrade ingest walks normally and records it).state resetandrebuild_derivable, so a wiped/rebuilt ledger always re-walks.Detects every relevant change: file add/remove (count + bytes), in-place append (bytes + mtime — which directory mtimes miss), truncate (bytes), and touch (mtime sum, sensitive in either direction so a backwards clock still registers).
Impact
A no-op
burn ingeston a live ledger dropped from ~0.74s → ~0.19s. The residual is the unavoidable stat walk over the source trees — on this machine an atypically large OpenCode tree (138k message files); on Claude/Codex-dominant ledgers (the polling/refresh case the issue targets) it returns in tens of ms.Tests
source_fingerprint_is_stable_and_moves_on_change— stable across calls; moves on file add and on append.ingest_all_short_circuits_when_source_unchanged— second sweep with unchanged source reportsscanned_sessions == 0(proves the loops never ran, not just that nothing was appended).ingest_all_rescans_after_source_file_appended— a new source file defeats the gate and its turns are ingested + queryable.legacy_v1_ledger_migrates_to_v2_on_open…extended to assert the v6source_fingerprintcolumn is added on migration.Full
cargo test --workspacegreen.🤖 Generated with Claude Code
CodeAnt-AI Description
Skip unchanged ingest sweeps, while still rescanning on real file events
What Changed
ingest()now returns immediately when the source files match the last recorded fingerprint, so a no-op sweep no longer loads cursors or walks every session fileImpact
✅ Faster no-op ingest✅ Fewer unnecessary session scans✅ Safer live file updates during watch mode💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.