Skip to content

Add --clean flag to sync-full for fresh re-sync after bulk deletions#123

Open
robelkin wants to merge 5 commits intowesm:mainfrom
robelkin:resync-clean
Open

Add --clean flag to sync-full for fresh re-sync after bulk deletions#123
robelkin wants to merge 5 commits intowesm:mainfrom
robelkin:resync-clean

Conversation

@robelkin
Copy link
Contributor

Background

I'm aware that msgvault is primarily an archiving tool, however what I found was that once I was able to use it with an LLM via the MCP, I was able to very efficiently cut out a lot of the newsletter, promotion, and other spam that I've accumulated over the last 20-25 years.

It seemed crazy to me that I would then keep all of that information in my local database even though I was very happy to have deleted it from Gmail. So I added this flag that allows you to re-sync the database for a particular account to get a fresh copy of your account down once you've completed a lot of mass deletions.

It's effectively the same as wiping the database and running sync-full, but if you have multiple accounts, that's not very viable.

Usage

# Delete local data for one account and re-sync from Gmail
msgvault sync-full you@gmail.com --clean

# Skip confirmation prompt (for scripting)
msgvault sync-full you@gmail.com --clean --yes

What it does

  1. Shows what will be deleted (message/conversation/label counts)
  2. Requires confirmation before proceeding
  3. Deletes all local data for the specified account only
  4. Immediately runs a full sync to re-download from Gmail

Other accounts in the database are completely untouched.

Technical details

Performance optimization

Initial implementation used a single transaction with CASCADE deletes, which took ~2.5 hours for 125k messages. The optimized version completes in ~2-5 minutes by:

  • Disabling FK checks during bulk delete (PRAGMA foreign_keys = OFF)
  • Explicit child table deletion - deletes from message_bodies, message_raw, message_recipients, message_labels, attachments, reactions directly instead of relying on CASCADE triggers
  • Batched deletes - 5000 rows per batch to keep WAL size manageable
  • Progress reporting - shows current table and deletion percentage

Safety features

  • Requires --yes flag or interactive confirmation
  • Only operates on the specified account (isolation verified by tests)
  • Double Ctrl+C pattern: first saves checkpoint, second force quits

New store methods

  • ResetSourceData(sourceID) - simple wrapper for backwards compatibility
  • ResetSourceDataWithProgress(sourceID, callback) - batched delete with progress reporting

Tests added

  • TestResetSourceData - verifies basic functionality
  • TestResetSourceData_IsolatesAccounts - critical test ensuring one account's reset doesn't affect other accounts' messages, conversations, labels, or sync state

🤖 Generated with Claude Code

Adds a --clean option that deletes all local data for an account
(messages, conversations, labels, sync history) and re-syncs from
scratch. Useful for recovering from corrupted state or resetting
staged/deleted items.

Features:
- ResetSourceData() in store with transaction and CASCADE deletes
- Confirmation prompt showing what will be deleted (use --yes to skip)
- Double Ctrl+C pattern: first saves checkpoint, second force quits
- Account isolation tests to verify other accounts are untouched

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major performance improvements to ResetSourceData:

1. Disable foreign key checks during bulk delete (avoids validation overhead)
2. Delete child tables explicitly (bypasses CASCADE trigger overhead)
3. Batch deletes with LIMIT 5000 (keeps transactions small)
4. Add progress callback for real-time status updates

Expected speedup: 2.5 hours -> ~2-5 minutes for 125k messages.

The CLI now shows per-table progress and message deletion percentage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Local MCP configuration should not be tracked.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 10, 2026

roborev: Combined Review

Verdict: Changes introduce two High-severity issues and one Medium-severity issue; no Critical findings.

High

  1. internal/store/sync.go:425, 428, 584
    PRAGMA foreign_keys is connection-local in SQLite, but s.db.Exec uses pooled connections. This risks returning a connection with foreign keys disabled to the pool and makes the intended PRAGMA ineffective for subsequent deletes.
    Fix: Acquire a dedicated *sql.Conn with s.db.Conn(ctx) and run PRAGMA + all batched deletes on that connection, ensuring foreign_keys is re-enabled before returning it to the pool.

  2. internal/store/sync.go (function deleteChildTableBatched)
    The batching loop never advances past the first LIMIT of message IDs because it repeatedly selects the same first batch and stops when RowsAffected() is 0. This can leave child rows for later messages untouched and creates orphans after parent deletes with foreign keys off.
    Fix: Iterate deterministically through message IDs (e.g., ORDER BY id with lastID tracking), or prefetch IDs per batch and delete children for that batch before advancing.

Medium

  1. internal/store/sync.go:409
    Long-running reset loop does not accept context.Context, so Ctrl+C cancellation from the CLI does not propagate to DB calls or loop control.
    Fix: Update signature to ResetSourceDataWithProgress(ctx context.Context, ...), use ExecContext, and check ctx.Err() inside loops.

Low
None.


Synthesized from 4 reviews (agents: codex, gemini | types: review, security)

Wrap deferred s.db.Exec in anonymous function with explicit
blank identifier assignment to satisfy errcheck linter.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 10, 2026

roborev: Combined Review

Overall: Needs changes — two high-risk data integrity issues, plus test and cleanup gaps.

Critical
None.

High

  1. internal/store/sync.go:424-427
    PRAGMA foreign_keys = OFF is executed on pooled *sql.DB, so it may apply to a different connection than the deletes, and a tainted connection can return to the pool with FKs disabled. This risks inconsistent delete behavior and future integrity issues.
    Fix: Use a dedicated connection (s.db.Conn(ctx)) or a single-connection transaction; run PRAGMA + deletes + re-enable on that same connection.

  2. internal/store/sync.go:597-618
    deleteChildTableBatched repeatedly selects the same first LIMIT ? message IDs because parents aren’t deleted until later; after the first batch, RowsAffected becomes 0 and the loop exits, leaving orphaned child rows.
    Fix: Select only message IDs that still have child rows (join to child table), or batch delete by child table row IDs.

Medium

  1. internal/store/sync_test.go:210
    Tests use small message counts (< batchSize) and don’t validate child table cleanup, so the batching bug is unexercised.
    Fix: Add a test with >5000 messages (or make batch size configurable for tests) and assert all child rows are removed.

Low

  1. internal/store/sync.go:427-428
    Error from re-enabling FKs is ignored (_ = ...), risking silent failure.
    Fix: Log the error (or propagate if possible) when re-enabling fails.

Synthesized from 4 reviews (agents: codex, gemini | types: security, review)

1. PRAGMA on pooled connection: PRAGMA foreign_keys = OFF was executed
   on *sql.DB which is a connection pool. The PRAGMA might apply to a
   different connection than the subsequent deletes, and a connection
   could return to the pool with FKs disabled.

   Fix: Use sql.Conn to get a dedicated connection for all operations.

2. Batching logic bug: deleteChildTableBatched repeatedly selected the
   same first LIMIT N message IDs because messages weren't deleted yet.
   After the first batch deleted those children, subsequent batches
   found 0 rows and exited early, leaving orphaned child rows.

   Fix: Delete by child table rowid with a JOIN to messages, ensuring
   each batch finds actual existing child rows to delete.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 10, 2026

roborev: Combined Review

Summary verdict: 1 high, 2 low findings; no critical issues.

High

  1. internal/store/sync.go (lines 430-575): PRAGMA foreign_keys = OFF on a pooled connection is only re-enabled on the success path. Any early return leaves FKs disabled when the connection returns to the pool, risking integrity corruption. Add a defer to restore FKs (ideally with context.Background()), optionally keeping an explicit re-enable on the happy path.

Low

  1. cmd/msgvault/cmd/syncfull.go (lines 144-146 in diff): Errors from COUNT(*) Scan calls are ignored; failures show 0 counts and can mislead before a destructive action. Handle and abort on error.
  2. internal/store/sync_test.go: Missing test for error paths in ResetSourceDataWithProgress to ensure FK checks are restored on the same pooled connection after failure. Add a failure-injection test.

Synthesized from 4 reviews (agents: codex, gemini | types: security, review)

@wesm
Copy link
Owner

wesm commented Feb 11, 2026

Sounds like a good idea to me. I have deleted over a million emails from my Gmail account but I don't want to delete them from my archive, so I wouldn't use this but I could understand why someone would want to. I actually had planned to add a permadelete option to the CLI and TUI to be able to delete groups of emails from the local archive using the same UX as the remote delete delete-staged API (maybe this should be called delete-remote for clarity). If you delete locally then we probably also need a msgvault gc or something that will vacuum the sqlite database.

I'll take a close look at this and work on it when I can so bear with me, I have a lot going on right now!

@robelkin
Copy link
Contributor Author

No worries at all, appreciate that this repo took off a little. Great to hear you see the value here! Agree on the vacuum'ing. I'll not play around with all that as sounds like you might fold this into a wider consideration so don't want to get in the way there, but sounds sensible!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants