Add WhatsApp import from decrypted backup (#136)#160
Add WhatsApp import from decrypted backup (#136)#160
Conversation
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
Add `import --type whatsapp` command for importing messages from a decrypted WhatsApp msgstore.db into the msgvault unified schema. New package internal/whatsapp/: - Reads msgstore.db as read-only SQLite - Maps WhatsApp schema to msgvault tables (conversations, participants, messages, attachments, reactions, reply threading) - Batch processing (1000 msgs/batch) with checkpoint/resume - Optional --contacts for vCard name resolution (update-only, no creation) - Optional --media-dir for content-addressed media storage - Imports: text, images, video, audio, voice notes, documents, stickers, GIFs, reactions, replies, group participants - Skips: system messages, calls, location shares, contacts, polls Security: - Media path traversal defense (sanitize, reject absolute/.. paths, boundary check against mediaDir) - Streaming hash + copy for media (no io.ReadAll, 100MB max) - E.164 phone validation (reject non-numeric JIDs, 4-15 digit range) - File permissions 0600/0750 for attachments - Per-chat reply map scoping to bound memory Query engine updates: - Sender filters in DuckDB and SQLite check both message_recipients (email path) and direct sender_id (WhatsApp/chat path) - Phone number included in sender predicates for from:+447... queries - MatchesEmpty filters account for sender_id to avoid false positives - MCP handler routes non-email from values to display name matching - Parquet cache extended with sender_id, message_type, attachment_count, phone_number, title columns - Cache schema versioning to force rebuild on column layout changes Store additions: - EnsureConversationWithType (parameterized conversation_type) - EnsureParticipantByPhone (E.164 validated, with identifier row) - UpdateParticipantDisplayNameByPhone (update-only for contacts) - EnsureConversationParticipant, UpsertReaction, UpsertMessageRawWithFormat Tested with 1.19M messages (13.5k conversations, April 2016-present).
b3274f3 to
600d3d8
Compare
roborev: Combined Review (
|
- Add missing p_direct_sender JOIN in SQLite MatchesEmpty(ViewSenders) path so SenderName filter doesn't reference an unjoined alias - Add DB fallback for reply threading: when quoted message key_id is not in the per-chat in-memory map, look it up in the messages table to link replies from previous runs or other chats - Scope GetGmailIDsByFilter to Gmail sources only (JOIN on source_type='gmail' in SQLite, message_type filter in DuckDB Parquet fallback) to prevent WhatsApp IDs leaking into deletion/staging workflows - Remove misleading checkpoint PageToken (resume not implemented; re-runs are safe via upsert dedup)
roborev: Combined Review (
|
…rmalization - ImportContacts now counts DB errors and returns an aggregated error instead of silently dropping failures from UpdateParticipantDisplayNameByPhone - Remove UK-specific 0→+44 normalization in normalizeVCardPhone; local numbers without country code are now skipped as ambiguous - Only normalize unambiguous international formats (+ prefix, 00 prefix)
roborev: Combined Review (
|
…s error - MatchesEmpty(ViewSenders) now treats a sender as non-empty when either email_address or phone_number is present (both SQLite and DuckDB) - Contact import errors now cause the import command to exit with failure instead of printing a warning and returning success
roborev: Combined Review (
|
…refresh - parseVCardFile now unfolds RFC 2425 continuation lines (leading space/tab) so multi-line FN and TEL values are correctly parsed - Decode QUOTED-PRINTABLE encoded FN values (e.g., =C3=A9 → é) - EnsureConversationWithType now updates conversation_type and title on existing rows when values have changed (non-empty title only) - Added tests for folded lines, encoded names, and QP decoding
roborev: Combined Review (
|
vCard field names are case-insensitive per RFC 2426. Match BEGIN/END/FN/TEL using uppercased key portion while preserving original value bytes.
roborev: Combined Review (
|
…erminal injection - Handle QUOTED-PRINTABLE soft line breaks (= at EOL) during vCard parsing by joining continuation lines before property extraction - Tighten phone normalization to only accept numbers with explicit country code indicators (+ or 00 prefix), avoiding false matches - Add SanitizeTerminal() to strip ANSI escape sequences and control characters from untrusted metadata (chat names, snippets) before rendering to terminal/TUI - Add tests for all three fixes
roborev: Combined Review (
|
Databases created before the WhatsApp feature have no phone_number, sender_id, message_type, attachment_count, or title columns because CREATE TABLE IF NOT EXISTS is a no-op for existing tables. This breaks the MCP server and cache builder with: Binder Error: Column "phone_number" in REPLACE list not found Add ALTER TABLE migrations in InitSchema() for all v2 columns. Silently ignores "duplicate column name" errors for databases that already have the columns.
roborev: Combined Review (
|
Existing v2 caches may have been built before the schema migration added phone_number to the participants table, resulting in Parquet files without the column. Bumping to v3 ensures build-cache detects the mismatch and triggers a full rebuild automatically.
roborev: Combined Review (
|
Post-rebase test failures on v0.9.0After rebasing Root cause: The test schemas in Fix: Add to each test schema: attachment_count INTEGER DEFAULT 0,
sender_id INTEGER,
message_type TEXT NOT NULL DEFAULT 'email',Build passes clean, and all other test packages pass. Just these 5 tests in |
|
From Claude "The test failures are a legitimate bug in this branch, not a rebase artifact. The branch adds three new
The build cache query in cmd/msgvault/cmd/build_cache.go (lines 249-252) now references all three columns: COALESCE(TRY_CAST(m.attachment_count AS INTEGER), 0) as attachment_count, But the test helper setupTestSQLite() in build_cache_test.go (line 43) still creates the messages table This is the contributor's responsibility to fix — the branch modified the production schema and the build The CI on main passes because main doesn't have these columns in the query — they were introduced by this |
Summary
import --type whatsappcommand for importing messages from a decrypted WhatsAppmsgstore.dbbackupsender_idpath (not just email-basedmessage_recipients)fromvalues to display name and phone number matchingsender_id,message_type,attachment_count,phone_number, andtitlecolumnsWhat's in this PR
import --type whatsappmsgvault import --type whatsapp --phone "+447700900000" /path/to/msgstore.dbmsgstore.db(SQLite) as read-only_id)syncfull.go)--media-dirfor copying media files to content-addressed storage--contactsfor importing vCard contacts (display name resolution only — updates existing participants, does not create new ones)What's imported
Skipped: system messages (type 7), calls (15/64/66), location shares (9), contacts (10), polls (99), statuses/stories (11).
Security
..prefixed paths, verify resolved path stays within--media-dirboundaryio.TeeReader(noio.ReadAll), capped at 100MB max file size+prefix in store layerkeyIDToMsgID) cleared per-chat to prevent unbounded growth across 13k+ conversationsQuery engine updates
buildFilterConditionsnow check bothmessage_recipients(email path) and directsender_id(WhatsApp/chat path) for sender filtersfrom:+447...queriesMatchesEmptyfilters account forsender_idto avoid false positive "no sender" matches on WhatsApp messagesfromparameter: if value contains@, filters by email; if starts with+, filters by phone; otherwise filters by display nameStore additions
EnsureConversationWithType— likeEnsureConversationbut acceptsconversationTypeparameterEnsureParticipantByPhone— get-or-create by phone number with E.164 validationUpdateParticipantDisplayNameByPhone— update-only (for contacts import, does not create participants)EnsureConversationParticipant— insert-or-ignore into conversation_participantsUpsertReaction— insert-or-ignore into reactions tableUpsertMessageRawWithFormat— likeUpsertMessageRawbut accepts format parameterDesign note:
import --typevsimport-whatsappThe existing pattern uses
import-mbox/import-emlx. This PR usesimport --type whatsappwith a dispatcher, which is extensible for future sources (iMessage, Telegram, etc.) without adding new top-level commands. Happy to rename toimport-whatsappif you prefer consistency with the existing pattern.MCP support
WhatsApp messages are fully queryable via MCP after import and cache rebuild:
Test plan
go test ./internal/whatsapp/...— mapping and contacts unit testsgo test ./internal/query/...— DuckDB/SQLite engine tests (updated fixtures)go test ./internal/store/...— store tests passgo test ./internal/mcp/...— MCP handler tests passgo vet ./...— cleangosec ./...— no findings in PR files (expected false positives only)msgvault import --type whatsapp --phone "+44..." /path/to/msgstore.db --contacts /path/to/contacts.vcfmsgvault build-cache --full-rebuildafter importmsgvault search, TUI, and MCP queriesTested with a real 1.19M message WhatsApp backup (13.5k conversations, April 2016–present). Import completes successfully, messages are searchable via TUI and MCP.