Skip to content

Add WhatsApp import from decrypted backup (#136)#160

Open
eddowding wants to merge 9 commits intowesm:mainfrom
eddowding:feat/whatsapp-import
Open

Add WhatsApp import from decrypted backup (#136)#160
eddowding wants to merge 9 commits intowesm:mainfrom
eddowding:feat/whatsapp-import

Conversation

@eddowding
Copy link

@eddowding eddowding commented Feb 26, 2026

Summary

  • Add import --type whatsapp command for importing messages from a decrypted WhatsApp msgstore.db backup
  • Extend sender/query filters in both DuckDB and SQLite engines to support WhatsApp's sender_id path (not just email-based message_recipients)
  • Update MCP handler to route non-email from values to display name and phone number matching
  • Extend Parquet cache schema with sender_id, message_type, attachment_count, phone_number, and title columns
  • Add cache schema versioning to force rebuild on column layout changes

What's in this PR

import --type whatsapp

  • New CLI: msgvault import --type whatsapp --phone "+447700900000" /path/to/msgstore.db
  • Reads a decrypted WhatsApp msgstore.db (SQLite) as read-only
  • Maps WhatsApp schema to msgvault's existing multi-source tables
  • Batch processing (1000 messages at a time) for memory efficiency
  • Checkpoint/resume via sync_run cursor (last processed _id)
  • Progress callback with CLIProgress (same pattern as syncfull.go)
  • Optional --media-dir for copying media files to content-addressed storage
  • Optional --contacts for importing vCard contacts (display name resolution only — updates existing participants, does not create new ones)
  • Idempotent: re-running uses upsert dedup, no duplicates

What's imported

WhatsApp msgvault Notes
Text messages (type 0) messages + message_bodies Full text
Images (type 1) messages + attachments Metadata; file if --media-dir
Videos (type 2) messages + attachments Metadata only
Audio (type 3) messages + attachments Metadata
Voice notes (type 5) messages + attachments Metadata; file if --media-dir
Documents (type 13) messages + attachments Metadata
Stickers (type 90) messages + attachments Metadata; file if --media-dir
GIFs (type 4) messages + attachments Metadata
Reactions reactions Emoji reactions
Replies/quotes reply_to_message_id Via key_id lookup (scoped per-chat)
Group participants conversation_participants Role (admin/member)
Captions message_bodies.body_text From media_caption
1:1 chats conversations (direct_chat)
Group chats conversations (group_chat) With title

Skipped: system messages (type 7), calls (15/64/66), location shares (9), contacts (10), polls (99), statuses/stories (11).

Security

  • Media path traversal defense: sanitize paths from DB, reject absolute and .. prefixed paths, verify resolved path stays within --media-dir boundary
  • Streaming media processing: hash + copy via io.TeeReader (no io.ReadAll), capped at 100MB max file size
  • E.164 phone validation: reject non-numeric JID users, enforce 4–15 digit range, require + prefix in store layer
  • File permissions: 0600 for attachment files, 0750 for directories
  • Bounded memory: reply-threading map (keyIDToMsgID) cleared per-chat to prevent unbounded growth across 13k+ conversations

Query engine updates

  • DuckDB and SQLite buildFilterConditions now check both message_recipients (email path) and direct sender_id (WhatsApp/chat path) for sender filters
  • Phone number included in sender predicates for from:+447... queries
  • MatchesEmpty filters account for sender_id to avoid false positive "no sender" matches on WhatsApp messages
  • MCP from parameter: if value contains @, filters by email; if starts with +, filters by phone; otherwise filters by display name
  • Parquet cache schema extended with new columns; cache schema versioning forces full rebuild when column layout changes

Store additions

  • EnsureConversationWithType — like EnsureConversation but accepts conversationType parameter
  • EnsureParticipantByPhone — get-or-create by phone number with E.164 validation
  • UpdateParticipantDisplayNameByPhone — update-only (for contacts import, does not create participants)
  • EnsureConversationParticipant — insert-or-ignore into conversation_participants
  • UpsertReaction — insert-or-ignore into reactions table
  • UpsertMessageRawWithFormat — like UpsertMessageRaw but accepts format parameter

Design note: import --type vs import-whatsapp

The existing pattern uses import-mbox / import-emlx. This PR uses import --type whatsapp with a dispatcher, which is extensible for future sources (iMessage, Telegram, etc.) without adding new top-level commands. Happy to rename to import-whatsapp if you prefer consistency with the existing pattern.

MCP support

WhatsApp messages are fully queryable via MCP after import and cache rebuild:

# Search by sender name
list_messages(from: "Jane Smith")

# Search by phone
list_messages(from: "+447700900000")

# Full-text search
search_messages(query: "dinner plans", from: "Jane Smith")

# Conversation listing includes WhatsApp groups
list_conversations()

Test plan

  • go test ./internal/whatsapp/... — mapping and contacts unit tests
  • go test ./internal/query/... — DuckDB/SQLite engine tests (updated fixtures)
  • go test ./internal/store/... — store tests pass
  • go test ./internal/mcp/... — MCP handler tests pass
  • go vet ./... — clean
  • gosec ./... — no findings in PR files (expected false positives only)
  • Manual test: msgvault import --type whatsapp --phone "+44..." /path/to/msgstore.db --contacts /path/to/contacts.vcf
  • Manual test: msgvault build-cache --full-rebuild after import
  • Manual test: verify via msgvault search, TUI, and MCP queries

Tested with a real 1.19M message WhatsApp backup (13.5k conversations, April 2016–present). Import completes successfully, messages are searchable via TUI and MCP.

@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (44cc74f)

Verdict: The PR introduces valuable WhatsApp import
functionality but requires important fixes for path traversal vulnerabilities, memory consumption, cache schema compatibility, and storage path routing before merging.

High Severity

Path traversal / arbitrary file read via media paths

  • Files: internal/whatsapp/importer.go:414-433, 477
    -486
  • Issue: filepath.Join(mediaDir, relPath) is used directly with untrusted DB data. Absolute paths (e.g., /etc/passwd) and directory traversal sequences (../../...) are not blocked before os.Open. A crafted backup database can trick the importer
    into reading arbitrary local files outside the intended mediaDir.
  • Remediation: Sanitize relPath using filepath.Clean(), reject absolute paths, and strictly verify that the resulting joined path resides within the mediaDir boundary before opening. Fall back to using just the base filename if
    necessary.

Media copy writes to the wrong root path

  • Files: internal/whatsapp/importer.go:454, 460
  • Issue: handleMediaFile builds storageDir as attachments/<hashprefix> and then anchors writes with filepath .Dir("") (effectively the current working directory) rather than cfg.AttachmentsDir(). This stores imported attachments outside the msgvault data directory, making them unretrievable.
  • Remediation: Pass the attachments root directory into the importer options. Write to `/<hash[:2]

/and store thestorage_pathas<hash[:2]>/(without theattachments/` prefix).

Cache schema compatibility regression for existing Parquet caches

  • Files: internal/query/duckdb.go:178, 197;
    cmd/msgvault/cmd/build_cache.go:98, 138
  • Issue: DuckDB queries now require new Parquet columns (attachment_count, sender_id, message_type, phone_number, title). The incremental cache logic lacks a
    schema-version gate and won't force a rebuild when old-format files exist, causing query failures.
  • Remediation: Add cache schema versioning and force a full rebuild on mismatch, or explicitly handle mixed schemas via union_by_name and defensive column handling.

Medium Severity

Unbounded memory consumption when processing media files

  • Files: internal/whatsapp/importer.go:440-444
  • Issue: io.ReadAll(f) loads the entire media file into memory. Large or malicious WhatsApp media files (like videos) can
    cause severe memory pressure or OOM crashes.
  • Remediation: Replace io.ReadAll with streaming. Hash the file incrementally via io.Copy (to hash.Hash), stream the content to the destination, and enforce a maximum allowed media size with io.LimitedReader.

**
from:+<phone> MCP filter path is routed to email filtering**

  • Files: internal/mcp/handlers.go:327; internal/query/sqlite.go:289; internal/query/duckdb.go:701
  • Issue
    :
    MCP treats +... as filter.Sender, but the sender predicates only compare email_address, not phone_number. Filtering by WhatsApp phone senders will fail to match.
  • Remediation: Include phone_number = ? in the sender predicates (for both recipient and direct
    -sender paths), or route +... inputs to a phone-aware field.

Contact import behavior contradicts contract

  • Files: internal/whatsapp/contacts.go:19, 39
  • Issue: ImportContacts claims to update existing participants, but it unconditionally calls Ensure ParticipantByPhone. This creates new participants and artificially inflates the matched count on every call.
  • Remediation: Implement "update-if-exists" semantics for contact imports, and only increment matched when an existing participant is actually found and updated.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (b11426d)

This PR implements the new WhatsApp import feature but introduces several medium-severity issues requiring fixes
, including an MCP context-poisoning vulnerability, potential memory exhaustion, and parsing/query regressions.

Medium Severity

  • Sender Identity Spoofing / Context Poisoning in MCP Tool
    • File: internal/mcp/handlers.go:327

Description: The condition strings.Contains(v, " @") incorrectly checks for a space before the @. Standard emails (user@example.com) fail this check and fall through to display name matching (filter.SenderName). This creates a spoofing vulnerability where an attacker can forge a display name to match
a target email and poison the LLM's context.
* Remediation: Remove the leading space: strings.Contains(v, "@").

  • Unbounded Memory Growth on Large Imports (DoS)

    • File: internal/whatsapp/importer. go:100, internal/whatsapp/importer.go:258
    • Description: keyIDToMsgID stores all imported message keys for the full run. A very large or crafted msgstore.db can force high memory usage and crash the process
      .
    • Remediation: Bound memory use by scoping this map per chat/batch, or replace it with a DB-backed lookup (indexed by source message key) combined with a deferred reply-resolution pass.
  • WhatsApp JID Parsing Error

    • File:
      internal/whatsapp/mapping.go:126
    • Description: strings.TrimSuffix(user, " @"+server) incorrectly expects a space before the @ symbol, causing it to fail at stripping standard WhatsApp JID suffixes. Additionally, the inline comment above it is mang
      led.
    • Remediation: Change " @"+server to "@" + server and correct the inline comment.
  • Regression in "Empty Sender" Filters

    • File: internal/query/duckdb.go:706, internal/ query/duckdb.go:721, internal/query/sqlite.go:292, internal/query/sqlite.go:313
    • Description: MatchesEmpty(ViewSenders) and MatchesEmpty(ViewSenderNames) only check
      message_recipients and ignore messages.sender_id. WhatsApp messages can have a valid direct sender without a from recipient row, causing them to be incorrectly treated as having an empty sender/name.
    • Remediation: Include sender_id -> participants in the empty
      checks for both query engines.
  • Database Pollution from Invalid Phone Values

    • File: internal/store/messages.go:782, internal/whatsapp/mapping.go:121, internal/whatsapp/importer.go:147
    • Description: EnsureParticipantByPhone accepts any string (including empty/invalid), and normalizePhone can return non-E.164 values from unexpected JID forms. This can insert garbage participants and pollute conversation membership/sender links.
    • Remediation: Validate
      phone numbers (strict E.164) before insertion, and skip or log invalid sender/member IDs.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (ed2bf62)

Summary Verdict: The code demonstrates strong security practices overall, but contains several medium-severity issues in the new WhatsApp importer regarding unbounded memory growth, silent
error ignoring, and hardcoded regional settings.

Medium Severity Findings

  1. Unbounded Memory Growth in keyIDToMsgID

    • Files: [internal/whatsapp/importer.go:96](/home/roborev/repos/msgvault/internal/whatsapp/
      importer.go:96), internal/whatsapp/importer.go:261
    • Description: The keyIDToMsgID map grows for the entire import
      and is never bounded or pruned. A very large or malicious msgstore.db with many unique key_id values can drive unbounded memory growth and crash the importer (availability/DoS risk).
    • Suggested Remediation: Replace the in-memory global map with a bounded cache plus DB lookup fallback, or
      persist key mappings in a temp table keyed by source_message_id and query on demand.
  2. Incorrect Metric Tracking and Limit Enforcement

    • Files: [internal/whatsapp/importer.go:171](/home/roborev/repos/msgvault/internal/
      whatsapp/importer.go:171), internal/whatsapp/importer.go:255, [internal/whatsapp/importer.go:272](/home
      /roborev/repos/msgvault/internal/whatsapp/importer.go:272)
    • Description: summary.MessagesAdded++ and totalAdded++ happen after every successful UpsertMessage, but upsert may simply update an existing row. On re-
      imports, the output can claim many messages were "added" when none were newly inserted, and the --limit flag can trigger early based on these inflated counts.
    • Suggested Remediation: Have UpsertMessage return an inserted bool flag, increment MessagesAdded only when inserted == true , and track a separate "processed" counter for limit semantics if desired.
  3. Silent Error Dropping in Contact Import

    • File: [internal/whatsapp/contacts.go:35](/home/roborev/repos/msgvault/internal/whatsapp/contacts.go:
  • Description: In ImportContacts, errors from UpdateParticipantDisplayNameByPhone are ignored unless err == nil && updated. If database writes fail, the caller still receives success-like results.
  • Suggested Remediation: Accumulate and return the first error (or an aggregated error
    count) while continuing best-effort processing.
  1. Hardcoded Country Code for Phone Numbers

    • File: internal/whatsapp/contacts.go:150
    • **
      Description:** A hardcoded UK country code (+44) is prepended to any local phone number starting with 0. This will corrupt contact numbers for users located in other countries.
    • Suggested Remediation: Make the default regional country code configurable or pass it as a parameter in ImportOptions.
  2. Silent Error Dropping in Attachment Metadata Update

    • File: internal/whatsapp/importer.go:420
    • Description: Silently ignoring the
      error from imp.store.DB().Exec when updating attachments metadata masks potential SQL schema errors (e.g., if width or duration_ms columns do not exist).
    • Suggested Remediation: Log or return the error to ensure schema mismatches are detectable.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (b3274f3)

The PR successfully implements the WhatsApp importer and query extensions, but requires fixes for a high-severity SQL generation bug and several medium-severity logic and data-handling issues
.

🔴 High Severity

SQL Generation Error with Sender Filters

  • Location: internal/query/sqlite.go:290 (and internal/query/sqlite.go:297)
  • Issue: When an empty sender filter (MatchesEmpty(View Senders)) and a sender name filter are used together, the LEFT JOIN for p_direct_sender is skipped by the first condition. This leads to a SQL runtime syntax error in the second condition (no such column: p_direct_sender.display_name).
  • Recommendation
    :
    Add LEFT JOIN participants p_direct_sender ON p_direct_sender.id = m.sender_id to the MatchesEmpty(ViewSenders) join block, or rewrite the sender filters to use EXISTS subqueries to avoid alias coupling.

🟡 Medium Severity

**
Reply Threads Break Across Resumes**

  • Location: internal/whatsapp/importer.go:329
  • Issue: The keyIDToMsgID map only tracks messages imported in the current memory process. If the import is resumed, quoting a message imported in a previous run
    silently drops the reply_to_message_id link.
  • Recommendation: If quoted.QuotedKeyID is not found in keyIDToMsgID, fall back to querying the database to resolve the message_id using the source_message_id.

**
Missing FTS Indexing Fallback for Direct Chats**

  • Location: internal/whatsapp/importer.go:290
  • Issue: Received messages in direct chats will have an empty sender address in the FTS index because the fallback to chat.User (which is correctly
    used during database senderID resolution) is missing in the FTS indexing block.
  • Recommendation: Add the fallback logic to FTS resolution: else if chat.GroupType == 0 && waMsg.FromMe == 0 { senderAddr = normalizePhone(chat.User, chat. Server) }.

Hardcoded UK Phone Normalization

  • Location: internal/whatsapp/contacts.go:145
  • Issue: normalizeVCardPhone automatically converts any local number starting with 0 to +44.... This UK-specific hardcoding
    will mis-normalize non-UK contacts, causing missed or incorrect participant name matches.
  • Recommendation: Remove the hardcoded UK conversion unless a country code is explicitly configured; otherwise, only accept explicit international forms (+ / 00).

Silenced Database Update Errors

  • Location: internal/ whatsapp/contacts.go:39
  • Issue: ImportContacts silently drops database update errors (UpdateParticipantDisplayNameByPhone) and still returns success, potentially hiding failed contact imports from the user.
  • Recommendation: Accumulate and return at least the first update error (or an aggregated error
    wrapper), while optionally continuing best-effort processing.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

Add `import --type whatsapp` command for importing messages from a
decrypted WhatsApp msgstore.db into the msgvault unified schema.

New package internal/whatsapp/:
- Reads msgstore.db as read-only SQLite
- Maps WhatsApp schema to msgvault tables (conversations, participants,
  messages, attachments, reactions, reply threading)
- Batch processing (1000 msgs/batch) with checkpoint/resume
- Optional --contacts for vCard name resolution (update-only, no creation)
- Optional --media-dir for content-addressed media storage
- Imports: text, images, video, audio, voice notes, documents,
  stickers, GIFs, reactions, replies, group participants
- Skips: system messages, calls, location shares, contacts, polls

Security:
- Media path traversal defense (sanitize, reject absolute/.. paths,
  boundary check against mediaDir)
- Streaming hash + copy for media (no io.ReadAll, 100MB max)
- E.164 phone validation (reject non-numeric JIDs, 4-15 digit range)
- File permissions 0600/0750 for attachments
- Per-chat reply map scoping to bound memory

Query engine updates:
- Sender filters in DuckDB and SQLite check both message_recipients
  (email path) and direct sender_id (WhatsApp/chat path)
- Phone number included in sender predicates for from:+447... queries
- MatchesEmpty filters account for sender_id to avoid false positives
- MCP handler routes non-email from values to display name matching
- Parquet cache extended with sender_id, message_type, attachment_count,
  phone_number, title columns
- Cache schema versioning to force rebuild on column layout changes

Store additions:
- EnsureConversationWithType (parameterized conversation_type)
- EnsureParticipantByPhone (E.164 validated, with identifier row)
- UpdateParticipantDisplayNameByPhone (update-only for contacts)
- EnsureConversationParticipant, UpsertReaction, UpsertMessageRawWithFormat

Tested with 1.19M messages (13.5k conversations, April 2016-present).
@eddowding eddowding force-pushed the feat/whatsapp-import branch from b3274f3 to 600d3d8 Compare February 26, 2026 12:31
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (600d3d8)

Verdict: The PR introduces solid WhatsApp import capabilities and secure data handling, but requires fixes for a high-severity SQL join error and a few medium-severity logical regressions.

High Severity

1. SQL execution error due to missing join

  • File: internal/query/sqlite.go (around line 293)
  • Description: When filter.MatchesEmpty(ViewSenders) is true, the p_direct_sender alias
    is not joined. If a SenderName filter is simultaneously applied, the code skips adding the missing join but appends a condition referencing p_direct_sender.display_name. This results in a SQL execution error (no such column: p_direct_sender.display_name).
  • Suggested
    Fix:
    Add LEFT JOIN participants p_direct_sender ON p_direct_sender.id = m.sender_id to the joins array inside the else if filter.MatchesEmpty(ViewSenders) block.

Medium Severity

2. Checkpoint/resume is not actually
implemented despite UI messaging

  • Files: internal/whatsapp/importer.go:101, [internal/whatsapp/importer.go:391](/home/robore
    v/repos/msgvault/internal/whatsapp/importer.go:391)
  • Description: The importer writes checkpoints (UpdateSyncCheckpoint) and prints “Saving checkpoint... Run again to continue.”, but it never reads a prior checkpoint to resume. Re-runs will rescan from the
    start.
  • Suggested Fix: Load the last checkpoint at import start and initialize the per-chat/message cursor from it (or remove the checkpoint/resume messaging if not supported yet).

3. Reply threading misses already-imported quoted messages

  • Description: reply_to_message_id is set only when the quoted key exists in the in-memory keyIDToMsgID map from the current run. If the quoted message was imported in a previous run, reply links are dropped.
  • Suggested Fix
    :
    On map miss, do a DB lookup by (source_id, source_message_id) and set reply_to_message_id when found.

4. GetGmailIDsByFilter broadened sender matching can pull non-Gmail IDs into Gmail workflows

  • Files
    :
    internal/query/sqlite.go:896, [internal/query/duckdb.go:1472](/home/roborev/repos/msgvault/internal
    /query/duckdb.go:1472)
  • Description: This function is Gmail-specific by contract, but the sender filter logic now matches phone/direct-sender rows (WhatsApp/chat). In mixed-source datasets, this can return non-Gmail source_message_id values for deletion/staging flows.
  • Suggested Fix: Explicitly scope this method to Gmail sources (source_type='gmail' or equivalent join/filter) regardless of the sender filter semantics.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

- Add missing p_direct_sender JOIN in SQLite MatchesEmpty(ViewSenders)
  path so SenderName filter doesn't reference an unjoined alias
- Add DB fallback for reply threading: when quoted message key_id is
  not in the per-chat in-memory map, look it up in the messages table
  to link replies from previous runs or other chats
- Scope GetGmailIDsByFilter to Gmail sources only (JOIN on
  source_type='gmail' in SQLite, message_type filter in DuckDB
  Parquet fallback) to prevent WhatsApp IDs leaking into
  deletion/staging workflows
- Remove misleading checkpoint PageToken (resume not implemented;
  re-runs are safe via upsert dedup)
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (4ef671b)

Summary Verdict: The code is secure and generally well-implemented, with two medium-severity logic issues identified in the contact import process.

Medium

  1. Silent data-path failures in contact import

    File: internal/whatsapp/contacts.go:37
    ImportContacts ignores errors from UpdateParticipantDisplayNameByPhone (err is dropped unless err == nil && updated). If DB writes fail, the command can report success with partial/zero updates and no warning.
    Suggested fix: Propagate the first error (or collect and return an aggregated error count), and surface failures in CLI output.

  2. Incorrect phone normalization can mis-assign names

    File: internal/whatsapp/contacts.go:143
    normalizeVCardPhone rewrites any local 0... number to +44....
    That is UK-specific and can incorrectly map non-UK contacts, leading to wrong participant name updates.
    Suggested fix: Only normalize unambiguous international formats (+ / 00) unless a country code is explicitly configured.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

…rmalization

- ImportContacts now counts DB errors and returns an aggregated error
  instead of silently dropping failures from UpdateParticipantDisplayNameByPhone
- Remove UK-specific 0→+44 normalization in normalizeVCardPhone; local
  numbers without country code are now skipped as ambiguous
- Only normalize unambiguous international formats (+ prefix, 00 prefix)
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (03abaf4)

The code is generally clean and secure with no vulnerabilities found, but there are a few medium-severity issues related to query filtering and error handling that need addressing.

Medium

Incorrect Empty Sender Filtering
Files: /home/roborev/repos/msgvault/internal/query/sqlite.go:292, /home/roborev/repos/msgvault/internal/query/duckdb.go:704
MatchEmpty (ViewSenders) checks only the sender email (plus sender_id), not the sender phone. A phone-only sender in message_recipients can be incorrectly treated as an “empty sender.”
Suggested fix: In both engines, treat the sender as non-empty when either the email or phone
_number
is present.

Hidden Partial Failures on Contact Import
File: /home/roborev/repos/msgvault/cmd/msgvault/cmd/import.go:130
Contact import DB errors are only printed as warnings, and the command still exits with a
success status. This can hide partial failures in scripted usage.
Suggested fix: Return an error when whatsapp.ImportContacts returns an error, or gate this behavior behind an explicit “best-effort” flag.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

…s error

- MatchesEmpty(ViewSenders) now treats a sender as non-empty when either
  email_address or phone_number is present (both SQLite and DuckDB)
- Contact import errors now cause the import command to exit with failure
  instead of printing a warning and returning success
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (f11e5ff)

Verdict: The code is secure, but there are two medium-severity functional issues regarding vCard parsing and conversation metadata updates that need addressing.

Medium

  • vCard parsing misses common valid formats (line folding/encoded values), causing contact import misses

    • File: internal/whatsapp/contacts.go:59
    • Problem: parseVCardFile reads raw lines directly and does not unfold continuation lines or decode encoded
      FN values. Real-world .vcf exports often use folded lines and encoded names, so names/phones can be skipped or malformed.
    • Fix: Add RFC-compliant unfolding + decoding (or use a vCard parser lib), and add tests for folded FN/
      TEL lines and encoded names.
  • Conversation metadata is insert-only; existing rows never get title/type refresh

    • File: [internal/store/messages.go:741](/home/roborev/repos/msgvault/internal/store/messages.
      go:741)
    • Problem: EnsureConversationWithType does SELECT then INSERT, returning early when the row exists. If a group title/type changes (or was initially empty), the data stays stale.
    • Fix: Use upsert (ON CONFLICT) to
      update conversation_type, and update title when the new title is non-empty.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

…refresh

- parseVCardFile now unfolds RFC 2425 continuation lines (leading
  space/tab) so multi-line FN and TEL values are correctly parsed
- Decode QUOTED-PRINTABLE encoded FN values (e.g., =C3=A9 → é)
- EnsureConversationWithType now updates conversation_type and title
  on existing rows when values have changed (non-empty title only)
- Added tests for folded lines, encoded names, and QP decoding
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (58b39e3)

Summary Verdict: One medium severity issue was identified regarding vCard parsing case-sensitivity; no security vulnerabilities or high severity issues were found.

Medium

vCard parsing is case-sensitive and can silently miss valid contacts
Files: internal/whatsapp/contacts.go:92, [internal/whatsapp/contacts.go:104](/
home/roborev/repos/msgvault/internal/whatsapp/contacts.go:104)
BEGIN:VCARD/END:VCARD, FN, and TEL are matched with exact-case prefixes. vCard field names are case-insensitive, so lowercase/mixed
-case files can be skipped or partially parsed.
Suggested fix: parse the key segment case-insensitively (e.g., uppercase only the key/params part before matching), while preserving original value bytes.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

vCard field names are case-insensitive per RFC 2426. Match BEGIN/END/FN/TEL
using uppercased key portion while preserving original value bytes.
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (b4c6db8)

Review completed with 1 High and 4 Medium severity findings regarding
email/JID parsing bugs, terminal injection vulnerabilities, and vCard decoding errors.

High

  • File: internal/mcp/handlers.go:326
    Issue: Incorrect string matching for email addresses. strings.Contains(v, " @") looks for a space
    before the @ symbol, which fails to match valid email addresses and misroutes them to the SenderName filter instead of Sender.
    Remediation: Change to strings.Contains(v, "@").

Medium

  • File: cmd/msgvault/cmd/import. go:195, 205 (Also introduced in internal/tui/view.go:536, 541, 897, 904)
    Issue: Terminal control-sequence injection. Untrusted message/chat metadata (e.g
    ., WhatsApp snippet, chat titles) is written directly to terminal output, allowing crafted values containing ANSI/OSC escape sequences to manipulate the terminal display (spoofing, clipboard injection).
    Remediation: Sanitize untrusted text before rendering to the terminal/TUI by stripping ESC/C0/C
    1 control characters and normalizing newlines.

  • File: internal/whatsapp/contacts.go:69, 149
    Issue: Quoted-printable decoding is incomplete for soft line breaks. The current implementation handles =XX bytes but not QP soft wraps (= at EOL), which
    can corrupt folded encoded names common in vCards.
    Remediation: Decode using the mime/quotedprintable reader or explicitly strip =\r?\n soft breaks before hex decoding.

  • File: internal/whatsapp/contacts.go:188
    Issue: Phone normalization
    can incorrectly convert ambiguous local numbers into E.164. The code prefixes + for any 10+ digit number not starting with 0, potentially mis-normalizing local/national formats and matching the wrong participant.
    Remediation: Only accept numbers with an explicit + or 00 unless a default region/country code is explicitly configured.

  • File: internal/whatsapp/mapping.go:134
    Issue: JID suffix trimming failure. strings.TrimSuffix(user, " @"+server) assumes a space before the @ symbol,
    failing to strip the server suffix from standard JIDs (e.g., 447700900000@s.whatsapp.net).
    Remediation: Change to strings.TrimSuffix(user, "@"+server).


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

…erminal injection

- Handle QUOTED-PRINTABLE soft line breaks (= at EOL) during vCard
  parsing by joining continuation lines before property extraction
- Tighten phone normalization to only accept numbers with explicit
  country code indicators (+ or 00 prefix), avoiding false matches
- Add SanitizeTerminal() to strip ANSI escape sequences and control
  characters from untrusted metadata (chat names, snippets) before
  rendering to terminal/TUI
- Add tests for all three fixes
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (7a7f594)

Summary Verdict: The code review identified one medium-severity issue related to attachment deduplication during failed media copies.

Medium

  • File: internal/whatsapp/ importer.go (~L300) and internal/store/messages.go (UpsertAttachment, ~L650)
  • Description: If a media copy fails, handleMediaFile returns an empty contentHash and storagePath, but the import still calls Ups ertAttachment. Because UpsertAttachment deduplicates by (message_id, content_hash), multiple attachments with an empty hash will incorrectly collapse into a single record.
  • Suggested remediation: Only deduplicate when content_hash != '' (or use a fallback stable key like the WhatsApp media row
    ID or path), and update metadata by the attachment row ID instead of the hash.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

Databases created before the WhatsApp feature have no phone_number,
sender_id, message_type, attachment_count, or title columns because
CREATE TABLE IF NOT EXISTS is a no-op for existing tables.

This breaks the MCP server and cache builder with:
  Binder Error: Column "phone_number" in REPLACE list not found

Add ALTER TABLE migrations in InitSchema() for all v2 columns.
Silently ignores "duplicate column name" errors for databases that
already have the columns.
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (d56499e)

Verdict: The PR implements WhatsApp import functionality with good security practices, but requires fixes for message ID collisions and string parsing errors.

Medium

  • Message ID collision / data overwrite

    • Files: internal/whatsapp/queries.go:58, internal /whatsapp/importer.go:258
    • Issue: fetchMessages coerces m.key_id to '' (COALESCE(m.key_id, '')), then the importer uses that as SourceMessageID for upserts. If any non
      -skipped messages have null/empty key_id, multiple distinct rows can collapse onto the same logical message key.
    • Remediation: Reject/skip empty key_id messages with a warning, or use a stable fallback key (e.g., chat_row_id:_id ) when key_id is empty.
  • Unexpected space in suffix trimming logic

    • File: internal/whatsapp/mapping.go:132
    • Issue: The suffix trimming logic strings.TrimSuffix(user, " @"+server) contains an
      unexpected space before the @, which will fail to strip valid JID suffixes. Additionally, the comment contains an accidental path replacement (@internal/api/server_test.go instead of @server).
    • Remediation: Remove the space to use "@" + server and fix the comment.
  • Unexpected space in email matching

    • File: internal/mcp/handlers.go:327

    • Issue: strings.Contains(v, " @") includes an unexpected space before the @, which will fail to match standard email addresses without spaces.

    • Remediation: Change to strings.Contains(v, "@").


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

Existing v2 caches may have been built before the schema migration
added phone_number to the participants table, resulting in Parquet
files without the column. Bumping to v3 ensures build-cache detects
the mismatch and triggers a full rebuild automatically.
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (b7dcbcd)

Verdict: The PR introduces solid features and security practices, but requires fixes for DSN construction, terminal injection in single-line outputs, and phone number normalization.

Medium

1. SQLite DSN String Concatenation
File: internal/whatsapp/importer.go:40
Issue: SQLite DSN is built by string concatenation (waDBPath + "?mode=ro..."). If waDBPath contains URI metacharacters (not
ably ?), DSN parsing can be altered so enforced options like mode=ro are not reliably applied, and the opened DB path can differ from the validated path.
Remediation: Build DSN with explicit URI construction/escaping instead of concatenation (for example, file: URI + escaped
path + fixed query params), and reject/normalize reserved URI characters in filesystem paths before opening.

2. Terminal Injection in Single-Line Output
Files: internal/textutil/encoding.go:183, cmd/msgvault/cmd/import.go:204 , internal/tui/view.go:533
Issue: SanitizeTerminal intentionally preserves \n and \r. In single-line render paths (progress line and list rows), that still allows output spoofing/layout breakage.
Remediation: Add a
single-line sanitizer (strip \r, \n, \t) for these UI contexts, and keep the current sanitizer for multiline-safe contexts.

3. Contact Phone Normalization
File: internal/whatsapp/contacts.go:205
Issue
:
normalizeVCardPhone keeps all digits when input starts with +, so values like +44 (0)7700... become +4407700... and won’t match participant numbers (+447700...).
**
Remediation:** Strip optional trunk markers like (0) before digit normalization for +-prefixed numbers, and enforce E.164 length bounds.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@eddowding
Copy link
Author

Post-rebase test failures on v0.9.0

After rebasing feat/whatsapp-import onto current main (v0.9.0), all 5 TestBuildCache_* tests fail with:

Binder Error: Table "m" does not have a column named "attachment_count"

Root cause: The test schemas in cmd/msgvault/cmd/build_cache_test.go don't include the three columns added by this branch (attachment_count, sender_id, message_type). There are 4 CREATE TABLE messages definitions in the test file (lines ~43, ~1131, ~1331, ~1410) that all need updating.

Fix: Add to each test schema:

attachment_count INTEGER DEFAULT 0,
sender_id INTEGER,
message_type TEXT NOT NULL DEFAULT 'email',

Build passes clean, and all other test packages pass. Just these 5 tests in cmd/msgvault/cmd.

@wesm
Copy link
Owner

wesm commented Feb 27, 2026

From Claude

"The test failures are a legitimate bug in this branch, not a rebase artifact. The branch adds three new
columns to the messages table in the production schema (internal/store/schema.sql):

  1. attachment_count (line 143)
  2. sender_id (line 122)
  3. message_type (line 112)

The build cache query in cmd/msgvault/cmd/build_cache.go (lines 249-252) now references all three columns:

COALESCE(TRY_CAST(m.attachment_count AS INTEGER), 0) as attachment_count,
m.sender_id,
COALESCE(TRY_CAST(m.message_type AS VARCHAR), '') as message_type,

But the test helper setupTestSQLite() in build_cache_test.go (line 43) still creates the messages table
without these columns. The test schema was never updated to match the production schema changes made in this
branch's own commits.

This is the contributor's responsibility to fix — the branch modified the production schema and the build
cache queries but didn't update the test schemas. The fix is straightforward: add the three missing columns to
the test's CREATE TABLE messages definition. There's one setupTestSQLite() function (line 43) that all the
failing tests share, so a single edit should fix all 5 tests.

The CI on main passes because main doesn't have these columns in the query — they were introduced by this
branch."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants