Skip to content

Race condition: disk writes/deletes during second full_sync can cause duplication or SEGFAULT on follower #1806

@viktorerlingsson

Description

@viktorerlingsson

Describe the bug
Disk mutations (message publishing, queue purge/delete) are not atomic with the replication actions sent to followers. During the second full_sync (which holds @lock), another thread can write to or delete a file on disk while the corresponding append/delete_file call blocks waiting for @lock. This can cause:

  • Message duplication — a message written to disk is included in the synced file and then also sent as an append action after the follower is marked synced.
  • SEGFAULT — a queue purge/delete closes and unmaps an MFile that full_sync is reading via file.to_slice.

Timeline (duplication)

  1. Thread A (follower sync) — starts second full_sync, acquires @lock
  2. Thread Afiles_with_hash computes hash for file_1 (current state)
  3. Thread B (publisher) — writes msg_x to file_1 on disk
  4. Thread B — calls replicator.append(file_1, msg_x)each_follower → blocks waiting for @lock
  5. Thread A — follower requests file_1, receives it (now including msg_x)
  6. Thread A — marks follower as synced, releases @lock
  7. Thread B — acquires @lock, follower is now synced, sends append(file_1, msg_x)
  8. Resultfile_1 on the follower contains msg_x twice

Root cause

The disk write (step 3) and the replication action (step 4) are not atomic with respect to the second full_sync holding @lock. The message is written to disk before each_follower is called, so the second full_sync can observe the new data in the file and transfer it to the follower. When @lock is released, the pending append action goes through to the now-synced follower, duplicating the data.

Notes

  • The same class of issue exists for delete_file — a file could be deleted from disk (e.g. queue purge/delete) before each_follower runs. The second full_sync may then try to read an MFile that has been closed/unmapped, causing a SEGFAULT. Or the follower receives a delete for a file it never got.
  • This affects 2.7.0 where follower sync runs on a parallel execution context (@mt), making the race window more likely. In 2.6.x (single-threaded fibers), the window is narrower but still theoretically possible at yield points within full_sync.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions