-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Description
Describe the bug
Disk mutations (message publishing, queue purge/delete) are not atomic with the replication actions sent to followers. During the second full_sync (which holds @lock), another thread can write to or delete a file on disk while the corresponding append/delete_file call blocks waiting for @lock. This can cause:
- Message duplication — a message written to disk is included in the synced file and then also sent as an
appendaction after the follower is marked synced. - SEGFAULT — a queue purge/delete closes and unmaps an
MFilethatfull_syncis reading viafile.to_slice.
Timeline (duplication)
- Thread A (follower sync) — starts second
full_sync, acquires@lock - Thread A —
files_with_hashcomputes hash forfile_1(current state) - Thread B (publisher) — writes
msg_xtofile_1on disk - Thread B — calls
replicator.append(file_1, msg_x)→each_follower→ blocks waiting for@lock - Thread A — follower requests
file_1, receives it (now includingmsg_x) - Thread A — marks follower as synced, releases
@lock - Thread B — acquires
@lock, follower is now synced, sendsappend(file_1, msg_x) - Result —
file_1on the follower containsmsg_xtwice
Root cause
The disk write (step 3) and the replication action (step 4) are not atomic with respect to the second full_sync holding @lock. The message is written to disk before each_follower is called, so the second full_sync can observe the new data in the file and transfer it to the follower. When @lock is released, the pending append action goes through to the now-synced follower, duplicating the data.
Notes
- The same class of issue exists for
delete_file— a file could be deleted from disk (e.g. queue purge/delete) beforeeach_followerruns. The secondfull_syncmay then try to read anMFilethat has been closed/unmapped, causing a SEGFAULT. Or the follower receives a delete for a file it never got. - This affects 2.7.0 where follower sync runs on a parallel execution context (
@mt), making the race window more likely. In 2.6.x (single-threaded fibers), the window is narrower but still theoretically possible at yield points withinfull_sync.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels