Bugfix: replication shouldn't break if mfiles are closed by spuun · Pull Request #1792 · cloudamqp/lavinmq

spuun · 2026-03-04T09:50:29Z

WHAT is this pull request doing?

If a queue is closed because of a bad segment, the segment files are still registered in the replicator and will be replicated during a full sync. Because they are closed an error will be raised which aborts the replication. There's also a possibility that the close is happening while a checksum is being calculated which then will result in a segfault.

This PR will do two things:

Wait for any full sync to be done before closing files
When mfiles are closed, they are "re-registered" in the replicator with path only to make sure the files are still being replicated.

HOW can this pull request be tested?

Specs should cover it

They should still be replicatd when a message store is closed, but we must re-register them with path only for replication to work.

claude · 2026-03-04T09:54:15Z

Code Review

No issues found.

The approach is sound: acquiring @sync_lock before closing MFiles prevents the race with full_sync, and re-registering files by path causes files_with_hash/with_file to fall back to safe disk reads instead of accessing unmapped memory. The locking order avoids deadlocks, and the checksum invalidation on re-registration is correct.

viktorerlingsson · 2026-03-04T10:55:23Z

Should we wait_for_sync in delete_file as well?

spuun · 2026-03-04T11:57:42Z

Should we wait_for_sync in delete_file as well?

Yeah, maybe. I think I managed to reason about it and that it wasn't necessary, but maybe better be safe than sorry... hm.

claude · 2026-03-04T14:03:05Z

PR Review

Bug: `wg.wait` inside `@sync_lock` in `delete_file` risks holding the lock indefinitely

In src/lavinmq/message_store.cr (the private delete_file method), wg.wait is called inside wait_for_sync (i.e., while holding @sync_lock):

spawn(name: "wait for file deletion is replicated") do
  replicator.wait_for_sync do
    wg.wait          # <-- blocks while holding @sync_lock
  ensure
    file.close
  end
end

If a follower disconnects mid-operation after DeleteAction has been dequeued from the channel but before done is called (e.g., socket failure during sync in action_loop), the WaitGroup will never reach zero. Since @sync_lock is held, all future full syncs will be blocked permanently.

By contrast, the close method correctly places wg.wait outside wait_for_sync:

spawn(name: "wait for file deletion is replicated") do
  wg.wait                       # <-- waits WITHOUT holding @sync_lock
  replicator.wait_for_sync do   # <-- only then acquires @sync_lock
    # re-register and close
  end
end

The delete_file method should follow the same pattern. After replicator.delete_file(file.path, wg) removes the entry from @files, no full sync will iterate over it, so there is no need to hold @sync_lock during wg.wait — only file.close needs the lock protection:

spawn(name: "wait for file deletion is replicated") do
  wg.wait
  replicator.wait_for_sync do
    file.close
  end
end

viktorerlingsson · 2026-03-04T14:21:01Z

Yeah, maybe. I think I managed to reason about it and that it wasn't necessary, but maybe better be safe than sorry... hm.

Yeah, I think it's more unlikely to cause problems, but I think it can cause problems at least on queue delete or purge_all 🤔

viktorerlingsson · 2026-03-04T14:32:19Z

spec/message_store_spec.cr

 require "time"
 require "../src/lavinmq/message_store"

+class SpyReplicator


I think this needs a comment - what is this and why is it here?

spec/clustering_spec.cr

viktorerlingsson · 2026-03-04T14:34:17Z

spec/clustering_spec.cr

+      wg.try &.done # negotiation done — server is now entering files_with_hash on @mt
+      sha1_size = Digest::SHA1.new.digest_size
+      client_lz4 = Compress::LZ4::Reader.new(client_io)
+      2.times do


I'd like a comment here as well, explaining why this runs 2 times and maybe what happens in the loop

claude · 2026-03-05T08:13:22Z

Code Review

No issues found.

The approach is correct:

wait_for_sync acquires @sync_lock, which serializes with full_sync, preventing SEGFAULT from accessing unmapped MFiles during a concurrent sync.
Re-registering files by path (@files[path] = nil) before closing MFiles ensures files_with_hash falls back to reading from disk via File.open rather than calling to_slice on a closed MFile.
Clearing @checksums on all register_file overloads is correct — re-registered files may have changed content, so stale checksums should be invalidated.
Removing the !replicator.all_followers.empty? guard in close is appropriate since a new follower could begin a full sync at any moment, so the @sync_lock must always be acquired to prevent races.

spuun · 2026-03-05T08:14:36Z

Bug: wg.wait inside @sync_lock in delete_file risks holding the lock indefinitely

The WaitGroup will always be resolved. When a follower is disconnected #close will be called which will resolve all pending actions.

viktorerlingsson

Not sure how to properly test this besides the spec, but the code looks good to me 👍

If a queue is closed because of a bad segment, the segment files are still registered in the replicator and will be replicated during a full sync. Because they are closed an error will be raised which aborts the replication. There's also a possibility that the close is happening while a checksum is being calculated which then will result in a segfault. This PR will do two things: 1. Wait for any full sync to be done before closing files 2. When mfiles are closed, they are "re-registered" in the replicator with path only to make sure the files are still being replicated. Specs should cover it

spuun added 3 commits March 4, 2026 09:02

Re-register files to replicator when mfiles are closed

cb81cce

They should still be replicatd when a message store is closed, but we must re-register them with path only for replication to work.

Dont close mfiles while full sync is going on

bbd4194

Add specs

ec127bf

spuun requested a review from a team as a code owner March 4, 2026 09:50

spuun changed the title ~~Bugfix dont replicate closed files~~ Bugfix: replication shouldn't break if mfiles are closed Mar 4, 2026

viktorerlingsson linked an issue Mar 4, 2026 that may be closed by this pull request

Clustering full_sync crashes on closed MFile #1757

Closed

Dont close file during full sync

9b7df77

viktorerlingsson reviewed Mar 4, 2026

View reviewed changes

spec/clustering_spec.cr Outdated Show resolved Hide resolved

viktorerlingsson reviewed Mar 4, 2026

View reviewed changes

Add comments to spec code

743e41e

viktorerlingsson approved these changes Mar 5, 2026

View reviewed changes

spuun merged commit fe2c479 into main Mar 8, 2026
18 checks passed

spuun deleted the bugfix-dont-replicate-closed-files branch March 8, 2026 08:51

This was referenced Mar 19, 2026

Handle closed MFiles during clustering full_sync #1752

Closed

Handle closed MFiles during clustering full_sync #1751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: replication shouldn't break if mfiles are closed#1792

Bugfix: replication shouldn't break if mfiles are closed#1792
spuun merged 5 commits intomainfrom
bugfix-dont-replicate-closed-files

spuun commented Mar 4, 2026

Uh oh!

claude bot commented Mar 4, 2026

Uh oh!

viktorerlingsson commented Mar 4, 2026

Uh oh!

spuun commented Mar 4, 2026

Uh oh!

claude bot commented Mar 4, 2026

Uh oh!

viktorerlingsson commented Mar 4, 2026

Uh oh!

viktorerlingsson Mar 4, 2026

Uh oh!

Uh oh!

viktorerlingsson Mar 4, 2026

Uh oh!

claude bot commented Mar 5, 2026

Uh oh!

spuun commented Mar 5, 2026

Bug: `wg.wait` inside `@sync_lock` in `delete_file` risks holding the lock indefinitely

Uh oh!

viktorerlingsson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spuun commented Mar 4, 2026

WHAT is this pull request doing?

HOW can this pull request be tested?

Uh oh!

claude bot commented Mar 4, 2026

Uh oh!

viktorerlingsson commented Mar 4, 2026

Uh oh!

spuun commented Mar 4, 2026

Uh oh!

claude bot commented Mar 4, 2026

PR Review

Bug: wg.wait inside @sync_lock in delete_file risks holding the lock indefinitely

Uh oh!

viktorerlingsson commented Mar 4, 2026

Uh oh!

viktorerlingsson Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viktorerlingsson Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 5, 2026

Uh oh!

spuun commented Mar 5, 2026

Bug: wg.wait inside @sync_lock in delete_file risks holding the lock indefinitely

Uh oh!

viktorerlingsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug: `wg.wait` inside `@sync_lock` in `delete_file` risks holding the lock indefinitely

Bug: `wg.wait` inside `@sync_lock` in `delete_file` risks holding the lock indefinitely