Skip to content

Bugfix: replication shouldn't break if mfiles are closed#1792

Merged
spuun merged 5 commits intomainfrom
bugfix-dont-replicate-closed-files
Mar 8, 2026
Merged

Bugfix: replication shouldn't break if mfiles are closed#1792
spuun merged 5 commits intomainfrom
bugfix-dont-replicate-closed-files

Conversation

@spuun
Copy link
Member

@spuun spuun commented Mar 4, 2026

WHAT is this pull request doing?

If a queue is closed because of a bad segment, the segment files are still registered in the replicator and will be replicated during a full sync. Because they are closed an error will be raised which aborts the replication. There's also a possibility that the close is happening while a checksum is being calculated which then will result in a segfault.

This PR will do two things:

  1. Wait for any full sync to be done before closing files
  2. When mfiles are closed, they are "re-registered" in the replicator with path only to make sure the files are still being replicated.

HOW can this pull request be tested?

Specs should cover it

spuun added 3 commits March 4, 2026 09:02
They should still be replicatd when a message store is closed, but
we must re-register them with path only for replication to work.
@spuun spuun requested a review from a team as a code owner March 4, 2026 09:50
@claude
Copy link

claude bot commented Mar 4, 2026

Code Review

No issues found.

The approach is sound: acquiring @sync_lock before closing MFiles prevents the race with full_sync, and re-registering files by path causes files_with_hash/with_file to fall back to safe disk reads instead of accessing unmapped memory. The locking order avoids deadlocks, and the checksum invalidation on re-registration is correct.

@spuun spuun changed the title Bugfix dont replicate closed files Bugfix: replication shouldn't break if mfiles are closed Mar 4, 2026
@viktorerlingsson
Copy link
Member

Should we wait_for_sync in delete_file as well?

@viktorerlingsson viktorerlingsson linked an issue Mar 4, 2026 that may be closed by this pull request
@spuun
Copy link
Member Author

spuun commented Mar 4, 2026

Should we wait_for_sync in delete_file as well?

Yeah, maybe. I think I managed to reason about it and that it wasn't necessary, but maybe better be safe than sorry... hm.

@claude
Copy link

claude bot commented Mar 4, 2026

PR Review

Bug: wg.wait inside @sync_lock in delete_file risks holding the lock indefinitely

In src/lavinmq/message_store.cr (the private delete_file method), wg.wait is called inside wait_for_sync (i.e., while holding @sync_lock):

spawn(name: "wait for file deletion is replicated") do
  replicator.wait_for_sync do
    wg.wait          # <-- blocks while holding @sync_lock
  ensure
    file.close
  end
end

If a follower disconnects mid-operation after DeleteAction has been dequeued from the channel but before done is called (e.g., socket failure during sync in action_loop), the WaitGroup will never reach zero. Since @sync_lock is held, all future full syncs will be blocked permanently.

By contrast, the close method correctly places wg.wait outside wait_for_sync:

spawn(name: "wait for file deletion is replicated") do
  wg.wait                       # <-- waits WITHOUT holding @sync_lock
  replicator.wait_for_sync do   # <-- only then acquires @sync_lock
    # re-register and close
  end
end

The delete_file method should follow the same pattern. After replicator.delete_file(file.path, wg) removes the entry from @files, no full sync will iterate over it, so there is no need to hold @sync_lock during wg.wait — only file.close needs the lock protection:

spawn(name: "wait for file deletion is replicated") do
  wg.wait
  replicator.wait_for_sync do
    file.close
  end
end

@viktorerlingsson
Copy link
Member

Yeah, maybe. I think I managed to reason about it and that it wasn't necessary, but maybe better be safe than sorry... hm.

Yeah, I think it's more unlikely to cause problems, but I think it can cause problems at least on queue delete or purge_all 🤔

require "time"
require "../src/lavinmq/message_store"

class SpyReplicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a comment - what is this and why is it here?

wg.try &.done # negotiation done — server is now entering files_with_hash on @mt
sha1_size = Digest::SHA1.new.digest_size
client_lz4 = Compress::LZ4::Reader.new(client_io)
2.times do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a comment here as well, explaining why this runs 2 times and maybe what happens in the loop

@claude
Copy link

claude bot commented Mar 5, 2026

Code Review

No issues found.

The approach is correct:

  • wait_for_sync acquires @sync_lock, which serializes with full_sync, preventing SEGFAULT from accessing unmapped MFiles during a concurrent sync.
  • Re-registering files by path (@files[path] = nil) before closing MFiles ensures files_with_hash falls back to reading from disk via File.open rather than calling to_slice on a closed MFile.
  • Clearing @checksums on all register_file overloads is correct — re-registered files may have changed content, so stale checksums should be invalidated.
  • Removing the !replicator.all_followers.empty? guard in close is appropriate since a new follower could begin a full sync at any moment, so the @sync_lock must always be acquired to prevent races.

@spuun
Copy link
Member Author

spuun commented Mar 5, 2026

Bug: wg.wait inside @sync_lock in delete_file risks holding the lock indefinitely

The WaitGroup will always be resolved. When a follower is disconnected #close will be called which will resolve all pending actions.

Copy link
Member

@viktorerlingsson viktorerlingsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to properly test this besides the spec, but the code looks good to me 👍

@spuun spuun merged commit fe2c479 into main Mar 8, 2026
18 checks passed
@spuun spuun deleted the bugfix-dont-replicate-closed-files branch March 8, 2026 08:51
viktorerlingsson pushed a commit that referenced this pull request Mar 9, 2026
If a queue is closed because of a bad segment, the segment files are
still registered in the replicator and will be replicated during a full
sync. Because they are closed an error will be raised which aborts the
replication. There's also a possibility that the close is happening
while a checksum is being calculated which then will result in a
segfault.

This PR will do two things:
1. Wait for any full sync to be done before closing files
2. When mfiles are closed, they are "re-registered" in the replicator
with path only to make sure the files are still being replicated.

Specs should cover it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clustering full_sync crashes on closed MFile

2 participants