Reuse the message_expire_loop for drop_overflow#1721
Conversation
PR Review: Reuse
|
|
|
ca3924e to
a419c87
Compare
|
@claude you are wrong about this 1. Bug: Missing @consumers_empty.when_false in no-consumer / no-TTL select (lines 135-144) |
PR Review: Recreate drop_overflow_channel when neededBug: Lost signals from unbuffered channel can cause permanent max_length violation
There is a window where Concrete scenario:
The old synchronous Suggested fix: Use a buffered channel with capacity 1: @drop_overflow_channel = ::Channel(Nil).new(1)This allows This change would need to be applied in all three places where the channel is created (initialization, Minor: Queues with TTL partially mask the bugIn the TTL branch, |
|
FYI I've found some kind of dead-lock. Trying to narrow it down. |
PR Review: bugfix/callstack-dropoverflow1. Unbuffered channel can silently lose overflow signals (bug)
Whenever the expire loop is executing Concrete scenario:
The same applies during Fix: Use a buffered channel with capacity 1: @drop_overflow_channel = ::Channel(Nil).new(1)This ensures at least one pending signal is retained. Multiple signals are naturally deduplicated since the buffer holds at most 1 value, and |
kickster97
left a comment
There was a problem hiding this comment.
To me this looks ok, I ran a few tests and i worked well
b669658 to
f70eecd
Compare
PR Review1. Missing specsAll files changed — Per project rules, all code changes must have corresponding specs. This PR restructures core queue behavior (overflow dropping moved to the expire loop, lock scope narrowed in
2. TOCTOU in
|
| end | ||
|
|
||
| private def drop_overflow : Nil | ||
| return if immediate_delivery? |
There was a problem hiding this comment.
This was added as a result of a failing spec.
The queue could end up with more message than max-length, probably because missed signals (e.g. other signals taking precedence) Since drop overflow was handled in all states (select statements) it might aswell be moved to the beginning of the loop to react on any loop wake up. The check if drop is about to happen is pretty cheap, just a quick lock and size check.
f70eecd to
42831ce
Compare
PR ReviewBug 1 (Critical): TTL change signals gated behind overflow settings
private def cleanup_messages(reason : CleanupReason) : Nil
@cleanup_message_channel.try_send?(reason) if (@max_length || @max_length_bytes) && !immediate_delivery?
endThe guard Previously,
The guard should only apply to the Bug 2 (Critical): MQTT session overflow handling broken
Previously,
private def cleanup_messages(reason : CleanupReason) : Nil
drop_overflow if (@max_length || @max_length_bytes) && !immediate_delivery?
endMissing specsPer project policy, all code changes must have corresponding specs. This PR has no spec changes. Given the significant refactoring of the expire loop, overflow handling, and lock scope changes, regression specs would be valuable -- particularly for:
|
PR ReviewBug:
|
WHAT is this pull request doing?
reason for the change
When a queue overflows,
drop_overflowis called synchronously in thepublishmethod.drop_overflowdead-letters messages viaexpire_msg, which callsvhost.publishtoroute the dead-lettered message to another queue. If that target queue also overflows, it
dead-letters back to the original queue, creating a recursive publish → drop_overflow →
publish → drop_overflow chain. The call stack grows with each cycle until the process
crashes.
solution
Reuse the existing
message_expire_loopfiber to performdrop_overflowasynchronously.A new zero-buffered
@drop_overflow_channel(Channel(Nil)) signals the fiber thatoverflow needs to be checked. The
publishandrequeuemethods send a non-blockingsignal (
try_send?) instead of callingdrop_overflowsynchronously. This breaks therecursive call stack because the dead-lettering happens in a separate fiber, not in the
caller's stack frame.
changed behaviour
Overflow is no longer immediate
Previously,
drop_overflowran synchronously insidepublish, so the queue was trimmedbefore
publishreturned. Now the overflow signal is processed by themessage_expire_loopfiber, which means there is a brief window where the queueexceeds
max-lengthormax-length-bytesuntil the fiber wakes up and trims it.In practice this window is very short (a single fiber yield).
Extra fiber wakeup per empty→non-empty transition
The original loop was fully parked when consumers existed — it blocked on
@consumers_empty.when_true.receiveat the top and never woke until all consumersdisconnected.
Now the loop blocks on
@msg_store.empty.when_false.receive, which fires on theempty→non-empty state transition (not per message). When consumers are present, the
fiber wakes, checks
@consumers.empty?, enters theselectin theelsebranch, andparks again. This is one extra wakeup per transition, not per publish, so the overhead
is negligible.