Non-blocking follower action sending by baelter · Pull Request #1763 · cloudamqp/lavinmq

baelter · 2026-02-28T20:39:08Z

Summary

Replace blocking @actions.send with @actions.try_send in Follower#send_action
When a follower's action channel is full, disconnect it immediately instead of stalling the leader's main fiber
Adds FollowerTooSlowError caught in Server#each_follower which logs a warning and closes the slow follower

Motivation

When the leader is CPU-saturated, slow or unresponsive followers cause send_action to block on the bounded channel. This stalls the leader's main fiber, blocking ALL message processing for all clients. Now the leader disconnects slow followers and continues serving traffic.

Test plan

Existing clustering spec updated to verify non-blocking disconnect behavior
make test SPEC=spec/clustering_spec.cr
Manual test: saturate a leader, add a slow follower, verify it gets disconnected without stalling

…talling leader Replace blocking @actions.send with @actions.try_send in Follower#send_action. When the action channel is full, the follower is disconnected with a warning instead of blocking the leader's fiber. This prevents a single slow or unresponsive follower from stalling all message processing.

carlhoerberg · 2026-03-02T02:11:16Z

or should we have a more aggressive write timeout for follower connections?

carlhoerberg · 2026-03-02T02:25:37Z

no, this is probably better. a very low write_timeout could cause issues with non loaded followers but with suddenly elevated latency.

@lock

When a follower was too slow (FollowerTooSlowError), each_follower called f.close while holding @lock. f.close calls @running.wait, which blocks until action_loop finishes. But action_loop may be stuck in sync() waiting for socket ACKs, and the socket is only closed *after* @running.wait returns — a circular wait/deadlock. Fix: add Follower#disconnect that closes the actions channel and socket immediately (interrupting any blocked IO in action_loop) without waiting. The full cleanup (draining pending actions, closing lz4/socket) is already handled by handle_socket's outer ensure block, which calls close after action_loop exits. Mirrors the existing Channel::ClosedError pattern which uses Fiber.yield rather than blocking on cleanup. P1 finding from code review of PR #1763.

baelter · 2026-03-02T13:48:24Z

Test the scenario in 2.7 where we have better MT.
Can we block less when @actions is full?

Adding a follower should have lower prio then keeping the leader health.

oskgu360 · 2026-03-05T08:14:47Z

Adding a follower should have lower prio then keeping the leader health.

Just a note on the subject (not on these particular changes), when running HA cluster, isn't highest prio to get all nodes in sync? IMO blocking producers/consumers would be the action to allow for followers to catch up? Otherwise whats the point of HA?

Not saying we should let the leader die, but maybe not allow for more actions until follower catch up? I.e the min_isr take, either config or up to a mininmum number of nodes for quorom at least.

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking follower action sending#1763

Non-blocking follower action sending#1763
baelter wants to merge 2 commits intomainfrom
feature/nonblocking-follower-actions

baelter commented Feb 28, 2026

Uh oh!

This comment was marked as resolved.

carlhoerberg commented Mar 2, 2026

Uh oh!

carlhoerberg commented Mar 2, 2026

Uh oh!

baelter commented Mar 2, 2026

Uh oh!

oskgu360 commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

baelter commented Feb 28, 2026

Summary

Motivation

Test plan

Uh oh!

This comment was marked as resolved.

carlhoerberg commented Mar 2, 2026

Uh oh!

carlhoerberg commented Mar 2, 2026

Uh oh!

baelter commented Mar 2, 2026

Uh oh!

oskgu360 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oskgu360 commented Mar 5, 2026 •

edited

Loading