Skip to content

Non-blocking follower action sending#1763

Draft
baelter wants to merge 2 commits intomainfrom
feature/nonblocking-follower-actions
Draft

Non-blocking follower action sending#1763
baelter wants to merge 2 commits intomainfrom
feature/nonblocking-follower-actions

Conversation

@baelter
Copy link
Member

@baelter baelter commented Feb 28, 2026

Summary

  • Replace blocking @actions.send with @actions.try_send in Follower#send_action
  • When a follower's action channel is full, disconnect it immediately instead of stalling the leader's main fiber
  • Adds FollowerTooSlowError caught in Server#each_follower which logs a warning and closes the slow follower

Motivation

When the leader is CPU-saturated, slow or unresponsive followers cause send_action to block on the bounded channel. This stalls the leader's main fiber, blocking ALL message processing for all clients. Now the leader disconnects slow followers and continues serving traffic.

Test plan

  • Existing clustering spec updated to verify non-blocking disconnect behavior
  • make test SPEC=spec/clustering_spec.cr
  • Manual test: saturate a leader, add a slow follower, verify it gets disconnected without stalling

…talling leader

Replace blocking @actions.send with @actions.try_send in Follower#send_action.
When the action channel is full, the follower is disconnected with a warning
instead of blocking the leader's fiber. This prevents a single slow or
unresponsive follower from stalling all message processing.
@claude

This comment was marked as resolved.

@carlhoerberg
Copy link
Member

or should we have a more aggressive write timeout for follower connections?

@carlhoerberg
Copy link
Member

no, this is probably better. a very low write_timeout could cause issues with non loaded followers but with suddenly elevated latency.

When a follower was too slow (FollowerTooSlowError), each_follower called
f.close while holding @lock. f.close calls @running.wait, which blocks
until action_loop finishes. But action_loop may be stuck in sync() waiting
for socket ACKs, and the socket is only closed *after* @running.wait returns
— a circular wait/deadlock.

Fix: add Follower#disconnect that closes the actions channel and socket
immediately (interrupting any blocked IO in action_loop) without waiting.
The full cleanup (draining pending actions, closing lz4/socket) is already
handled by handle_socket's outer ensure block, which calls close after
action_loop exits.

Mirrors the existing Channel::ClosedError pattern which uses Fiber.yield
rather than blocking on cleanup.

P1 finding from code review of PR #1763.
@baelter
Copy link
Member Author

baelter commented Mar 2, 2026

  • Test the scenario in 2.7 where we have better MT.
  • Can we block less when @actions is full?

Adding a follower should have lower prio then keeping the leader health.

@oskgu360
Copy link
Member

oskgu360 commented Mar 5, 2026

Adding a follower should have lower prio then keeping the leader health.

Just a note on the subject (not on these particular changes), when running HA cluster, isn't highest prio to get all nodes in sync? IMO blocking producers/consumers would be the action to allow for followers to catch up? Otherwise whats the point of HA?

Not saying we should let the leader die, but maybe not allow for more actions until follower catch up? I.e the min_isr take, either config or up to a mininmum number of nodes for quorom at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants