Conversation
…talling leader Replace blocking @actions.send with @actions.try_send in Follower#send_action. When the action channel is full, the follower is disconnected with a warning instead of blocking the leader's fiber. This prevents a single slow or unresponsive follower from stalling all message processing.
This comment was marked as resolved.
This comment was marked as resolved.
|
or should we have a more aggressive write timeout for follower connections? |
|
no, this is probably better. a very low write_timeout could cause issues with non loaded followers but with suddenly elevated latency. |
When a follower was too slow (FollowerTooSlowError), each_follower called f.close while holding @lock. f.close calls @running.wait, which blocks until action_loop finishes. But action_loop may be stuck in sync() waiting for socket ACKs, and the socket is only closed *after* @running.wait returns — a circular wait/deadlock. Fix: add Follower#disconnect that closes the actions channel and socket immediately (interrupting any blocked IO in action_loop) without waiting. The full cleanup (draining pending actions, closing lz4/socket) is already handled by handle_socket's outer ensure block, which calls close after action_loop exits. Mirrors the existing Channel::ClosedError pattern which uses Fiber.yield rather than blocking on cleanup. P1 finding from code review of PR #1763.
Adding a follower should have lower prio then keeping the leader health. |
Just a note on the subject (not on these particular changes), when running HA cluster, isn't highest prio to get all nodes in sync? IMO blocking producers/consumers would be the action to allow for followers to catch up? Otherwise whats the point of HA? Not saying we should let the leader die, but maybe not allow for more actions until follower catch up? I.e the |
Summary
@actions.sendwith@actions.try_sendinFollower#send_actionFollowerTooSlowErrorcaught inServer#each_followerwhich logs a warning and closes the slow followerMotivation
When the leader is CPU-saturated, slow or unresponsive followers cause
send_actionto block on the bounded channel. This stalls the leader's main fiber, blocking ALL message processing for all clients. Now the leader disconnects slow followers and continues serving traffic.Test plan
make test SPEC=spec/clustering_spec.cr