fix: drop stale routing entries on NoShardsAvailable failures#2
fix: drop stale routing entries on NoShardsAvailable failures#2osyniakov wants to merge 4 commits into
Conversation
When an index was deleted and recreated, the router's per-ingester routing entry for the old incarnation could stay marked as having open shards because the ingester's piggybacked routing update only covers sources it still holds. Persist retries then kept picking the dead entry and the request surfaced as a 503 until Chitchat eventually caught up. Treat a `NoShardsAvailable` failure as a signal that this (leader, index_uid, source_id) has no reachable shard and zero it out in the routing table. If no nodes remain for that (index_id, source_id) the next attempt re-queries the control plane, which returns the fresh incarnation's shards. Fixes quickwit-oss#6324
Clarifies the hidden contract the fix leans on: the zero-out and piggybacked routing update run under the same lock, which is what keeps the rate-limited subcase of NoShardsAvailable correct.
nadav-govari
left a comment
There was a problem hiding this comment.
Thank you for the change - looks good to me. Just one small thing to clean up.
| shard_update.open_shard_count as usize, | ||
| index_uid, | ||
| source_id, | ||
| 0, |
There was a problem hiding this comment.
I'd prefer to just zero out the open shard count but keep the capacity score the same. The best way to do this is with a new short method on the routing table.
There was a problem hiding this comment.
Thanks for review @nadav-govari! The commit with a fix has been pushed. Could you please take another look and if it looks good I am happy to open PR towards main quickwit-oss branch
Address PR review: introduce RoutingTable::mark_node_no_shards instead of calling apply_capacity_update(.., 0, 0). The new method only zeros the open_shard_count and leaves the capacity_score untouched (capacity is a node-level WAL signal independent of any specific source). It also no-ops on missing entries/nodes and on incarnation mismatches, so a narrowing signal can never roll back a fresher entry.
| return; | ||
| }; | ||
| if entry.index_uid != *index_uid { | ||
| return; |
There was a problem hiding this comment.
I think you actually end up with the same problem a different way here- on index_uid changes, we don't actually apply the update. In the existing code we have index uid ordering checks here.
There was a problem hiding this comment.
@nadav-govari the review comment has been addressed. Could you please take another look?
…shards Address PR review: replace the != short-circuit with the same Less / Equal / Greater cmp match used by apply_capacity_update and merge_from_shards. A stale signal (entry newer than the failure's index_uid) is still ignored; a signal for a newer incarnation now advances the entry, drops stale nodes, and forces a CP re-seed — consistent with how the rest of the routing table handles monotonic incarnations.
When an index was deleted and recreated, the router's per-ingester
routing entry for the old incarnation could stay marked as having open
shards because the ingester's piggybacked routing update only covers
sources it still holds. Persist retries then kept picking the dead
entry and the request surfaced as a 503 until Chitchat eventually
caught up.
Treat a
NoShardsAvailablefailure as a signal that this (leader,index_uid, source_id) has no reachable shard and zero it out in the
routing table. If no nodes remain for that (index_id, source_id) the
next attempt re-queries the control plane, which returns the fresh
incarnation's shards.
Fixes quickwit-oss#6324