fix: drop stale routing entries on NoShardsAvailable failures by osyniakov · Pull Request #2 · osyniakov/quickwit

osyniakov · 2026-04-21T18:15:15Z

When an index was deleted and recreated, the router's per-ingester
routing entry for the old incarnation could stay marked as having open
shards because the ingester's piggybacked routing update only covers
sources it still holds. Persist retries then kept picking the dead
entry and the request surfaced as a 503 until Chitchat eventually
caught up.

Treat a NoShardsAvailable failure as a signal that this (leader,
index_uid, source_id) has no reachable shard and zero it out in the
routing table. If no nodes remain for that (index_id, source_id) the
next attempt re-queries the control plane, which returns the fresh
incarnation's shards.

Fixes quickwit-oss#6324

When an index was deleted and recreated, the router's per-ingester routing entry for the old incarnation could stay marked as having open shards because the ingester's piggybacked routing update only covers sources it still holds. Persist retries then kept picking the dead entry and the request surfaced as a 503 until Chitchat eventually caught up. Treat a `NoShardsAvailable` failure as a signal that this (leader, index_uid, source_id) has no reachable shard and zero it out in the routing table. If no nodes remain for that (index_id, source_id) the next attempt re-queries the control plane, which returns the fresh incarnation's shards. Fixes quickwit-oss#6324

Clarifies the hidden contract the fix leans on: the zero-out and piggybacked routing update run under the same lock, which is what keeps the rate-limited subcase of NoShardsAvailable correct.

nadav-govari

Thank you for the change - looks good to me. Just one small thing to clean up.

nadav-govari · 2026-04-27T16:36:18Z

-                                shard_update.open_shard_count as usize,
+                                index_uid,
+                                source_id,
+                                0,


I'd prefer to just zero out the open shard count but keep the capacity score the same. The best way to do this is with a new short method on the routing table.

Thanks for review @nadav-govari! The commit with a fix has been pushed. Could you please take another look and if it looks good I am happy to open PR towards main quickwit-oss branch

Address PR review: introduce RoutingTable::mark_node_no_shards instead of calling apply_capacity_update(.., 0, 0). The new method only zeros the open_shard_count and leaves the capacity_score untouched (capacity is a node-level WAL signal independent of any specific source). It also no-ops on missing entries/nodes and on incarnation mismatches, so a narrowing signal can never roll back a fresher entry.

nadav-govari · 2026-05-04T18:38:05Z

+            return;
+        };
+        if entry.index_uid != *index_uid {
+            return;


I think you actually end up with the same problem a different way here- on index_uid changes, we don't actually apply the update. In the existing code we have index uid ordering checks here.

@nadav-govari the review comment has been addressed. Could you please take another look?

…shards Address PR review: replace the != short-circuit with the same Less / Equal / Greater cmp match used by apply_capacity_update and merge_from_shards. A stale signal (entry newer than the failure's index_uid) is still ignored; a signal for a newer incarnation now advances the entry, drops stale nodes, and forces a CP re-seed — consistent with how the rest of the routing table handles monotonic incarnations.

claude added 2 commits April 21, 2026 18:14

fix: document routing_update invariant on NoShardsAvailable fix

227048f

Clarifies the hidden contract the fix leans on: the zero-out and piggybacked routing update run under the same lock, which is what keeps the rate-limited subcase of NoShardsAvailable correct.

nadav-govari reviewed Apr 27, 2026

View reviewed changes

nadav-govari reviewed May 4, 2026

View reviewed changes

ncoiffier-celonis mentioned this pull request Jun 19, 2026

Persistent "no open shard found on ingester" after some indexer restart quickwit-oss/quickwit#6531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drop stale routing entries on NoShardsAvailable failures#2

fix: drop stale routing entries on NoShardsAvailable failures#2
osyniakov wants to merge 4 commits into
mainfrom
claude/fix-issue-6324-mHady

osyniakov commented Apr 21, 2026

Uh oh!

nadav-govari left a comment

Uh oh!

nadav-govari Apr 27, 2026

Uh oh!

osyniakov Apr 27, 2026

Uh oh!

nadav-govari May 4, 2026

Uh oh!

osyniakov May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

osyniakov commented Apr 21, 2026

Uh oh!

nadav-govari left a comment

Choose a reason for hiding this comment

Uh oh!

nadav-govari Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

osyniakov Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

nadav-govari May 4, 2026

Choose a reason for hiding this comment

Uh oh!

osyniakov May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants