Skip to content

[BUGFIX] Ingester: avoid send-on-closed-channel panic in ActiveQueriedSeriesService#7533

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-active-queried-series-close-race-v2
Open

[BUGFIX] Ingester: avoid send-on-closed-channel panic in ActiveQueriedSeriesService#7533
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-active-queried-series-close-race-v2

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

What this PR does

Fixes a crash-on-shutdown bug in ActiveQueriedSeriesService.

ActiveQueriedSeriesService.stopping() (pkg/ingester/active_queried_series.go:426 before this patch) called close(m.updateChan) while concurrent callers of UpdateSeriesBatch (line 455-476) could still be inside a non-blocking select { case m.updateChan <- ... : default: } send. select+default does NOT protect against sending on a closed channel — that always panics. Any in-flight UpdateSeriesBatch during shutdown could crash the ingester with panic: send on closed channel.

This PR removes close(m.updateChan) entirely. Workers already exit via the existing <-ctx.Done() arm in processUpdates (the BasicService lifecycle cancels the service context before invoking stopping), and m.workers.Wait() still synchronises shutdown. A non-blocking drain after Wait() returns the remaining pooled hash slices to the sync.Pool. Late sends that arrive after the drain exits are tolerated: UpdateSeriesBatch uses a non-blocking send so producers never block, and any leftover entries in the buffered channel are reclaimed when the service is GC'd.

This was found as part of the wider audit of CPU / memory / goroutine leaks in this codebase (same audit as #7528).

Which issue(s) this PR fixes

Fixes #7531

Checklist

  • CHANGELOG.md updated — [BUGFIX] Ingester: entry under master / unreleased.
  • Documentation updated — not applicable; no flags or config changed.
  • Tests added: TestActiveQueriedSeriesService_NoSendOnClosedChannelOnShutdown races 32 concurrent producers against StopAndAwaitTerminated with a two-phase hammer (before and after shutdown) and per-goroutine panic recovery. Test fails deterministically (5/5) if close(m.updateChan) is reintroduced; passes 100/100 iterations under -race with the fix.

Test plan

  • go build -tags "netgo slicelabels" ./pkg/ingester/... — clean
  • go vet -tags "netgo slicelabels" ./pkg/ingester/... — clean
  • gofmt -l — clean
  • goimports -local github.com/cortexproject/cortex -l — clean
  • go test -race -tags "netgo slicelabels" -run TestActiveQueriedSeriesService_NoSendOnClosedChannelOnShutdown ./pkg/ingester/... -count=100 -timeout 1200s — PASS, no flakes
  • go test -tags "netgo slicelabels" -run "^TestIngester|^TestActiveQueriedSeries" ./pkg/ingester/... -count=1 — PASS, no regressions
  • Reverting the fix and re-running the regression test produces panic: send on closed channel (5/5 iterations); restoring fixes it (5/5 pass)

🤖 Generated with Claude Code

…dSeriesService

ActiveQueriedSeriesService.stopping() previously closed updateChan while
concurrent UpdateSeriesBatch callers could still be inside a non-blocking
select+default send. select+default does NOT protect against panicking
on send to a closed channel — that always panics. As a result, any
in-flight UpdateSeriesBatch during ingester shutdown could crash the
process with "panic: send on closed channel".

Stop closing the channel. Workers already exit via the existing
<-ctx.Done() arm (BasicService cancels the service context before
invoking stopping), and m.workers.Wait() still synchronises shutdown.
A non-blocking drain after Wait() returns pooled hash slices, keeping
shutdown allocation behavior clean. Late sends that arrive after the
drain exits are tolerated: UpdateSeriesBatch uses a non-blocking
select+default send so producers never block, and any leftover entries
in the buffered channel are reclaimed when the service is GC'd.

Add TestActiveQueriedSeriesService_NoSendOnClosedChannelOnShutdown: a
deterministic regression test that races 32 concurrent producers
against StopAndAwaitTerminated, with a two-phase hammer (before and
after shutdown) and per-goroutine panic recovery. The test fails
deterministically (5/5 iterations) when the close() is reintroduced
and passes 100/100 iterations under -race with the fix.

Fixes cortexproject#7531

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@dosubot dosubot Bot added component/ingester go Pull requests that update Go code type/bug type/tests labels May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Ingester: panic 'send on closed channel' in ActiveQueriedSeriesService — Stop races concurrent UpdateSeriesBatch senders

1 participant