[BUGFIX] Ingester: avoid send-on-closed-channel panic in ActiveQueriedSeriesService#7533
Open
sandy2008 wants to merge 1 commit into
Open
Conversation
…dSeriesService ActiveQueriedSeriesService.stopping() previously closed updateChan while concurrent UpdateSeriesBatch callers could still be inside a non-blocking select+default send. select+default does NOT protect against panicking on send to a closed channel — that always panics. As a result, any in-flight UpdateSeriesBatch during ingester shutdown could crash the process with "panic: send on closed channel". Stop closing the channel. Workers already exit via the existing <-ctx.Done() arm (BasicService cancels the service context before invoking stopping), and m.workers.Wait() still synchronises shutdown. A non-blocking drain after Wait() returns pooled hash slices, keeping shutdown allocation behavior clean. Late sends that arrive after the drain exits are tolerated: UpdateSeriesBatch uses a non-blocking select+default send so producers never block, and any leftover entries in the buffered channel are reclaimed when the service is GC'd. Add TestActiveQueriedSeriesService_NoSendOnClosedChannelOnShutdown: a deterministic regression test that races 32 concurrent producers against StopAndAwaitTerminated, with a two-phase hammer (before and after shutdown) and per-goroutine panic recovery. The test fails deterministically (5/5 iterations) when the close() is reintroduced and passes 100/100 iterations under -race with the fix. Fixes cortexproject#7531 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Fixes a crash-on-shutdown bug in
ActiveQueriedSeriesService.ActiveQueriedSeriesService.stopping()(pkg/ingester/active_queried_series.go:426before this patch) calledclose(m.updateChan)while concurrent callers ofUpdateSeriesBatch(line 455-476) could still be inside a non-blockingselect { case m.updateChan <- ... : default: }send.select+defaultdoes NOT protect against sending on a closed channel — that always panics. Any in-flightUpdateSeriesBatchduring shutdown could crash the ingester withpanic: send on closed channel.This PR removes
close(m.updateChan)entirely. Workers already exit via the existing<-ctx.Done()arm inprocessUpdates(theBasicServicelifecycle cancels the service context before invokingstopping), andm.workers.Wait()still synchronises shutdown. A non-blocking drain afterWait()returns the remaining pooled hash slices to thesync.Pool. Late sends that arrive after the drain exits are tolerated:UpdateSeriesBatchuses a non-blocking send so producers never block, and any leftover entries in the buffered channel are reclaimed when the service is GC'd.This was found as part of the wider audit of CPU / memory / goroutine leaks in this codebase (same audit as #7528).
Which issue(s) this PR fixes
Fixes #7531
Checklist
CHANGELOG.mdupdated —[BUGFIX] Ingester:entry undermaster / unreleased.TestActiveQueriedSeriesService_NoSendOnClosedChannelOnShutdownraces 32 concurrent producers againstStopAndAwaitTerminatedwith a two-phase hammer (before and after shutdown) and per-goroutine panic recovery. Test fails deterministically (5/5) ifclose(m.updateChan)is reintroduced; passes 100/100 iterations under-racewith the fix.Test plan
go build -tags "netgo slicelabels" ./pkg/ingester/...— cleango vet -tags "netgo slicelabels" ./pkg/ingester/...— cleangofmt -l— cleangoimports -local github.com/cortexproject/cortex -l— cleango test -race -tags "netgo slicelabels" -run TestActiveQueriedSeriesService_NoSendOnClosedChannelOnShutdown ./pkg/ingester/... -count=100 -timeout 1200s— PASS, no flakesgo test -tags "netgo slicelabels" -run "^TestIngester|^TestActiveQueriedSeries" ./pkg/ingester/... -count=1— PASS, no regressionspanic: send on closed channel(5/5 iterations); restoring fixes it (5/5 pass)🤖 Generated with Claude Code