Skip to content

fix: panic that happens when a target gets deleted when using decompression#3475

Merged
kalleep merged 6 commits intomainfrom
kalleep/fix-loki-file-panic-2
May 2, 2025
Merged

fix: panic that happens when a target gets deleted when using decompression#3475
kalleep merged 6 commits intomainfrom
kalleep/fix-loki-file-panic-2

Conversation

@kalleep
Copy link
Contributor

@kalleep kalleep commented Apr 30, 2025

PR Description

Alternative to #3452. In this pr I am taking a different approach.

Instead of having to call a Stop function on readers we instead accept a context.Context in Run. Run will now block until context is canceled or if it stopped for any other reason, for decompressor this happens when the whole file is read.
Run is also now responsible for cleanups and we don't have to use any locks or check for nil channels. This is all done as an attempt to simplify the logic a bit.

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@kalleep kalleep requested a review from a team as a code owner April 30, 2025 10:29

for {
r.reader.Run()
r.reader.Run(ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit problematic when using decompressor and potentially when using tailer. We will retry the until the target disappear. But when using decompressor we don't tail a file but read it fully until its finished. I wonder if we should quit them there then so we don't keep them around because its not necessary.

I can think of a couple of ways to do this, either set MaxRetries so we don't do this forever or return some kind of result from Run that can indicate if we should keep going or stop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we read the file after initial_delay has elapsed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it happens here

if d.cfg.InitialDelay > 0 {
level.Info(d.logger).Log("msg", "sleeping before starting decompression", "path", d.path, "duration", d.cfg.InitialDelay.String())
time.Sleep(d.cfg.InitialDelay)
}
.

The issue is that when Run is done when using decompressor there is no more to read from the file but we still keep the task around

Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is definitely a step in the right direction!


for {
r.reader.Run()
r.reader.Run(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we read the file after initial_delay has elapsed?

go d.updatePosition()

d.metrics.filesActive.Add(1.)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have a waitgroup to make sure all those goroutines stop before Run returns?

Copy link
Contributor Author

@kalleep kalleep Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really necessary, stop function for both components only returns when both of these are done e.g. https://github.com/grafana/alloy/pull/3475/files#diff-fd87f4f66dac7e8ca37dcaa131b3cb5a0c3498347d5c5309fe0bba0b50492576R380-R383

@@ -47,12 +48,10 @@ type tailer struct {

componentStopping func() bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function still work? I thought it'd lock mut and return stopping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +52 to +54
posquit chan struct{} // used by the readLine method to tell the updatePosition method to stop
posdone chan struct{} // used by the updatePosition method to notify when it stopped
done chan struct{} // used by the readLine method to notify when it stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all these channels have to be struct fields and whether all the goroutines have to be started in Run... it is harder to see what depends on what this way. WDYT about e.g. readLines starting the updatePosition goroutine and closing it before it closes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was experimenting a bit with this before, not having them as struct fields but we have to pass them around.

I made it so that both readers works the same.

Run starts both updatePosition and readlines. When it stops either by it self or by context being canceled we stop readlines and wait for it. That one will close posquit that in turn.

I can try to think if there is a better way to structure this

Copy link
Contributor Author

@kalleep kalleep Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dd1f2e7 is an attempt to do this, wdyt?

@kalleep kalleep force-pushed the kalleep/fix-loki-file-panic-2 branch from 8e5e4b4 to dd1f2e7 Compare April 30, 2025 13:18
@kalleep kalleep merged commit 85718e8 into main May 2, 2025
38 checks passed
@kalleep kalleep deleted the kalleep/fix-loki-file-panic-2 branch May 2, 2025 13:22
kalleep added a commit that referenced this pull request May 5, 2025
…ession (#3475)

* Fix panic caused that can happen when when file is removed for
decompressor

* Change to start readLine start and stops updatePositions

* Add changelog
kalleep added a commit that referenced this pull request May 5, 2025
* fix: panic that happens when a target gets deleted when using decompression (#3475)

* Fix panic caused that can happen when when file is removed for
decompressor

* Change to start readLine start and stops updatePositions

* Add changelog

* Fix mimir.rules.kubernetes panic on non-leader debug info retrieval (#3451)

* Fix mimir.rules.kubernetes to only return eventProcessor state if it exists

* fix: deadlock in loki.source.file when target is removed (#3488)

* Fix deadlock that can happen when stopping reader tasks

Co-authored-by: William Dumont <william.dumont@grafana.com>

* fix: emit valid logfmt key (#3495)

* Fix log keys to be valid for logfmt

* Add changelog

* Fix streams limit error check so that metrics are correctly labeled as `ReasonStreamLimited` (#3466)

* fix: replace direct error string compare with isErrMaxStreamsLimitExceeded helper

* update CHANGELOG

* Make errMaxStreamsLimitExceeded an error type

---------

Co-authored-by: Théo Brigitte <theo.brigitte@gmail.com>
Co-authored-by: William Dumont <william.dumont@grafana.com>
Co-authored-by: Marat Khvostov <marathvostov@gmail.com>
marctc pushed a commit that referenced this pull request May 7, 2025
…ession (#3475)

* Fix panic caused that can happen when when file is removed for
decompressor

* Change to start readLine start and stops updatePositions

* Add changelog
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants