fix: panic that happens when a target gets deleted when using decompression#3475
fix: panic that happens when a target gets deleted when using decompression#3475
Conversation
|
|
||
| for { | ||
| r.reader.Run() | ||
| r.reader.Run(ctx) |
There was a problem hiding this comment.
This part is a bit problematic when using decompressor and potentially when using tailer. We will retry the until the target disappear. But when using decompressor we don't tail a file but read it fully until its finished. I wonder if we should quit them there then so we don't keep them around because its not necessary.
I can think of a couple of ways to do this, either set MaxRetries so we don't do this forever or return some kind of result from Run that can indicate if we should keep going or stop
There was a problem hiding this comment.
I thought that we read the file after initial_delay has elapsed?
There was a problem hiding this comment.
Yes it happens here
alloy/internal/component/loki/source/file/decompresser.go
Lines 215 to 218 in 4774ea5
The issue is that when Run is done when using decompressor there is no more to read from the file but we still keep the task around
ptodev
left a comment
There was a problem hiding this comment.
Thank you, this is definitely a step in the right direction!
|
|
||
| for { | ||
| r.reader.Run() | ||
| r.reader.Run(ctx) |
There was a problem hiding this comment.
I thought that we read the file after initial_delay has elapsed?
| go d.updatePosition() | ||
|
|
||
| d.metrics.filesActive.Add(1.) | ||
| go func() { |
There was a problem hiding this comment.
Do we want to have a waitgroup to make sure all those goroutines stop before Run returns?
There was a problem hiding this comment.
Not really necessary, stop function for both components only returns when both of these are done e.g. https://github.com/grafana/alloy/pull/3475/files#diff-fd87f4f66dac7e8ca37dcaa131b3cb5a0c3498347d5c5309fe0bba0b50492576R380-R383
| @@ -47,12 +48,10 @@ type tailer struct { | |||
|
|
|||
| componentStopping func() bool | |||
There was a problem hiding this comment.
Does this function still work? I thought it'd lock mut and return stopping?
There was a problem hiding this comment.
This one is passed by loki.source.file component:
- https://github.com/grafana/alloy/blob/main/internal/component/loki/source/file/file.go#L293
- https://github.com/grafana/alloy/blob/main/internal/component/loki/source/file/file.go#L315
It is then used by both readers to determine what do do when stopping:
https://github.com/grafana/alloy/blob/main/internal/component/loki/source/file/tailer.go#L392
So yes this till works
| posquit chan struct{} // used by the readLine method to tell the updatePosition method to stop | ||
| posdone chan struct{} // used by the updatePosition method to notify when it stopped | ||
| done chan struct{} // used by the readLine method to notify when it stopped |
There was a problem hiding this comment.
I wonder if all these channels have to be struct fields and whether all the goroutines have to be started in Run... it is harder to see what depends on what this way. WDYT about e.g. readLines starting the updatePosition goroutine and closing it before it closes?
There was a problem hiding this comment.
I was experimenting a bit with this before, not having them as struct fields but we have to pass them around.
I made it so that both readers works the same.
Run starts both updatePosition and readlines. When it stops either by it self or by context being canceled we stop readlines and wait for it. That one will close posquit that in turn.
I can try to think if there is a better way to structure this
8e5e4b4 to
dd1f2e7
Compare
…ession (#3475) * Fix panic caused that can happen when when file is removed for decompressor * Change to start readLine start and stops updatePositions * Add changelog
* fix: panic that happens when a target gets deleted when using decompression (#3475) * Fix panic caused that can happen when when file is removed for decompressor * Change to start readLine start and stops updatePositions * Add changelog * Fix mimir.rules.kubernetes panic on non-leader debug info retrieval (#3451) * Fix mimir.rules.kubernetes to only return eventProcessor state if it exists * fix: deadlock in loki.source.file when target is removed (#3488) * Fix deadlock that can happen when stopping reader tasks Co-authored-by: William Dumont <william.dumont@grafana.com> * fix: emit valid logfmt key (#3495) * Fix log keys to be valid for logfmt * Add changelog * Fix streams limit error check so that metrics are correctly labeled as `ReasonStreamLimited` (#3466) * fix: replace direct error string compare with isErrMaxStreamsLimitExceeded helper * update CHANGELOG * Make errMaxStreamsLimitExceeded an error type --------- Co-authored-by: Théo Brigitte <theo.brigitte@gmail.com> Co-authored-by: William Dumont <william.dumont@grafana.com> Co-authored-by: Marat Khvostov <marathvostov@gmail.com>
…ession (#3475) * Fix panic caused that can happen when when file is removed for decompressor * Change to start readLine start and stops updatePositions * Add changelog
PR Description
Alternative to #3452. In this pr I am taking a different approach.
Instead of having to call a
Stopfunction on readers we instead accept acontext.ContextinRun. Run will now block until context is canceled or if it stopped for any other reason, for decompressor this happens when the whole file is read.Run is also now responsible for cleanups and we don't have to use any locks or check for nil channels. This is all done as an attempt to simplify the logic a bit.
Which issue(s) this PR fixes
Notes to the Reviewer
PR Checklist