Skip to content

loki.source.file: refactor tail package#5003

Merged
kalleep merged 30 commits intomainfrom
kalleep/refactor-tailer
Dec 11, 2025
Merged

loki.source.file: refactor tail package#5003
kalleep merged 30 commits intomainfrom
kalleep/refactor-tailer

Conversation

@kalleep
Copy link
Contributor

@kalleep kalleep commented Dec 4, 2025

PR Description

During hackaton I decided to explore new api for our internal / vendored tail package.

The package had questionable locking and it is really hard to determine if it was done correctly. The design of it also spawned 2 goroutines per file we tailed and everything was communicated over a channel.

So the design I came up with is a much smaller api surface with only two public methods, Next and Stop.

Next will return next line of the file or block until either a file event occurs or the file is stopped.
Stop will cancel any blocked calls to Next and close the file.

I also added offset to Line, this is the offset into the file right after the line we just consumed. This allows us to remove an additional goroutine in tailer.go, the backgroud job updating position and metrics. We can instead do this during the read loop still respecting the interval for updates.

I ran two alloy collectors with the same config and tailing 20 files

2025-12-04-11:52:59

This will drastically reduce number of goroutines and reduce the memory overhead that comes with that for alloy instances tailing many files.

I also removed usage of loki.EntryHandler and set labels directly on entry we send, this will remove one additional goroutine.

So in total it would be 4 less goroutines per filed taild.

Which issue(s) this PR fixes

Notes to the Reviewer

  • Watcher was refactored into two functions instead, one that block until a file exists or context is canceled and one that blocks until a file event is detected, no need to run these as backgroud jobs and send events through channels

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@kalleep kalleep requested a review from a team as a code owner December 4, 2025 10:56
Copy link
Contributor Author

@kalleep kalleep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what I should put in the changelog, any suggestions?

@kalleep kalleep added publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 4, 2025
@kalleep kalleep requested a review from Copilot December 4, 2025 14:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the internal tail package to simplify its API and reduce resource usage. The refactoring eliminates 4 goroutines per tailed file by replacing the channel-based design with direct method calls (Next() and Stop()), consolidating position updates into the main read loop, and removing the loki.EntryHandler wrapper. The new design reduces memory overhead and complexity while maintaining all existing functionality.

Key Changes:

  • Simplified tail package API from channel-based to blocking method calls with Next() and Stop() methods
  • Added Offset field to Line struct to enable inline position tracking without a separate goroutine
  • Consolidated file watching logic into standalone blocking functions instead of background goroutines

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/component/loki/source/file/tailer.go Refactored to use new tail API, removed separate position update goroutine, merged label handling into entry creation
internal/component/loki/source/file/tailer_test.go Updated test to accommodate new tailer startup behavior
internal/component/loki/source/file/internal/tail/file.go New implementation of file tailing with blocking Next() method
internal/component/loki/source/file/internal/tail/file_test.go Comprehensive test suite for new File implementation
internal/component/loki/source/file/internal/tail/line.go Added Line struct with Offset field for position tracking
internal/component/loki/source/file/internal/tail/config.go New configuration structures for File and WatcherConfig
internal/component/loki/source/file/internal/tail/block.go Standalone blocking functions for file existence and event detection
internal/component/loki/source/file/internal/tail/fileext/file_*.go Consolidated and renamed from winfile package, includes platform-specific file operations
internal/component/loki/source/file/internal/tail/watch/*.go Removed old watch package with channel-based file watching
internal/component/loki/source/file/internal/tail/tail.go Removed old channel-based tail implementation
internal/component/loki/source/file/internal/tail/util/util.go Removed unused utility package
internal/component/loki/source/file/internal/tail/README.md Removed outdated README referencing old implementation
Comments suppressed due to low confidence (1)

internal/component/loki/source/file/internal/tail/fileext/file_windows.go:104

  • Typo in comment: "ended information" should be "extended information" to match the description above.

Comment on lines +204 to +205
MinBackoff: defaultWatcherConfig.MaxPollFrequency,
MaxBackoff: defaultWatcherConfig.MaxPollFrequency,
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff configuration uses defaultWatcherConfig.MaxPollFrequency instead of f.cfg.WatcherConfig.MaxPollFrequency. This means that even if the user configured custom polling frequencies, the reopen logic will always use the default values (250ms). This is inconsistent with the rest of the file watching behavior and could lead to unexpected delays or polling frequencies during file reopens. Consider using f.cfg.WatcherConfig.MaxPollFrequency instead.

Suggested change
MinBackoff: defaultWatcherConfig.MaxPollFrequency,
MaxBackoff: defaultWatcherConfig.MaxPollFrequency,
MinBackoff: f.cfg.WatcherConfig.MaxPollFrequency,
MaxBackoff: f.cfg.WatcherConfig.MaxPollFrequency,

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, AI's comment seems valid here, can you leave a comment to explain?

Copy link
Contributor Author

@kalleep kalleep Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the original implementation had this too https://github.com/grafana/alloy/blob/main/internal/component/loki/source/file/internal/tail/tail.go#L229

So I just kept the same behavior but we can for sure change this

@kalleep kalleep added publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 4, 2025
@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 9d7217a to 1fa3763 Compare December 5, 2025 09:02
@kalleep kalleep added os:windows publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed os:windows publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 5, 2025
@kalleep kalleep marked this pull request as draft December 5, 2025 10:35
@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 4d413e9 to 1fa3763 Compare December 5, 2025 10:53
@kalleep
Copy link
Contributor Author

kalleep commented Dec 5, 2025

Deployed this to our biggest internal cluster:
2025-12-05-14:05:00

@kalleep kalleep added publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository and removed publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository labels Dec 8, 2025
@kalleep kalleep marked this pull request as ready for review December 8, 2025 09:06

line = strings.TrimRight(line, "\n")
// Trim Windows line endings
line = strings.TrimSuffix(line, "\r")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just TrimRight "\r\n"? TrimRight will remove all of the trailing characters that are in the set, so it'll remove anything, and the current behavior will only trim a single newline on Windows but all newlines on non-Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what was there before but you are right we can just use strings.TrimRight("\r\n") 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I did it this way is that I wasn't sure whether \r could have any other uses at the end of the line. TrimRight(line, "\r\n") would also trim a line which ends with \r\r\n. I wanted to retain the first \r since it's technically not part of the line break.

Comment on lines +204 to +205
MinBackoff: defaultWatcherConfig.MaxPollFrequency,
MaxBackoff: defaultWatcherConfig.MaxPollFrequency,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, AI's comment seems valid here, can you leave a comment to explain?

@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 27d7fc2 to 12dcebf Compare December 10, 2025 08:45
@kalleep kalleep requested a review from dehaansa December 10, 2025 08:53
@github-actions
Copy link
Contributor

github-actions bot commented Dec 10, 2025

🔍 Dependency Review

gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 -> removed — ⚠️ Needs Review

The dependency gopkg.in/tomb.v1 has been removed. This library was previously used to manage goroutine lifecycles (start/kill/wait via a Tomb), notably within the file tailing implementation under internal/component/loki/source/file/internal/tail.

The PR already refactors the internal tailer to use context-based cancellation and backoff, eliminating the need for tomb. If there are other parts of the repository still using tomb, those will now fail to compile and will require migration to context (or errgroup). Please review any remaining imports of gopkg.in/tomb.v1 across the repo.

Key migration patterns:

  • Replace tomb lifecycles with context.Context and cancel functions.
  • Replace tomb-backed blocking and polling with context-aware polling/backoff.

Concise code updates to adopt removal:

  • Replace embedded Tomb fields and Kill/Wait handling with context:
- type Tail struct {
-   ...
-   tomb.Tomb // provides: Done, Kill, Dying
- }
+ type Tail struct {
+   ...
+   ctx    context.Context
+   cancel context.CancelFunc
+ }
  • Replace constructor patterns that relied on tomb with context:
- go t.tailFileSync()
+ ctx, cancel := context.WithCancel(context.Background())
+ t.ctx, t.cancel = ctx, cancel
+ go t.tailFileSync()
  • Replace select cases using tomb’s Dying with context cancellation:
- case <-tail.Dying():
+ case <-tail.ctx.Done():
  • Replace Kill/Wait-based Stop with context cancel + resource close:
- func (t *Tail) Stop() error {
-   t.Kill(nil)
-   return t.Wait()
- }
+ func (t *Tail) Stop() error {
+   t.cancel()
+   t.mu.Lock()
+   defer t.mu.Unlock()
+   return t.file.Close()
+ }
  • Replace tomb-aware “wait until file exists” with context + backoff:
- if err := tail.watcher.BlockUntilExists(&tail.Tomb); err != nil { ... }
+ if err := blockUntilExists(ctx, cfg); err != nil { ... }
  • Replace tomb + watcher notifications with a context-aware polling loop that detects file events:
- changes, err := tail.watcher.ChangeEvents(&tail.Tomb, pos)
- select {
- case <-changes.Modified:
-   ...
- case <-changes.Deleted:
-   ...
- case <-changes.Truncated:
-   ...
- }
+ event, err := blockUntilEvent(ctx, f, prevSize, cfg)
+ switch event {
+ case eventModified:
+   ...
+ case eventDeleted:
+   ...
+ case eventTruncated:
+   ...
+ }

These replacements match the refactor already present in this PR (see the new context-based tailer implementation in internal/component/loki/source/file/internal/tail).

Relevant evidence and references:

  • Removed usage (old) relied on tomb:

    • tomb.Tomb // provides: Done, Kill, Dying
    • t.Kill(err), t.Wait(), t.Dying()
    • watch.NewPollingFileWatcher(...).BlockUntilExists(&t.Tomb) and tomb-aware change events
  • New implementation (in this PR) no longer requires tomb:

    • blockUntilExists(ctx, cfg) using github.com/grafana/dskit/backoff
    • blockUntilEvent(ctx, f, prevSize, cfg) polling for file changes
    • Tailer methods use context.Context and cancel for lifecycle
    • Stop/cleanup flows use file.Close() guarded by context

Notes

  • Only one dependency changed in go.mod: gopkg.in/tomb.v1 was removed.
  • The PR already refactors internal tailer code to context-based cancellation and polling; ensure no other packages in the repo still import tomb. If they do, apply the migration snippets above.

@kalleep
Copy link
Contributor Author

kalleep commented Dec 10, 2025

Test that fails in windows is not related to this pr and seems to be broken on main

@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 3c6c2a8 to 1b6fdf8 Compare December 10, 2025 13:41
@kalleep kalleep force-pushed the kalleep/refactor-tailer branch from 099fdce to 7d564d4 Compare December 11, 2025 10:47
@kalleep kalleep merged commit 688da66 into main Dec 11, 2025
44 checks passed
@kalleep kalleep deleted the kalleep/refactor-tailer branch December 11, 2025 13:43
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

frozen-due-to-age os:windows publish-dev:linux builds and deploys an image to grafana/alloy-dev container repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants