fix: no longer drop request if stream is dropped in loki.source.api#4834
fix: no longer drop request if stream is dropped in loki.source.api#4834
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical bug in loki.source.api where requests were incorrectly cancelled when relabel rules dropped a specific stream (instead of continuing to process remaining streams). Additionally, it refactors the component to send batches of entries rather than individual entries, improving transactional safety during shutdown scenarios.
Key changes:
- Fixed bug where
returnshould have beencontinuewhen a stream is dropped by relabel rules - Added
ForceShutdown()method to cancel all in-flight requests before server shutdown - Refactored to send batches of entries instead of individual entries for better atomicity
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
internal/component/loki/source/api/internal/lokipush/push_api_server.go |
Fixed the relabel rules bug (continue instead of return), added ForceShutdown(), and changed to batch sending |
internal/component/loki/source/api/internal/lokipush/push_api_server_test.go |
Added fakeBatchReceiver test helper to support batch-based testing |
internal/component/loki/source/api/api.go |
Changed handler from LogsReceiver to LogsBatchReceiver and updated Run() to process batches |
internal/component/loki/source/api/api_test.go |
Updated tests to remove manual cleanup now handled by context cancellation |
internal/component/common/loki/receiver.go |
New file containing receiver interfaces, extracted from entry_handler.go |
internal/component/common/loki/entry.go |
New file containing Entry type definition, extracted from entry_handler.go |
internal/component/common/loki/entry_handler.go |
Removed receiver and entry code that was moved to separate files |
internal/component/common/loki/entry_handler_test.go |
Added test coverage for label middleware functionality |
CHANGELOG.md |
Added entry documenting the bug fix |
internal/component/loki/source/api/internal/lokipush/push_api_server_test.go
Show resolved
Hide resolved
internal/component/loki/source/api/internal/lokipush/push_api_server.go
Outdated
Show resolved
Hide resolved
internal/component/loki/source/api/internal/lokipush/push_api_server.go
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| if lastErr != nil { | ||
| level.Warn(s.logger).Log("msg", "at least one entry in the push request failed to process", "err", lastErr.Error()) |
There was a problem hiding this comment.
I think it would be cheap and easy to add a counter of how many entries failed out of how many in the batch. And that would be useful.
There was a problem hiding this comment.
Hm after checking what kind of metrics other exposes we could add two metrics:
- loki_source_api_entries_processed - this would be the total number of entries in a request
- loki_source_api_entries_written - this would be the total number of entries that we have forwarded
With that you could derive how many are dropped either by relabeling or by invalid labels, WDYT?
There was a problem hiding this comment.
I went with my suggestion. Let me know what you think :)
| w.WriteHeader(http.StatusServiceUnavailable) | ||
| return | ||
| } | ||
| entries = append(entries, e) |
There was a problem hiding this comment.
Might want to reuse the slices in a pool if this comes up in allocations.
There was a problem hiding this comment.
Agree, do you think we should do that now or wait and see if it shows up?
There was a problem hiding this comment.
If we go down the pooling path could we try to avoid sending it through a channel? It's fine today because we know the implementation is an unbuffered channel but it feels risky to depend upon that given the server can't enforce it.
There was a problem hiding this comment.
Let's wait to see if this comes up in profiles.
|
💻 Deploy preview deleted (fix: no longer drop request if stream is dropped in loki.source.api). |
2b5b401 to
1dfccfc
Compare
| * `loki_source_api_entries_processed` (counter): Total number of log entries processed. | ||
| * `loki_source_api_entries_written` (counter): Total number of log entries forwarded. |
There was a problem hiding this comment.
I think we do want some metrics, but we also don't want to have too many metrics. Should we settle on only one off them? I think loki_source_api_entries_written would be better option to keep. The idea is to be able to spot some major issues and then continue debugging using other means like logging, profiles or local repro.
There was a problem hiding this comment.
Sure, only kept loki_source_api_entries_written
thampiotr
left a comment
There was a problem hiding this comment.
LGMT % adding only one debug metric instead of two
This will ensure that all entries in a request is sent down the pipeline.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
a66e3fc to
73e08d5
Compare
…4834) * fix: add ForceShutdown that will cancel in-flight requests before stopping server * Split into multiple files and add LogsBatchReceiver * don't drop request when relabel rules drops a specific stream * fix: use loki.LogsBatchReceiver to ensure all entries in a request is sent down the pipeline * add changelog * add checks for entries and use sync once to close channel
* fix: no longer drop request if stream is dropped in loki.source.api (#4834) * fix: add ForceShutdown that will cancel in-flight requests before stopping server * Split into multiple files and add LogsBatchReceiver * don't drop request when relabel rules drops a specific stream * fix: use loki.LogsBatchReceiver to ensure all entries in a request is sent down the pipeline * add changelog * add checks for entries and use sync once to close channel * update changelog for next rc * Fix flaky tests: port in for loki source api tests and logs integration test (#4875) * Fix port in use flakyness for loki source api tests * Pin loki container version for integration tests * Add a new mimir.alerts.kubernetes component (#3448) * Add a new mimir.alerts.kubernetes component * Sync Mimir periodically, test the case of a CRD deletion * Add TODOs * Longer test timeout * Check if pods are running * Apply suggestions from code review Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com> * Fix metric doc * Fix changelog --------- Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com> * Remove experimental flag from stage.windowsevent (#4879) --------- Co-authored-by: Karl Persson <23356117+kalleep@users.noreply.github.com> Co-authored-by: Kyle Eckhart <kgeckhart@users.noreply.github.com> Co-authored-by: Paulin Todev <paulin.todev@gmail.com> Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>
…rafana#4834) * fix: add ForceShutdown that will cancel in-flight requests before stopping server * Split into multiple files and add LogsBatchReceiver * don't drop request when relabel rules drops a specific stream * fix: use loki.LogsBatchReceiver to ensure all entries in a request is sent down the pipeline * add changelog * add checks for entries and use sync once to close channel
PR Description
This pr solves several issues in
loki.source.api.No longer cancel request if stream is dropped by relabel rules, this is clearly a return instead of continue bug.
Add function
ForceShutdownthat will cancel all request before shutting down server.Send batches instead of individual entries from request handler to component. This makes it a bit more transaction safe where we don't partially ingest entries if request is either canceled or a shutdown happens.
Which issue(s) this PR fixes
Notes to the Reviewer
PR Checklist