[receiver/awss3receiver] Fix data loss on partial SQS message processing failure#45153
Merged
atoulme merged 1 commit intoopen-telemetry:mainfrom Dec 31, 2025
Merged
Conversation
b7a0a7a to
1a3bd48
Compare
…ing failure The SQS notification reader was unconditionally deleting messages after processing all records, even when some S3 object retrievals or callback processing failed. This caused data loss when an SQS message contained multiple S3 notification records and any of them failed to process. Changes: - Track success/failure for all records in a message - Only delete message from SQS if ALL records processed successfully - Add log warning when message is left for retry due to failures - Add tests for partial failure, S3 retrieval error, and callback error The fix ensures messages remain in the queue for retry after visibility timeout when any record fails, preventing permanent data loss. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
1a3bd48 to
452dd91
Compare
atoulme
approved these changes
Dec 31, 2025
Contributor
|
Thank you for your contribution @aknuds1! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help. |
seongpil0948
pushed a commit
to seongpil0948/opentelemetry-collector-contrib
that referenced
this pull request
Jan 10, 2026
…ing failure (open-telemetry#45153) ## Summary - Fix data loss when SQS messages contain multiple S3 object notifications and some fail to process - The SQS notification reader was unconditionally deleting messages after processing, even when some S3 object retrievals or callback processing failed - Messages are now only deleted when ALL records are successfully processed ## Background When S3 sends notifications to SQS, each message can contain multiple S3 object notification records. The existing code would: 1. Process all records in a message 2. Log errors for any failures 3. Delete the message regardless of failures This caused permanent data loss because failed records could never be retried - the message was already deleted from the queue. ## Changes - Track success/failure for all records in a message using `allRecordsSucceeded` flag - Only call `DeleteMessage` if all records processed successfully - Add warning log when message is left for retry due to failures - Failed messages remain in queue and become visible after visibility timeout for retry ## Test plan - [x] Added test `does_not_delete_message_on_S3_retrieval_error` - verifies message not deleted when S3 GetObject fails - [x] Added test `does_not_delete_message_on_partial_failure` - verifies message not deleted when multi-record message has mixed success/failure - [x] Added test `does_not_delete_message_on_callback_error` - verifies message not deleted when callback processing fails - [x] Existing tests continue to pass ## Trade-off With this fix, if any record fails, the entire message stays in the queue - meaning successful records will be reprocessed on retry. This is acceptable because: - SQS provides at-least-once delivery, so consumers should be idempotent anyway - Data loss is worse than duplicate processing - Partial acknowledgment would require external state tracking, adding significant complexity Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Background
When S3 sends notifications to SQS, each message can contain multiple S3 object notification records. The existing code would:
This caused permanent data loss because failed records could never be retried - the message was already deleted from the queue.
Changes
allRecordsSucceededflagDeleteMessageif all records processed successfullyTest plan
does_not_delete_message_on_S3_retrieval_error- verifies message not deleted when S3 GetObject failsdoes_not_delete_message_on_partial_failure- verifies message not deleted when multi-record message has mixed success/failuredoes_not_delete_message_on_callback_error- verifies message not deleted when callback processing failsTrade-off
With this fix, if any record fails, the entire message stays in the queue - meaning successful records will be reprocessed on retry. This is acceptable because: