Skip to content

[receiver/awss3receiver] Fix data loss on partial SQS message processing failure#45153

Merged
atoulme merged 1 commit intoopen-telemetry:mainfrom
aknuds1:arve/awss3-sqs-partial-failure
Dec 31, 2025
Merged

[receiver/awss3receiver] Fix data loss on partial SQS message processing failure#45153
atoulme merged 1 commit intoopen-telemetry:mainfrom
aknuds1:arve/awss3-sqs-partial-failure

Conversation

@aknuds1
Copy link
Contributor

@aknuds1 aknuds1 commented Dec 25, 2025

Summary

  • Fix data loss when SQS messages contain multiple S3 object notifications and some fail to process
  • The SQS notification reader was unconditionally deleting messages after processing, even when some S3 object retrievals or callback processing failed
  • Messages are now only deleted when ALL records are successfully processed

Background

When S3 sends notifications to SQS, each message can contain multiple S3 object notification records. The existing code would:

  1. Process all records in a message
  2. Log errors for any failures
  3. Delete the message regardless of failures

This caused permanent data loss because failed records could never be retried - the message was already deleted from the queue.

Changes

  • Track success/failure for all records in a message using allRecordsSucceeded flag
  • Only call DeleteMessage if all records processed successfully
  • Add warning log when message is left for retry due to failures
  • Failed messages remain in queue and become visible after visibility timeout for retry

Test plan

  • Added test does_not_delete_message_on_S3_retrieval_error - verifies message not deleted when S3 GetObject fails
  • Added test does_not_delete_message_on_partial_failure - verifies message not deleted when multi-record message has mixed success/failure
  • Added test does_not_delete_message_on_callback_error - verifies message not deleted when callback processing fails
  • Existing tests continue to pass

Trade-off

With this fix, if any record fails, the entire message stays in the queue - meaning successful records will be reprocessed on retry. This is acceptable because:

  • SQS provides at-least-once delivery, so consumers should be idempotent anyway
  • Data loss is worse than duplicate processing
  • Partial acknowledgment would require external state tracking, adding significant complexity

@aknuds1 aknuds1 force-pushed the arve/awss3-sqs-partial-failure branch 2 times, most recently from b7a0a7a to 1a3bd48 Compare December 25, 2025 11:12
…ing failure

The SQS notification reader was unconditionally deleting messages after
processing all records, even when some S3 object retrievals or callback
processing failed. This caused data loss when an SQS message contained
multiple S3 notification records and any of them failed to process.

Changes:
- Track success/failure for all records in a message
- Only delete message from SQS if ALL records processed successfully
- Add log warning when message is left for retry due to failures
- Add tests for partial failure, S3 retrieval error, and callback error

The fix ensures messages remain in the queue for retry after visibility
timeout when any record fails, preventing permanent data loss.

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
@aknuds1 aknuds1 force-pushed the arve/awss3-sqs-partial-failure branch from 1a3bd48 to 452dd91 Compare December 25, 2025 11:13
@aknuds1 aknuds1 marked this pull request as ready for review December 25, 2025 11:14
@aknuds1 aknuds1 requested review from a team and atoulme as code owners December 25, 2025 11:14
@github-actions github-actions bot requested a review from adcharre December 25, 2025 11:14
@atoulme atoulme merged commit 88bceba into open-telemetry:main Dec 31, 2025
194 checks passed
@otelbot
Copy link
Contributor

otelbot bot commented Dec 31, 2025

Thank you for your contribution @aknuds1! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

@github-actions github-actions bot added this to the next release milestone Dec 31, 2025
@aknuds1 aknuds1 deleted the arve/awss3-sqs-partial-failure branch January 2, 2026 09:59
seongpil0948 pushed a commit to seongpil0948/opentelemetry-collector-contrib that referenced this pull request Jan 10, 2026
…ing failure (open-telemetry#45153)

## Summary
- Fix data loss when SQS messages contain multiple S3 object
notifications and some fail to process
- The SQS notification reader was unconditionally deleting messages
after processing, even when some S3 object retrievals or callback
processing failed
- Messages are now only deleted when ALL records are successfully
processed

## Background
When S3 sends notifications to SQS, each message can contain multiple S3
object notification records. The existing code would:
1. Process all records in a message
2. Log errors for any failures
3. Delete the message regardless of failures

This caused permanent data loss because failed records could never be
retried - the message was already deleted from the queue.

## Changes
- Track success/failure for all records in a message using
`allRecordsSucceeded` flag
- Only call `DeleteMessage` if all records processed successfully
- Add warning log when message is left for retry due to failures
- Failed messages remain in queue and become visible after visibility
timeout for retry

## Test plan
- [x] Added test `does_not_delete_message_on_S3_retrieval_error` -
verifies message not deleted when S3 GetObject fails
- [x] Added test `does_not_delete_message_on_partial_failure` - verifies
message not deleted when multi-record message has mixed success/failure
- [x] Added test `does_not_delete_message_on_callback_error` - verifies
message not deleted when callback processing fails
- [x] Existing tests continue to pass

## Trade-off
With this fix, if any record fails, the entire message stays in the
queue - meaning successful records will be reprocessed on retry. This is
acceptable because:
- SQS provides at-least-once delivery, so consumers should be idempotent
anyway
- Data loss is worse than duplicate processing
- Partial acknowledgment would require external state tracking, adding
significant complexity

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants