Skip to content

[receiver/prometheusremotewritereceiver] Fix silent data loss on consumer failure#45151

Merged
songy23 merged 2 commits intoopen-telemetry:mainfrom
aknuds1:arve/prom-remote-write-silent-loss
Jan 7, 2026
Merged

[receiver/prometheusremotewritereceiver] Fix silent data loss on consumer failure#45151
songy23 merged 2 commits intoopen-telemetry:mainfrom
aknuds1:arve/prom-remote-write-silent-loss

Conversation

@aknuds1
Copy link
Contributor

@aknuds1 aknuds1 commented Dec 25, 2025

Summary

The receiver was sending HTTP 204 No Content before calling ConsumeMetrics(), so if the consumer failed, clients incorrectly thought data was delivered. This violates the Prometheus Remote Write 2.0 specification which states:

Receivers MUST NOT return a 2xx HTTP status code if any of the pieces of sent data known to the Receiver were NOT written successfully.

Changes

  • Move WriteHeader(204) to after ConsumeMetrics() succeeds
  • Return 400 Bad Request for permanent consumer errors
  • Return 500 Internal Server Error for retryable errors
  • Add tests for consumer error handling

Impact

Without this fix, when a downstream consumer fails (e.g., backend unavailable, memory limiter rejecting batches, exporter failures), Prometheus clients receive a success response and won't retry, leading to silent data loss.

Testing

Added TestHandlePRWConsumerResponse with sub-tests:

  • success returns 204 - verifies normal operation
  • retryable error returns 500 - verifies temporary failures return 500
  • permanent error returns 400 - verifies permanent failures return 400

@aknuds1 aknuds1 force-pushed the arve/prom-remote-write-silent-loss branch 2 times, most recently from 23971f0 to 85f883a Compare December 25, 2025 10:11
@aknuds1 aknuds1 marked this pull request as ready for review December 25, 2025 10:11
@aknuds1 aknuds1 requested review from a team, ArthurSens and dashpole as code owners December 25, 2025 10:12
@github-actions github-actions bot requested a review from perebaj December 25, 2025 10:12
@aknuds1 aknuds1 force-pushed the arve/prom-remote-write-silent-loss branch from 85f883a to 0dc03c0 Compare January 2, 2026 10:07
…umer failure

The receiver was sending HTTP 204 No Content before calling ConsumeMetrics(),
so if the consumer failed, clients incorrectly thought data was delivered.
This violates the Prometheus Remote Write spec which states receivers MUST NOT
return 2xx if data was not successfully written.

Changes:
- Move WriteHeader(204) to after ConsumeMetrics() succeeds
- Return 400 Bad Request for permanent consumer errors
- Return 500 Internal Server Error for retryable errors
- Add tests for consumer error handling

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
@aknuds1 aknuds1 force-pushed the arve/prom-remote-write-silent-loss branch from 0dc03c0 to ec15cd2 Compare January 2, 2026 10:24
…ote-write-silent-loss

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
@ArthurSens ArthurSens added ready to merge Code review completed; ready to merge by maintainers and removed waiting-for-code-owners labels Jan 7, 2026
@songy23 songy23 merged commit b2a1e07 into open-telemetry:main Jan 7, 2026
218 of 222 checks passed
@github-actions github-actions bot added this to the next release milestone Jan 7, 2026
@otelbot
Copy link
Contributor

otelbot bot commented Jan 7, 2026

Thank you for your contribution @aknuds1! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

@aknuds1 aknuds1 deleted the arve/prom-remote-write-silent-loss branch February 2, 2026 14:51
rashmichandrashekar added a commit to Azure/prometheus-collector that referenced this pull request Feb 18, 2026
This PR upgrades the otelcollector to the latest version available for
the opentelemetry-collector and opentelemetry-operator.

It was automatically generated by the GitHub Actions workflow.

The summary of the OSS changelog is below:
# Prometheusreceiver Changes
## v0.142.0 to v0.144.0

Generated on: 2026-01-27 07:11:01

---

### v0.144.0
- [**FEATURE**] `receiver/prometheus`: receiver/prometheus now
associates scraped _created text lines as the created timestamp of its
metric family rather than its own metric series, as defined by the
OpenMetricsText spec
([#45291](open-telemetry/opentelemetry-collector-contrib#45291))
- [**FEATURE**] `receiver/prometheus`: Add comprehensive troubleshooting
and best practices guide to Prometheus receiver README
([#44925](open-telemetry/opentelemetry-collector-contrib#44925))
The guide includes common issues and solutions, performance optimization
strategies, production deployment best practices, monitoring
recommendations, and debugging tips.
- [**FEATURE**] `receiver/prometheusremotewrite`: Replace labels.Map()
iteration with direct label traversal to eliminate intermediate map
allocations.
([#45166](open-telemetry/opentelemetry-collector-contrib#45166))
- [**BUG FIX**] `receiver/prometheusremotewrite`: Fix silent data loss
when consumer fails by returning appropriate HTTP error codes instead of
204 No Content.
([#45151](open-telemetry/opentelemetry-collector-contrib#45151))
The receiver was sending HTTP 204 No Content before calling
ConsumeMetrics(), causing clients to believe data was successfully
delivered even when the consumer failed. Now returns 400 Bad Request for
permanent errors and 500 Internal Server Error for retryable errors, as
per the Prometheus Remote Write 2.0 specification.
### v0.143.0
- [**BREAKING**] `receiver/prometheus`: Remove deprecated
`use_start_time_metric` and `start_time_metric_regex` configuration
options.
([#44180](open-telemetry/opentelemetry-collector-contrib#44180))
The `use_start_time_metric` and `start_time_metric_regex` configuration
options have been removed after being deprecated in v0.142.0. Users who
have these options set in their configuration will experience collector
startup failures after upgrading. To migrate, remove these configuration
options and use the `metricstarttime` processor instead for equivalent
functionality.
- [**FEATURE**] `receiver/prometheus`: Add
`receiver.prometheusreceiver.RemoveReportExtraScrapeMetricsConfig`
feature gate to disable the `report_extra_scrape_metrics` config option.
([#44181](open-telemetry/opentelemetry-collector-contrib#44181))
When enabled, the `report_extra_scrape_metrics` configuration option is
ignored, and extra scrape metrics are controlled solely by the
`receiver.prometheusreceiver.EnableReportExtraScrapeMetrics` feature
gate. This mimics Prometheus behavior where extra scrape metrics are
controlled by a feature flag.

## Summary

| Category | Count |
|----------|-------|
| Breaking Changes | 1 |
| Features | 4 |
| Bug Fixes | 1 |
| Other Changes | 0 |
| **Total** | **6** |

# Target-allocator Changes
## v0.142.0 to v0.144.0

Generated on: 2026-01-27 07:11:16

---

No changes found for target-allocator between v0.142.0 and v0.144.0

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Rashmi Chandrashekar <rashmy@microsoft.com>
Co-authored-by: Grace Wehner <grace.wehner@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to merge Code review completed; ready to merge by maintainers receiver/prometheusremotewrite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants