Add rate‑limited log warning when BatchSpanProcessor drops spans by VamshikrishnaMonagari · Pull Request #8167 · open-telemetry/opentelemetry-java

VamshikrishnaMonagari · 2026-03-09T03:09:39Z

Summary

When BatchSpanProcessor's internal queue fills and spans are dropped, the SDK previously only emitted internal metrics (which can be missed if no metrics pipeline is configured). This PR adds a rate-limited WARNING log so users can detect span drops via logs.

Key changes

Add ThrottlingLogger to BatchSpanProcessor
Log a warning when a span is dropped due to the queue being full
Add a unit test ensuring the warning is emitted under queue-full conditions

Fixes

Fixes #7103

robsunday · 2026-03-10T10:55:14Z

sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java

      if (!queue.offer(span)) {
        spanProcessorInstrumentation.dropSpans(1);
+        throttledLogger.log(
+            Level.WARNING,


Isn't warning too high level? I think INFO would be enough.

Good point. I think WARNING is appropriate here because dropped spans represent actual data loss, which impacts observability. This warrants immediate attention from operators in production. Since we're using ThrottlingLogger, the output is already rate-limited to prevent log spam.

However, if the team prefers INFO, I'm happy to adjust. Please let me know.

robsunday · 2026-03-10T10:55:41Z

sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java

      ComponentId.generateLazy("batching_span_processor");

  private static final Logger logger = Logger.getLogger(BatchSpanProcessor.class.getName());
+  private static final ThrottlingLogger throttledLogger = new ThrottlingLogger(logger);


throttlingLogger would be a better name.

Good catch! I've updated the variable name from throttledLogger to throttlingLogger to better reflect that this logger performs throttling (matching the active form of the class name ThrottlingLogger).

jack-berg

Given that we already have internal metrics that would allow a user to detect / react to this condition, is it necessary to duplicate with a log signal?

More generally, do we want to duplicate signals? If so, when?

We already do some duplication of signals. For example, we log exception when an export fails and also record this to internal telemetry. What's the principle here?

cc @jkwatson

jkwatson · 2026-03-12T22:16:27Z

Given that we already have internal metrics that would allow a user to detect / react to this condition, is it necessary to duplicate with a log signal?

More generally, do we want to duplicate signals? If so, when?

We already do some duplication of signals. For example, we log exception when an export fails and also record this to internal telemetry. What's the principle here?

cc @jkwatson

I think we've been assuming that everyone will at least have metrics in their pipeline, but perhaps that's a bad assumption. A throttled log for this seems non-terrible to me, but I don't really have a strong opinion one way or another. Perhaps a simpler way would be to just log if we dropped spans at the end of the export (or do we already do that?)

VamshikrishnaMonagari · 2026-03-13T01:14:27Z

Given that we already have internal metrics that would allow a user to detect / react to this condition, is it necessary to duplicate with a log signal?
More generally, do we want to duplicate signals? If so, when?
We already do some duplication of signals. For example, we log exception when an export fails and also record this to internal telemetry. What's the principle here?
cc @jkwatson

I think we've been assuming that everyone will at least have metrics in their pipeline, but perhaps that's a bad assumption. A throttled log for this seems non-terrible to me, but I don't really have a strong opinion one way or another. Perhaps a simpler way would be to just log if we dropped spans at the end of the export (or do we already do that?)

Thanks for the feedback, @jack-berg and @jkwatson!

I'd argue the log isn't really a duplication, it serves different audience. The internal metrics via dropSpans() are valuable when a MeterProvider is configured, but many users (especially those only using the SDK for tracing, or those early in their OpenTelemetry adoption) won't have a metrics pipeline set up. For them, spans get silently dropped with no indication that anything is wrong. I belive throttled log provides immediate feedback right in the application output.

This also follows established patterns within the SDK. ThrottlingLogger is also widely used across the SDK (in SdkDoubleCounter, AsynchronousMetricStorage, DefaultSynchronousMetricStorage, etc...) for exactly this kind of diagnostic signal.

Regarding the suggestion to log at the end of the export, the drop happens at addSpan() time when the queue is at capacity, so that's the most accurate and actionable place to surface it. By the time the export runs, the dropped spans are already gone and the exporter has no visibility into them. The existing export-failure logs cover a different failure mode for exporter errors, while this covers queue overflow. having both gives complete coverage.

The use of ThrottlingLogger ensures this won't cause log spam even under sustained high-throughput dropping.

jkwatson · 2026-03-14T02:19:34Z

I think my (and perhaps Jack's) point was not that the ThrottlingLogger wasn't a good option, but that it might be more important to know how many spans got dropped, which this approach will not let you know, since the logs will only be emitted for the first few. It would let you know that spans were being dropped, but not how bad the problem is. If a solution could be put together to log the total number dropped per batch (which would certainly require more work!), that would probably be a better solution.

VamshikrishnaMonagari · 2026-03-14T18:52:57Z

I think my (and perhaps Jack's) point was not that the ThrottlingLogger wasn't a good option, but that it might be more important to know how many spans got dropped, which this approach will not let you know, since the logs will only be emitted for the first few. It would let you know that spans were being dropped, but not how bad the problem is. If a solution could be put together to log the total number dropped per batch (which would certainly require more work!), that would probably be a better solution.

That's a great point, @jkwatson. I see what you mean now.

I think the per-batch drop count approach you're suggesting is a better solution. The implementation would be fairly straightforward: replace the ThrottlingLogger call with an AtomicInteger counter that gets incremented in addSpan() on each drop, and then read and reset it in exportCurrentBatch() to log the total. Something like:

"BatchSpanProcessor dropped 47 spans since the last export because the queue is full (maxQueueSize=2048)"

This is naturally throttled by the export cycle, gives visibility into how severe the backpressure is, and the threading model already supports it cleanly since addSpan() runs on caller threads and exportCurrentBatch() runs on the single worker thread.

I'll update the PR with this approach. Let me know if that aligns with what you had in mind.

…anProcessor Instead of logging a throttled message per dropped span, accumulate drop counts using an AtomicInteger and report the total number of dropped spans once per export cycle in exportCurrentBatch(). This gives users visibility into the magnitude of span drops, not just that drops are occurring. The counter is incremented atomically in addSpan() (called from application threads) and read/reset with getAndSet(0) in exportCurrentBatch() (called from the worker thread).

VamshikrishnaMonagari · 2026-03-18T00:37:37Z

@jkwatson
I've updated the PR to go with the per-batch count approach you suggested. It now tracks drops in an AtomicInteger and reports the total once per export cycle, so you'd see something like:

BatchSpanProcessor dropped 47 span(s) since the last export because the queue is full (maxQueueSize=2048)

This naturally limits itself to one log per batch and gives a clear picture of how much backpressure there is. Let me know what you think!

sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java

anuraaga · 2026-03-19T01:11:15Z

sdk/trace/src/test/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessorTest.java

+                BatchSpanProcessor.builder(mockSpanExporter)
+                    .setMaxQueueSize(1)
+                    .setMaxExportBatchSize(1)
+                    .setScheduleDelay(Duration.ofMillis(1))


Since we don't need any exports, may as well set to a huge number to reduce potential flakiness. And maybe it makes sense then to drop more than one span

sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java

anuraaga · 2026-03-19T01:15:13Z

sdk/trace/src/test/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessorTest.java

+    await()
+        .untilAsserted(
+            () ->
+                logs.assertContains(


Seems reasonable to assert the whole message, especially since it has some dynamic parts

Thanks for the review, @anuraaga! Good catches on the test. I've updated it.

@jkwatson @jack-berg -- on the question about SpanProcessorInstrumentation, I agree with @anuraaga that keeping the tracking in BSP makes more sense since the instrumentation counters are synchronous and adding state for this there would complicate things without much benefit.

Would appreciate another look when you get a chance!

Use large scheduleDelay with forceFlush() instead of relying on timing to reduce flakiness. Drop multiple spans (2) to verify the count is reported correctly. Assert the full log message including the drop count and maxQueueSize.

jkwatson

Thanks!

codecov · 2026-03-24T21:11:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.31%. Comparing base (3ea5e08) to head (903425f).
⚠️ Report is 29 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #8167      +/-   ##
============================================
+ Coverage     90.29%   90.31%   +0.01%     
- Complexity     7650     7652       +2     
============================================
  Files           843      843              
  Lines         23059    23071      +12     
  Branches       2309     2311       +2     
============================================
+ Hits          20822    20836      +14     
+ Misses         1519     1516       -3     
- Partials        718      719       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

otelbot · 2026-03-25T16:46:08Z

Thank you for your contribution @VamshikrishnaMonagari! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey.

Alert through logs for dropped spans in BatchSpanProcessor

ff11a8b

VamshikrishnaMonagari requested a review from a team as a code owner March 9, 2026 03:09

robsunday reviewed Mar 10, 2026

View reviewed changes

Fix variable name for throttling logger in BatchSpanProcessor

912e997

VamshikrishnaMonagari requested a review from robsunday March 11, 2026 01:30

jack-berg reviewed Mar 12, 2026

View reviewed changes

VamshikrishnaMonagari requested a review from jack-berg March 13, 2026 01:15

jkwatson reviewed Mar 18, 2026

View reviewed changes

sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java Show resolved Hide resolved

anuraaga reviewed Mar 19, 2026

View reviewed changes

VamshikrishnaMonagari requested review from anuraaga and jkwatson March 22, 2026 00:54

Fix checkstyle import ordering in BatchSpanProcessorTest

903425f

jkwatson approved these changes Mar 24, 2026

View reviewed changes

jkwatson merged commit dd7fd4b into open-telemetry:main Mar 25, 2026
27 checks passed

Conversation

VamshikrishnaMonagari commented Mar 9, 2026

Summary

Key changes

Fixes

Uh oh!

robsunday Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

VamshikrishnaMonagari Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

robsunday Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

VamshikrishnaMonagari Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

jack-berg left a comment

Choose a reason for hiding this comment

Uh oh!

jkwatson commented Mar 12, 2026

Uh oh!

VamshikrishnaMonagari commented Mar 13, 2026

Uh oh!

jkwatson commented Mar 14, 2026

Uh oh!

VamshikrishnaMonagari commented Mar 14, 2026

Uh oh!

VamshikrishnaMonagari commented Mar 18, 2026

Uh oh!

Uh oh!

anuraaga Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anuraaga Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

VamshikrishnaMonagari Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

jkwatson left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

otelbot bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Mar 24, 2026 •

edited

Loading