Skip to content

Shutdown paths panic on thread join failures instead of returning errors #3375

@bryantbiggs

Description

@bryantbiggs

Summary

Several places in the batch processors panic via .unwrap() or .expect() on thread operations (spawning and joining). These panics occur in shutdown/cleanup code and in constructors, where graceful error handling is expected. If the worker thread panicked for any reason, calling handle.join().unwrap() during shutdown will cause a second panic, masking the original error and potentially causing cascading failures.

Panic Sites

1. BatchSpanProcessor::new() -- thread spawn

File: opentelemetry-sdk/src/trace/span_processor.rs, line 456

.expect("Failed to spawn thread"); //TODO: Handle thread spawn failure

The existing TODO comment acknowledges this should be handled. If thread creation fails (e.g., resource exhaustion), the entire process panics instead of returning an error.

2. BatchSpanProcessor::shutdown() -- thread join

File: opentelemetry-sdk/src/trace/span_processor.rs, lines 688-689

if let Some(handle) = self.handle.lock().unwrap().take() {
    handle.join().unwrap();
}

Two panics here: lock().unwrap() panics if the mutex is poisoned (which happens if another thread panicked while holding the lock), and join().unwrap() panics if the worker thread panicked.

3. BatchLogProcessor::new() -- thread spawn

File: opentelemetry-sdk/src/logs/batch_log_processor.rs, line 488

.expect("Thread spawn failed."); //TODO: Handle thread spawn failure

Same issue as the span processor constructor, with the same TODO comment.

4. BatchLogProcessor::shutdown() -- thread join

File: opentelemetry-sdk/src/logs/batch_log_processor.rs, lines 283-284

if let Some(handle) = self.handle.lock().unwrap().take() {
    handle.join().unwrap();
}

Same dual-panic issue as the span processor shutdown path.

Why This Is Problematic

  1. Masks original errors: If the worker thread panicked due to a bug, calling join().unwrap() in shutdown causes a second panic that obscures the root cause.
  2. Cascading failures: Panicking in shutdown paths can abort the process before other cleanup (e.g., flushing other processors) completes.
  3. Violates shutdown contract: shutdown() already returns OTelSdkResult, so callers expect errors to be returned, not to trigger panics.
  4. Constructor panics are unrecoverable: Panicking in new() on thread spawn failure gives callers no chance to fall back to an alternative (e.g., a simpler processor or a no-op).

Existing Precedent

PeriodicReader in opentelemetry-sdk/src/metrics/periodic_reader.rs (lines 327-335) already handles thread spawn failure gracefully:

// TODO: Should we fail-fast here and bubble up the error to user?
#[allow(unused_variables)]
if let Err(e) = result_thread_creation {
    otel_error!(
        name: "PeriodReaderThreadStartError",
        message = "Failed to start PeriodicReader thread. Metrics will not be exported.",
        error = format!("{:?}", e)
    );
}

Although it silently degrades (which has its own issues per #3210), it at least does not panic.

Suggested Fix

For constructors (new())

Return Result<Self, OTelSdkError> from the constructors (or provide a fallible builder .build() method) so callers can handle thread spawn failures. This is a breaking API change.

For shutdown paths

  • Replace handle.join().unwrap() with proper error handling that converts a thread panic into an OTelSdkError:
    if let Some(handle) = self.handle.lock()
        .map_err(|e| OTelSdkError::InternalFailure(format!("Mutex poisoned: {e}")))?
        .take()
    {
        handle.join().map_err(|_| {
            OTelSdkError::InternalFailure("Worker thread panicked".into())
        })?;
    }
  • This ensures the OTelSdkResult return type is used as intended.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions