-
Notifications
You must be signed in to change notification settings - Fork 646
Description
Summary
Several places in the batch processors panic via .unwrap() or .expect() on thread operations (spawning and joining). These panics occur in shutdown/cleanup code and in constructors, where graceful error handling is expected. If the worker thread panicked for any reason, calling handle.join().unwrap() during shutdown will cause a second panic, masking the original error and potentially causing cascading failures.
Panic Sites
1. BatchSpanProcessor::new() -- thread spawn
File: opentelemetry-sdk/src/trace/span_processor.rs, line 456
.expect("Failed to spawn thread"); //TODO: Handle thread spawn failureThe existing TODO comment acknowledges this should be handled. If thread creation fails (e.g., resource exhaustion), the entire process panics instead of returning an error.
2. BatchSpanProcessor::shutdown() -- thread join
File: opentelemetry-sdk/src/trace/span_processor.rs, lines 688-689
if let Some(handle) = self.handle.lock().unwrap().take() {
handle.join().unwrap();
}Two panics here: lock().unwrap() panics if the mutex is poisoned (which happens if another thread panicked while holding the lock), and join().unwrap() panics if the worker thread panicked.
3. BatchLogProcessor::new() -- thread spawn
File: opentelemetry-sdk/src/logs/batch_log_processor.rs, line 488
.expect("Thread spawn failed."); //TODO: Handle thread spawn failureSame issue as the span processor constructor, with the same TODO comment.
4. BatchLogProcessor::shutdown() -- thread join
File: opentelemetry-sdk/src/logs/batch_log_processor.rs, lines 283-284
if let Some(handle) = self.handle.lock().unwrap().take() {
handle.join().unwrap();
}Same dual-panic issue as the span processor shutdown path.
Why This Is Problematic
- Masks original errors: If the worker thread panicked due to a bug, calling
join().unwrap()in shutdown causes a second panic that obscures the root cause. - Cascading failures: Panicking in shutdown paths can abort the process before other cleanup (e.g., flushing other processors) completes.
- Violates shutdown contract:
shutdown()already returnsOTelSdkResult, so callers expect errors to be returned, not to trigger panics. - Constructor panics are unrecoverable: Panicking in
new()on thread spawn failure gives callers no chance to fall back to an alternative (e.g., a simpler processor or a no-op).
Existing Precedent
PeriodicReader in opentelemetry-sdk/src/metrics/periodic_reader.rs (lines 327-335) already handles thread spawn failure gracefully:
// TODO: Should we fail-fast here and bubble up the error to user?
#[allow(unused_variables)]
if let Err(e) = result_thread_creation {
otel_error!(
name: "PeriodReaderThreadStartError",
message = "Failed to start PeriodicReader thread. Metrics will not be exported.",
error = format!("{:?}", e)
);
}Although it silently degrades (which has its own issues per #3210), it at least does not panic.
Suggested Fix
For constructors (new())
Return Result<Self, OTelSdkError> from the constructors (or provide a fallible builder .build() method) so callers can handle thread spawn failures. This is a breaking API change.
For shutdown paths
- Replace
handle.join().unwrap()with proper error handling that converts a thread panic into anOTelSdkError:if let Some(handle) = self.handle.lock() .map_err(|e| OTelSdkError::InternalFailure(format!("Mutex poisoned: {e}")))? .take() { handle.join().map_err(|_| { OTelSdkError::InternalFailure("Worker thread panicked".into()) })?; }
- This ensures the
OTelSdkResultreturn type is used as intended.
Related
- Library crates should propagate errors instead of silently logging them #3210 -- broader effort to propagate errors instead of silently logging them
- [Bug]: panic during shutdown with TokioCurrentThread batch tracer #1963 -- previous report of panic during shutdown (different root cause, now closed)