Shutdown paths panic on thread join failures instead of returning errors

## Summary

Several places in the batch processors panic via `.unwrap()` or `.expect()` on thread operations (spawning and joining). These panics occur in shutdown/cleanup code and in constructors, where graceful error handling is expected. If the worker thread panicked for any reason, calling `handle.join().unwrap()` during shutdown will cause a second panic, masking the original error and potentially causing cascading failures.

## Panic Sites

### 1. `BatchSpanProcessor::new()` -- thread spawn

**File:** `opentelemetry-sdk/src/trace/span_processor.rs`, line 456

```rust
.expect("Failed to spawn thread"); //TODO: Handle thread spawn failure
```

The existing TODO comment acknowledges this should be handled. If thread creation fails (e.g., resource exhaustion), the entire process panics instead of returning an error.

### 2. `BatchSpanProcessor::shutdown()` -- thread join

**File:** `opentelemetry-sdk/src/trace/span_processor.rs`, lines 688-689

```rust
if let Some(handle) = self.handle.lock().unwrap().take() {
    handle.join().unwrap();
}
```

Two panics here: `lock().unwrap()` panics if the mutex is poisoned (which happens if another thread panicked while holding the lock), and `join().unwrap()` panics if the worker thread panicked.

### 3. `BatchLogProcessor::new()` -- thread spawn

**File:** `opentelemetry-sdk/src/logs/batch_log_processor.rs`, line 488

```rust
.expect("Thread spawn failed."); //TODO: Handle thread spawn failure
```

Same issue as the span processor constructor, with the same TODO comment.

### 4. `BatchLogProcessor::shutdown()` -- thread join

**File:** `opentelemetry-sdk/src/logs/batch_log_processor.rs`, lines 283-284

```rust
if let Some(handle) = self.handle.lock().unwrap().take() {
    handle.join().unwrap();
}
```

Same dual-panic issue as the span processor shutdown path.

## Why This Is Problematic

1. **Masks original errors:** If the worker thread panicked due to a bug, calling `join().unwrap()` in shutdown causes a *second* panic that obscures the root cause.
2. **Cascading failures:** Panicking in shutdown paths can abort the process before other cleanup (e.g., flushing other processors) completes.
3. **Violates shutdown contract:** `shutdown()` already returns `OTelSdkResult`, so callers expect errors to be returned, not to trigger panics.
4. **Constructor panics are unrecoverable:** Panicking in `new()` on thread spawn failure gives callers no chance to fall back to an alternative (e.g., a simpler processor or a no-op).

## Existing Precedent

`PeriodicReader` in `opentelemetry-sdk/src/metrics/periodic_reader.rs` (lines 327-335) already handles thread spawn failure gracefully:

```rust
// TODO: Should we fail-fast here and bubble up the error to user?
#[allow(unused_variables)]
if let Err(e) = result_thread_creation {
    otel_error!(
        name: "PeriodReaderThreadStartError",
        message = "Failed to start PeriodicReader thread. Metrics will not be exported.",
        error = format!("{:?}", e)
    );
}
```

Although it silently degrades (which has its own issues per #3210), it at least does not panic.

## Suggested Fix

### For constructors (`new()`)

Return `Result<Self, OTelSdkError>` from the constructors (or provide a fallible builder `.build()` method) so callers can handle thread spawn failures. This is a breaking API change.

### For shutdown paths

- Replace `handle.join().unwrap()` with proper error handling that converts a thread panic into an `OTelSdkError`:
  ```rust
  if let Some(handle) = self.handle.lock()
      .map_err(|e| OTelSdkError::InternalFailure(format!("Mutex poisoned: {e}")))?
      .take()
  {
      handle.join().map_err(|_| {
          OTelSdkError::InternalFailure("Worker thread panicked".into())
      })?;
  }
  ```
- This ensures the `OTelSdkResult` return type is used as intended.

## Related

- #3210 -- broader effort to propagate errors instead of silently logging them
- #1963 -- previous report of panic during shutdown (different root cause, now closed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown paths panic on thread join failures instead of returning errors #3375

Summary

Panic Sites

1. `BatchSpanProcessor::new()` -- thread spawn

2. `BatchSpanProcessor::shutdown()` -- thread join

3. `BatchLogProcessor::new()` -- thread spawn

4. `BatchLogProcessor::shutdown()` -- thread join

Why This Is Problematic

Existing Precedent

Suggested Fix

For constructors (`new()`)

For shutdown paths

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shutdown paths panic on thread join failures instead of returning errors #3375

Description

Summary

Panic Sites

1. BatchSpanProcessor::new() -- thread spawn

2. BatchSpanProcessor::shutdown() -- thread join

3. BatchLogProcessor::new() -- thread spawn

4. BatchLogProcessor::shutdown() -- thread join

Why This Is Problematic

Existing Precedent

Suggested Fix

For constructors (new())

For shutdown paths

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `BatchSpanProcessor::new()` -- thread spawn

2. `BatchSpanProcessor::shutdown()` -- thread join

3. `BatchLogProcessor::new()` -- thread spawn

4. `BatchLogProcessor::shutdown()` -- thread join

For constructors (`new()`)