[BUG] Static-destruction race: Telemetry::increment_counter touches destroyed counters_ after Telemetry::~Telemetry returns

## Summary

When a `tracing::Tracer` (or just a `telemetry::Telemetry`) is held by an object with **static-storage lifetime** and the process exits, the `CurlImpl` libcurl event-loop thread can call back into `Telemetry::increment_counter` *after* `Telemetry::~Telemetry` has already returned and its members (`counters_`, `rates_`, …) are mid-destruction. The result is a SEGV inside `std::unordered_map::operator[]` reaching into a destroyed hash table.

Reproducible on **v2.0.0** (the version we're pinning). I read the v2.1.0 sources too — `Telemetry::increment_counter` and `Telemetry::~Telemetry` are byte-identical between v2.0.0 and v2.1.0, so the race shape is unchanged.

## Symptoms

- AddressSanitizer reports a `SEGV on unknown address` (typically a low, garbage-looking pointer) from inside `Telemetry::increment_counter` → `std::unordered_map::operator[]` → `_M_bucket_index`.
- Fires **after** Catch2 (or whatever test framework / `main`) has already printed its success summary — i.e. during `exit()` / static destruction, not during the actual work the program did.
- Happens with **`tracing_enabled=false` and `report_traces=false`** in the `TracerConfig` — the telemetry curl thread still starts and still sends `app-started` + queues `app-closing`, which is what triggers the in-flight callback.

## Stack trace (ASan, g++ 15.2.1, x86_64-linux, debug build with `-fsanitize=address,undefined`)

I've trimmed the noisy hash-table internals; the relevant frames are #6 onwards:

```
==N==ERROR: AddressSanitizer: SEGV on unknown address 0x... (T1)
  ==> READ memory access from a background thread

  #0–#5  std::__detail::_Hash_code_base<MetricContext<Metric<Counter>>, …>::_M_bucket_index
         std::unordered_map<…>::operator[]
  #6     Telemetry::increment_counter(
           Metric<Counter> const&,
           std::vector<std::string> const&)                          telemetry_impl.cpp:799
  #7     send_payload(…)::<lambda(Error)>                            telemetry_impl.cpp:399
  #8–#11 std::function<void(Error)>::operator()
  #12    CurlImpl::handle_message(CURLMsg const&)                    curl.cpp:591
  #13    CurlImpl::run()                                             curl.cpp:559
  #14    CurlImpl::CurlImpl(…)::<lambda()>                           curl.cpp:297
  …      std::thread invoker / asan_thread_start / start_thread
```

## Root cause (as I read the code)

`Telemetry::~Telemetry()` (telemetry_impl.cpp:228 in v2.0.0) does two things:

1. `cancel_tasks(tasks_)` — cancels the heartbeat scheduler.
2. `app_closing()` → `send_payload(\"app-closing\", …)` — **enqueues** an HTTP POST onto `http_client_`, which is the `Curl` instance owned outside of `Telemetry`. The libcurl event-loop thread is still alive at this point; the `send_payload` lambda captures `this` (the `Telemetry`) and binds `increment_counter` calls inside the response/error callback.

The destructor then returns. Reverse-declaration-order member destruction begins: `distributions_`, `rates_snapshot_`, `rates_`, `counters_snapshot_`, **`counters_` (destroyed here)**, etc.

`CurlImpl::~CurlImpl` (curl.cpp:312) joins the event-loop thread, but it doesn't run until the **parent** `Tracer` / outer-owner destructor unwinds far enough to destroy it — which is *later* than the `Telemetry` member destruction inside the same enclosing object's destruction. There's a window where the event-loop thread is still processing the `app-closing` response and invokes the callback, which touches `Telemetry::increment_counter` → already-destroyed `counters_` → SEGV.

This is invisible in normal use because most test programs let the OS reclaim the process before the curl thread's response lands. With ASan / under a static-storage owner, the destruction order is fully forced and the race is reliably triggered.

## Minimal repro shape

We don't have a self-contained mini-repro pre-baked, but the shape is:

```cpp
// In some translation unit:
static std::shared_ptr<datadog::tracing::Tracer> make_tracer() {
  datadog::tracing::TracerConfig cfg;
  cfg.service = \"repro\";
  cfg.tracing_enabled = false;
  cfg.report_traces    = false;
  auto vc = datadog::tracing::finalize_config(cfg);
  return std::make_shared<datadog::tracing::Tracer>(*vc);
}

// And then somewhere with static lifetime (function-scope static or
// namespace-scope, doesn't matter):
static auto s_tracer = make_tracer();

int main() {
  // Trivially exit. The crash is in static destruction, not in main().
}
```

Build with `-fsanitize=address,undefined` and run. The ASan SEGV from the curl thread is the symptom.

## Suggested fix directions

(I read v2.0.0 and v2.1.0 of `telemetry_impl.cpp`; the same options apply to both.)

1. **Synchronously drain in flight callbacks in `Telemetry::~Telemetry`** before letting the destructor return — e.g. call `http_client_->drain(deadline)` (the API already exists on `Curl`, see curl.cpp:421). After `drain`, no callback referencing `this` can still be queued.
2. **Capture by `shared_ptr` rather than by raw `this` in `send_payload`'s response/error lambda**, with the `Telemetry` either being internally `shared_from_this()` or wrapping its mutable state in a `shared_ptr<State>`. This keeps the closure self-contained and removes the cross-object ordering coupling.
3. **Document the lifetime contract** — currently nothing in the public API warns the host that a `Tracer` held in static storage may crash during process exit. If neither of the above fixes lands, at minimum a comment in `tracer_config.h` saying \"do not hold `Tracer` in static storage\" would save downstream debugging.

Option (1) feels like the smallest, most targeted change. (2) is more invasive but would also fix any future shape of this same bug class.

## Context

Surfaced during a sanitizer sweep on a third-party C++ proxy. Both the Catch2 unit suite and the Jest integration suite under an ASan-instrumented proxy are otherwise clean (zero proxy-side findings); this is the only artifact, and it's 100% in `dd-trace-cpp` + libstdc++ frames.

Happy to provide more detail or test a proposed patch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Static-destruction race: Telemetry::increment_counter touches destroyed counters_ after Telemetry::~Telemetry returns #320

Summary

Symptoms

Stack trace (ASan, g++ 15.2.1, x86_64-linux, debug build with `-fsanitize=address,undefined`)

Root cause (as I read the code)

Minimal repro shape

Suggested fix directions

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Static-destruction race: Telemetry::increment_counter touches destroyed counters_ after Telemetry::~Telemetry returns #320

Description

Summary

Symptoms

Stack trace (ASan, g++ 15.2.1, x86_64-linux, debug build with -fsanitize=address,undefined)

Root cause (as I read the code)

Minimal repro shape

Suggested fix directions

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Stack trace (ASan, g++ 15.2.1, x86_64-linux, debug build with `-fsanitize=address,undefined`)