fix: compute rate under lock by KowalskiThomas · Pull Request #928 · DataDog/datadogpy

KowalskiThomas · 2026-03-28T20:20:09Z

What does this PR do?

Fixes a race condition in MaxSampleMetric.flush() where the sample rate was computed outside the lock, causing it to be inconsistent with the number of samples returned inside the lock. This leads to incorrect sample rates being sent to the DogStatsd backend for Histogram, Distribution, and Timing metrics when max_metric_samples_per_context is enabled.

Description of the Change

In MaxSampleMetric.flush(), the effective sample rate (stored_metric_samples / total_metric_samples) was calculated before acquiring the metric lock:

# BEFORE (buggy):
def flush(self):
    rate = self.stored_metric_samples / self.total_metric_samples  # outside lock
    with self.lock:
        return [MetricAggregator(..., rate, ...) for i in range(self.stored_metric_samples)]

A concurrent sampling thread could acquire the metric lock between the rate calculation and the with self.lock: block, adding a new sample. This results in:

rate reflecting N total samples
The returned list containing N+1 samples (or a different stored count)

The rate reported to the backend is used for statistical scaling (e.g., rate=0.5 means each sample represents 2 events). Returning mismatched rate and sample counts causes the backend to over- or under-weight the metrics.

The fix moves the rate calculation inside the lock so the rate and the sample list are always consistent:

# AFTER (fixed):
def flush(self):
    with self.lock:
        rate = self.stored_metric_samples / self.total_metric_samples
        return [MetricAggregator(..., rate, ...) for i in range(self.stored_metric_samples)]

Alternate Designs

One alternative would be to snapshot the counts before acquiring the lock and cap stored_metric_samples inside the lock. However, moving both reads inside the lock is simpler, more correct, and involves no additional overhead.

Possible Drawbacks

None. The lock is already acquired for the list comprehension, so adding the rate calculation inside the lock has negligible performance impact.

Verification Process

Added unit tests in tests/unit/dogstatsd/test_max_sample_metrics.py to assert that:
- The flushed rate correctly reflects skipped samples (stored/total < 1.0 when skip_sample() is called)
- The flushed rate is correct for bounded reservoir sampling (max_metric_samples > 0)
- The flushed rate is consistent with the number of samples returned
All existing tests continue to pass.

Additional Notes

This bug only manifests when max_metric_samples_per_context > 0 is set (the experimental bounded-sampling feature) and concurrent threads are both sampling and flushing. The incorrect rate would cause the Datadog backend to apply the wrong statistical weight to the reported metrics.

Release Notes

Fix race condition in MaxSampleMetric.flush() where sample rate was computed outside the metric lock, causing incorrect rates to be reported to the backend when concurrent sampling and flushing occurred with max_metric_samples_per_context enabled.

Review checklist (to be filled by reviewers)

Feature or bug fix MUST have appropriate tests (unit, integration, etc...)
PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have one changelog/ label attached. If applicable it should have the backward-incompatible label attached.
PR should not have do-not-merge/ label attached.
If Applicable, issue must have kind/ and severity/ labels attached at least.

PR by Bits - View session in Datadog

Comment @DataDog to request changes

Co-authored-by: KowalskiThomas <14239160+KowalskiThomas@users.noreply.github.com>

datadog-datadog-prod-us1 · 2026-03-28T20:20:10Z

View session in Datadog

Bits Dev status: ✅ Done

CI Auto-fix: Disabled | Enable

Comment @DataDog to request changes

datadog-datadog-prod-us1-2 · 2026-03-28T20:20:15Z

I can only run on private repositories.

fix: compute rate under lock

774e539

Co-authored-by: KowalskiThomas <14239160+KowalskiThomas@users.noreply.github.com>

datadog-datadog-prod-us1-2 bot added the Bits AI label Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: compute rate under lock#928

fix: compute rate under lock#928
KowalskiThomas wants to merge 1 commit intomasterfrom
dd/kowalski/fix/max-sample-metric-rate-race-condition

KowalskiThomas commented Mar 28, 2026

Uh oh!

datadog-datadog-prod-us1 bot commented Mar 28, 2026

Uh oh!

datadog-datadog-prod-us1-2 bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KowalskiThomas commented Mar 28, 2026

What does this PR do?

Description of the Change

Alternate Designs

Possible Drawbacks

Verification Process

Additional Notes

Release Notes

Review checklist (to be filled by reviewers)

Uh oh!

datadog-datadog-prod-us1 bot commented Mar 28, 2026

Uh oh!

datadog-datadog-prod-us1-2 bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant