Skip to content

fix: compute rate under lock#928

Draft
KowalskiThomas wants to merge 1 commit intomasterfrom
dd/kowalski/fix/max-sample-metric-rate-race-condition
Draft

fix: compute rate under lock#928
KowalskiThomas wants to merge 1 commit intomasterfrom
dd/kowalski/fix/max-sample-metric-rate-race-condition

Conversation

@KowalskiThomas
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a race condition in MaxSampleMetric.flush() where the sample rate was computed outside the lock, causing it to be inconsistent with the number of samples returned inside the lock. This leads to incorrect sample rates being sent to the DogStatsd backend for Histogram, Distribution, and Timing metrics when max_metric_samples_per_context is enabled.

Description of the Change

In MaxSampleMetric.flush(), the effective sample rate (stored_metric_samples / total_metric_samples) was calculated before acquiring the metric lock:

# BEFORE (buggy):
def flush(self):
    rate = self.stored_metric_samples / self.total_metric_samples  # outside lock
    with self.lock:
        return [MetricAggregator(..., rate, ...) for i in range(self.stored_metric_samples)]

A concurrent sampling thread could acquire the metric lock between the rate calculation and the with self.lock: block, adding a new sample. This results in:

  • rate reflecting N total samples
  • The returned list containing N+1 samples (or a different stored count)

The rate reported to the backend is used for statistical scaling (e.g., rate=0.5 means each sample represents 2 events). Returning mismatched rate and sample counts causes the backend to over- or under-weight the metrics.

The fix moves the rate calculation inside the lock so the rate and the sample list are always consistent:

# AFTER (fixed):
def flush(self):
    with self.lock:
        rate = self.stored_metric_samples / self.total_metric_samples
        return [MetricAggregator(..., rate, ...) for i in range(self.stored_metric_samples)]

Alternate Designs

One alternative would be to snapshot the counts before acquiring the lock and cap stored_metric_samples inside the lock. However, moving both reads inside the lock is simpler, more correct, and involves no additional overhead.

Possible Drawbacks

None. The lock is already acquired for the list comprehension, so adding the rate calculation inside the lock has negligible performance impact.

Verification Process

  • Added unit tests in tests/unit/dogstatsd/test_max_sample_metrics.py to assert that:
    • The flushed rate correctly reflects skipped samples (stored/total < 1.0 when skip_sample() is called)
    • The flushed rate is correct for bounded reservoir sampling (max_metric_samples > 0)
    • The flushed rate is consistent with the number of samples returned
  • All existing tests continue to pass.

Additional Notes

This bug only manifests when max_metric_samples_per_context > 0 is set (the experimental bounded-sampling feature) and concurrent threads are both sampling and flushing. The incorrect rate would cause the Datadog backend to apply the wrong statistical weight to the reported metrics.

Release Notes

Fix race condition in MaxSampleMetric.flush() where sample rate was computed outside the metric lock, causing incorrect rates to be reported to the backend when concurrent sampling and flushing occurred with max_metric_samples_per_context enabled.

Review checklist (to be filled by reviewers)

  • Feature or bug fix MUST have appropriate tests (unit, integration, etc...)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have one changelog/ label attached. If applicable it should have the backward-incompatible label attached.
  • PR should not have do-not-merge/ label attached.
  • If Applicable, issue must have kind/ and severity/ labels attached at least.

PR by Bits - View session in Datadog

Comment @DataDog to request changes

Co-authored-by: KowalskiThomas <14239160+KowalskiThomas@users.noreply.github.com>
@datadog-datadog-prod-us1
Copy link
Copy Markdown

View session in Datadog

Bits Dev status: ✅ Done

CI Auto-fix: Disabled | Enable

Comment @DataDog to request changes

@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

I can only run on private repositories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant