Draft
Conversation
Co-authored-by: KowalskiThomas <14239160+KowalskiThomas@users.noreply.github.com>
|
I can only run on private repositories. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes a race condition in
MaxSampleMetric.flush()where the sample rate was computed outside the lock, causing it to be inconsistent with the number of samples returned inside the lock. This leads to incorrect sample rates being sent to the DogStatsd backend for Histogram, Distribution, and Timing metrics whenmax_metric_samples_per_contextis enabled.Description of the Change
In
MaxSampleMetric.flush(), the effective sample rate (stored_metric_samples / total_metric_samples) was calculated before acquiring the metric lock:A concurrent sampling thread could acquire the metric lock between the rate calculation and the
with self.lock:block, adding a new sample. This results in:ratereflecting N total samplesThe rate reported to the backend is used for statistical scaling (e.g., rate=0.5 means each sample represents 2 events). Returning mismatched rate and sample counts causes the backend to over- or under-weight the metrics.
The fix moves the rate calculation inside the lock so the rate and the sample list are always consistent:
Alternate Designs
One alternative would be to snapshot the counts before acquiring the lock and cap
stored_metric_samplesinside the lock. However, moving both reads inside the lock is simpler, more correct, and involves no additional overhead.Possible Drawbacks
None. The lock is already acquired for the list comprehension, so adding the rate calculation inside the lock has negligible performance impact.
Verification Process
tests/unit/dogstatsd/test_max_sample_metrics.pyto assert that:stored/total < 1.0whenskip_sample()is called)max_metric_samples > 0)Additional Notes
This bug only manifests when
max_metric_samples_per_context > 0is set (the experimental bounded-sampling feature) and concurrent threads are both sampling and flushing. The incorrect rate would cause the Datadog backend to apply the wrong statistical weight to the reported metrics.Release Notes
Fix race condition in
MaxSampleMetric.flush()where sample rate was computed outside the metric lock, causing incorrect rates to be reported to the backend when concurrent sampling and flushing occurred withmax_metric_samples_per_contextenabled.Review checklist (to be filled by reviewers)
changelog/label attached. If applicable it should have thebackward-incompatiblelabel attached.do-not-merge/label attached.kind/andseverity/labels attached at least.PR by Bits - View session in Datadog
Comment @DataDog to request changes