Skip to content

[statsd] Fix buffer operation thread-safety#642

Merged
sgnn7 merged 2 commits intomasterfrom
sgnn7/ac34-statsd-batch-thread-safety
Apr 9, 2021
Merged

[statsd] Fix buffer operation thread-safety#642
sgnn7 merged 2 commits intomasterfrom
sgnn7/ac34-statsd-batch-thread-safety

Conversation

@sgnn7
Copy link
Copy Markdown
Contributor

@sgnn7 sgnn7 commented Mar 5, 2021

What does this PR do?

Old code did not have locking around its open_buffer() and
close_buffer() operations which made those two methods not
thread-safe. This change fixes this and aligns this module with the
documentation.

Fixes #439
Fixes #601

Description of the Change

We now have an RLock around buffer operations that prevents threads
from changing the buffer owned by another thread.

Overall performance decrease when sending 2+ metrics is within the +/- 5% margin of error:

$ /time_buffer_ops.py 
Using Python3
baseline: 6.19679 (baseline)
mod     : 6.02907 (97.29%, 2.78% perf increase)
baseline: 6.20096 (baseline)
mod     : 6.04097 (97.42%, 2.65% perf increase)
baseline: 6.05407 (baseline)
mod     : 6.12549 (101.18%, -1.17% perf increase)
baseline: 6.16327 (baseline)
mod     : 6.16413 (100.01%, -0.01% perf increase)
baseline: 5.94379 (baseline)
mod     : 6.18436 (104.05%, -3.89% perf increase)

$ ./time_buffer_ops.py 
Using Python2
baseline: 5.24699 (baseline)
mod     : 5.25159 (100.09%, -0.09% perf increase)
baseline: 5.17050 (baseline)
mod     : 5.27538 (102.03%, -1.99% perf increase)
baseline: 5.46608 (baseline)
mod     : 5.56813 (101.87%, -1.83% perf increase)
baseline: 5.10800 (baseline)
mod     : 5.41473 (106.01%, -5.66% perf increase)
baseline: 5.06513 (baseline)
mod     : 5.27768 (104.20%, -4.03% perf increase)

Alternate Designs

Possible Drawbacks

Verification Process

Unit tests cover the full scope of changes

Additional Notes

Release Notes

Review checklist (to be filled by reviewers)

  • Feature or bug fix MUST have appropriate tests (unit, integration, etc...)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have one changelog/ label attached. If applicable it should have the backward-incompatible label attached.
  • PR should not have do-not-merge/ label attached.
  • If Applicable, issue must have kind/ and severity/ labels attached at least.

@sgnn7 sgnn7 added the changelog/Fixed Fixed features results into a bug fix version bump label Mar 5, 2021
@sgnn7 sgnn7 requested review from a team as code owners March 5, 2021 23:41
Comment on lines +383 to +384
if not hasattr(self, 'buffer'):
raise BufferError('Cannot close buffer that was never opened')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be correct though it looks like a different change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edge case was found during the test writing. It's a separate commit but it can be split out into a separate PR if you want.

Comment on lines +391 to +395
finally:
# Release all locks that might have been held by our thread.
while self._buffer_lock_depth > 0:
self._buffer_lock_depth -= 1
self._buffer_lock.release()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean you can have multiple open_buffer from the same thread and only one close_buffer?
That sounds like trying to circumvent programming errors.

Copy link
Copy Markdown
Contributor Author

@sgnn7 sgnn7 Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean you can have multiple open_buffer from the same thread and only one close_buffer?
That sounds like trying to circumvent programming errors.

Hey @jd, there was no real guidance to our users that an open_buffer should match a close_buffer. Even if we did have that guidance, it's not technically a programming error - the following code gets into the tricky territory of "what's the correct behavior expected" that ends up being ambiguous as to what it should do/allow:

statds.open_buffer()
# do something to the buffer

statsd.open_buffer()
# do something else
statsd.close_buffer()

The second open_buffer in the strictest of terms (which is what you were pointing out I think) seems like it should just throw an exception as it's out-of-order but given the former behavior and its user friendliness it could be interpreted as a "just give me a new buffer to write". We've discussed this a bit on slack before the PR was implemented and chose the latter as a more-appropriate behavior.

As a sidenote, both open_buffer and close_buffer are getting deprecated soon since we want buffering by default so they will effectively be noops in the future.

"""
Flush the buffer and switch back to single metric packets.
"""
if not hasattr(self, 'buffer'):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider setting self.buffer at None in the constructor as an alternative solution?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point and I did consider that as an alternative but that option has the unintended effect of ignoring blatantly erroneous behavior when the user does not call open_buffer even a single time before trying to close the buffer. While we might not care from the logic point of view if the buffer was created or not beforehand, it made more sense here to notify the user that their code is likely very wrong.

self._flush_buffer()
finally:
# Release all locks that might have been held by our thread.
while self._buffer_lock_depth > 0:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following code:

d = DogStatsd()
# ...
with d as batch:
       with d as other_batch:
             other_batch.gauge("users.online", 123)
       # As the lock was released at the previous line, another thread can call `d.open_buffer()` and update buffer at the same time as `batch.gauge("users.online", 123)`
       batch.gauge("users.online", 123)

As a sidenote, both open_buffer and close_buffer are getting deprecated soon since we want buffering by default so they will effectively be noops in the future

Note: I am not sure it worth handling the case in the previous code as these functions are getting deprecated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting edge case for sure! Let me think about this some more when I get time and I'll get back to you on it.

Copy link
Copy Markdown
Contributor Author

@sgnn7 sgnn7 Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thinking about this particular example and other edge use cases, I think it might be just better to make the lock not explicitly re-entrant from the same thread which should fix your and @jd's concerns.

ogaca-dd
ogaca-dd previously approved these changes Apr 2, 2021
sgnn7 added 2 commits April 8, 2021 14:18
When a `close_buffer` was called before any `open_buffer`, we would raise
an unexpected exception. This change ensures that we have a consistent
error shown to the user when that happens.
Old code did not have locking around its `open_buffer()` and
`close_buffer()` operations which made those two methods not
thread-safe. This change fixes this and aligns this module with the
documentation.
@sgnn7 sgnn7 force-pushed the sgnn7/ac34-statsd-batch-thread-safety branch from f439b1c to 4b2aa1d Compare April 8, 2021 20:23
Copy link
Copy Markdown

@ogaca-dd ogaca-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sgnn7 sgnn7 merged commit 33f727c into master Apr 9, 2021
@sgnn7 sgnn7 deleted the sgnn7/ac34-statsd-batch-thread-safety branch April 9, 2021 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/Fixed Fixed features results into a bug fix version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible race condition in _flush_buffer DogStatsd is not thread-safe when using buffer

3 participants