Skip to content

feat: Add datadog metrics backend#703

Merged
markstory merged 4 commits into
mainfrom
metrics-adapter
Jun 17, 2026
Merged

feat: Add datadog metrics backend#703
markstory merged 4 commits into
mainfrom
metrics-adapter

Conversation

@markstory

Copy link
Copy Markdown
Member

We currently require applications to implement their own metrics backend. This has resulted in metrics from the taskworker runtime having inconsistent tags and metric names, which makes building dashboards and alerting for taskworkers tedious.

Having a more opinionated metrics backend will allow us to make metric names and tags that are important to observability structurally required. I've also included a prefixed metric shim which will allow us to switch metrics providers without gaps in observability.

Refs STREAM-816

We currently require applications to implement their own metrics
backend. This has resulted in metrics from the taskworker runtime having
inconsistent tags and metric names, which makes building dashboards and
alerting for taskworkers tedious.

Having a more opinionated metrics backend will allow us to make metric
names and tags that are important to observability structurally
required. I've also included a prefixed metric shim which will allow us
to switch metrics providers without gaps in observability.

Refs STREAM-816
@markstory markstory requested a review from a team as a code owner June 12, 2026 20:10
@linear-code

linear-code Bot commented Jun 12, 2026

Copy link
Copy Markdown

STREAM-816

try:
yield None
finally:
self._emit("timing", key, time.monotonic() - start, tags, sample_rate)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer sends seconds not milliseconds

High Severity

DatadogMetrics.timer emits elapsed wall time from time.monotonic() directly to DogStatsD timing, but that API expects values in milliseconds. Reported durations for RPC and task timers will be roughly three orders of magnitude too small, breaking latency dashboards and alerts.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fe820a1. Configure here.



def _rss_bytes() -> int:
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory metric uses peak RSS

Medium Severity

track_memory_usage derives its distribution from the difference in ru_maxrss, which is a process lifetime high-water mark, not current RSS. Memory freed inside the block does not lower the metric, and on Linux ru_maxrss is in kilobytes despite _rss_bytes implying bytes.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fe820a1. Configure here.

Comment thread clients/python/src/taskbroker_client/metrics.py
Comment thread clients/python/src/taskbroker_client/metrics.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2078b84. Configure here.

Comment thread clients/python/src/taskbroker_client/metrics.py
Comment thread clients/python/src/taskbroker_client/metrics.py
Comment thread clients/python/src/taskbroker_client/metrics.py
It should be aligned with the other methods.
instance: str | None = None,
tags: Tags | None = None,
sample_rate: float = 1,
sample_rate: float | None = None,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default value has been silly for a while. Now is a good time to change this.

Comment thread clients/python/src/taskbroker_client/metrics.py
client.distribution.assert_called_once()
args, _ = client.distribution.call_args
assert args[0] == "taskworker.mem"
assert isinstance(args[1], int)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The track_memory_usage function incorrectly uses ru_maxrss, a high-water mark for memory, to calculate memory deltas. This can cause test failures and makes the production metric unreliable.
Severity: MEDIUM

Suggested Fix

To accurately measure memory usage within the context, replace resource.getrusage with a method that measures current memory usage, not a historical peak. For example, use a library like psutil to get the current resident set size (RSS) at the beginning and end of the block. The difference between these two values will provide a more accurate measurement of the memory consumed within that specific context.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: clients/python/tests/test_metrics.py#L140

Potential issue: The function `track_memory_usage` calculates memory usage deltas using
`resource.getrusage(resource.RUSAGE_SELF).ru_maxrss`. This value represents the peak
memory usage (a high-water mark) of the process, not the current usage. In the test
`test_track_memory_usage`, an assertion `args[1] > 0` expects a positive memory delta
after an allocation. However, if previous tests have already set a high memory peak, the
new allocation may not increase it, resulting in a delta of zero and a test failure.
This also affects production, where the metric will likely report zero for most tasks,
making it ineffective for its intended purpose of tracking memory usage per task.

@markstory markstory merged commit 63fadc8 into main Jun 17, 2026
27 checks passed
@markstory markstory deleted the metrics-adapter branch June 17, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants