Skip to content

ref(worker): add more observability for producer futures#719

Merged
bmckerry merged 2 commits into
mainfrom
ben/taskbroker-future-observability
Jun 18, 2026
Merged

ref(worker): add more observability for producer futures#719
bmckerry merged 2 commits into
mainfrom
ben/taskbroker-future-observability

Conversation

@bmckerry

Copy link
Copy Markdown
Member

This PR adds more observability to TaskProducer (and the worker child) to help troubleshoot why some producer futures never got marked as done.

Specifically:

  • changes TaskProducer's _pending_futures from a single queue to a dict keyed by producer name, so we can break down worker child metrics by producer
  • adds metric for how many messages were produced by a task, tagged by producer name
  • separates the task execution time metric from the time a task that produced spends sitting in the worker child's pending queue
    • adds a new e2e metric that records both of these put together
  • records a gauge of how many futures are still incomplete in a worker child's tasks with pending futures
    • not sure if this should be a dist?
  • records a gauge of how long the oldest task with pending futures is
    • previously only tasks with 100% complete futures had their latency recorded, this will tell us how old the tasks with incomplete futures are

@bmckerry bmckerry requested a review from a team as a code owner June 17, 2026 19:36

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c6f52d0. Configure here.

"processing_pool": processing_pool_name,
"taskbroker_host": taskbroker_host,
},
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka wait omitted from latency

Medium Severity

For activations that await producer futures, record_task_execution still derives execution_latency, COGS usage, and Sentry check-in duration from completion_time, which is now futures_start_time rather than wall-clock completion. Time spent waiting on Kafka futures is dropped from those values even though e2e_latency includes it.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c6f52d0. Configure here.

@bmckerry bmckerry Jun 17, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we'd want to handle these - in theory tasks aren't "complete" until all produced futures are complete, but I still see task function execution time & future awaiting time as separate things we should track

Open to advice here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me that what we're measuring here is the additional time it takes to move futures from one list to another, not waiting for the produce to actually finished. is that correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execution_duration is how long the task function takes to execute, futures_duration is how long it took for all of a tasks producer futures to be marked as done. Since tasks that produced messages don't go into processed_tasks until all their futures are done, this e2e time metric tracks how long that took.

inflight=task.inflight,
next_state=task.status,
execution_start_time=task.execution_start_time,
execution_end_time=task.futures_start_time,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you're already passing in futures_start_time

@bmckerry bmckerry Jun 18, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this branch of the code, tasks have been sitting in pending_task_futures waiting for their producer futures to finish (so execution_end_time == futures_start_time, which is when the tasks were added to pending_task_futures).

Here is where we call _task_execution_complete() when a task had no pending futures, so a different value (the current time) is passed in as execution_end_time.

I think this module is due for a bit of a refactor to clean up the branching logic, but I'd rather do that in a separate PR.

"processing_pool": processing_pool_name,
"taskbroker_host": taskbroker_host,
},
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me that what we're measuring here is the additional time it takes to move futures from one list to another, not waiting for the produce to actually finished. is that correct?

Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated
Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated

@untitaker untitaker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving to unblock

i think the histogram vs gauge thing might prevent you from getting insights. but if you feel like it's good enough for your purposes to see workerchild's metrics muddled together per-pool (and I think it'll be fine), then i'd rather unblock for now and revisit later (i can see this discussion dragging on for too long, especially since it concerns cardinality)

and the other thing about which latency is the right one: i also don't think we need to resolve this conversation in this PR

it's important to me that we get to a state where we can safely roll out producer futures in existing high-scale tasks (and also for tasks that we will create from legacy consumers), and for that we need the investigation to continue. otherwise legacy consumers project will soon get to a point where we can't port more consumers either. everything else is irrelevant right now.

@bmckerry bmckerry force-pushed the ben/taskbroker-future-observability branch from c995b96 to 6941464 Compare June 18, 2026 15:16
@bmckerry bmckerry merged commit 6d22bb7 into main Jun 18, 2026
27 checks passed
@bmckerry bmckerry deleted the ben/taskbroker-future-observability branch June 18, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants