Skip to content

compute: move mz_dataflow_initial_output_duration_seconds metric to the controller#21533

Merged
teskje merged 5 commits into
MaterializeInc:mainfrom
teskje:controller-dataflow-initial-output-duration
Sep 5, 2023
Merged

compute: move mz_dataflow_initial_output_duration_seconds metric to the controller#21533
teskje merged 5 commits into
MaterializeInc:mainfrom
teskje:controller-dataflow-initial-output-duration

Conversation

@teskje

@teskje teskje commented Sep 1, 2023

Copy link
Copy Markdown
Contributor

This PR moves the mz_dataflow_initial_output_duration_seconds metric from being a replica export to a controller export. The main benefit we get from this is that the worker_id label is removed, which both makes the metric easier to use and greatly reduces its cardinality.

As part of this change, this PR:

Motivation

  • This PR adds a known-desirable feature.

Part of https://github.com/MaterializeInc/database-issues/issues/5547.
Design doc: #19717.

Tips for reviewer

Look at the commits separately!

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/A

@teskje teskje force-pushed the controller-dataflow-initial-output-duration branch from 0d940d3 to fc6a422 Compare September 4, 2023 09:35
@teskje teskje marked this pull request as ready for review September 4, 2023 10:32
@teskje teskje requested a review from a team September 4, 2023 10:32
@shepherdlybot

shepherdlybot Bot commented Sep 4, 2023

Copy link
Copy Markdown

This PR has higher risk. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 81 / 100 62% 0

@teskje teskje requested review from antiguru and vmarcos September 4, 2023 10:33

@vmarcos vmarcos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, thanks!

@def- def- force-pushed the controller-dataflow-initial-output-duration branch from fc6a422 to 2c9beb6 Compare September 5, 2023 09:37
def- and others added 5 commits September 5, 2023 12:11
The # character is considered a comment in verbose mode
This commit removes the `CollectionMetric` infrastructure from the
compute replica code. There was only a single per-collection metric,
`mz_dataflow_initial_output_duration`, which we want to have as a
controller export instead. No other per-collection metrics are currently
planned on the replica-side, so there is no need to keep
`CollectionMetric` around.
This commit performs some cleanup work on the `ReplicaTask` logic. This
mostly consists of factoring out logic into methods and accessing needed
state via `self` rather than passing it around in parameters.

This commit also introduces `observe_command` and `observe_response`
methods, which will be the place where the
`initial_output_duration_seconds` metric is updated.
This commit adds the `mz_dataflow_initial_output_duration_seconds`
metric back in, this time on the controller side. The metric is
per-replica so maintaining it within the replica tasks is easiest. For
this the replica task needs to start observing compute commands and
responses more closely, to get to know about new collections being
created and their frontiers advancing beyond their `as_of`s.
@teskje teskje force-pushed the controller-dataflow-initial-output-duration branch from 2c9beb6 to 3261d30 Compare September 5, 2023 10:13

@def- def- left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://buildkite.com/materialize/coverage/builds/214 seems to be covered in tests. Edit: Actually let me rerun, I think there was a problem.

@teskje

teskje commented Sep 5, 2023

Copy link
Copy Markdown
Contributor Author

Actually let me rerun, I think there was a problem.

Looks like it finished, but I'm not sure how to interpret the results.

@def-

def- commented Sep 5, 2023

Copy link
Copy Markdown
Contributor

@teskje

teskje commented Sep 5, 2023

Copy link
Copy Markdown
Contributor Author

TFTRs!

@teskje teskje merged commit e14d030 into MaterializeInc:main Sep 5, 2023
@teskje teskje deleted the controller-dataflow-initial-output-duration branch September 5, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants