Skip to content

compute: mz_dataflow_initial_output_duration_seconds#20126

Merged
teskje merged 4 commits into
MaterializeInc:mainfrom
teskje:metric-time-to-snapshot
Jul 25, 2023
Merged

compute: mz_dataflow_initial_output_duration_seconds#20126
teskje merged 4 commits into
MaterializeInc:mainfrom
teskje:metric-time-to-snapshot

Conversation

@teskje

@teskje teskje commented Jun 23, 2023

Copy link
Copy Markdown
Contributor

This PR introduces a new replica metric, mz_dataflow_initial_output_duration_seconds, that tracks the time from the installation of a compute collection until it first produced any outputs. This is not necessarily the time until a dataflow has been fully hydrated (depending on your definition of 'hydration'), but might be a good stand-in.

From the design doc (#19717):

  • mz_dataflow_initial_output_duration_seconds
    • Type: gauge
    • Labels: worker_id, collection_id
    • Description: The time from dataflow installation up to when the first output was produced.
    • Export Type: direct
    • Notes: To reduce the cardinality of this metric, we limit it to non-transient dataflows.

Demo

You can try the metric out by running these commands in psql:

materialize=> create table t (a int);
CREATE TABLE
materialize=> insert into t select generate_series(1, 2000);
INSERT 0 2000
materialize=> create view v as select t1.a as x, t2.a as y from t t1, t t2;
CREATE VIEW
materialize=> \timing
Timing is on.
materialize=> create default index on v; select * from v limit 1;
CREATE INDEX
Time: 173.885 ms
  x  |  y
-----+-----
 256 | 256
(1 row)

Time: 15370.698 ms (00:15.371)

Then check the metrics endpoint and you should see an entry with a time close to the duration of that final SELECT:

mz_dataflow_initial_output_duration_seconds{collection_id="u2",worker_id="0"} 15.362563792

When you drop the index, the metrics entry will go away too.

Motivation

  • This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#5547.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/A

@teskje teskje force-pushed the metric-time-to-snapshot branch 4 times, most recently from 5fddd8b to b66a0a7 Compare July 24, 2023 09:08
This commit introduces a (still empty) `CollectionMetrics` type that
will contain per-collection metrics, and adds it to the
`CollectionState` of non-transient collections. No metrics are collected
for transient collections, to reduce metric cardinality.
@teskje teskje force-pushed the metric-time-to-snapshot branch from b66a0a7 to 963721c Compare July 24, 2023 09:47
///
// TODO(teskje): Now that we explicitly track the collection's `as_of`, we might be able to
// simplify this to an `Antichain<Timestamp>` again.
reported_frontier: ReportedFrontier,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made private to make it more likely that the different paths that update the reported frontier also call observe_snapshot_produced. Unfortunately, most of these paths are in the same module and there is nothing but discipline keeping us from modifying reported_frontier directly.

@teskje teskje marked this pull request as ready for review July 24, 2023 09:51
@teskje teskje requested review from a team, antiguru and vmarcos July 24, 2023 09:51

@antiguru antiguru left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think to looks fine. I wonder if we can improve the testing of this PR. If I recall correctly, we don't have a way to test the metrics exported by a replica, so it's not easy to do this at the moment. Maybe we should invest a bit more in this? (Or, file an issue so we don't forget!)

Comment thread src/compute/src/compute_state.rs Outdated
Comment thread test/cluster/mzcompose.py Outdated
@teskje teskje force-pushed the metric-time-to-snapshot branch from 8f5ca68 to abc06cc Compare July 25, 2023 08:50
@teskje

teskje commented Jul 25, 2023

Copy link
Copy Markdown
Contributor Author

Added an mzcompose-based cluster test. And it found a bug, so huge thanks for insisting on a test @antiguru!

teskje added 3 commits July 25, 2023 11:03
This commit introduces a new replica metric,
`mz_dataflow_initial_output_duration_seconds`, that tracks the time from
the installation of a compute collection until it first produced any
outputs. This is not necessarily the time until a dataflow has been
fully hydrated (depending on your definition of 'hydration'), but might
be a good stand-in.
This commit adds a new mzcompose test to verify that the
`mz_dataflow_initial_output_duration_seconds` metric works as expected.
Future dataflow metrics can be covered in the same test.
@teskje teskje force-pushed the metric-time-to-snapshot branch from abc06cc to f220133 Compare July 25, 2023 09:03

@vmarcos vmarcos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be extremely helpful, thanks!

@teskje

teskje commented Jul 25, 2023

Copy link
Copy Markdown
Contributor Author

TFTRs!

@teskje teskje merged commit 2ba7e36 into MaterializeInc:main Jul 25, 2023
@teskje teskje deleted the metric-time-to-snapshot branch July 25, 2023 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants