compute: refactor trace metrics by teskje · Pull Request #20942 · MaterializeInc/materialize

teskje · 2023-08-02T13:14:34Z

This commit refactors the way trace metrics are handled in compute.

It includes one functional change: mz_arrangement_maintenance_seconds_total loses its arrangement_id label. We determined that having this label blows up the cardinality of this metric too much (currently it produces ~15k timeseries in production) to be defensible.

The larger refactor moves the definition of trace metrics into the ComputeMetrics type. This makes all replica metrics defined at the same place and simplifies the metrics plumbing done during initialization.

Motivation

This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#5547.
Design doc: #19717.

Tips for reviewer

I thought about adding a collection_id label to mz_arrangement_maintenance_seconds_total. However, implementing this seems difficult, as the TraceManager would have to learn about the arrangement -> collection mapping. Given that we will have additional metrics showing us row and batch counts per collection, I think we should be able to derive the likely sources of changed maintenance times without a collection_id label on this specific metric. LMK if anyone feels strongly the other way!

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/A

This commit refactors the way trace metrics are handled in compute. It includes one functional change: `mz_arrangement_maintenance_seconds_total` loses its `arrangement_id` label. We determined that having this label blows up the cardinality of this metric too much (currently it produces ~15k timeseries in production) to be defensible. The larger refactor moves the definition of trace metrics into the `ComputeMetrics` type. This makes all replica metrics defined at the same place and simplifies the metrics plumbing done during initialization.

vmarcos

Thanks! This is much more readable, and I also agree with the reasoning to reduce the cardinality of the arrangement maintenance metric. One minor nit below for your consideration.

This commit extends the existing replica metrics test to include more metrics exported by replicas.

teskje · 2023-08-03T16:47:54Z

TFTRs!

teskje force-pushed the trace-metrics branch from b6bec9a to 7c6070e Compare August 3, 2023 11:38

teskje marked this pull request as ready for review August 3, 2023 13:12

teskje requested review from a team and vmarcos August 3, 2023 13:12

umanwizard approved these changes Aug 3, 2023

View reviewed changes

vmarcos approved these changes Aug 3, 2023

View reviewed changes

Comment thread test/cluster/mzcompose.py Outdated

test/cluster: extend replica metrics test

3d59c77

This commit extends the existing replica metrics test to include more metrics exported by replicas.

teskje force-pushed the trace-metrics branch from 7c6070e to 3d59c77 Compare August 3, 2023 15:18

teskje merged commit d8fb2b6 into MaterializeInc:main Aug 3, 2023

teskje deleted the trace-metrics branch July 23, 2025 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: refactor trace metrics#20942

compute: refactor trace metrics#20942
teskje merged 2 commits into
MaterializeInc:mainfrom
teskje:trace-metrics

teskje commented Aug 2, 2023 •

edited

Loading

Uh oh!

vmarcos left a comment

Uh oh!

Uh oh!

teskje commented Aug 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teskje commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

vmarcos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teskje commented Aug 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

teskje commented Aug 2, 2023 •

edited

Loading