Skip to content

compute: refactor trace metrics#20942

Merged
teskje merged 2 commits into
MaterializeInc:mainfrom
teskje:trace-metrics
Aug 3, 2023
Merged

compute: refactor trace metrics#20942
teskje merged 2 commits into
MaterializeInc:mainfrom
teskje:trace-metrics

Conversation

@teskje

@teskje teskje commented Aug 2, 2023

Copy link
Copy Markdown
Contributor

This commit refactors the way trace metrics are handled in compute.

It includes one functional change: mz_arrangement_maintenance_seconds_total loses its arrangement_id label. We determined that having this label blows up the cardinality of this metric too much (currently it produces ~15k timeseries in production) to be defensible.

The larger refactor moves the definition of trace metrics into the ComputeMetrics type. This makes all replica metrics defined at the same place and simplifies the metrics plumbing done during initialization.

Motivation

  • This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#5547.
Design doc: #19717.

Tips for reviewer

I thought about adding a collection_id label to mz_arrangement_maintenance_seconds_total. However, implementing this seems difficult, as the TraceManager would have to learn about the arrangement -> collection mapping. Given that we will have additional metrics showing us row and batch counts per collection, I think we should be able to derive the likely sources of changed maintenance times without a collection_id label on this specific metric. LMK if anyone feels strongly the other way!

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/A

This commit refactors the way trace metrics are handled in compute.

It includes one functional change:
`mz_arrangement_maintenance_seconds_total` loses its `arrangement_id`
label. We determined that having this label blows up the cardinality of
this metric too much (currently it produces ~15k timeseries in
production) to be defensible.

The larger refactor moves the definition of trace metrics into the
`ComputeMetrics` type. This makes all replica metrics defined at the
same place and simplifies the metrics plumbing done during
initialization.
@teskje teskje marked this pull request as ready for review August 3, 2023 13:12
@teskje teskje requested review from a team and vmarcos August 3, 2023 13:12

@vmarcos vmarcos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is much more readable, and I also agree with the reasoning to reduce the cardinality of the arrangement maintenance metric. One minor nit below for your consideration.

Comment thread test/cluster/mzcompose.py Outdated
This commit extends the existing replica metrics test to include more
metrics exported by replicas.
@teskje teskje merged commit d8fb2b6 into MaterializeInc:main Aug 3, 2023
@teskje

teskje commented Aug 3, 2023

Copy link
Copy Markdown
Contributor Author

TFTRs!

@teskje teskje deleted the trace-metrics branch July 23, 2025 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants