compute: refactor trace metrics#20942
Merged
Merged
Conversation
This commit refactors the way trace metrics are handled in compute. It includes one functional change: `mz_arrangement_maintenance_seconds_total` loses its `arrangement_id` label. We determined that having this label blows up the cardinality of this metric too much (currently it produces ~15k timeseries in production) to be defensible. The larger refactor moves the definition of trace metrics into the `ComputeMetrics` type. This makes all replica metrics defined at the same place and simplifies the metrics plumbing done during initialization.
umanwizard
approved these changes
Aug 3, 2023
vmarcos
approved these changes
Aug 3, 2023
vmarcos
left a comment
Contributor
There was a problem hiding this comment.
Thanks! This is much more readable, and I also agree with the reasoning to reduce the cardinality of the arrangement maintenance metric. One minor nit below for your consideration.
This commit extends the existing replica metrics test to include more metrics exported by replicas.
Contributor
Author
|
TFTRs! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit refactors the way trace metrics are handled in compute.
It includes one functional change:
mz_arrangement_maintenance_seconds_totalloses itsarrangement_idlabel. We determined that having this label blows up the cardinality of this metric too much (currently it produces ~15k timeseries in production) to be defensible.The larger refactor moves the definition of trace metrics into the
ComputeMetricstype. This makes all replica metrics defined at the same place and simplifies the metrics plumbing done during initialization.Motivation
Part of MaterializeInc/database-issues#5547.
Design doc: #19717.
Tips for reviewer
I thought about adding a
collection_idlabel tomz_arrangement_maintenance_seconds_total. However, implementing this seems difficult, as theTraceManagerwould have to learn about the arrangement -> collection mapping. Given that we will have additional metrics showing us row and batch counts per collection, I think we should be able to derive the likely sources of changed maintenance times without acollection_idlabel on this specific metric. LMK if anyone feels strongly the other way!Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.