compute-client: peek metrics by teskje · Pull Request #20063 · MaterializeInc/materialize

teskje · 2023-06-21T13:45:02Z

This PR adds metrics for finished peeks as described in the Compute metrics design (#19717):

the total counts of peeks performed since the last restart
a histogram of peek durations

Both metrics are labeled by peek result type (rows, error, or canceled), since the difference will likely be interesting to us. For example, if we want to calculate the average peek response time, we'll probably want to exclude canceled peeks as they might artificially lower the average.

Motivation

This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#5547.

Tips for reviewer

I've added some clarifying annotations to the relevant places in the code.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/A

teskje · 2023-06-21T16:35:19Z

+            let duration = peek.requested_at.elapsed();
+            self.compute
+                .metrics
+                .observe_peek_response(&response, duration);


I tried implementing this method on self.compute (i.e. the Instance type) instead, but that lead to borrow-checker issues because we are mutably borrowing from self.compute.peeks here.

teskje · 2023-06-21T16:36:58Z

+                name: "mz_compute_peek_duration_seconds",
+                help: "A histogram of peek durations since restart.",
+                var_labels: ["instance_id", "result"],
+                buckets: histogram_seconds_buckets(0.000_500, 32.),


For the lower bound I looked at what existing code is already using. It seems unrealistic that peeks that involve network communication would be complete in less than one ms, so this should be fine.

vmarcos

The changes LGTM!

I was not 100% sure that the reduction in cardinality is absolutely necessary? It's too bad that we'd not be able to observe the distribution of successful peeks for a given workload.

One alternative would be to coarsen the histograms for cancellations and errors, for example, since they are likely to reveal only bigger effects (e.g., users wait that much before cancelling, errors come back typically rather quickly except if everything is slowed down anyway).

teskje · 2023-06-22T18:31:44Z

I don't think reducing the cardinality is absolutely necessary, this was just me thinking that we don't have a use for differentiating between Rows and Error responses. If you think having the distinction would be useful, we can add it in!

vmarcos · 2023-06-23T15:25:07Z

I don't think reducing the cardinality is absolutely necessary, this was just me thinking that we don't have a use for differentiating between Rows and Error responses. If you think having the distinction would be useful, we can add it in!

I think that it would be useful to have the distinction, so that we can analyze the distribution of peek durations focusing only on "goodput". It should be that well-tuned production clusters should have very few Error responses (making the distinction less relevant for analysis when true, but good as an indicator to be checked), while in development / staging clusters we should see more differences between the two. If cardinality is not a big concern, then we could collect with the same number of buckets also the Errors and cancellations.

This commit adds metrics for finished peeks: * the total counts of peeks performed since the last restart * a histogram of peek durations Both metrics are labeled by peek result type (completed or canceled), since the difference will likely be interesting to us. For example, if we want to calculate the average peek response time, we'll probably want to exclude canceled peeks as they might artificially lower the average.

teskje · 2023-07-17T12:05:49Z

This makes sense to me. I've added a second commit that splits rows and error responses. Mind taking another look?

vmarcos

Thanks, LGTM!

teskje · 2023-07-18T10:11:22Z

TFTRs!

teskje marked this pull request as ready for review June 21, 2023 16:27

teskje requested review from a team, umanwizard and vmarcos June 21, 2023 16:27

teskje commented Jun 21, 2023

View reviewed changes

Comment thread doc/developer/design/20230531_compute_metrics.md Outdated

teskje commented Jun 21, 2023

View reviewed changes

vmarcos approved these changes Jun 22, 2023

View reviewed changes

teskje added 2 commits July 17, 2023 13:37

fixup! compute-client: peek metrics

adb4769

teskje force-pushed the peek-metrics branch from aea254b to adb4769 Compare July 17, 2023 12:03

teskje requested a review from vmarcos July 17, 2023 12:05

antiguru approved these changes Jul 17, 2023

View reviewed changes

Comment thread src/compute-client/src/controller/instance.rs Outdated

fixup! compute-client: peek metrics

042955b

vmarcos approved these changes Jul 18, 2023

View reviewed changes

teskje merged commit 214ffaa into MaterializeInc:main Jul 18, 2023

teskje deleted the peek-metrics branch July 18, 2023 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute-client: peek metrics#20063

compute-client: peek metrics#20063
teskje merged 3 commits into
MaterializeInc:mainfrom
teskje:peek-metrics

teskje commented Jun 21, 2023 •

edited

Loading

Uh oh!

Uh oh!

teskje Jun 21, 2023

Uh oh!

teskje Jun 21, 2023

Uh oh!

vmarcos left a comment

Uh oh!

teskje commented Jun 22, 2023

Uh oh!

vmarcos commented Jun 23, 2023

Uh oh!

teskje commented Jul 17, 2023

Uh oh!

Uh oh!

vmarcos left a comment

Uh oh!

teskje commented Jul 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teskje commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

Uh oh!

teskje Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

teskje Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

vmarcos left a comment

Choose a reason for hiding this comment

Uh oh!

teskje commented Jun 22, 2023

Uh oh!

vmarcos commented Jun 23, 2023

Uh oh!

teskje commented Jul 17, 2023

Uh oh!

Uh oh!

vmarcos left a comment

Choose a reason for hiding this comment

Uh oh!

teskje commented Jul 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

teskje commented Jun 21, 2023 •

edited

Loading