Skip to content

compute-client: peek metrics#20063

Merged
teskje merged 3 commits into
MaterializeInc:mainfrom
teskje:peek-metrics
Jul 18, 2023
Merged

compute-client: peek metrics#20063
teskje merged 3 commits into
MaterializeInc:mainfrom
teskje:peek-metrics

Conversation

@teskje

@teskje teskje commented Jun 21, 2023

Copy link
Copy Markdown
Contributor

This PR adds metrics for finished peeks as described in the Compute metrics design (#19717):

  • the total counts of peeks performed since the last restart
  • a histogram of peek durations

Both metrics are labeled by peek result type (rows, error, or canceled), since the difference will likely be interesting to us. For example, if we want to calculate the average peek response time, we'll probably want to exclude canceled peeks as they might artificially lower the average.

Motivation

  • This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#5547.

Tips for reviewer

I've added some clarifying annotations to the relevant places in the code.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/A

@teskje teskje marked this pull request as ready for review June 21, 2023 16:27
@teskje teskje requested review from a team, umanwizard and vmarcos June 21, 2023 16:27
Comment thread doc/developer/design/20230531_compute_metrics.md Outdated
let duration = peek.requested_at.elapsed();
self.compute
.metrics
.observe_peek_response(&response, duration);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried implementing this method on self.compute (i.e. the Instance type) instead, but that lead to borrow-checker issues because we are mutably borrowing from self.compute.peeks here.

name: "mz_compute_peek_duration_seconds",
help: "A histogram of peek durations since restart.",
var_labels: ["instance_id", "result"],
buckets: histogram_seconds_buckets(0.000_500, 32.),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the lower bound I looked at what existing code is already using. It seems unrealistic that peeks that involve network communication would be complete in less than one ms, so this should be fine.

@vmarcos vmarcos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM!

I was not 100% sure that the reduction in cardinality is absolutely necessary? It's too bad that we'd not be able to observe the distribution of successful peeks for a given workload.

One alternative would be to coarsen the histograms for cancellations and errors, for example, since they are likely to reveal only bigger effects (e.g., users wait that much before cancelling, errors come back typically rather quickly except if everything is slowed down anyway).

@teskje

teskje commented Jun 22, 2023

Copy link
Copy Markdown
Contributor Author

I don't think reducing the cardinality is absolutely necessary, this was just me thinking that we don't have a use for differentiating between Rows and Error responses. If you think having the distinction would be useful, we can add it in!

@vmarcos

vmarcos commented Jun 23, 2023

Copy link
Copy Markdown
Contributor

I don't think reducing the cardinality is absolutely necessary, this was just me thinking that we don't have a use for differentiating between Rows and Error responses. If you think having the distinction would be useful, we can add it in!

I think that it would be useful to have the distinction, so that we can analyze the distribution of peek durations focusing only on "goodput". It should be that well-tuned production clusters should have very few Error responses (making the distinction less relevant for analysis when true, but good as an indicator to be checked), while in development / staging clusters we should see more differences between the two. If cardinality is not a big concern, then we could collect with the same number of buckets also the Errors and cancellations.

teskje added 2 commits July 17, 2023 13:37
This commit adds metrics for finished peeks:
 * the total counts of peeks performed since the last restart
 * a histogram of peek durations

Both metrics are labeled by peek result type (completed or canceled),
since the difference will likely be interesting to us. For example, if
we want to calculate the average peek response time, we'll probably want
to exclude canceled peeks as they might artificially lower the average.
@teskje

teskje commented Jul 17, 2023

Copy link
Copy Markdown
Contributor Author

This makes sense to me. I've added a second commit that splits rows and error responses. Mind taking another look?

@teskje teskje requested a review from vmarcos July 17, 2023 12:05
Comment thread src/compute-client/src/controller/instance.rs Outdated

@vmarcos vmarcos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@teskje teskje merged commit 214ffaa into MaterializeInc:main Jul 18, 2023
@teskje

teskje commented Jul 18, 2023

Copy link
Copy Markdown
Contributor Author

TFTRs!

@teskje teskje deleted the peek-metrics branch July 18, 2023 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants