compute-client: peek metrics#20063
Conversation
| let duration = peek.requested_at.elapsed(); | ||
| self.compute | ||
| .metrics | ||
| .observe_peek_response(&response, duration); |
There was a problem hiding this comment.
I tried implementing this method on self.compute (i.e. the Instance type) instead, but that lead to borrow-checker issues because we are mutably borrowing from self.compute.peeks here.
| name: "mz_compute_peek_duration_seconds", | ||
| help: "A histogram of peek durations since restart.", | ||
| var_labels: ["instance_id", "result"], | ||
| buckets: histogram_seconds_buckets(0.000_500, 32.), |
There was a problem hiding this comment.
For the lower bound I looked at what existing code is already using. It seems unrealistic that peeks that involve network communication would be complete in less than one ms, so this should be fine.
vmarcos
left a comment
There was a problem hiding this comment.
The changes LGTM!
I was not 100% sure that the reduction in cardinality is absolutely necessary? It's too bad that we'd not be able to observe the distribution of successful peeks for a given workload.
One alternative would be to coarsen the histograms for cancellations and errors, for example, since they are likely to reveal only bigger effects (e.g., users wait that much before cancelling, errors come back typically rather quickly except if everything is slowed down anyway).
|
I don't think reducing the cardinality is absolutely necessary, this was just me thinking that we don't have a use for differentiating between Rows and Error responses. If you think having the distinction would be useful, we can add it in! |
I think that it would be useful to have the distinction, so that we can analyze the distribution of peek durations focusing only on "goodput". It should be that well-tuned production clusters should have very few Error responses (making the distinction less relevant for analysis when true, but good as an indicator to be checked), while in development / staging clusters we should see more differences between the two. If cardinality is not a big concern, then we could collect with the same number of buckets also the Errors and cancellations. |
This commit adds metrics for finished peeks: * the total counts of peeks performed since the last restart * a histogram of peek durations Both metrics are labeled by peek result type (completed or canceled), since the difference will likely be interesting to us. For example, if we want to calculate the average peek response time, we'll probably want to exclude canceled peeks as they might artificially lower the average.
|
This makes sense to me. I've added a second commit that splits |
|
TFTRs! |
This PR adds metrics for finished peeks as described in the Compute metrics design (#19717):
Both metrics are labeled by peek result type (rows, error, or canceled), since the difference will likely be interesting to us. For example, if we want to calculate the average peek response time, we'll probably want to exclude canceled peeks as they might artificially lower the average.
Motivation
Part of MaterializeInc/database-issues#5547.
Tips for reviewer
I've added some clarifying annotations to the relevant places in the code.
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.