design doc: Operational metrics for Compute#19717
Conversation
0bc3afe to
e53ce47
Compare
e53ce47 to
fb0d940
Compare
danhhz
left a comment
There was a problem hiding this comment.
LGTM! the biggest thing i've found is being careful about cardinality, but it's clear that you've put a lot of thought into it here
| * **Labels**: `instance_id`, `replica_id`, `type` | ||
| * **Description**: The total number of compute responses sent, by replica and response type. | ||
| * `mz_compute_command_message_size_bytes` | ||
| * **Type**: histogram |
There was a problem hiding this comment.
fwiw, the approach persist has taken is to never use histograms at first, and then introduce them when we found we need the additional fidelity in practice. (that's a pretty extreme take, so not suggesting you have to do the same thing)
my wild guess would be that the command and response message_size metrics might be fine as counters of total bytes. otoh mz_peek_duration_seconds definitely feels worth a histogram (though maybe there's a way to consolidate the result label?)
There was a problem hiding this comment.
The main issue with histogram is that they add a bunch of per-bucket time series behind the scenes, right? My stance in this design was that it's probably fine to add these as long as the metric is otherwise low-cardinality. We definitely want to avoid histograms with a collection_id label.
But I agree we shouldn't add unnecessary histograms. Having a counter for the message sizes sounds reasonable to me, I'll change that.
What do you mean by consolidating the result label? Omitting it? My thinking was that the result would allow us to see how long it takes for us to respond to a peek, but also how long people are willing to wait before they cancel a peek. So what I really care about is whether or not the peek was cancelled, not so much whether it returned successfully or with an error.
There was a problem hiding this comment.
The main issue with histogram is that they add a bunch of per-bucket time series behind the scenes, right? My stance in this design was that it's probably fine to add these as long as the metric is otherwise low-cardinality.
yup. and that's reasonable. was mostly an fyi that persist started from an even more extreme place of not having any until they proved necessary
What do you mean by consolidating the result label? Omitting it? My thinking was that the result would allow us to see how long it takes for us to respond to a peek, but also how long people are willing to wait before they cancel a peek. So what I really care about is whether or not the peek was cancelled, not so much whether it returned successfully or with an error.
yeah, sorry, I didn't have enough domain knowledge to have a specific counter-proposal. given this context, I'd say the strawman would be to either discard "error" timings or fold "error" into "success". if you don't care about per-instance_id breakdowns of how long people are willing to wait before cancelling, then you could split cancels into a second histogram which doesn't even have the instance-id label
(to be clear, all the comments I've made on this doc are just suggestions, feel free to ignore any/all of them)
|
|
||
| The database has two ways of exposing metrics: directly and through the prometheus-exporter. | ||
|
|
||
| ### Direct Export |
There was a problem hiding this comment.
fyi that this adds things like env+pod labels to everything when it scrapes them, so all cardinalities end up getting multiplied by the number of processes
There was a problem hiding this comment.
Yes, this is true for prometheus-exporter metrics as well! I think that's the behavior we need to be able to separate metrics by environment (for controller metrics) or compute worker (for replica metrics).
I haven't thought too much about multi-process replicas. When you have, say, a 2-process replica with 4 workers, you'd get the labels:
- (process-0, worker 0)
- (process-0, worker 1)
- (process-1, worker 2)
- (process-1, worker 3)
So in this case the process tag doesn't contribute any additional time series.
There was a problem hiding this comment.
hmmm I think maybe I wasn't clear. perhaps an example would help? if you run the following in staging us-east-1:
mz_envd_up{namespace="environment-82803c07-9246-4d3d-8516-e080dd332293-0"}- this is a promethus-exporter metric
- this query returns 1 result with a
podlabel ofpromsql-exporter-9d5f88d89-fjsz8
mz_persist_metadata_seconds{namespace="environment-82803c07-9246-4d3d-8516-e080dd332293-0"}- this is a direct export metric
- this query returns 5 results, one for each process/pod. each of them get a different value for the
podlabel
which means any direct export metrics that are scraped from computed (as opposed to ones from the controller in envd) might end up with more copies than one would naturally expect. this effect is more relevant for persist (which runs in every process) than for anything you'd be direct exporting from the controller (which is only on envd). not necessarily anything to do, just a heads up in case it wasn't already known
There was a problem hiding this comment.
Ah, when I said "this is true for prometheus-exporter metrics as well" I was thinking of the per-replica metrics supported by the new prometheus-exporter. If you specify that you want to collect a metric per replica, then the exporter will run the metric query on each replica and expose multiple time series with a replica_full_name label. It's true that this is not per-pod but per-replica though.
Thanks for the elaboration and sorry for the confusion. I think we are on the same page here :)
jpepin
left a comment
There was a problem hiding this comment.
This is an excellent overview of the metrics tooling available and a helpful proposal for evaluating the health of compute resources.
The work in https://github.com/MaterializeInc/cloud/issues/5210 would be useful for monitoring the load of the new compute metrics on an ongoing basis. Ad hoc analysis of our Prometheus quotas and metrics is probably good enough to get started, though.
It's out of scope for this doc, but it would be helpful if during the creation of these metrics, compute engineers could assess whether any of them would be suitable for our proposed automated release checks or any other compute health alerts.
| * [ ] `mz_peek_duration_seconds` | ||
| * **Type**: histogram | ||
| * **Labels**: `instance_id`, `result` | ||
| * **Description**: A histogram of peek durations since restart, by result type (success, error, canceled). |
There was a problem hiding this comment.
For purposes of standardization and discoverability, would it be possible to prefix all of these metrics with mz_compute_, or would this be misleading in some cases?
There was a problem hiding this comment.
For mz_peeks_* I don't have strong feelings. Peeks are a compute concept, so there shouldn't be any ambiguity without "compute", but discoverability is probably a good reason to still have a prefix.
The other metrics that lack a "compute" prefix are the mz_dataflow and mz_arrangement metrics. For future-proofing reasons I think we shouldn't mention compute here. Once we have cluster unification we will hopefully be able to include storage dataflow in these metrics.
e145237 to
3ffc73f
Compare
|
Thanks for the reviews! There doesn't seem to be much need for discussion on this doc. I'm planning to merge it tomorrow, so if you still mean to comment, please do so until then. |
|
I mentioned in the sync today that we may want a histogram on optimizer runtime: getting statistics might be slow sometimes---I have a 250ms overall timeout on stats---but tracking optimizer runtime could be generally informative for us. |
This would be an optimizer metric, which I excluded from the scope of this design doc. I think we should either handle this as part of https://github.com/MaterializeInc/database-issues/issues/5111 or talk to the adapter folk if we think such a metric would better be implemented outside of the optimizer. |
Oops, did not notice that in the "non-goals". Sorry! I'll leave a note there. |
This PR adds a design doc for operational metrics for Compute.
View the rendered version here.
Motivation
Part of MaterializeInc/database-issues#5547.
Tips for Reviewers
We can do bike-shedding about metric names on the future PRs that implement them.
Here I am mostly interested in your thoughts on:
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.