design doc: Operational metrics for Compute by teskje · Pull Request #19717 · MaterializeInc/materialize

teskje · 2023-06-05T15:21:49Z

This PR adds a design doc for operational metrics for Compute.

Motivation

This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#5547.

Tips for Reviewers

We can do bike-shedding about metric names on the future PRs that implement them.
Here I am mostly interested in your thoughts on:

missing metrics we should also have
proposed metrics that seem not valuable
proposed metrics that should have different labels
proposed metrics that should be implemented differently

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/A

teskje · 2023-06-07T08:41:44Z

I added @jpepin and @danhhz as non-Compute reviewers because of your Prometheus expertise.

danhhz

LGTM! the biggest thing i've found is being careful about cardinality, but it's clear that you've put a lot of thought into it here

danhhz · 2023-06-06T18:37:02Z

+    * **Labels**: `instance_id`, `replica_id`, `type`
+    * **Description**: The total number of compute responses sent, by replica and response type.
+  * `mz_compute_command_message_size_bytes`
+    * **Type**: histogram


fwiw, the approach persist has taken is to never use histograms at first, and then introduce them when we found we need the additional fidelity in practice. (that's a pretty extreme take, so not suggesting you have to do the same thing)

my wild guess would be that the command and response message_size metrics might be fine as counters of total bytes. otoh mz_peek_duration_seconds definitely feels worth a histogram (though maybe there's a way to consolidate the result label?)

The main issue with histogram is that they add a bunch of per-bucket time series behind the scenes, right? My stance in this design was that it's probably fine to add these as long as the metric is otherwise low-cardinality. We definitely want to avoid histograms with a collection_id label.

But I agree we shouldn't add unnecessary histograms. Having a counter for the message sizes sounds reasonable to me, I'll change that.

What do you mean by consolidating the result label? Omitting it? My thinking was that the result would allow us to see how long it takes for us to respond to a peek, but also how long people are willing to wait before they cancel a peek. So what I really care about is whether or not the peek was cancelled, not so much whether it returned successfully or with an error.

The main issue with histogram is that they add a bunch of per-bucket time series behind the scenes, right? My stance in this design was that it's probably fine to add these as long as the metric is otherwise low-cardinality.

yup. and that's reasonable. was mostly an fyi that persist started from an even more extreme place of not having any until they proved necessary

What do you mean by consolidating the result label? Omitting it? My thinking was that the result would allow us to see how long it takes for us to respond to a peek, but also how long people are willing to wait before they cancel a peek. So what I really care about is whether or not the peek was cancelled, not so much whether it returned successfully or with an error.

yeah, sorry, I didn't have enough domain knowledge to have a specific counter-proposal. given this context, I'd say the strawman would be to either discard "error" timings or fold "error" into "success". if you don't care about per-instance_id breakdowns of how long people are willing to wait before cancelling, then you could split cancels into a second histogram which doesn't even have the instance-id label

(to be clear, all the comments I've made on this doc are just suggestions, feel free to ignore any/all of them)

danhhz · 2023-06-06T18:38:56Z

+
+The database has two ways of exposing metrics: directly and through the prometheus-exporter.
+
+### Direct Export


fyi that this adds things like env+pod labels to everything when it scrapes them, so all cardinalities end up getting multiplied by the number of processes

Yes, this is true for prometheus-exporter metrics as well! I think that's the behavior we need to be able to separate metrics by environment (for controller metrics) or compute worker (for replica metrics).

I haven't thought too much about multi-process replicas. When you have, say, a 2-process replica with 4 workers, you'd get the labels:

(process-0, worker 0)

(process-0, worker 1)

(process-1, worker 2)

(process-1, worker 3)

So in this case the process tag doesn't contribute any additional time series.

hmmm I think maybe I wasn't clear. perhaps an example would help? if you run the following in staging us-east-1:

mz_envd_up{namespace="environment-82803c07-9246-4d3d-8516-e080dd332293-0"}

this is a promethus-exporter metric

this query returns 1 result with a pod label of promsql-exporter-9d5f88d89-fjsz8

mz_persist_metadata_seconds{namespace="environment-82803c07-9246-4d3d-8516-e080dd332293-0"}

this is a direct export metric

this query returns 5 results, one for each process/pod. each of them get a different value for the pod label

which means any direct export metrics that are scraped from computed (as opposed to ones from the controller in envd) might end up with more copies than one would naturally expect. this effect is more relevant for persist (which runs in every process) than for anything you'd be direct exporting from the controller (which is only on envd). not necessarily anything to do, just a heads up in case it wasn't already known

Ah, when I said "this is true for prometheus-exporter metrics as well" I was thinking of the per-replica metrics supported by the new prometheus-exporter. If you specify that you want to collect a metric per replica, then the exporter will run the metric query on each replica and expose multiple time series with a replica_full_name label. It's true that this is not per-pod but per-replica though.

Thanks for the elaboration and sorry for the confusion. I think we are on the same page here :)

jpepin

This is an excellent overview of the metrics tooling available and a helpful proposal for evaluating the health of compute resources.
The work in https://github.com/MaterializeInc/cloud/issues/5210 would be useful for monitoring the load of the new compute metrics on an ongoing basis. Ad hoc analysis of our Prometheus quotas and metrics is probably good enough to get started, though.

It's out of scope for this doc, but it would be helpful if during the creation of these metrics, compute engineers could assess whether any of them would be suitable for our proposed automated release checks or any other compute health alerts.

jpepin · 2023-06-08T03:11:35Z

+  * [ ] `mz_peek_duration_seconds`
+    * **Type**: histogram
+    * **Labels**: `instance_id`, `result`
+    * **Description**: A histogram of peek durations since restart, by result type (success, error, canceled).


For purposes of standardization and discoverability, would it be possible to prefix all of these metrics with mz_compute_, or would this be misleading in some cases?

For mz_peeks_* I don't have strong feelings. Peeks are a compute concept, so there shouldn't be any ambiguity without "compute", but discoverability is probably a good reason to still have a prefix.

The other metrics that lack a "compute" prefix are the mz_dataflow and mz_arrangement metrics. For future-proofing reasons I think we shouldn't mention compute here. Once we have cluster unification we will hopefully be able to include storage dataflow in these metrics.

teskje · 2023-06-14T08:51:56Z

Thanks for the reviews!

There doesn't seem to be much need for discussion on this doc. I'm planning to merge it tomorrow, so if you still mean to comment, please do so until then.

mgree · 2023-06-14T19:01:54Z

I mentioned in the sync today that we may want a histogram on optimizer runtime: getting statistics might be slow sometimes---I have a 250ms overall timeout on stats---but tracking optimizer runtime could be generally informative for us.

teskje · 2023-06-15T08:55:04Z

I mentioned in the sync today that we may want a histogram on optimizer runtime

This would be an optimizer metric, which I excluded from the scope of this design doc. I think we should either handle this as part of https://github.com/MaterializeInc/database-issues/issues/5111 or talk to the adapter folk if we think such a metric would better be implemented outside of the optimizer.

mgree · 2023-06-15T13:21:09Z

I mentioned in the sync today that we may want a histogram on optimizer runtime

This would be an optimizer metric, which I excluded from the scope of this design doc. I think we should either handle this as part of MaterializeInc/database-issues#5111 or talk to the adapter folks if we think such a metric would better be implemented outside of the optimizer.

Oops, did not notice that in the "non-goals". Sorry! I'll leave a note there.

teskje force-pushed the design-compute-metrics branch 11 times, most recently from 0bc3afe to e53ce47 Compare June 6, 2023 13:50

teskje changed the title ~~[WIP] design doc: Operational metrics for Compute~~ design doc: Operational metrics for Compute Jun 6, 2023

teskje marked this pull request as ready for review June 6, 2023 13:52

teskje requested review from a team, danhhz and jpepin June 6, 2023 13:52

design doc: Operational metrics for Compute

fb0d940

teskje force-pushed the design-compute-metrics branch from e53ce47 to fb0d940 Compare June 7, 2023 12:16

danhhz approved these changes Jun 7, 2023

View reviewed changes

jpepin approved these changes Jun 8, 2023

View reviewed changes

teskje added 2 commits June 12, 2023 12:58

'compute' prefix for peek metrics

0500020

Make message size metrics counters

3ffc73f

teskje force-pushed the design-compute-metrics branch from e145237 to 3ffc73f Compare June 12, 2023 12:06

teskje mentioned this pull request Jun 13, 2023

Compute protocol and controller state metrics #19905

Merged

5 tasks

teskje merged commit 6880434 into MaterializeInc:main Jun 15, 2023

teskje deleted the design-compute-metrics branch June 15, 2023 08:56

teskje mentioned this pull request Jun 15, 2023

compute*: command history metrics #19964

Merged

5 tasks

This was referenced Jun 21, 2023

compute-client: peek metrics #20063

Merged

compute: mz_dataflow_initial_output_duration_seconds #20126

Merged

This was referenced Aug 2, 2023

compute: adjust reconciliation metrics #20939

Merged

compute: refactor trace metrics #20942

Merged

teskje mentioned this pull request Sep 1, 2023

compute: move mz_dataflow_initial_output_duration_seconds metric to the controller #21533

Merged

5 tasks


		The database has two ways of exposing metrics: directly and through the prometheus-exporter.

		### Direct Export

Conversation

teskje commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for Reviewers

Checklist

Uh oh!

teskje commented Jun 7, 2023

Uh oh!

danhhz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danhhz Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpepin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teskje commented Jun 14, 2023

Uh oh!

mgree commented Jun 14, 2023

Uh oh!

teskje commented Jun 15, 2023

Uh oh!

mgree commented Jun 15, 2023 • edited by teskje Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

teskje commented Jun 5, 2023 •

edited

Loading

danhhz Jun 12, 2023 •

edited

Loading

mgree commented Jun 15, 2023 •

edited by teskje

Loading