Skip to content

design doc: Operational metrics for Compute#19717

Merged
teskje merged 3 commits into
MaterializeInc:mainfrom
teskje:design-compute-metrics
Jun 15, 2023
Merged

design doc: Operational metrics for Compute#19717
teskje merged 3 commits into
MaterializeInc:mainfrom
teskje:design-compute-metrics

Conversation

@teskje

@teskje teskje commented Jun 5, 2023

Copy link
Copy Markdown
Contributor

This PR adds a design doc for operational metrics for Compute.

View the rendered version here.

Motivation

  • This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#5547.

Tips for Reviewers

We can do bike-shedding about metric names on the future PRs that implement them.
Here I am mostly interested in your thoughts on:

  • missing metrics we should also have
  • proposed metrics that seem not valuable
  • proposed metrics that should have different labels
  • proposed metrics that should be implemented differently

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/A

@teskje teskje force-pushed the design-compute-metrics branch 11 times, most recently from 0bc3afe to e53ce47 Compare June 6, 2023 13:50
@teskje teskje changed the title [WIP] design doc: Operational metrics for Compute design doc: Operational metrics for Compute Jun 6, 2023
@teskje teskje marked this pull request as ready for review June 6, 2023 13:52
@teskje teskje requested review from a team, danhhz and jpepin June 6, 2023 13:52
@teskje

teskje commented Jun 7, 2023

Copy link
Copy Markdown
Contributor Author

I added @jpepin and @danhhz as non-Compute reviewers because of your Prometheus expertise.

@teskje teskje force-pushed the design-compute-metrics branch from e53ce47 to fb0d940 Compare June 7, 2023 12:16

@danhhz danhhz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! the biggest thing i've found is being careful about cardinality, but it's clear that you've put a lot of thought into it here

* **Labels**: `instance_id`, `replica_id`, `type`
* **Description**: The total number of compute responses sent, by replica and response type.
* `mz_compute_command_message_size_bytes`
* **Type**: histogram

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, the approach persist has taken is to never use histograms at first, and then introduce them when we found we need the additional fidelity in practice. (that's a pretty extreme take, so not suggesting you have to do the same thing)

my wild guess would be that the command and response message_size metrics might be fine as counters of total bytes. otoh mz_peek_duration_seconds definitely feels worth a histogram (though maybe there's a way to consolidate the result label?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue with histogram is that they add a bunch of per-bucket time series behind the scenes, right? My stance in this design was that it's probably fine to add these as long as the metric is otherwise low-cardinality. We definitely want to avoid histograms with a collection_id label.

But I agree we shouldn't add unnecessary histograms. Having a counter for the message sizes sounds reasonable to me, I'll change that.

What do you mean by consolidating the result label? Omitting it? My thinking was that the result would allow us to see how long it takes for us to respond to a peek, but also how long people are willing to wait before they cancel a peek. So what I really care about is whether or not the peek was cancelled, not so much whether it returned successfully or with an error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue with histogram is that they add a bunch of per-bucket time series behind the scenes, right? My stance in this design was that it's probably fine to add these as long as the metric is otherwise low-cardinality.

yup. and that's reasonable. was mostly an fyi that persist started from an even more extreme place of not having any until they proved necessary

What do you mean by consolidating the result label? Omitting it? My thinking was that the result would allow us to see how long it takes for us to respond to a peek, but also how long people are willing to wait before they cancel a peek. So what I really care about is whether or not the peek was cancelled, not so much whether it returned successfully or with an error.

yeah, sorry, I didn't have enough domain knowledge to have a specific counter-proposal. given this context, I'd say the strawman would be to either discard "error" timings or fold "error" into "success". if you don't care about per-instance_id breakdowns of how long people are willing to wait before cancelling, then you could split cancels into a second histogram which doesn't even have the instance-id label

(to be clear, all the comments I've made on this doc are just suggestions, feel free to ignore any/all of them)


The database has two ways of exposing metrics: directly and through the prometheus-exporter.

### Direct Export

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi that this adds things like env+pod labels to everything when it scrapes them, so all cardinalities end up getting multiplied by the number of processes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is true for prometheus-exporter metrics as well! I think that's the behavior we need to be able to separate metrics by environment (for controller metrics) or compute worker (for replica metrics).

I haven't thought too much about multi-process replicas. When you have, say, a 2-process replica with 4 workers, you'd get the labels:

  • (process-0, worker 0)
  • (process-0, worker 1)
  • (process-1, worker 2)
  • (process-1, worker 3)

So in this case the process tag doesn't contribute any additional time series.

@danhhz danhhz Jun 12, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I think maybe I wasn't clear. perhaps an example would help? if you run the following in staging us-east-1:

  • mz_envd_up{namespace="environment-82803c07-9246-4d3d-8516-e080dd332293-0"}
    • this is a promethus-exporter metric
    • this query returns 1 result with a pod label of promsql-exporter-9d5f88d89-fjsz8
  • mz_persist_metadata_seconds{namespace="environment-82803c07-9246-4d3d-8516-e080dd332293-0"}
    • this is a direct export metric
    • this query returns 5 results, one for each process/pod. each of them get a different value for the pod label

which means any direct export metrics that are scraped from computed (as opposed to ones from the controller in envd) might end up with more copies than one would naturally expect. this effect is more relevant for persist (which runs in every process) than for anything you'd be direct exporting from the controller (which is only on envd). not necessarily anything to do, just a heads up in case it wasn't already known

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, when I said "this is true for prometheus-exporter metrics as well" I was thinking of the per-replica metrics supported by the new prometheus-exporter. If you specify that you want to collect a metric per replica, then the exporter will run the metric query on each replica and expose multiple time series with a replica_full_name label. It's true that this is not per-pod but per-replica though.

Thanks for the elaboration and sorry for the confusion. I think we are on the same page here :)

@jpepin jpepin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent overview of the metrics tooling available and a helpful proposal for evaluating the health of compute resources.
The work in https://github.com/MaterializeInc/cloud/issues/5210 would be useful for monitoring the load of the new compute metrics on an ongoing basis. Ad hoc analysis of our Prometheus quotas and metrics is probably good enough to get started, though.

It's out of scope for this doc, but it would be helpful if during the creation of these metrics, compute engineers could assess whether any of them would be suitable for our proposed automated release checks or any other compute health alerts.

* [ ] `mz_peek_duration_seconds`
* **Type**: histogram
* **Labels**: `instance_id`, `result`
* **Description**: A histogram of peek durations since restart, by result type (success, error, canceled).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For purposes of standardization and discoverability, would it be possible to prefix all of these metrics with mz_compute_, or would this be misleading in some cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mz_peeks_* I don't have strong feelings. Peeks are a compute concept, so there shouldn't be any ambiguity without "compute", but discoverability is probably a good reason to still have a prefix.

The other metrics that lack a "compute" prefix are the mz_dataflow and mz_arrangement metrics. For future-proofing reasons I think we shouldn't mention compute here. Once we have cluster unification we will hopefully be able to include storage dataflow in these metrics.

@teskje

teskje commented Jun 14, 2023

Copy link
Copy Markdown
Contributor Author

Thanks for the reviews!

There doesn't seem to be much need for discussion on this doc. I'm planning to merge it tomorrow, so if you still mean to comment, please do so until then.

@mgree

mgree commented Jun 14, 2023

Copy link
Copy Markdown
Contributor

I mentioned in the sync today that we may want a histogram on optimizer runtime: getting statistics might be slow sometimes---I have a 250ms overall timeout on stats---but tracking optimizer runtime could be generally informative for us.

@teskje

teskje commented Jun 15, 2023

Copy link
Copy Markdown
Contributor Author

I mentioned in the sync today that we may want a histogram on optimizer runtime

This would be an optimizer metric, which I excluded from the scope of this design doc. I think we should either handle this as part of https://github.com/MaterializeInc/database-issues/issues/5111 or talk to the adapter folk if we think such a metric would better be implemented outside of the optimizer.

@teskje teskje merged commit 6880434 into MaterializeInc:main Jun 15, 2023
@teskje teskje deleted the design-compute-metrics branch June 15, 2023 08:56
@teskje teskje mentioned this pull request Jun 15, 2023
5 tasks
@mgree

mgree commented Jun 15, 2023

Copy link
Copy Markdown
Contributor

I mentioned in the sync today that we may want a histogram on optimizer runtime

This would be an optimizer metric, which I excluded from the scope of this design doc. I think we should either handle this as part of MaterializeInc/database-issues#5111 or talk to the adapter folks if we think such a metric would better be implemented outside of the optimizer.

Oops, did not notice that in the "non-goals". Sorry! I'll leave a note there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants