-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Helm Chart
- kube-prometheus-stack: 82.12.0
Description
The kube-prometheus-stack chart comes with some fine dashboards for exploration into the Cluster resource usage.
However, two of the main dashboards use two inconsistent metrics for displaying such information:
- The Kubernetes / Compute Resources / Cluster dashboard relies on the
container_memory_rssmetric. This dashboard also contains tables with resources specific to individual namespaces, and links to a second dashboard... - The Kubernetes / Compute Resources / Namespace (Pods) dashboard, which relies heavily on the
container_memory_working_set_bytesmetric.
Whether to use one metric or another has been a debated topic for years, but the issue here is less the merit of which metric to use, but the fact that they are used to calculate the same information, with wildly different results.
For example, here is a snapshot I just took of the "Memory Requests By Namespace" table in the Cluster dashboard:
And of the same information, as given by the linked Compute Resources / Namespace (Pods) dashboard:

As you can see, one displays the "Memory Request %" information as 33.3% and the other 65.5%, a roughly 32% gap! As you may understand, using these numbers to try and right-size a cluster, for example, proves complicated.
A quick check tells me that both metrics are used in many of the dashboards:
angelin01:~/tmp/kube-prometheus-stack/templates/grafana/dashboards-1.14 > grep -rl container_memory_rss
k8s-resources-pod.yaml
k8s-resources-node.yaml
k8s-resources-namespace.yaml
k8s-resources-multicluster.yaml
k8s-resources-cluster.yaml
angelin01:~/tmp/kube-prometheus-stack/templates/grafana/dashboards-1.14 > grep -rl container_memory_working_set_bytes
k8s-resources-workloads-namespace.yaml
k8s-resources-workload.yaml
k8s-resources-pod.yaml
k8s-resources-node.yaml
k8s-resources-namespace.yaml
So I guess my question is: does it make sense to pick one metric for one piece of information, and standardize it across these dashboards? For my use case in particular, I'm particularly interested in the "% of resources used vs allocated" panels, since we are going through some right-sizing work!
I'm willing to make the changes and open a pull request, but first I'd like some guidance from the maintainers about the direction to go! Thanks!