-
Notifications
You must be signed in to change notification settings - Fork 540
Description
URL
https://grafana.com/docs/alloy/latest/configure/kubernetes/
Component(s)
No response
Feedback
For the context, we have had issues lately on our Azure cluster where it seemed that we reached AKS Inflight limits.
Azure Inflight determines if the Kube API is throttled.
Using Alloy on its default configuration on 4 cluster saturated the Kube API in a way that new objects couldn't even be created.
Since each alloy pod (we could have 20 nodes, so 20 pods) would watch all logs of the cluster, the path containerLogs was definitely saturated, and Azure Inflight too.
In the end, we could add the following rule for Alloy on our pod configuration:
rule {
source_labels = ["__meta_kubernetes_pod_node_name"]
action = "keep"
regex = env("K8S_NODE_NAME")
}
with K8S_NODE_NAME being and extraEnv given to alloy
extraEnv:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
That configuration allows retrieval of pods logs that are on the same node as the alloy pods that watches
Firstly, there was a huge memory decrease, each pod was consuming ~3Go of memory per pod.
Now Alloy uses around 300Mo per pod.
The path for containerLogs had a night an day difference too:
And also, specific for Azure on our case, a decrease in Inlfight requests:
--
Consequences:
Since Loki deduplicates entries, we hadn't noticed that alloy was actually watching all pods each at first.
We have had 4 saturations on our Kubernetes API, that affected our pipelines (runners were throttled)
Also, our monitoring has been affected for a while, since we have had alloy disabled on multiple times.
Of course, there was time spend from our team as well as increased costs for our clusters since we tried changing AKS Inflight tier (around 300$ per cluster)
To summary, i think for the potential cost of that behavior, i think it should be better specified in the documentation that this happens.
Of course, using clustering mode for Alloy would mitigate the issue too.
Either way, all of what I described wasn't really clear for us when reading the documentation
Tip
React with 👍 if this issue is important to you.