Skip to content

Docs feedback: It should be specified that each daemonset pod watches ALL pods logs in the cluster #4787

@adrian-salas

Description

@adrian-salas

URL

https://grafana.com/docs/alloy/latest/configure/kubernetes/

Component(s)

No response

Feedback

For the context, we have had issues lately on our Azure cluster where it seemed that we reached AKS Inflight limits.

Azure Inflight determines if the Kube API is throttled.
Using Alloy on its default configuration on 4 cluster saturated the Kube API in a way that new objects couldn't even be created.

Since each alloy pod (we could have 20 nodes, so 20 pods) would watch all logs of the cluster, the path containerLogs was definitely saturated, and Azure Inflight too.

In the end, we could add the following rule for Alloy on our pod configuration:

        rule {
          source_labels = ["__meta_kubernetes_pod_node_name"]
          action        = "keep"
          regex         = env("K8S_NODE_NAME")
        }

with K8S_NODE_NAME being and extraEnv given to alloy

  extraEnv:
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName

That configuration allows retrieval of pods logs that are on the same node as the alloy pods that watches

Firstly, there was a huge memory decrease, each pod was consuming ~3Go of memory per pod.
Now Alloy uses around 300Mo per pod.

The path for containerLogs had a night an day difference too:

Image

And also, specific for Azure on our case, a decrease in Inlfight requests:

Image

--
Consequences:

Since Loki deduplicates entries, we hadn't noticed that alloy was actually watching all pods each at first.

We have had 4 saturations on our Kubernetes API, that affected our pipelines (runners were throttled)
Also, our monitoring has been affected for a while, since we have had alloy disabled on multiple times.

Of course, there was time spend from our team as well as increased costs for our clusters since we tried changing AKS Inflight tier (around 300$ per cluster)

To summary, i think for the potential cost of that behavior, i think it should be better specified in the documentation that this happens.
Of course, using clustering mode for Alloy would mitigate the issue too.
Either way, all of what I described wasn't really clear for us when reading the documentation

Tip

React with 👍 if this issue is important to you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions