Skip to content

delete loop deletes c.Pod map entries for newly-created pod, causing lookup miss #42978

@VicThomas-Medallia

Description

@VicThomas-Medallia

Component(s)

processor/k8sattributes

What happened?

Description

We noticed an interesting phenomenon when a StatefulSet pod was restarted:

  1. For 2-2.5 minutes, spans generated in the newly-created pod would have various k8s resource attributes set correctly.
  2. But then, for a few minutes spans generated by the newly-created pod would no longer have these k8s resource attributes set at all.
  3. Then finally, the k8s resource attributes on spans generated by the newly-created pod would once again have these k8s resource attrbutes set correctly. This behavior would then hold for the remaining duration of the pod's lifespan.

The container in the pod is not setting k8s.pod.uid. Instead, the lookup is apparently being done based on connection IP address. In our environment it is common for a newly-created pod to be assigned to the same pod IP address as the pod that it is replacing.

What I believe is happening is the following race condition sequence:

  1. Pod is deleted. Consequently, a deleteRequest is appended to the c.deleteQueue. Due to gracePeriod and interval, the deleteLoop() function won't attempt to remove the related entries from c.Pods for 2 to 2.5 minutes.
  2. Very soon after the pod was deleted, a new pod -- having the same IP address, namespace, and pod name is created. The entries in c.Pod are updated.
  3. After the gracePeriod has expired, the entries in c.Pod for the deleted pod are removed the next time the deleteLoop() housekeeping executes. Unfortunately, the entries for the new pod have the same pod selectors as the old pod except for one that is based on k8s.pod.uid. So, the new pod loses its entries in c.Pod except for the one based on k8s.pod.uid.
  4. After some amount of time, handlePodUpdate() is triggered, which then updates the c.Pods entries for the new pod back to what they should be.

Steps to Reproduce

Use the following pod_association rules:

pod_association:
  - sources:
      - from: "resource_attribute"
        name: "k8s.pod.uid"
  - sources:
      - from: "resource_attribute"
        name: "k8s.pod.name"
      - from: "resource_attribute"
        name: "k8s.namespace.name"
  - sources:
      - from: "resource_attribute"
        name: "k8s.pod.ip"
  - sources:
      - from: "connection"

Add this test to processor/k8sattributesprocessor/internal/kube/client_test.go and run it:

func TestReproduceBug(t *testing.T) {
	makePodIdentifiers := func(pod *api_v1.Pod) []PodIdentifier {
		return []PodIdentifier{
			newPodIdentifier("resource_attribute", "k8s.pod.uid", string(pod.UID)),
			PodIdentifier{
				{
					Source: AssociationSource{
						From: "resource_attribute",
						Name: "k8s.namespace.name",
					},
					Value: "ns",
				},
				{
					Source: AssociationSource{
						From: "resource_attribute",
						Name: "k8s.pod.name",
					},
					Value: "abc-0",
				},
			},
			newPodIdentifier("resource_attribute", "k8s.pod.ip", "10.0.0.1"),
			newPodIdentifier("connection", "", "10.0.0.1"),
		}
	}

	makePod := func(podUid string) *api_v1.Pod {
		pod1 := &api_v1.Pod{
			ObjectMeta: meta_v1.ObjectMeta{
				Name:      "abc-0",
				Namespace: "ns",
				UID:       types.UID(podUid),
			},
			Status: api_v1.PodStatus{
				PodIP: "10.0.0.1",
			},
		}
		return pod1
	}

	c, _ := newTestClient(t)
	doAssertions := func(pod *api_v1.Pod) {
		podIdentifiers := makePodIdentifiers(pod)
		for _, id := range podIdentifiers {
			foundPod, ok := c.Pods[id]
			assert.True(t, ok, "Pod should be present in c.Pods for identifier %v", id)
			assert.Equal(t, pod.UID, types.UID(foundPod.PodUID))
			assert.Equal(t, "ns", foundPod.Namespace)
			assert.Equal(t, "abc-0", foundPod.Name)
			assert.Equal(t, "10.0.0.1", foundPod.Address)
		}
	}

	// Clear the pods map to start fresh...
	c.Pods = make(map[PodIdentifier]*Pod)
	// Set associations to match what we have configured for our OpenTelemetry Collector.
	c.Associations = []Association{
		{
			Sources: []AssociationSource{
				{
					From: "resource_attribute",
					Name: "k8s.pod.uid",
				},
			},
		},
		{
			Sources: []AssociationSource{
				{
					From: "resource_attribute",
					Name: "k8s.namespace.name",
				},
				{
					From: "resource_attribute",
					Name: "k8s.pod.name",
				},
			},
		},
		{
			Sources: []AssociationSource{
				{
					From: "resource_attribute",
					Name: "k8s.pod.ip",
				},
			},
		},
		{
			Sources: []AssociationSource{
				{
					From: "connection",
				},
			},
		},
	}

	// Add a pod and verify that we can find it by all identifiers.
	pod1 := makePod("12345678-1234-1234-1234-123456789abc")
	c.handlePodAdd(pod1)
	doAssertions(pod1)

	// Delete a pod and verify that we can still fund it by all identifiers until 2-2.5 minutes have passed
	c.handlePodDelete(pod1)
	doAssertions(pod1)
	assert.Len(t, c.Pods, 4)

	// Add a pod with the same values as pod1 except for UID, and verify that we can find it by all identifiers.
	pod2 := makePod("87654321-4321-4321-4321-cba987654321")
	c.handlePodAdd(pod2)
	doAssertions(pod2)
	assert.Len(t, c.Pods, 5) // 4 from pod2 + 1 from pod1 (the pod UID identifier)

	// Prove that the bug exists!
	// Start a new delete loop that has no grace period and has a short interval. Then, wait long enough for the
	// delete loop housekeeping processing to have run. After that, assert that only the pod UID identifier still
	// exists in the map.
	go func() {
		time.Sleep(3 * time.Second)
		close(c.stopCh)
	}()
	c.deleteLoop(1*time.Second, 0*time.Second)
	assert.Len(t, c.Pods, 1) // Only one mapping remains.

	podIdentifiers := makePodIdentifiers(pod2)
	foundPod, ok := c.Pods[podIdentifiers[0]] // The identifier based on k8s.pod.uid.
	assert.True(t, ok)
	assert.Equal(t, string(pod2.UID), foundPod.PodUID) // The one remaining mapping is the one based on k8s.pod.uid.

	foundPod, ok = c.Pods[podIdentifiers[1]] // The identifier based on (k8s.namespace.name, k8s.pod.name).
	assert.False(t, ok)

	foundPod, ok = c.Pods[podIdentifiers[2]] // The identifier based on k8s.pod.ip.
	assert.False(t, ok)

	foundPod, ok = c.Pods[podIdentifiers[3]] // The identifier based on connection.
	assert.False(t, ok)
}

The test will pass. But, in this case, this is not desired because at the end of the test are assertions that the state is actually not the desired state.

Expected Result

There should not be a period of time during which spans are not decorated with the expected k8s resource attributes.

Actual Result

2-2.5 minutes after the new pod starts, spans generated by the pod are not decorated with the k8s resource attributes. And then, after some amount of time passes, the decoration works again and does not stop working.

Collector version

v0.130.1

Environment information

Environment

Kubernetes 1.33.4, with crio container runtime.

OpenTelemetry Collector configuration

# Our configuration is assembled via many snippet files and would be very time-consuming to assemble here. Also, the pod associations and test code that I shared above should be sufficient to reproduce the problem. However, here is the k8sattributesprocessor config that I used when initially debugging the problem on my laptop:

auth_type: "kubeConfig"
context: "den"
wait_for_metadata: true
wait_for_metadata_timeout: "10s"
filter:
  node: "den-r10-u09"
extract:
  metadata:
    - "container.id"
    - "k8s.container.name"
    - "k8s.cronjob.name"
    - "k8s.daemonset.name"
    - "k8s.deployment.name"
    - "k8s.job.name"
    - "k8s.namespace.name"
    - "k8s.node.name"
    - "k8s.pod.ip"
    - "k8s.pod.name"
    - "k8s.replicaset.name"
    - "k8s.statefulset.name"
  labels:
    - tag_name: "medallia.temp.deployment.environment.name"  # Maps to "meta_environment"
      from: "pod"
      key: "app.medallia.com/environment"
    - tag_name: "medallia.temp.medallia.component"  # Maps to "meta_component"
      from: "pod"
      key: "app.kubernetes.io/component"
    - tag_name: "medallia.temp.medallia.instance"  # Maps to "meta_instance"
      from: "pod"
      key: "app.kubernetes.io/instance"
    - tag_name: "medallia.temp.medallia.managed_by"  # Maps to "meta_managed_by
      from: "pod"
      key: "app.kubernetes.io/managed-by"
    - tag_name: "medallia.temp.medallia.operator"  # Maps to "meta_operator"
      from: "pod"
      key: "app.medallia.com/operator"
    - tag_name: "medallia.temp.medallia.owner"  # Maps to "meta_owner"
      from: "pod"
      key: "app.medallia.com/owner"
    - tag_name: "medallia.temp.medallia.team"  # Maps to "meta_team"
      from: "pod"
      key: "app.medallia.com/team"
    - tag_name: "medallia.temp.service.name"  # Maps to "meta_name"
      from: "pod"
      key: "app.kubernetes.io/name"
    - tag_name: "medallia.temp.service.namespace"  # Maps to "meta_part_of"
      from: "pod"
      key: "app.kubernetes.io/part-of"
    - tag_name: "medallia.temp.service.version"  # Maps to "meta_version"
      from: "pod"
      key: "app.kubernetes.io/version"
    - tag_name: "medallia.temp.express_cluster_name"  # For special handling of Express and Social Media workloads.
      from: "pod"
      key: "express.medallia.com/cluster-name"
  annotations:
    - tag_name: "medallia.temp.pipeline_output_loki"
      from: "pod"
      key: "logging.medallia.com/pipeline-output-loki"
pod_association:
  - sources:
      - from: "resource_attribute"
        name: "k8s.pod.uid"
  - sources:
      - from: "resource_attribute"
        name: "k8s.pod.name"
      - from: "resource_attribute"
        name: "k8s.namespace.name"
  - sources:
      - from: "resource_attribute"
        name: "k8s.pod.ip"
  - sources:
      - from: "connection"

Log output

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions