Skip to content

[loadbalancingexporter] Distribution across target endpoints is not approximately uniform #41200

@jamesmoessis

Description

@jamesmoessis

Component(s)

exporter/loadbalancing

What happened?

Description

In our production workload, the load balancing exporter is sending spans in a fashion that was more widely distributed than I expected, per pod. The highest pod received 3.5x the number of spans that the pod that received the lowest.

This creates a large distribution of CPU usage across endpoints that the LB exporter is sending to, not ideal for managing infrastructure, and right-sizing the allocations for the downstream workload.

Steps to Reproduce

I've written a test that confirms that the distribution is not uniform, over 100k random trace IDs, I tracked the number of resolutions per endpoint:

Image

As you can see the trend is steeply down. I generated this writing this test in consistent_hashing_test.go. Then generated this chart from the CSV.

func TestUniformityOfDistribution(t *testing.T) {
	endpoints := make([]string, 300)
	for i := range endpoints {
		endpoints[i] = fmt.Sprintf("endpoint-%d", i)
	}
	ring := newHashRing(endpoints)

	n := 100_000
	resolutions := make(map[string][]pcommon.TraceID, len(endpoints))

	for i := 0; i < n; i++ {
		id := generateRandomTraceID(t)
		resolved := ring.endpointFor(id[:])
		resolutions[resolved] = append(resolutions[resolved], id)
	}

	timesResolved := make([]int, len(endpoints))
	for i := range timesResolved {
		res, ok := resolutions[endpoints[i]]
		numRes := 0
		if ok {
			numRes = len(res)
		}
		timesResolved[i] = numRes
	}

	f, err := os.Create("resolutions.csv")
	require.NoError(t, err)
	defer f.Close()

	_, err = f.WriteString("endpoint,n\n")
	require.NoError(t, err)

	for i := range endpoints {
		s := fmt.Sprintf("%s,%d\n", endpoints[i], timesResolved[i])
		_, err = f.WriteString(s)
		require.NoError(t, err)
	}
}

Expected Result

Traces are evenly distributed across endpoints.

Actual Result

They are not even close to uniformly distributed. We observe a trend as shown in the screenshot.

Collector version

v0.128.0

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions