-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Component(s)
exporter/loadbalancing
What happened?
Description
In our production workload, the load balancing exporter is sending spans in a fashion that was more widely distributed than I expected, per pod. The highest pod received 3.5x the number of spans that the pod that received the lowest.
This creates a large distribution of CPU usage across endpoints that the LB exporter is sending to, not ideal for managing infrastructure, and right-sizing the allocations for the downstream workload.
Steps to Reproduce
I've written a test that confirms that the distribution is not uniform, over 100k random trace IDs, I tracked the number of resolutions per endpoint:
As you can see the trend is steeply down. I generated this writing this test in consistent_hashing_test.go. Then generated this chart from the CSV.
func TestUniformityOfDistribution(t *testing.T) {
endpoints := make([]string, 300)
for i := range endpoints {
endpoints[i] = fmt.Sprintf("endpoint-%d", i)
}
ring := newHashRing(endpoints)
n := 100_000
resolutions := make(map[string][]pcommon.TraceID, len(endpoints))
for i := 0; i < n; i++ {
id := generateRandomTraceID(t)
resolved := ring.endpointFor(id[:])
resolutions[resolved] = append(resolutions[resolved], id)
}
timesResolved := make([]int, len(endpoints))
for i := range timesResolved {
res, ok := resolutions[endpoints[i]]
numRes := 0
if ok {
numRes = len(res)
}
timesResolved[i] = numRes
}
f, err := os.Create("resolutions.csv")
require.NoError(t, err)
defer f.Close()
_, err = f.WriteString("endpoint,n\n")
require.NoError(t, err)
for i := range endpoints {
s := fmt.Sprintf("%s,%d\n", endpoints[i], timesResolved[i])
_, err = f.WriteString(s)
require.NoError(t, err)
}
}Expected Result
Traces are evenly distributed across endpoints.
Actual Result
They are not even close to uniformly distributed. We observe a trend as shown in the screenshot.
Collector version
v0.128.0
Environment information
OpenTelemetry Collector configuration
Log output
Additional context
No response
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.