Skip to content

[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

@matej-g

Description

@matej-g

Component(s)

connector/spanmetrics

What happened?

Description

I have an application that is sending spans to the collector, which are subsequently ran through the connector. However, once that application is shut down, I'm seeing metrics for the spans previously generated by the app being produce indefinitely. This is despite the fact that no new traces are being emitted by the application (since the application has been already shut down, as stated above). This is particularly problematic for applications with large number of operations (spans), since I keep receiving tons of data indefinitely (i.e. until I restart the collector).

Steps to Reproduce

Easiest is to reproduce with telemetrygen. For example:

  1. Run a collector with simple pipeline that accepts OTLP traces -> exports traces to spanmetrics connector -> receives metrics -> exports metrics to debug exporter
  2. Send a couple of traces from telemetrygen to collector
  3. Observe on stdout that the duration histogram is being produced by the span metrics connector for the spans emitted by telemetrygen
  4. Terminate the telemetrygen pod
  5. Observe on the stdout that the duration histogram data points are still being produced for spans previously created by telemetrygen, even though I already terminated the service

Expected Result

The metrics should stop being produced eventually.

Actual Result

The metrics keep getting exported indefinitely (until I restart the collector).

Collector version

v0.91.0

Environment information

Environment

Local kind cluster

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  debug:

connectors:
  spanmetrics:

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [debug]

Log output

No response

Additional context

There have been couple of similar issues flying around (e.g. #29604, #17306), although it's not 100% clear the users are describing the same issue as here, since previously there were also related reports of memory leaks.

Some users have been adviced to adjust the config (e.g. this suggestion #17306 (comment)), but these unfortunately do not address the cause of the issue (and as a side note, even when trying to decrease the size of cache, this does not affect the number of metrics that keep being produced according to my tests. At least for the cumulative temporality, the cache eviction actually does not seem to be taking place, but this is only my deduction after glancing at the connector code).

I would imagine that ideally this could be solved if we could implement a logic where "if span X is not seen for Y amount of time, stop producing metrics for this span".

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions