Skip to content

fix: disable stats computation to prevent per-invocation latency in Lambda#244

Merged
jchrostek-dd merged 1 commit into
mainfrom
john/sles-2790
Apr 20, 2026
Merged

fix: disable stats computation to prevent per-invocation latency in Lambda#244
jchrostek-dd merged 1 commit into
mainfrom
john/sles-2790

Conversation

@jchrostek-dd
Copy link
Copy Markdown
Contributor

Summary

Fixes ~40% Lambda latency regression for customers using datadog-lambda-go with extension v91+ and dd-trace-go v2.1.0+.

Jira: SLES-2790


Root Cause

DataDog/dd-trace-go#3548 changed the default value of DD_TRACE_STATS_COMPUTATION_ENABLED from false to true in dd-trace-go v2.1.0, enabling client-side stats computation by default.

What happens when stats computation is enabled

At the end of every Lambda invocation, HandlerFinished calls tracer.Flush(). Here is the exact call chain when stats computation is enabled:

  1. tracer.Flush() sends a flush signal to the tracer worker goroutine and blocks on a done channel
  2. The worker goroutine processes the flush in order:
    t.traceWriter.flush()                                 // send traces via /v0.1/traces
    t.statsd.Flush()                                      // flush statsd metrics
    t.stats.flushAndSend(time.Now(), withCurrentBucket)  // ← NEW in v2.1.0
    done <- struct{}{}                                    // unblock tracer.Flush()
    
  3. flushAndSend() calls canComputeStats(), which returns true when three conditions are met:
    • The tracer has StatsComputationEnabled == true (default since v2.1.0)
    • The extension advertises stats: true in its /info response (extension v91 does this)
    • The extension advertises drop_p0s: true in its /info response
  4. With all conditions met, flushAndSend() makes a synchronous HTTP POST to http://127.0.0.1:8126/v0.6/stats
  5. Only after the extension responds to that POST does done get sent, unblocking tracer.Flush()

This means every Lambda invocation pays an extra HTTP round-trip to the extension — on top of the trace payload flush — before the handler can return. This round-trip is 33–59ms per warm invocation and up to 809ms on cold start (first TCP connection to extension).

Why extension v58→v91 matters: Extension v58 did not advertise the /v0.6/stats endpoint in its /info response, so canComputeStats() returned false and no stats POST was made. Extension v91 added this support, which combined with the v2.1.0 default change means all three conditions became true simultaneously — triggering the regression.

Why disabling telemetry didn't help: DD_INSTRUMENTATION_TELEMETRY_ENABLED=false disables the telemetry reporter goroutine (a completely different subsystem). The correct env var to disable stats computation is DD_TRACE_STATS_COMPUTATION_ENABLED=false.


Fix

Added tracer.WithStatsComputation(false) to initTracer() in internal/trace/listener.go. This sets the tracer option that disables stats computation regardless of the DD_TRACE_STATS_COMPUTATION_ENABLED env var and regardless of what the extension advertises in /info. The extension still receives and processes traces — only the stats pre-aggregation and the /v0.6/stats POST are skipped.

This is the identical fix already applied to dd-trace-go's own Lambda wrapper in DataDog/dd-trace-go#4471 (APMSVLS-389).


Verification

Two Go Lambda functions were deployed to AWS sandbox using extension v91 on provided.al2/x86_64 (matching the customer's exact config):

  • sles-2790-go-unfixed: published datadog-lambda-go v1.31.0 (stats enabled by default)
  • sles-2790-go-fixed: patched with WithStatsComputation(false)

10,000 warm invocations per function (20 sequential warm-up invocations excluded, then 10,000 test invocations at 10 concurrent workers):

Metric Unfixed Fixed Improvement
Mean 34.59 ms 25.07 ms 27.5% faster
Median 30.52 ms 20.86 ms 31.6% faster
p90 50.66 ms 39.82 ms 21.4% faster
p95 60.91 ms 48.90 ms 19.7% faster
p99 170.20 ms 122.51 ms 28.0% faster

Every percentile improved. The median (30ms → 21ms, 32% improvement) is the cleanest signal as it is least affected by cold-start outliers.

CloudWatch extension logs for the unfixed function show on every invocation:

DD_EXTENSION | DEBUG | Stats request to https://trace.agent.datadoghq.com/api/v0.2/stats took 33-59 ms

Fixed function: no Stats request log line at all across all 10,000 invocations.


Immediate Workaround (no deploy needed)

Set DD_TRACE_STATS_COMPUTATION_ENABLED=false in Lambda environment variables.

…head in Lambda

Stats computation was enabled by default in dd-trace-go v2.1.0 (DataDog/dd-trace-go#3548).
When extension v91+ advertises /v0.6/stats support in its /info response, tracer.Flush()
makes a synchronous HTTP POST to http://127.0.0.1:8126/v0.6/stats before returning,
adding 33-59ms per warm invocation.

This fix disables stats computation at the tracer level so it is never posted from Lambda,
mirroring the identical fix already applied to the dd-trace-go contrib Lambda wrapper in
DataDog/dd-trace-go#4471 (APMSVLS-389).

Resolves: SLES-2790
@purple4reina
Copy link
Copy Markdown
Contributor

CloudWatch extension logs for the unfixed function show on every invocation:

I suspect that once we enable Agent-Side Stats, we'll start seeing these log messages again and probably see an increase in post runtime duration again. But I think you were seeing these requests for each invocation, but with proper Agent-Side Stats they'll be every 10s.

@purple4reina purple4reina marked this pull request as ready for review April 20, 2026 20:30
@purple4reina purple4reina requested a review from a team as a code owner April 20, 2026 20:30
@jchrostek-dd jchrostek-dd merged commit fafc57f into main Apr 20, 2026
11 of 12 checks passed
@jchrostek-dd jchrostek-dd deleted the john/sles-2790 branch April 20, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants