Listen on both TCP and named pipes when a pipe name is configured#94
Conversation
28db9c8 to
131147e
Compare
2c71ea0 to
9d56bcc
Compare
2f5764b to
f9c920d
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the Windows named-pipe transport behavior in the trace mini-agent so that, when DD_APM_WINDOWS_PIPE_NAME is set, the agent listens on both TCP (DD_APM_RECEIVER_PORT, default 8126) and the configured named pipe concurrently—preserving connectivity for legacy tracers that only support TCP.
Changes:
- Refactors
MiniAgentstartup to spawn TCP and (optionally) named-pipe accept loops and supervise them via a centralizedserve()function. - Removes the config behavior that forced
dd_apm_receiver_port = 0when a Windows pipe name is configured. - Updates the related config unit test to assert the default TCP port is preserved when a pipe name is set.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
crates/datadog-trace-agent/src/mini_agent.rs |
Starts TCP unconditionally, optionally starts named-pipe accept loop, and centralizes task supervision. |
crates/datadog-trace-agent/src/config.rs |
Keeps TCP port enabled even when DD_APM_WINDOWS_PIPE_NAME is configured; updates test expectation accordingly. |
Comments suppressed due to low confidence (1)
crates/datadog-trace-agent/src/mini_agent.rs:93
- Same as above: this
mutis unused and suppressed via#[allow(unused_mut)]. Prefer removing themutbinding (and the allow) to avoid masking future cases where mutability is accidentally unused.
#[allow(unused_mut)]
let mut stats_flusher_handle = tokio::spawn(async move {
stats_flusher
.start_stats_flusher(stats_config, stats_rx)
.await;
});
duncanpharvey
left a comment
There was a problem hiding this comment.
What versions of the .NET, Java, and Node.js tracers was named pipe support added in? Can we validate that anyone is using one of these older version to justify the added complexity in listening over TCP and Named Pipes?
If we do move forward with adding this backwards compatibility for older tracers, do we have a plan for deprecation?
|
@duncanpharvey We need this for the .net tracer specifically, where this support does not yet exist: DataDog/dd-trace-dotnet#8164 -- see the table in the PR description. I actually think this complexity is worth keeping-- it's under 100 lines, I wouldn't prioritize deprecating this against any of the other work we have. At best, I would throw it at the bottom of the backlog. Do you expect performance issues or similar that would change the calculus there? |
6751361 to
48b70aa
Compare
Previously, configuring DD_APM_WINDOWS_PIPE_NAME forced the TCP receiver port to 0, leaving legacy TCP-only tracers stranded. The mini agent now listens on both transports concurrently when a pipe name is set: TCP stays on dd_apm_receiver_port (default 8126), and the named pipe is added alongside. Behavior is unchanged when no pipe name is configured. A single serve() function supervises the TCP and named-pipe accept loops plus the trace flusher, stats flusher, and stats concentrator. Any task dying unexpectedly is fatal — tracers pick one transport at startup and stay on it, so silently degrading a transport would strand the tracers using it. On shutdown_rx, accept loops drain in-flight handlers via a watch fan-out before the supervisor signals the stats flusher to do its final emit, preserving the trace-stats invariant that no AddChunk/ClientStatsPayload messages are lost. Survivors are aborted before serve() returns so they don't detach and hold sockets, named pipes, or buffered channel state alive. shutdown_rx now uses tokio::sync::watch::Receiver<bool> instead of oneshot::Receiver<()> so accept loops can subscribe directly via .clone(); no separate accept_shutdown channel is needed. Pipe code stays fully cfg-gated. On non-Windows or no-windows-pipes builds, no pipe symbols are emitted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
48b70aa to
6efdb89
Compare
Boots one mini agent with both a non-zero dd_apm_receiver_port and a named pipe configured, sends a service-tagged trace through each, and asserts both service-name needles appear in the mock backend's request bodies. Then fires shutdown_tx and asserts the final stats flush arrives — exercising the watch-channel fan-out to both accept loops, the drain of each transport's in-flight handlers, and the supervisor's sequenced final flusher signal. Covers the dropped receiver_port=0 override in config.rs and the serve() supervisor's responsibility to keep both accept loops alive. create_test_trace_payload now takes Option<&str> so the dual-transport test can distinguish payloads by service name without a second helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6efdb89 to
1957ae3
Compare
test_mini_agent_tcp_with_real_flushers flaked on Ubuntu 24.04 ARM CI (and was reproducible occasionally on macOS too): the test would call verify_stats_request immediately after agent_handle.await, but on slow runners the mock server hadn't finished reading and storing the body of the final stats POST in time. The fixed-duration FLUSH_WAIT_DURATION sleep before each verify_* call had the same race in the other direction — fast runners waited longer than necessary, slow runners sometimes weren't enough. Replace both timed sleeps with a wait_for_request_at_path helper that polls the mock server every 50ms until at least one request arrives at the expected path, with a 60-second hard ceiling that panics with a clear message if exceeded. verify_trace_request and verify_stats_request are now async and use the helper internally, so callers just .await them — no separate sleep. FLUSH_WAIT_DURATION stays only for verify_no_stats_request, where a bounded wait is fundamental to the negative assertion (we can't poll for the absence of an event). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
557b7f3 to
f89f181
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f89f181bbf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Some(h) = pipe_handle.as_ref() { | ||
| h.abort(); | ||
| } | ||
| trace_flusher_handle.abort(); |
There was a problem hiding this comment.
Avoid aborting the trace flusher on graceful shutdown
When shutdown is requested after a trace handler has returned but before the next trace flush interval, this aborts trace_flusher_handle immediately after the accept loops drain. start_trace_flusher has no shutdown/final-flush path (it only flushes on its timer), so any SendData just queued by the drained handlers can remain in the receiver/aggregator and never be sent; the new real-flusher tests avoid this by waiting for verify_trace_request before shutdown, but production callers can lose traces during graceful shutdown.
Useful? React with 👍 / 👎.
## Summary of changes * Allow use of unique windows named pipes for IPC, coordinated with the serverless compatibility layer's mini-agent. * See: DataDog/datadog-serverless-compat-dotnet#20 for compat-side changes. * See: DataDog/serverless-compat-self-monitoring#54 for integration testing in Azure. ## Reason for change * Improve support for running multiple Azure Functions in the same hosting plan. [[jira](https://datadoghq.atlassian.net/browse/SVLS-8244)] ## Implementation details For [Azure Functions](https://docs.datadoghq.com/serverless/azure_functions/?tab=dotnet#setup), the tracer loads first, then the compat layer, and then the compat layer launches a rust binary to act as the agent. Additionally, the client may or may not be using the DogstatsD client, and it is also possible to have the compat layer and DogstatsD in place without the tracer. We need to coordinate named pipes across all three/four of these codebases, noting that the tracer will use DogstatsD's pipe for runtime metrics if available. Given this order of operations, and that we want this to work with immutable configuration, the tracer should set pipe names for the compatibility layer via instrumentation if the tracer is present. 1. Tracer loads -- pipe name is generated 2. We check the presence of the compat layer + windows configs (using undocumented envvar `DD_SERVERLESS_COMPAT_BINARY_PATH` if necessary to adapt to Azure changes) to see if named pipes can be used or if we should rely on TCP 4. Compat layer loads 5. Tracer uses instrumentation to force the compat layer to generate the pipe names <pipe-a>, <pipe-b> 6. Compat layer passes the name to the mini-agent tracer <pipe-a> and dogstatsd <pipe-b>. | Scenario | Pipe generated? | Transport | Result | |---|---|---|---| | New tracer + no compat at all | No (binary not found) | TCP | Works | | New tracer + old compat (< 1.5) | No (version check fails) | TCP | Works | | New tracer + new compat (≥ 1.5) | Yes | Named pipe | Works (integration coordinates) | | New tracer + dev compat (0.0.0) | Yes | Named pipe | Works (integration coordinates) | | Old tracer + no compat | No | TCP | Works | | Old tracer + old compat (< 1.5) | No | TCP | Works (neither side knows about pipes) | | Old tracer + new compat (≥ 1.5) | No | TCP | **Broken** — tracer sends via TCP, but compat binary listens on pipe only. Requires [more serverless-components changes](DataDog/serverless-components#94). | | Old tracer + dev compat (0.0.0) | No | TCP | Not supported | ## Test coverage * Self Monitoring * New Unit Tests --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Andrew Lock <andrewlock.net@gmail.com>
## Summary - Bumps the `Datadog.Serverless.Compat` `PackageReference` in `tracer/src/Datadog.AzureFunctions/Datadog.AzureFunctions.csproj` from `1.4.0` → `1.5.0`. - 1.5.0 ships multifunction named-pipe coordination for .NET on Azure Functions Premium plans. It bundles the [`datadog-serverless-compat/v0.24.0`](https://github.com/DataDog/serverless-components/releases/tag/datadog-serverless-compat%2Fv0.24.0) binary, which contains the dual TCP+pipe listener ([serverless-components#94](DataDog/serverless-components#94)) and best-effort TCP-8126 bind ([serverless-components#129](DataDog/serverless-components#129)) — both required to land the named-pipe path merged in [#8164](#8164). ## Test plan - [ ] CI green (csproj-only change; should be a no-op outside of `Datadog.AzureFunctions` packaging) - [ ] Confirm a built `Datadog.AzureFunctions` package resolves `Datadog.Serverless.Compat 1.5.0` from NuGet at restore time 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tracers running on Azure Functions Premium (multifunction) plans need named-pipe transport so each function gets isolated trace/dogstatsd channels. Older tracers without pipe support still need to reach the agent over TCP. Today the binary picks one transport: configuring a pipe name disables TCP, which would strand any old-tracer + new-compat combination.
This PR makes the binary listen on both transports concurrently when
DD_APM_WINDOWS_PIPE_NAMEis set: TCP stays onDD_APM_RECEIVER_PORT(default8126), named pipe is added alongside. When no pipe name is configured, behavior is unchanged (TCP-only, identical to before).What changed
mini_agent.rs: spawn TCP and (optionally) named-pipe accept loops, supervised by a singleserve()function. Any task dying unexpectedly is fatal — tracers pick one transport at startup and don't switch, so silently degrading a transport would strand tracers using it.mini_agent.rsandmain.rs:start_mini_agentnow takesshutdown_rx: tokio::sync::watch::Receiver<bool>(wasoneshot::Receiver<()>). Each accept loop subscribes via.clone(), so the shutdown signal fans out to both transports. Production caller inmain.rsalso updated.mini_agent.rs: graceful shutdown is preserved across the dual-transport architecture — onshutdown_rx, both accept loops drain their in-flight HTTP handlers before the supervisor signals the stats flusher (flusher_shutdown_tx) to do its final emit. The trace-stats invariant ("noAddChunk/ClientStatsPayloadmessages lost on shutdown") stays intact.config.rs: drop the override that forceddd_apm_receiver_port = 0whenever a pipe was configured. That override was load-bearing only when pipe-mode meant "no TCP."cfg-gated. On non-Windows / nowindows-pipesbuilds (Lambda), zero pipe symbols make it into the binary.Test plan
integration_test.rs)test_apm_windows_pipe_name)test_mini_agent_dual_transport_with_real_flushersintegration test (Windows +windows-pipes): boots one agent with both transports configured, sends a service-tagged trace through each, firesshutdown_tx, and asserts both traces and the final stats flush reach the mock backend. Exercises the watch fan-out and graceful drain on both accept loops in one go.x86_64-pc-windows-gnu --features windows-pipes, and the default Lambda buildRelated PRs and context
Part of the multifunction-named-pipes workstream — epic SVLS-8429, this PR's ticket SVLS-8243.
stats_concentrator_service_handleand the graceful drain into the newserve()supervisor)This PR's binary release must precede
datadog-serverless-compat-dotnetv1.5; v1.4 stays TCP-only and is unaffected.Notes
config.rs,dd_dogstatsd_port). Out of scope here; flagged for a follow-up.