fix(spanner): don't tear down host app's OTel ContextManager#8344
fix(spanner): don't tear down host app's OTel ContextManager#8344jcalem-rogo wants to merge 2 commits into
Conversation
`ensureInitialContextManagerSet()` (added to make async/await tracing work
for apps that haven't configured OpenTelemetry) currently tears down the
host application's `ContextManager` on most calls. The guard is OR-joined:
if (!context['_contextManager'] || context.active() === ROOT_CONTEXT) {
context.disable();
// ...install fresh AsyncHooksContextManager
}
Two problems:
1. `context['_contextManager']` is not a public field on the API singleton.
`setGlobalContextManager` writes into the package-private global registry
(`registerGlobal` in `@opentelemetry/api/internal/global-utils.js`), not
to a `_contextManager` property. So the first leg of the OR is effectively
always true and the function runs its install path on every call.
2. Even with that, `context.active() === ROOT_CONTEXT` fires on every gRPC
call made outside an active span — exactly what the Spanner session pool
does during background warmup (BatchCreateSessions, pool maintenance).
On those calls, the function `context.disable()`s the host app's
already-installed `ContextManager` and replaces it with a fresh one.
Any baggage the host had set before that moment is silently lost, along
with span parent linkage.
In practice this means: an app that uses OpenTelemetry, sets baggage on the
root context (e.g. user id, request id), and then issues a Spanner call,
will observe its baggage disappear on the next span emitted after Spanner's
pool warms up. We hit this in a Node service that uses OTel + Spanner —
chats emitted log records with `undefined` ids because host baggage had
been wiped by Spanner's pool-warmup `startTrace` call.
The fix uses the public API surface — `setGlobalContextManager()` already
returns `false` when a manager is already registered. Try to install; if
something else owns the slot, back out without touching anything.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request refactors the OpenTelemetry context manager initialization in the Spanner instrumentation to prevent breaking existing context managers by avoiding unnecessary teardowns. Instead of force-disabling previous managers, it now attempts to set a global manager and only keeps it if one wasn't already registered. Feedback was provided to optimize this process by using a module-level flag to prevent redundant object creation and initialization on every call.
Latch on a module-level flag so we don't allocate + enable + disable a fresh
AsyncHooksContextManager on every `new Spanner({...})` when the host app has
already registered its own ContextManager. Addresses review feedback.
Problem
ensureInitialContextManagerSet()inhandwritten/spanner/src/instrument.ts(added to make async/await tracing work for apps that haven't configured OpenTelemetry) currently tears down the host application'sContextManageron most calls. The guard is:Two issues:
context['_contextManager']is not a public field on the API singleton.setGlobalContextManagerwrites into the package-private global registry (registerGlobalin@opentelemetry/api/internal/global-utils.js), not to a_contextManagerproperty on the API object. So the first leg of the OR is effectively always true and the function runs its install path on every call.context.active() === ROOT_CONTEXTis true on every gRPC call made outside an active span. That's exactly what the Spanner session pool does during background warmup (BatchCreateSessions, keep-alives, pool maintenance). On those calls, the functioncontext.disable()s the host app's already-installedContextManagerand replaces it with a freshAsyncHooksContextManager.Impact
Any host app that:
ContextManager(viaNodeSDK, manualAsyncHooksContextManager, etc.)…will see its baggage and active-context state silently disappear shortly after
new Spanner({...})warms up the session pool. Span parents andBaggageSpanProcessor-style enrichment break for everything downstream.We hit this in a Node service that uses OTel + Spanner. Chats produced LLM observation log records with
undefinedchat ids because the host's baggage had been wiped by Spanner's pool-warmupstartTracecall. Bisecting the regression landed exactly on adding@google-cloud/spannerto the process.Affected versions confirmed:
8.7.1(latest),8.6.0, and the same body exists back through7.21.0. No relevant change between 7.x and 8.x in this code path.Fix
Use the public API surface:
setGlobalContextManager()already returnsfalsewhen a manager is already registered. Try to install; if something else owns the slot, back out without touching anything.This preserves the original intent — apps without OTel still get a working
AsyncHooksContextManager— while ensuring Spanner cannot tear down a manager the host already configured.Notes
ROOT_CONTEXTimport fromhandwritten/spanner/src/instrument.ts.observability-test/observability.tsblocks use a per-describesetGlobalContextManagerfor setup, which doesn't compose cleanly with assertions on this function's own registration behavior. The fix is verified end-to-end against a host service that previously showed the symptom.