feat(spanner): optimize RequestId propagation and minimize OpenTelemetry active tracing overhead#8329
feat(spanner): optimize RequestId propagation and minimize OpenTelemetry active tracing overhead#8329surbhigarg92 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to check if tracing is enabled before performing instrumentation, improving performance by avoiding unnecessary work when tracing is inactive. It also optimizes project ID retrieval in prepareGapicRequest_ by caching the result and refactors request ID header injection. Feedback identifies a logic bug in the isTracingEnabled caching mechanism regarding the precedence of custom providers and global state pollution. Additionally, suggestions were made to correct a copy-pasted docstring and to utilize the existing craftRequestId utility to maintain consistency.
…try active tracing overhead
5ea84d5 to
652517b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces conditional tracing logic to the Spanner client, ensuring that observability features like context propagation and request ID attribution are only active when tracing is enabled. It also refactors project ID replacement and request ID header generation for improved efficiency. Feedback from the review highlights critical issues where project ID and resource header replacements are incorrectly skipped after the first request, which could lead to incorrect headers in subsequent calls. Additionally, the reviewer suggested refining the tracing enablement cache to avoid permanently disabling tracing if the client is initialized before the OpenTelemetry SDK is fully configured.
| config.headers[CLOUD_RESOURCE_HEADER] = replaceProjectIdToken( | ||
| config.headers[CLOUD_RESOURCE_HEADER], | ||
| projectId!, | ||
| ); | ||
| this.projectIdReplaced_ = true; | ||
| } else { | ||
| reqOpts = config.reqOpts; | ||
| } |
There was a problem hiding this comment.
The CLOUD_RESOURCE_HEADER replacement has been moved inside the if (!this.projectIdReplaced_) block, meaning it will be skipped for all requests after the first one. This header must be updated for every request to ensure the correct project ID is sent to the backend.
this.projectIdReplaced_ = true;
}
config.headers[CLOUD_RESOURCE_HEADER] = replaceProjectIdToken(
config.headers[CLOUD_RESOURCE_HEADER],
projectId!,
);There was a problem hiding this comment.
During the first request, the SDK permanently replaces the {{projectId}} token in-place inside all cached instance and database objects (instance.formattedName_, database.formattedName_). For all subsequent requests, the headers are constructed from these already-resolved properties, meaning config.headers[CLOUD_RESOURCE_HEADER] already contains the correct project ID. Running the replacement on every subsequent request is mathematically redundant and adds unnecessary regex overhead on Node's single-threaded event loop.
There was a problem hiding this comment.
I do think that this is an actual problem, and then specifically for non-cached resource names. See this test case:
it('real application flow: should replace project ID tokens on Backup objects after initial Spanner request', done => {
const {Instance} = require('../src/instance');
const {Backup} = require('../src/backup');
// Stub setup...
const appSpanner = new Spanner({projectId: '{{projectId}}'});
asAny(appSpanner).auth.getProjectId = callback => {
callback(null, PROJECT_ID);
};
const realInstance = new Instance(appSpanner, 'my-instance');
appSpanner.instances_.set('my-instance', realInstance);
// Backup created before project ID is replaced
const realBackup = new Backup(realInstance, 'my-backup');
// 1. Initial request triggers prepareGapicRequest_
FAKE_GAPIC_CLIENT.getInstanceConfig = (reqOpts, gaxOpts, callback) => {
callback(null, {});
};
appSpanner.getInstanceConfig('nam1', err => {
if (err) return done(err);
// At this point, appSpanner.projectIdReplaced_ is true.
// 2. Application calls backup.getMetadata()
FAKE_GAPIC_CLIENT.getBackup = reqOpts => {
try {
assert.strictEqual(
reqOpts.name,
`projects/${PROJECT_ID}/instances/my-instance/backups/my-backup`,
);
done();
} catch (e) {
done(e);
}
return Promise.resolve([{}]);
};
realBackup.getMetadata();
});
});There was a problem hiding this comment.
I missed this case, this seems a problem only for classes like Backups.
Because any DataClient operations will be done via instance object which is created using Spanner instance. For these objects we are replacing the projectId.
this.instances_.forEach(instance => {
instance.formattedName_ = replaceProjectIdToken(
instance.formattedName_,
projectId!,
);
instance.databases_.forEach(database => {
database.formattedName_ = replaceProjectIdToken(
database.formattedName_,
projectId!,
);
});
});
I think the same problem will come for instanceConfigs object also. Using Spanner client for AdminOperations are deprecated but it may still impact existing customers . Let me think of an alternate solution
a983da5 to
571de00
Compare
571de00 to
85e4b52
Compare
| const gaxClient = this.clients_.get(clientName)!; | ||
| let reqOpts = extend(true, {}, config.reqOpts); | ||
| reqOpts = replaceProjectIdToken(reqOpts, projectId!); | ||
| // It would have been preferable to replace the projectId already in the |
There was a problem hiding this comment.
Should we really remove this comment? Isn't it still valid?
| const probeSpan = globalProvider | ||
| .getTracer(TRACER_NAME, TRACER_VERSION) | ||
| .startSpan('probe'); | ||
| const isRecording = probeSpan.isRecording(); |
There was a problem hiding this comment.
I don't think this is a safe assumption. It assumes that probeSpan.isRecording() will return true if tracing is enabled. But if tracing is enabled with a 5% sample rate, then there is no guarantee that this will return true. Or am I misunderstanding what is going on here?
There was a problem hiding this comment.
Thanks for highlighting it. Completely missed this . Another option which I was trying was below which also didn't look like a strong approach. I am checking this with OpenTelemetry team . Unlike Java, Node does not expose an option to check if tracer is enabled https://github.com/open-telemetry/opentelemetry-java/blob/31b3cd5f561a7cf6278a255fad33d40887c1a48b/api/all/src/main/java/io/opentelemetry/api/trace/Tracer.java#L72
const globalProvider = trace.getTracerProvider();
if (globalProvider) {
let delegate = globalProvider;
if (typeof (globalProvider as any).getDelegate === 'function') {
delegate = (globalProvider as any).getDelegate();
}
if (delegate) {
const name = delegate.constructor.name;
// Exclude the dummy NoopTracerProvider and uninitialized ProxyTracerProvider
if (name !== 'NoopTracerProvider' && name !== 'ProxyTracerProvider') {
globalTracingEnabled = true;
return true;
}
}
}
| */ | ||
| function isGlobalTracingEnabled(): boolean { | ||
| if (globalTracingEnabled !== undefined) { | ||
| return globalTracingEnabled; |
There was a problem hiding this comment.
This caching means that the result that is returned the first time is valid for the lifetime of the application. This means that:
- If this function happens to be called before the application has configured OpenTelemetry, then the value that it calculates and caches can be wrong.
- It does not take into the (maybe theoretical) possibility that an application could change its configuration later.
There was a problem hiding this comment.
If this function happens to be called before the application has configured OpenTelemetry, then the value that it calculates and caches can be wrong.
Expectation is OpenTelemetry Global registration should be done before Spanner instance is created, even in Java we expect customers to do before SpannerInstance creation , if done later it will not be picked for adding traces.
It does not take into the (maybe theoretical) possibility that an application could change its configuration later.
If opentelemetry provider is passed while creating Spanner object that will be considered, but global configuration is not expected to be changed later
There was a problem hiding this comment.
If we try to accommodate the requirement of letting customer register OpenTelemetry later, we will not be able to avoid enabling registering of AsyncHooksContextManager() . Registering this adds a good load on the application.
| X_GOOG_SPANNER_REQUEST_ID_SPAN_ATTR, | ||
| attributeXGoogSpannerRequestIdToActiveSpan, | ||
| craftRequestId, | ||
| PROCESS_PREFIX, |
There was a problem hiding this comment.
Is this really used elsewhere?
| InMemorySpanExporter, | ||
| } = require('@opentelemetry/sdk-trace-node'); | ||
| const {SimpleSpanProcessor} = require('@opentelemetry/sdk-trace-base'); | ||
| const {startTrace, ObservabilityOptions} = require('../src/instrument'); |
There was a problem hiding this comment.
(not related to this line, but it feels like the most logical place to add this comment)
We are not adding any new tests for this. Should we add tests that verify that:
- It does not matter when the OpenTelemetry configuration is done (before or after creating a Spanner instance).
- It does not matter what the trace sampling is. When tracing is enabled, even with a 1% sampling rate, then the request ID should be added to the traces.
There was a problem hiding this comment.
It does not matter when the OpenTelemetry configuration is done (before or after creating a Spanner instance).
As mentioned in previous comment reply, we need to discuss if we want to allow this usecase.
It does not matter what the trace sampling is. When tracing is enabled, even with a 1% sampling rate, then the request ID should be added to the traces.
Sure will add it
| config.headers[CLOUD_RESOURCE_HEADER] = replaceProjectIdToken( | ||
| config.headers[CLOUD_RESOURCE_HEADER], | ||
| projectId!, | ||
| ); | ||
| this.projectIdReplaced_ = true; | ||
| } else { | ||
| reqOpts = config.reqOpts; | ||
| } |
There was a problem hiding this comment.
I do think that this is an actual problem, and then specifically for non-cached resource names. See this test case:
it('real application flow: should replace project ID tokens on Backup objects after initial Spanner request', done => {
const {Instance} = require('../src/instance');
const {Backup} = require('../src/backup');
// Stub setup...
const appSpanner = new Spanner({projectId: '{{projectId}}'});
asAny(appSpanner).auth.getProjectId = callback => {
callback(null, PROJECT_ID);
};
const realInstance = new Instance(appSpanner, 'my-instance');
appSpanner.instances_.set('my-instance', realInstance);
// Backup created before project ID is replaced
const realBackup = new Backup(realInstance, 'my-backup');
// 1. Initial request triggers prepareGapicRequest_
FAKE_GAPIC_CLIENT.getInstanceConfig = (reqOpts, gaxOpts, callback) => {
callback(null, {});
};
appSpanner.getInstanceConfig('nam1', err => {
if (err) return done(err);
// At this point, appSpanner.projectIdReplaced_ is true.
// 2. Application calls backup.getMetadata()
FAKE_GAPIC_CLIENT.getBackup = reqOpts => {
try {
assert.strictEqual(
reqOpts.name,
`projects/${PROJECT_ID}/instances/my-instance/backups/my-backup`,
);
done();
} catch (e) {
done(e);
}
return Promise.resolve([{}]);
};
realBackup.getMetadata();
});
});| const span = trace.getActiveSpan(); | ||
| if (span) { | ||
| return span; | ||
| if (isTracingEnabled()) { |
There was a problem hiding this comment.
Note that this could just as well call isGlobalTracingEnabled directly, as it does not supply any options. So the more specific check whether a tracer has been set on any options is always skipped. Is that intentional?
Also, calling trace.getActiveSpan() should be an extremely cheap method to call when tracing is disabled, so I am not sure this entire method really optimizes anything.
Optimize RequestId propagation and minimize OpenTelemetry active tracing overhead.