feat: add telemetry service infrastructure (Phase A core)#919
Conversation
6731a38 to
a996d0a
Compare
mtojek
left a comment
There was a problem hiding this comment.
Thanks Ehab! I'm good, but you can try codex agents against it too :)
|
/coder-agents-review |
There was a problem hiding this comment.
Clean, well-layered telemetry spine. The sink interface contract, per-sink error isolation, and reactive level gating are well-designed. Test coverage is thorough at ~1.4:1 test:source ratio, with meaningful assertions on event shape, trace correlation, and error rethrow identity. The configWatcher API upgrade to ReadonlyMap is a genuine improvement that benefits all consumers.
6 P2, 3 P3, 2 Nit.
The P2s cluster around two themes: (1) shutdown ordering and event object safety in the fan-out loop, both of which become real bugs the moment sinks ship; (2) test gaps where mocks compile but assertions are missing, and one test that passes regardless of whether the code it claims to test is correct.
The default-on telemetry concern (DEREM-5) was already raised by @mtojek in the PR comments. That's a product decision that needs a human call before sinks land.
"I tried to build a case that the premises are wrong and could not." (Pariston)
🤖 This review was automatically generated with Coder Agents.
73c603f to
dd65ffb
Compare
Lays the spine for Phase A telemetry per the RFC. Defines the TelemetryEvent, TelemetryContext, and TelemetrySink interfaces, adds the TelemetryService with log/logError/time/trace, the Trace correlation handle, the coder.telemetry.level setting (off, local; default local), and ServiceContainer plus DeploymentManager wiring. No real sinks ship in this PR. LocalJsonlSink and the instrumentation sites land in follow-up tickets. Sinks are static and self-gating: the service holds a kill switch on coder.telemetry.level === "off", and each sink decides whether to emit based on its own gating signal. The setting is reactive via the existing watchConfigurationChanges helper, which now passes the changed values along so consumers don't have to re-read the config. Refs #900
event.ts: - Use raw vscode.env.appName for platformType to keep granularity (Visual Studio Code, Visual Studio Code - Insiders, Cursor, etc.). - Use 'unknown' sentinel for extensionVersion when packageJSON.version is missing or non-string, instead of misleading empty string. - Drop overly defensive ?? '' on platformVersion (vscode.version is always defined). Add a comment on the win32 -> windows mapping. service.ts: - Inline noopEmit into the NOOP_TRACE constructor. - Convert coerceLevel to a switch so future Phase B 'deployment' lands cleanly. - Have the watcher callback coerce the new value once and pass a TelemetryLevel directly to #applyLevelChange. - Extract a #safeCall(sink, action) helper that dedupes the catch+log pattern shared by dispose and the level-off flush. trace.ts: - Dedupe emitTimed's two emit calls via a small send closure. Tests: - event.test.ts: cover raw appName granularity (Insiders, Cursor) and the 'unknown' sentinel. - trace.test.ts: drop the JS-semantics test that wasn't really about Trace. - service.test.ts: combine redundant log and level-off cases, replace vi.fn 'was called' assertions with return-value assertions where the wrapped function ran.
…riptions Structural changes: - TelemetryContext now extends a readonly SessionContext. The service stores #session as readonly and tracks #deploymentUrl separately, combining them at emit time. - EmitFn and emitTimed move to a new emit.ts so the timed-emit pattern has a clear home (was previously in trace.ts alongside the Trace class). trace.ts now contains only the Trace class. - Watcher's getValue calls readLevel directly so the watcher fires only on effective level changes (e.g., toggling between unknown values that all coerce to local does not fire). Tests: - buildSession (renamed from buildContext) tests no longer mutate vscode.env.appName since platformType is now a raw passthrough. - trace.test.ts imports EmitFn from emit.ts. - Drop the unused logger field from the service test harness. package.json: - Tighten the coder.telemetry.level descriptions.
- Phase events compose hierarchical eventNames; new parentEventId encodes parent/child unambiguously under retries. time also carries a traceId so every timed event maps to an OTel span. - Trace -> Span; phase callback receives a child Span, so grandchildren are just recursion. time/trace/phase accept caller measurements. - All emission routed through #safeEmit so telemetry failures never reach callers; per-sink try/catch isolates writes. - Address review fixes: logger.dispose last, readonly fields, isTelemetryLevel guard, platformType -> platformName, telemetry tag on the setting, setDeploymentUrl test assertions, reactive-level test fix.
7749dee to
3ab123c
Compare
|
/coder-agents-review |
Remove coder.telemetry.level from package.json until a sink ships. The service still reads it internally and defaults to local. Drop time() since it just delegated to trace() with a mismatched signature.
There was a problem hiding this comment.
Strong response to R1. 9 of 11 findings fixed, DEREM-5 (default opt-in) evaluated and closed by the panel (3/4 accept the author's defense: local-only data is outside VS Code's off-device telemetry guidance, setting tagged "telemetry" for discoverability). The Trace→Span refactor, #safeEmit catch-all, readonly event types, and isTelemetryLevel guard are all clean improvements.
4 P3, 2 Nit new.
The main R2 finding is that DEREM-1's fix (logger dispose ordering) is incomplete: void this.telemetryService.dispose() is fire-and-forget, so the async flush still runs after logger.dispose() despite the comment claiming otherwise. Inert today (no sinks), but the comment asserts a guarantee the async execution model does not provide.
"The code and async execution model show this model is incorrect." (Pariston, on the DEREM-1 fix)
🤖 This review was automatically generated with Coder Agents.
Reserve result and durationMs as framework-managed keys so callers cannot pass them and have them silently overwritten. ServiceContainer.dispose() is async and awaits telemetry teardown so the logger survives any sink flush warnings.
401e03b to
5b9184b
Compare
- NOOP_SPAN.phase now passes NOOP_SPAN explicitly so detaching the method cannot strip the receiver - Trace level is captured at trace start and threaded through SpanOptions so a mid-trace level toggle no longer drops phases or the parent and leaves dangling parentEventId references - Pull a SpanOptions type out of EmitOptions so #startSpan and #emitTimed share the trace-context shape - Soften the phase-name docstring to match the sanitize-with-warning behavior
Lays the spine for Phase A telemetry per the RFC. No real sinks ship in this PR.
LocalJsonlSinkand the instrumentation sites land in follow-up tickets.What this adds:
TelemetryEvent,TelemetryContext,SessionContext, andTelemetrySinkinterfaces insrc/telemetry/event.ts, plusbuildSessionandbuildErrorBlockhelpersTelemetryServicewithlog/logError/time/trace/setDeploymentUrl/dispose, including auto-injected context (extension version, machine/session ID, OS, host, platform, deployment URL) and a monotoniceventSequence.time/trace/Span.phaseaccept caller-suppliedmeasurementsalongside the framework-setdurationMs.Spancorrelation handle (src/telemetry/span.ts): every timed event carries atraceId; phase events also carryparentEventId. Phase eventNames compose hierarchically (e.g.remote.setup.workspace_lookup).Span.phasecallbacks receive a childSpan, so grandchildren are just.phase()recursion. Maps directly to OTeltrace_id/span_id/parent_span_idfor OTLP/Traces export.coder.telemetry.levelsetting with valuesoffandlocal(defaultlocal), tagged so it surfaces under VS Code's Telemetry settings section.deploymentis reserved for Phase B.ServiceContainerregisters the service;DeploymentManagercallssetDeploymentUrlfromsetDeploymentand resets it fromclearDeployment.Design choices worth flagging:
coder.telemetry.level === "off". Each sink decides whether to emit based on its own gating signal, which makes Phase B (CoderServerSinkgated ondeployment) and Phase C (external sink gated on VS Code's built-intelemetry.telemetryLevel) drop in without service changes.watchConfigurationChangespasses aMap<setting, newValue>to its callback so consumers don't re-read the config. Flipping tooffflushes all sinks before the kill switch trips, so in-flight buffered events are not lost. The new value is validated by anisTelemetryLeveltype guard.#safeEmit, which catches and logs any telemetry-internal failure via the regular logger. Per-sinktry/catcharoundsink.writeisolates write failures further. A throwing sink (or any other telemetry bug) cannot affect the user fn's return value or its error..collide with the hierarchical path separator; the producer warns and sanitizes (replaces.with_), never throws — a typo in instrumentation cannot break a feature.TelemetryEventand its nested records arereadonlyat the type level so a sink can't mutate the event in TypeScript without an explicitascast.dispose()returnsPromise<void>. VS Code does not await deactivation, so flushes during shutdown are best-effort.Closes #900