Skip to content

feat: add telemetry service infrastructure (Phase A core)#919

Merged
EhabY merged 8 commits into
mainfrom
telemetry-phase-a-core
May 4, 2026
Merged

feat: add telemetry service infrastructure (Phase A core)#919
EhabY merged 8 commits into
mainfrom
telemetry-phase-a-core

Conversation

@EhabY
Copy link
Copy Markdown
Collaborator

@EhabY EhabY commented Apr 27, 2026

Lays the spine for Phase A telemetry per the RFC. No real sinks ship in this PR. LocalJsonlSink and the instrumentation sites land in follow-up tickets.

What this adds:

  • TelemetryEvent, TelemetryContext, SessionContext, and TelemetrySink interfaces in src/telemetry/event.ts, plus buildSession and buildErrorBlock helpers
  • TelemetryService with log/logError/time/trace/setDeploymentUrl/dispose, including auto-injected context (extension version, machine/session ID, OS, host, platform, deployment URL) and a monotonic eventSequence. time/trace/Span.phase accept caller-supplied measurements alongside the framework-set durationMs.
  • Span correlation handle (src/telemetry/span.ts): every timed event carries a traceId; phase events also carry parentEventId. Phase eventNames compose hierarchically (e.g. remote.setup.workspace_lookup). Span.phase callbacks receive a child Span, so grandchildren are just .phase() recursion. Maps directly to OTel trace_id/span_id/parent_span_id for OTLP/Traces export.
  • coder.telemetry.level setting with values off and local (default local), tagged so it surfaces under VS Code's Telemetry settings section. deployment is reserved for Phase B.
  • ServiceContainer registers the service; DeploymentManager calls setDeploymentUrl from setDeployment and resets it from clearDeployment.

Design choices worth flagging:

  • Sinks are static and self-gating. Sinks are constructed once at activation. The service holds a global kill switch on coder.telemetry.level === "off". Each sink decides whether to emit based on its own gating signal, which makes Phase B (CoderServerSink gated on deployment) and Phase C (external sink gated on VS Code's built-in telemetry.telemetryLevel) drop in without service changes.
  • Reactive level changes. watchConfigurationChanges passes a Map<setting, newValue> to its callback so consumers don't re-read the config. Flipping to off flushes all sinks before the kill switch trips, so in-flight buffered events are not lost. The new value is validated by an isTelemetryLevel type guard.
  • Errors never reach callers. All emission goes through #safeEmit, which catches and logs any telemetry-internal failure via the regular logger. Per-sink try/catch around sink.write isolates write failures further. A throwing sink (or any other telemetry bug) cannot affect the user fn's return value or its error.
  • Phase name validation. Phase names containing . collide with the hierarchical path separator; the producer warns and sanitizes (replaces . with _), never throws — a typo in instrumentation cannot break a feature.
  • Readonly event shape. TelemetryEvent and its nested records are readonly at the type level so a sink can't mutate the event in TypeScript without an explicit as cast.
  • dispose() returns Promise<void>. VS Code does not await deactivation, so flushes during shutdown are best-effort.

Closes #900

@EhabY EhabY self-assigned this Apr 27, 2026
@EhabY EhabY force-pushed the telemetry-phase-a-core branch 3 times, most recently from 6731a38 to a996d0a Compare April 28, 2026 12:44
Comment thread src/telemetry/service.ts Outdated
Comment thread src/telemetry/service.ts Outdated
Comment thread src/core/container.ts
Comment thread src/core/container.ts Outdated
Comment thread package.json Outdated
Comment thread src/telemetry/trace.ts Outdated
Comment thread src/telemetry/event.ts
Comment thread src/core/container.ts
Comment thread src/deployment/deploymentManager.ts
Comment thread src/telemetry/event.ts
@EhabY EhabY requested a review from mtojek April 28, 2026 17:05
Copy link
Copy Markdown
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ehab! I'm good, but you can try codex agents against it too :)

Comment thread src/telemetry/event.ts
Comment thread src/telemetry/trace.ts Outdated
Comment thread package.json Outdated
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented Apr 29, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-layered telemetry spine. The sink interface contract, per-sink error isolation, and reactive level gating are well-designed. Test coverage is thorough at ~1.4:1 test:source ratio, with meaningful assertions on event shape, trace correlation, and error rethrow identity. The configWatcher API upgrade to ReadonlyMap is a genuine improvement that benefits all consumers.

6 P2, 3 P3, 2 Nit.

The P2s cluster around two themes: (1) shutdown ordering and event object safety in the fan-out loop, both of which become real bugs the moment sinks ship; (2) test gaps where mocks compile but assertions are missing, and one test that passes regardless of whether the code it claims to test is correct.

The default-on telemetry concern (DEREM-5) was already raised by @mtojek in the PR comments. That's a product decision that needs a human call before sinks land.

"I tried to build a case that the premises are wrong and could not." (Pariston)

🤖 This review was automatically generated with Coder Agents.

Comment thread src/core/container.ts Outdated
Comment thread src/telemetry/service.ts
Comment thread src/telemetry/emit.ts Outdated
Comment thread test/unit/telemetry/service.test.ts
Comment thread package.json Outdated
Comment thread src/telemetry/event.ts Outdated
Comment thread src/telemetry/service.ts Outdated
Comment thread src/telemetry/emit.ts Outdated
Comment thread src/telemetry/event.ts Outdated
Comment thread src/telemetry/event.ts Outdated
@EhabY EhabY force-pushed the telemetry-phase-a-core branch from 73c603f to dd65ffb Compare April 29, 2026 16:12
EhabY added 5 commits May 4, 2026 13:31
Lays the spine for Phase A telemetry per the RFC. Defines the
TelemetryEvent, TelemetryContext, and TelemetrySink interfaces, adds
the TelemetryService with log/logError/time/trace, the Trace
correlation handle, the coder.telemetry.level setting (off, local;
default local), and ServiceContainer plus DeploymentManager wiring.

No real sinks ship in this PR. LocalJsonlSink and the instrumentation
sites land in follow-up tickets.

Sinks are static and self-gating: the service holds a kill switch on
coder.telemetry.level === "off", and each sink decides whether to emit
based on its own gating signal. The setting is reactive via the
existing watchConfigurationChanges helper, which now passes the
changed values along so consumers don't have to re-read the config.

Refs #900
event.ts:
- Use raw vscode.env.appName for platformType to keep granularity
  (Visual Studio Code, Visual Studio Code - Insiders, Cursor, etc.).
- Use 'unknown' sentinel for extensionVersion when packageJSON.version
  is missing or non-string, instead of misleading empty string.
- Drop overly defensive ?? '' on platformVersion (vscode.version is
  always defined). Add a comment on the win32 -> windows mapping.

service.ts:
- Inline noopEmit into the NOOP_TRACE constructor.
- Convert coerceLevel to a switch so future Phase B 'deployment' lands
  cleanly.
- Have the watcher callback coerce the new value once and pass a
  TelemetryLevel directly to #applyLevelChange.
- Extract a #safeCall(sink, action) helper that dedupes the catch+log
  pattern shared by dispose and the level-off flush.

trace.ts:
- Dedupe emitTimed's two emit calls via a small send closure.

Tests:
- event.test.ts: cover raw appName granularity (Insiders, Cursor) and
  the 'unknown' sentinel.
- trace.test.ts: drop the JS-semantics test that wasn't really about
  Trace.
- service.test.ts: combine redundant log and level-off cases, replace
  vi.fn 'was called' assertions with return-value assertions where
  the wrapped function ran.
…riptions

Structural changes:
- TelemetryContext now extends a readonly SessionContext. The service
  stores #session as readonly and tracks #deploymentUrl separately,
  combining them at emit time.
- EmitFn and emitTimed move to a new emit.ts so the timed-emit pattern
  has a clear home (was previously in trace.ts alongside the Trace
  class). trace.ts now contains only the Trace class.
- Watcher's getValue calls readLevel directly so the watcher fires only
  on effective level changes (e.g., toggling between unknown values
  that all coerce to local does not fire).

Tests:
- buildSession (renamed from buildContext) tests no longer mutate
  vscode.env.appName since platformType is now a raw passthrough.
- trace.test.ts imports EmitFn from emit.ts.
- Drop the unused logger field from the service test harness.

package.json:
- Tighten the coder.telemetry.level descriptions.
- Phase events compose hierarchical eventNames; new parentEventId
  encodes parent/child unambiguously under retries. time also carries
  a traceId so every timed event maps to an OTel span.
- Trace -> Span; phase callback receives a child Span, so grandchildren
  are just recursion. time/trace/phase accept caller measurements.
- All emission routed through #safeEmit so telemetry failures never
  reach callers; per-sink try/catch isolates writes.
- Address review fixes: logger.dispose last, readonly fields, isTelemetryLevel
  guard, platformType -> platformName, telemetry tag on the setting,
  setDeploymentUrl test assertions, reactive-level test fix.
@EhabY EhabY force-pushed the telemetry-phase-a-core branch from 7749dee to 3ab123c Compare May 4, 2026 10:32
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 4, 2026

/coder-agents-review

Remove coder.telemetry.level from package.json until a sink ships.
The service still reads it internally and defaults to local.
Drop time() since it just delegated to trace() with a mismatched signature.
Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong response to R1. 9 of 11 findings fixed, DEREM-5 (default opt-in) evaluated and closed by the panel (3/4 accept the author's defense: local-only data is outside VS Code's off-device telemetry guidance, setting tagged "telemetry" for discoverability). The Trace→Span refactor, #safeEmit catch-all, readonly event types, and isTelemetryLevel guard are all clean improvements.

4 P3, 2 Nit new.

The main R2 finding is that DEREM-1's fix (logger dispose ordering) is incomplete: void this.telemetryService.dispose() is fire-and-forget, so the async flush still runs after logger.dispose() despite the comment claiming otherwise. Inert today (no sinks), but the comment asserts a guarantee the async execution model does not provide.

"The code and async execution model show this model is incorrect." (Pariston, on the DEREM-1 fix)

🤖 This review was automatically generated with Coder Agents.

Comment thread src/telemetry/span.ts Outdated
Comment thread test/unit/telemetry/service.test.ts
Comment thread src/telemetry/service.ts
Comment thread src/telemetry/span.ts Outdated
Comment thread src/telemetry/service.ts Outdated
Reserve result and durationMs as framework-managed keys so callers cannot
pass them and have them silently overwritten. ServiceContainer.dispose()
is async and awaits telemetry teardown so the logger survives any sink
flush warnings.
@EhabY EhabY force-pushed the telemetry-phase-a-core branch from 401e03b to 5b9184b Compare May 4, 2026 11:34
- NOOP_SPAN.phase now passes NOOP_SPAN explicitly so detaching the method
  cannot strip the receiver
- Trace level is captured at trace start and threaded through SpanOptions
  so a mid-trace level toggle no longer drops phases or the parent and
  leaves dangling parentEventId references
- Pull a SpanOptions type out of EmitOptions so #startSpan and #emitTimed
  share the trace-context shape
- Soften the phase-name docstring to match the sanitize-with-warning behavior
@EhabY EhabY merged commit 06327d0 into main May 4, 2026
10 checks passed
@EhabY EhabY deleted the telemetry-phase-a-core branch May 4, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telemetry: core service, Trace, and ServiceContainer wiring

2 participants