Skip to content

Pre-release hardening: playground TS bootstrap telemetry and worker exit on uncaught#646

Merged
tnaum-ms merged 4 commits into
nextfrom
dev/tnaum/pre-release-fixes
May 13, 2026
Merged

Pre-release hardening: playground TS bootstrap telemetry and worker exit on uncaught#646
tnaum-ms merged 4 commits into
nextfrom
dev/tnaum/pre-release-fixes

Conversation

@tnaum-ms
Copy link
Copy Markdown
Collaborator

Three pre-release hardening fixes identified by the 0.8.0 release audit. All are scoped to the playground worker bootstrap path; nothing in this PR changes connectivity, data handling, or public APIs.

Changes

fix(playground): instrument and retry TS plugin bootstrap (F-01, F-02)

ClustersExtension.ensureTsRestart previously swallowed all errors into outputChannel.debug and set tsRestarted = true before doing any async work, so a single failure on first playground open left IntelliSense broken for the rest of the session with no signal to operators.

  • Wraps the bootstrap in callWithTelemetryAndErrorHandling('playground.tsPluginBootstrap') so result, error name, and errorMessage are captured automatically.
  • Adds a stage property (stubInstall, tsActivate, tsWait, tsRestart, complete) and stubCreated / tsExtensionFound so we can tell what failed without parsing messages.
  • Resets tsRestarted on failure so the next playground open retries. This matters on read-only extension installs (Snap, system package, locked enterprise) where the FS write can fail transiently.

fix(playground): exit worker after uncaughtException (F-04)

playgroundWorker.ts previously reported the exception to the main thread and stayed alive with a potentially corrupted mongoClient / shellRuntime. Now the worker schedules process.exit(1) after a 50 ms delay to let the IPC error flush. WorkerSessionManager already rejects pending requests, emits worker.unexpectedExit telemetry, and respawns on the next eval, so the path is already well covered downstream.

What this PR does not do

  • Does not change the long-term plan to publish the TS plugin as a standalone npm package (WIP: Publish TS server plugin as standalone npm package #548).
  • Does not change the 2 s magic sleep before typescript.restartTsServer; the new telemetry will tell us if that timing is actually a problem in the field.
  • Does not address sovereign-cloud OIDC ALLOWED_HOSTS (tracked separately).

Follow-up issues

Filed against the remaining audit findings:

Test plan

  • Lint and tests should pass unchanged; this is a pure logic refactor of two narrow code paths.
  • Manual: open a .documentdb.js playground; verify the bootstrap telemetry event fires once with result=Succeeded. Force a failure (chmod 0555 on the extension node_modules/) and verify result=Failed is emitted with stage=stubInstall and that opening a second playground retries.
  • Manual: throw inside a playground eval (for example process.nextTick(() => { throw new Error('boom'); })); verify the worker exits, the error surfaces in the result panel, and the next Run respawns a fresh worker.

tnaum-ms added 2 commits May 13, 2026 10:28
The runtime node_modules stub install + TS server restart used to fail
silently into outputChannel.debug. On read-only extension installs
(Snap, system package, locked enterprise, some container setups) the
stub install throws EACCES/EROFS and playground IntelliSense breaks
with no signal to operators and no possibility to recover during the
session.

Changes:
- Wrap ensureTsRestart in callWithTelemetryAndErrorHandling so failures
  are reported via the standard playground.tsPluginBootstrap event with
  result, error name, and errorMessage captured automatically.
- Add a stage property (stubInstall, tsActivate, tsWait, tsRestart,
  complete) so we can tell which phase failed without parsing error
  messages.
- Track stubCreated and tsExtensionFound for additional context.
- Reset tsRestarted on failure so opening another playground retries
  the bootstrap instead of being stuck for the rest of the session.

No additional logic change to the 2-second wait before
typescript.restartTsServer; that magic delay is tracked separately for
a future revisit (see follow-up issue).
Previously the worker reported the error to the main thread and stayed
alive. mongoClient and shellRuntime could be in a corrupted state, so a
subsequent eval might silently misbehave on the same connection.

Now the worker schedules process.exit(1) after a short delay (to allow
the IPC error message to flush). WorkerSessionManager already handles
unexpected worker exits: it rejects pending requests with a meaningful
error and emits worker.unexpectedExit telemetry. The next eval call on
the same cluster will respawn a fresh worker via ensureWorker.
@tnaum-ms tnaum-ms requested a review from a team as a code owner May 13, 2026 08:36
Copilot AI review requested due to automatic review settings May 13, 2026 08:36
tnaum-ms added 2 commits May 13, 2026 10:40
…-01)

stubCreated=false was ambiguous: the stub may have already existed, or
the stub-install branch may never have been reached (for example if an
error fired before that block). Add an explicit stubExisted property so
the combinations are unambiguous:

  stubCreated=true,  stubExisted=false  -> first-time install on this profile
  stubCreated=false, stubExisted=true   -> stub already present, no work
  stubCreated=false, stubExisted=false  -> bootstrap never reached the stub step

Pure telemetry refinement; no behavior change.
The uncaughtException handler was updated to schedule process.exit(1)
so the supervisor can respawn a clean worker (F-04). The companion
unhandledRejection handler was left as log-only, which means a late
promise rejection that escapes all eval-scoped catch blocks can leave
the worker running with potentially the same kind of corrupted state
F-04 set out to prevent.

Apply the same treatment symmetrically:
- If an eval is in flight, forward the rejection as an evalError so the
  user sees a meaningful failure instead of 'Worker exited unexpectedly'.
- Log the message and stack.
- Schedule process.exit(1) with the same 50 ms IPC-flush delay used by
  uncaughtException. WorkerSessionManager already handles the resulting
  unexpected exit and emits worker.unexpectedExit telemetry.

Pushes us closer to the 'crash early, respawn fresh' invariant for the
playground worker.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the Query Playground bootstrap path by adding telemetry/observability around TypeScript plugin initialization and ensuring the playground worker terminates after an uncaught exception so it can be respawned cleanly by the supervisor.

Changes:

  • Wrap TS plugin bootstrap in callWithTelemetryAndErrorHandling('playground.tsPluginBootstrap'), adding staged telemetry properties and resetting the retry flag on failure.
  • On uncaughtException in the playground worker, post the error back to the main thread and then schedule a non-graceful exit to force a clean respawn on the next eval.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/documentdb/playground/playgroundWorker.ts Ensures the worker exits shortly after reporting an uncaught exception to avoid continuing in a potentially corrupted state.
src/documentdb/ClustersExtension.ts Adds staged telemetry and retry behavior for TS plugin bootstrap, improving diagnosability and resilience on read-only installs.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Code Quality Checks

Check Status How to fix
Localization (l10n) ✅ Passed
ESLint ✅ Passed
Prettier formatting ✅ Passed

This comment is updated automatically on each push.

@github-actions
Copy link
Copy Markdown
Contributor

📦 Build Size Report

Metric Base (next) PR Delta
VSIX (vscode-documentdb-0.8.0.vsix) 7.51 MB 7.51 MB ⬆️ +0 KB (+0.0%)
Webview bundle (views.js) 5.79 MB 5.79 MB ✅ 0 KB (0.0%)

Download artifact · updated automatically on each push.

@tnaum-ms tnaum-ms merged commit 4a4278a into next May 13, 2026
8 checks passed
@tnaum-ms tnaum-ms deleted the dev/tnaum/pre-release-fixes branch May 13, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants