Snapshot Runtime: QuickJS WASM VM with snapshot/restore for workflow execution#1300
Snapshot Runtime: QuickJS WASM VM with snapshot/restore for workflow execution#1300TooTallNate wants to merge 138 commits into
Conversation
…refix Start of the serialization refactor (separate from snapshot-runtime). New files: - serialization/types.ts — SerializationFormat enum, SerializableSpecial interface, Reducers/Revivers types - serialization/codec.ts — Codec interface with formatPrefix, serialize, deserialize, and optional deserializeLegacy - serialization/format.ts — Format prefix encode/decode/peek, moved from the monolithic serialization.ts The Codec interface enables future alternative formats (CBOR, JSON) while keeping the devalue implementation as the current default.
Serialization refactor Phase 1: create the new module structure alongside the existing monolithic serialization.ts (which continues to work). New files: - serialization/reducers/common.ts — Date, Error, Map, Set, URL, BigInt, typed arrays, Headers, Request, Response, RegExp, URLSearchParams - serialization/reducers/class.ts — Class/Instance with WORKFLOW_SERIALIZE/ DESERIALIZE support - serialization/reducers/step-function.ts — StepFunction with closure vars - serialization/codec-devalue.ts — devalue Codec implementation - serialization/encryption.ts — composable encrypt/decrypt layer - serialization/workflow.ts — synchronous, no encryption, for VM use - serialization/step.ts — async with encryption, for step handler - serialization/client.ts — async with encryption, for start() API - serialization/index.ts — re-exports all public API - serialization/serialization.test.ts — 25 focused tests All modes compose their reducer/reviver sets from the shared building blocks. Cross-mode compatibility verified: data serialized in any mode can be deserialized in any other mode (for common types). Existing 108 serialization tests continue to pass unchanged.
- Add ./serialization/workflow export to @workflow/core package.json
- Add ./internal/serialization re-export to workflow meta-package
- The workflow bundle can now import serialize/deserialize via:
import { serialize, deserialize } from 'workflow/internal/serialization'
Full test suite passes: 493 tests across 22 files (including 25 new
serialization module tests).
1. Fix reducer composition order: Class/Instance reducers now come BEFORE common reducers in all three modes (workflow, step, client). This ensures custom Error subclasses with WORKFLOW_SERIALIZE are handled by the Instance reducer before the generic Error reducer (devalue uses first-match-wins semantics). 2. Fix encryption decrypt() to fail fast when encrypted data is encountered without a decryption key, instead of silently returning encrypted bytes that would fail later with an unhelpful format error. 3. Remove Request/Response from common reducers — they don't have matching common revivers, so including them caused asymmetric behavior (serialize as Request, deserialize as plain object). Request/Response handling belongs in mode-specific modules that can provide proper revivers. 4. Document Node.js dependency in the workflow serialization re-export. The current implementation uses node:util and Buffer. For the QuickJS VM (snapshot runtime), these will need polyfills — tracked separately.
The Codec interface now takes a SerializationMode ('workflow', 'step',
'client') instead of raw reducers/revivers. The reducer/reviver
composition is internal to the devalue codec implementation.
This is the right abstraction because reducers/revivers are devalue-
specific concepts. A future CBOR codec would handle Date, typed arrays,
Map, Set natively via the CBOR type system — it wouldn't use reducers
at all. A JSON codec would only support standard JSON types.
The mode-specific modules (workflow.ts, step.ts, client.ts) are now
simpler — they just pass the mode string to the codec.
The format prefix is now a branded string type validated by
isFormatPrefix() — any 4-character [a-z0-9] string is valid.
This removes the hard-coded enum of known formats, making the system
truly open for extension:
type FormatPrefix = string & { __brand: 'FormatPrefix' };
function isFormatPrefix(value: string): value is FormatPrefix;
The SerializationFormat object still provides well-known constants
('devl', 'encr') but they're now just typed constants, not an
exhaustive enum.
peekFormatPrefix() and decodeFormatPrefix() use isFormatPrefix() for
validation instead of checking against a known list. Unknown but valid
prefixes (e.g. 'cbor', 'json', 'v2b1') are accepted — the caller
decides whether they can handle the format.
6 new isFormatPrefix tests covering: valid strings, too short, too long,
uppercase, special characters. 1 new test for unknown-but-valid prefixes.
Proves that data serialized by the new modules can be deserialized by the old serialization.ts functions, and vice versa. This validates that the new modules are wire-format compatible and safe for incremental migration: - new workflow.serialize → old hydrateStepReturnValue (primitives, Date, Map, nested) - old dehydrateStepReturnValue → new workflow.deserialize (primitives, Date, nested) - old dehydrateWorkflowArguments → new workflow.deserialize - new client.serialize → old hydrateWorkflowArguments - new step.serialize + encryption → old hydrateStepArguments + decryption - old dehydrateStepArguments + encryption → new step.deserialize + decryption All 11 tests pass, confirming the new and old modules produce identical wire formats and can coexist during the migration.
Phase 1 of the VM snapshot runtime (RFC #1298). World interface changes (packages/world): - Add SnapshotMetadata type (lastEventId, createdAt) with zod schema - Add snapshots sub-interface to Storage: save(), load(), delete() - Export new types and schema from @workflow/world world-local implementation (packages/world-local): - Filesystem-based snapshot storage in {dataDir}/snapshots/ - {runId}.bin for serialized VM snapshot data - {runId}.json for metadata (lastEventId, createdAt) - save() overwrites existing snapshots (atomic via ensureDir + write) - load() returns null if no snapshot exists - delete() removes both files - Wired into createStorage() with tracing instrumentation
Phase 2 of the VM snapshot runtime (RFC #1298). - Add quickjs-wasi dependency to @workflow/core - Create snapshot-runtime.ts with the basic structure: - runSnapshotWorkflow() entry point - Fresh VM creation with deterministic WASI clock and seeded Math.random - Snapshot restore path (TODO: event processing) - Host function stubs for useStep, sleep, createHook via Symbol.for() - Interrupt handler (30s timeout) - Memory limit (64MB) - Snapshot serialization on suspension The useStep, sleep, and createHook host functions are stubs with TODO markers — the basic VM lifecycle and snapshot/restore flow is in place.
Demonstrates the core snapshot/restore mechanism with a compiled workflow pattern: - useStep implemented inside QuickJS as JS code (not host functions) - Pending step resolve/reject functions stored on globalThis.__resolvers - Step metadata (stepId, args) preserved across snapshot/restore - Multi-step workflow: snapshot at each suspension, restore and resolve, workflow continues from exact suspension point - Both tests pass: simple workflow + metadata preservation
The snapshot runtime (runSnapshotWorkflow) now handles the complete workflow lifecycle: - First run: bootstrap VM with workflow primitives, evaluate compiled workflow bundle, start workflow function, process any existing events - Snapshot: capture VM state when workflow suspends on step/sleep - Restore: deserialize snapshot, process delta events to resolve/reject pending promises, execute pending jobs - Completion: detect workflow result or error Workflow primitives (useStep, sleep) are implemented as JavaScript code inside the QuickJS VM, not as host function callbacks. This keeps the implementation simple — the host communicates by evaluating small JS snippets to resolve/reject promises. 7 tests covering: simple completion, step suspension, snapshot/restore with step completion, multi-step across 3 snapshots, sleep suspension and wake, step failure with try/catch.
…napshot flag - Add snapshot-entrypoint.ts that handles the full lifecycle: snapshot load → event fetching → runSnapshotWorkflow → result handling (create events, queue steps, save/delete snapshots) - Add feature flag: set WORKFLOW_RUNTIME=snapshot to use the new runtime - When enabled, the snapshot path runs before the event-replay path - Step queuing matches the existing step handler's expected payload format - Wait handling includes timeout calculation for delayed re-queuing - Extract workflow ID from SWC-compiled bundle's manifest comment
The snapshot runtime now successfully: 1. Evaluates the compiled workflow bundle in QuickJS 2. Suspends on the first step call 3. Snapshots the VM state 4. Creates step_created events and queues step execution Web API stubs added for TransformStream, ReadableStream, WritableStream, TextEncoder, TextDecoder, Headers, URL, console — these are referenced by the compiled bundle but not needed for basic step/sleep workflows. Remaining issue: step_created events use raw JSON for step input args, but the step handler expects devalue-serialized data. This is the data serialization boundary that needs to be resolved (RFC #1298 discusses moving devalue inside the QuickJS VM).
…untime The step_created events now contain properly devalue-serialized input data (Uint8Array with 'devl' format prefix) instead of raw JSON. This makes the step handler's hydrateStepArguments() work correctly. When processing step_completed events, the output is deserialized via workflow.deserialize() on the host side before passing to the QuickJS VM as JSON. This handles the devalue format prefix correctly. Also properly serializes the run_completed output.
Step arguments are now wrapped in { args: [...], closureVars?: {...} }
before being serialized with workflow.serialize(), matching the format
expected by the step handler's hydrateStepArguments().
The step handler successfully:
- Receives the step message
- Deserializes the step arguments
- Executes the step function (add(10, 7))
- Handles retry on retryable errors
- Completes the step and re-queues the workflow
New files: - serialization/base64.ts — pure-JS base64 encode/decode (no Buffer) - serialization/reducers/common-vm.ts — VM-compatible reducers using instanceof Error instead of types.isNativeError(), pure-JS base64 instead of Buffer - serialization/codec-devalue-vm.ts — devalue codec using VM reducers - serialization/workflow-vm.ts — VM workflow serialize/deserialize The VM serializer produces the EXACT same wire format as the Node.js serializer (devl-prefixed devalue data). Verified by 14 tests including critical cross-compatibility: - VM serialize → Node.js hydrateStepArguments (step handler path) - Node.js dehydrateStepReturnValue → VM deserialize (step result path) - Pure-JS base64 matches Node.js Buffer base64 Sub-path export: @workflow/core/serialization/workflow-vm Re-export: workflow/internal/serialization now points to workflow-vm
Data now flows as format-prefixed devalue bytes (devl + devalue.stringify)
across the VM boundary, with no JSON conversion in the middle:
Step args: VM __wdk_serialize({args}) → Uint8Array → event input
Step results: event output Uint8Array → VM __wdk_deserialize → value
Workflow result: VM __wdk_serialize(result) → Uint8Array → event output
Host functions __wdk_serialize/__wdk_deserialize are installed on
globalThis and use the VM-compatible workflow serializer (pure JS,
no Node.js deps). They are re-installed after snapshot restore since
host callbacks don't survive the snapshot.
VM-compatible serializer (workflow-vm.ts) produces the EXACT same
wire format as the Node.js serializer — verified by cross-compatibility
tests.
The serializer (devalue + reducers + TextEncoder/TextDecoder polyfills) is now bundled as a 16.6KB IIFE that's evaluated inside the QuickJS VM during bootstrap. The serialize/deserialize functions are real JS functions running inside the VM, operating on QuickJS-native values (Date, Map, Set, etc.) that can't cross the VM boundary via dump(). Architecture: - vm-bundle-entry.ts is bundled by esbuild into a self-contained IIFE - esbuild inject option ensures TextEncoder/TextDecoder polyfills run before any module-level code - The host only passes opaque Uint8Array blobs (devl-prefixed devalue) across the VM boundary - On snapshot restore, the serde functions survive in the QuickJS heap (no re-registration needed) New files: - polyfills/text-encoder.ts — pure JS TextEncoder (from nx.js) - polyfills/text-decoder.ts — pure JS TextDecoder (from nx.js) - polyfills/install-text-coding.ts — installs polyfills on globalThis - serialization/vm-bundle-entry.ts — esbuild entry for VM serde bundle - runtime/vm-serde-bundle.generated.ts — auto-generated bundle string - scripts/build-vm-serde-bundle.js — build script (runs during pnpm build) Removed: installSerdeHostFunctions (no longer needed — serde is in-VM)
…ecution The snapshot metadata now stores eventsCursor (the pagination cursor from events.list()) instead of lastEventId (the raw event ID). The world-local pagination expects cursors in 'timestamp|id' format, not raw event IDs. This fix enables the full workflow lifecycle: 1. First invocation: QuickJS VM evaluates workflow, suspends on step_0 2. Step handler executes add(10, 7) = 17 3. Second invocation: snapshot restored, step_0 resolved, suspends on step_1 4. Step handler executes add(17, 8) = 25 5. Third invocation: snapshot restored, both steps resolved, workflow completes 6. run_completed event created, snapshot cleaned up Verified end-to-end with the nextjs-turbopack workbench: - All events created correctly (run_created → run_completed) - Step retries work (the add function throws on first attempt) - Snapshots are saved/restored/deleted at correct lifecycle points - Run status transitions to 'completed'
🦋 Changeset detectedLatest commit: a8776ee The changes in this PR will be included in the next version bump. This PR includes changesets to release 21 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)fastify-replay (1 failed):
💻 Local Development (1 failed)nuxt-stable-snapshot (1 failed):
📦 Local Production (1 failed)astro-stable-snapshot (1 failed):
🐘 Local Postgres (1 failed)astro-stable-snapshot (1 failed):
Details by Category❌ ▲ Vercel Production
❌ 💻 Local Development
❌ 📦 Local Production
❌ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
- Extract workflow arguments from run_created event and pass to the workflow function via __wdk_deserialize() - Call executePendingJobs() after each step_completed/step_failed/ wait_completed event to allow async function await resumptions to unwind one step at a time - Add debug logging for workflow result bytes The addTenWorkflow e2e test is still failing: the workflow result bytes are 'devl-1' (devalue for undefined) even though all steps complete successfully. The issue appears to be that the async function return value is not propagating through the SWC-compiled workflow bundle's promise chain. This needs investigation — the unit tests with simple inline workflow code work correctly.
Three coupled changes in the snapshot entrypoint's suspension handler: 1. Build per-pending-op promises and await them with Promise.all instead of running them in a sequential for-loop. Mirrors the replay runtime's suspension-handler.ts pattern. 2. Run snapshot.save concurrently with the op dispatch via the same Promise.all. The snapshot is an optimization — if save lags or fails, the next workflow invocation simply replays from events. Previously blocked step queueing on a full storage round-trip. 3. Drop the redundant hooks.list pre-check from the hook_created branch. With deterministic correlationIds (snapshot runtime PRNG fix) and per-(runId, correlationId) uniqueness in worlds (world-local + world-postgres dedup fixes), EntityConflictError on events.create is the correct dedup signal and the pre-check is an unnecessary round-trip per pending hook. CI run 25095263499 measured snapshot ~2.37x slower than replay per-test on Vercel (sum: 2418s vs 1021s); these changes should narrow that gap considerably on cloud worlds where each storage call is a network round-trip.
Hook-related e2e tests (hookWorkflow, hookCleanupTestWorkflow,
hookDisposeTestWorkflow, hookWithSleepWorkflow, distributedAbortController)
previously slept a fixed 5 seconds before calling getHookByToken to wait
for the hook to be registered. On slower runtimes — notably the snapshot
runtime on Vercel where each workflow round-trip is several seconds longer
than replay — that fixed budget is too tight and the test fails with
HookNotFoundError. On faster runtimes it's unnecessarily slow.
Adds a waitForHook(token, { timeoutMs, intervalMs, runId }) helper that
polls until the hook resolves or the timeout (default 30s) expires, with
an optional runId filter for token-reuse tests where eventually-consistent
backends may briefly still report a stale hook. Each hook-wait site now
uses this helper. Non-hook fixed sleeps (workflow-progress polling for
sleepingWorkflow cancel tests, payload-processing waits in
hookWithSleepWorkflow) are unchanged.
The recursion-hazard fixes that motivated the blast-radius cap have all
landed:
1. Snapshot runtime correlationIds are now deterministic across
concurrent VM invocations (commit 83bcec — `__ulidTimestamp`
injection so same-resumption invocations produce identical ULIDs).
2. The seeded PRNG state is preserved by the VM heap snapshot itself
(commit a71503 — events cursor mixed into seed; ULID
monotonicFactory closure persists in the QuickJS heap).
3. Per-(runId, correlationId) uniqueness is enforced atomically in
world-local (commit ca0078) and via unique partial index in
world-postgres (commit 009a00) for step_created / hook_created /
wait_created.
With those guarantees the duplicate `start()` invocation that previously
fanned out hundreds of thousands of child runs on the fastify deployment
is no longer possible. Restore the full Vercel project matrix
(11 frameworks) and unskip fibonacciWorkflow on Vercel.
…aces Pipelining world.snapshots.save with the per-pending-op events.create + queueMessage dispatch (introduced in 22ab779) opened a window where a fast-completing step could re-invoke the workflow handler before the new snapshot was persisted. The handler then loads a stale (or missing) snapshot whose coroutine state doesn't match the latest events, leaving the workflow stuck. CI run 25098135190 caught this: fetchWorkflow on Vercel snapshot mode regressed from ~16s passing to a 60s timeout. Diagnostic showed both step_completed events landed at +5.5s but no run_completed ever fired. Restore the original ordering: await snapshot.save fully before any step is queued. Per-pending-op dispatch within a single suspension still runs in parallel via Promise.all, which retains the bulk of the wall-clock reduction (run 25098135190 measured ~568s saved on Vercel snapshot vs. the pre-parallelize baseline). Only the cross-invocation pipelining of save with queue is rolled back.
Wedges on Vercel snapshot runtime under concurrent matrix load are opaque from CI logs alone — the workflow handler runs inside a function on Vercel and its console output isn't surfaced in the CI job. This commit adds two pieces of diagnostic plumbing: 1. Always-on checkpoint logs at every major step of the snapshot suspension/restore lifecycle (`SNAPSHOT_DIAG`), plus matching entry/exit logs in the workflow and step queue handlers (`WORKFLOW_HANDLER_DIAG`, `STEP_HANDLER_DIAG`). Each record carries a per-invocation id, runId, elapsed time, and structured fields (snapshot bytes, events fetched + counts by type, pending op summary, outcome, exit action). Emitted at `warn` level so they show up in Vercel function logs without DEBUG=1. 2. e2e diagnostic harness extension that fetches matching function logs from `/v3/deployments/:id/events` for the wedged runId after a test failure and appends them to the existing run-diagnostic block. Only runs when `WORKFLOW_VERCEL_AUTH_TOKEN` / `WORKFLOW_VERCEL_TEAM` / `VERCEL_DEPLOYMENT_ID` are set (i.e. the Vercel-prod CI matrix); silently no-ops elsewhere. Together these let a failed test surface the function-side activity for its wedged run \u2014 e.g. whether the snapshot runtime even reached its post-VM checkpoint, what its last successful save / queue operation was, whether the next handler invocation ever started, etc. That visibility is what we need to actually find the wedge cause.
…reserve Buffer body across retries
Wedge root cause for snapshot runtime on Vercel under concurrent matrix
load. The old save() in world-vercel/src/snapshots.ts used:
fetch(url, { method: 'PUT', body: compressed, dispatcher: getDispatcher() })
where getDispatcher() returns a RetryAgent. fetch() wraps Buffer/Uint8Array
bodies in a one-shot ReadableStream (web fetch spec), so when the
RetryAgent retries on a transient 5xx or network error, the second
attempt has nothing left to read — the iterable yields 0 bytes, undici
detects the mismatch with Content-Length, and throws
UND_ERR_REQ_CONTENT_LENGTH_MISMATCH. With 5–15 MB snapshot bodies the
bug fires under any meaningful network turbulence.
The downstream impact is a permanent wedge:
1. Save throws -> workflow handler returns 500.
2. Queue retries the handler with backoff.
3. Each retry repeats the same save -> same throw -> same 500.
4. Production logs showed attempt: 19 (≈1.5 hours of retries)
before the test framework gave up at the 60s test timeout.
Switch to undici.request() (the lower-level API), which hands the Buffer
to the connection layer directly without stream wrapping, so retries
can replay the same body. Verified locally with a vitest regression
test that reproduces the exact production stack trace
(AsyncWriter.end -> writeIterable -> UND_ERR_REQ_CONTENT_LENGTH_MISMATCH)
without the fix and passes with it.
Other world-vercel endpoints (events, hooks, runs, …) hit the same
underlying undici limitation but in practice rarely fail this way: their
bodies are tiny (KB CBOR-encoded payloads), so the chance of network
turbulence mid-stream is much lower. They remain on fetch() for now.
Avoid a guaranteed-404 round-trip to the snapshot storage backend on
the very first workflow handler invocation. The suspension handler in
this file always saves the snapshot BEFORE creating any
step_created / hook_created / wait_created events, so if the events
preloaded by events.create('run_started') contain only run_created /
run_started, no save cycle has run yet and no snapshot can exist.
Detected by the new exported `canSkipSnapshotLoad(preloadedEvents)`
helper, with 8 unit tests covering each event-type combination
(undefined / empty / run_created+run_started / run_started only /
step_* / hook_received / wait_completed). When the helper returns true,
`existingSnapshot` is set to null without calling
`world.snapshots.load()` and the entrypoint falls through to the
first-run path with the preloaded events.
The wfdiag('snapshot_loaded') checkpoint now also reports
`skippedLoad: true` when the fast path was taken so we can confirm
the optimization is firing in production logs.
Reduces 404 noise on workflow-server's `/v2/runs/:runId/snapshot`
endpoint and saves a network round-trip on every initial workflow
invocation. Falls back to the normal load path whenever
`preloadedEvents` is missing or contains any non-initial event.
…ming breakdown
Two changes that go together:
1. New `stripInlineSourceMap()` helper in `source-map.ts` (with 4 unit
tests). The runtime entrypoint now strips the trailing
`//# sourceMappingURL=data:…` comment from the workflow bundle
before passing it to `vm.evalCode()`. The original (unstripped)
string is kept in the host-side scope so `remapErrorStack` can
still resolve original source positions on workflow failures.
The map is purely host-side metadata for stack-trace remapping —
the VM never reads it. But QuickJS retains source text for
stack-trace line lookups, so the multi-MB base64 comment was being
carried into the VM heap and showing up in every snapshot save+load
round-trip. Empirically, on the example workbench's bundle:
- Bundle string drops 5.16 MB → 1.20 MB (-77%)
- QuickJS heap snapshot drops 11.75 MB → 8.00 MB (-32%)
That maps to ~1s saved per per-step round-trip on Vercel.
2. Extend the `SNAPSHOT_DIAG snapshot_loaded` and
`SNAPSHOT_DIAG snapshot_saved` checkpoint logs with per-stage byte
counts and timings:
- load: returnedBytes (post-decompress, pre-decrypt),
loadDurationMs (HTTP round-trip), decryptDurationMs
- save: plaintextBytes (raw QuickJS output),
handedToWorldBytes (after host-side encrypt),
encryptDurationMs, storeDurationMs
So the savings show up in CI-fetched function logs alongside the
existing OTel attributes. Naming clarified: 'returnedBytes' /
'handedToWorldBytes' instead of misleading 'wireBytes', because
the world (e.g. world-vercel) applies its own gzip layer below
this — true on-the-wire bytes are emitted by world-vercel's own
diagnostic (separate commit).
Adds `WORLD_SNAPSHOT_DIAG` checkpoint logs to the snapshot save and load paths. Save reports inputBytes (what the core handed in) → wireBytes (after gzipSync) → compressionRatio, plus separate gzipDurationMs and putDurationMs. Load reports the equivalents: wireBytes (raw HTTP body) → decompressedBytes (after gunzipSync), plus getDurationMs and gunzipDurationMs. Pairs with the core `SNAPSHOT_DIAG` checkpoints from the previous commit so the entire snapshot lifecycle for any wedged run is grep-able by runId in Vercel function logs. Also covers the 404 (no-snapshot) case so a core `skippedLoad: true` checkpoint can be cross-referenced against the world's view: when both line up, the optimization is firing as intended; when only one side fires, something's off. All emitted at `console.warn` level — no DEBUG required, matching the format/style of the core wfdiag helper.
…able
The snapshot save path was doing the wrong thing: each world (vercel,
postgres, local) gzipped the bytes BEFORE handing them to its
transport, but core's encryption wrapped them AFTER. Net result was
`gzip(encrypt(plain))` on the wire — encryption produces ciphertext
that doesn't compress, so the gzip step was largely wasted CPU.
Flip the order so compression goes BEFORE encryption (the standard
compress-then-encrypt pattern used for at-rest blob encryption — no
CRIME/BREACH applicability here since the snapshot is opaque, no
attacker injection, no per-request size leakage). Move compression
into core so it happens once, in the right place, and so the world
layers can be simplified to opaque-bytes transport.
Codec choice: zstd when available (Node 22.15+), gzip otherwise.
Benchmarked against an 8 MB QuickJS heap snapshot (representative
production payload):
| codec | ratio | compress | decompress |
|--------|-------|----------|------------|
| zstd-3 | 4.29x | 18 ms | 6 ms |
| gzip-6 | 4.02x | 127 ms | 11 ms |
zstd is faster AND smaller. The format prefix on each blob (`zstd`
or `gzip`) marks the codec, so deployments running different Node
versions remain interoperable.
Pipeline now:
- SAVE: serialize → compress → encrypt → world.snapshots.save
- LOAD: world.snapshots.load → decrypt → decompress → deserialize
`@workflow/core`:
* New `serialization/compression.ts` with `compress` /
`decompress` / `isCompressed` / `PREFERRED_CODEC`. 11 unit
tests covering codec selection, idempotency, format-prefix
dispatch, legacy-blob passthrough.
* New SerializationFormat constants `GZIP` / `ZSTD`.
* `runtime/snapshot-entrypoint.ts` save path: compress → encrypt
→ store. Load path: decrypt → decompress. New byte-count and
timing fields on `SNAPSHOT_DIAG snapshot_saved` /
`snapshot_loaded` (compressedBytes, compressionRatio,
compressionCodec, compressDurationMs, decompressDurationMs).
* 7 new tests in `runtime/snapshot-encryption.test.ts` covering
the full pipeline round-trip with and without encryption, plus
legacy-blob backward compatibility.
`@workflow/world-vercel`:
* Drop `gzipSync` from save. Body is sent verbatim (already
compressed+encrypted by core upstream).
* Drop the `X-Snapshot-Content-Encoding: gzip` header on save.
* Load still gunzips when the response carries that header — for
backward compatibility with blobs written by older deployments.
`@workflow/world-postgres`:
* Drop `gzipSync` / `gunzipSync`. Stores opaque bytes.
Snapshots table is created per CI run; no migration concern.
`@workflow/world-local`:
* Save as `{runId}.bin` (was `.bin.gz`). Load still gunzips
legacy `.bin.gz` files via the `dataFile` metadata so a
developer's stale `.workflow-data/` directory keeps working.
The compress-then-encrypt pipeline that landed in 519bb1d added backward-compatibility code to read older snapshot blobs that were written under the previous SDK-side gzip scheme. The snapshot runtime is still on the snapshot-runtime feature branch and has no production deploy, so no such blob has ever been written under the old scheme that needs to outlive a feature-branch deploy. world-vercel: - Remove the X-Snapshot-Content-Encoding: gzip header round-trip on save and load. - Drop the gunzipSync import. - File header comment no longer mentions back-compat. world-local: - Drop the .bin.gz / dataFile metadata mechanism. Snapshots are now always stored as {runId}.bin alongside {runId}.json. - Drop the gunzipSync import and the LocalSnapshotMetadataSchema extension; metadata is just SnapshotMetadataSchema (eventsCursor + createdAt). - File-naming helpers extracted as dataPath() / metadataPath(). core: remove the now-irrelevant 'legacy snapshots saved before compression was added' test from snapshot-encryption.test.ts. The remaining 'plaintext bytes pass through unchanged' test still exercises the contract that decryptSerializedData() does not require prefixed input — that's a real pre-existing API contract used by non-snapshot callers, not snapshot back-compat.
Replaces 14 incremental per-commit changesets with 4 terse, package-scoped ones (one each for @workflow/core, world-vercel, world-postgres, world-local). The detailed per-change context is preserved in git history; CHANGELOG entries from changesets should describe what consumers need to know, not the implementation history.
This changeset is part of the serialization-refactor base branch (introduced in 6add40c) and was incorrectly deleted in the previous consolidation pass. Only changesets local to the snapshot-runtime branch should have been consolidated.
The file is regenerated on every build (`scripts/build-vm-serde-bundle.js`) and is already listed under turbo.json's outputs for caching. Tracking it just produced noisy diffs whenever someone built the package with a slightly different esbuild version.
…isites
Standardize on `Symbol.for('workflow-serialize')` /
`Symbol.for('workflow-deserialize')` everywhere — the parallel
`globalThis.__wdk_serialize` / `__wdk_deserialize` aliases have been
removed from `vm-bundle-entry.ts` and the snapshot runtime's inline
JS strings now use the symbol form directly. Single canonical name,
no duplication.
Drop the `?? Math.random` and `?? Date.now()` fallbacks from the
ULID generator setup. Both prerequisites
(`globalThis.__ulidTimestamp` and the host-replaced seeded
`Math.random`) are always set by `snapshot-runtime.ts` before the
serde bundle is evaluated; silently falling back to unseeded
`Math.random` or live `Date.now()` would re-introduce the
non-determinism we deliberately fixed (concurrent VM invocations of
the same resumption must produce identical correlationIds for the
world's EntityConflictError dedup to work). Now throws if
`__ulidTimestamp` isn't a number, and passes the seeded
`Math.random` reference explicitly to `monotonicFactory` so
upstream's `detectPRNG` never runs (it'd throw in QuickJS anyway,
since `crypto` is unavailable).
Drop the `URL` / `URLSearchParams` / `DOMException` availability
guards in `common-vm.ts`. quickjs-wasi's URL extension is always
loaded (`url.so`) and DOMException is always constructible — the
guards were dead code carried over from when those weren't reliably
available. The reducer/reviver code is now straightforward
`instanceof URL` / `new URL(...)` / `new DOMException(...)`.
Remove `packages/core/src/serialization/base64.ts` and its
sub-path exports (`./serialization/workflow`,
`./serialization/workflow-vm`). The pure-JS base64 helpers were
leftover from before `base64.so` shipped `btoa`/`atob` natively;
the VM-side reducers in `common-vm.ts` now build base64 strings via
the native ones. The sub-path exports had zero consumers in this
repo (the same cleanup landed on the `serialization-refactor`
branch in 05e0fee but never made it onto `snapshot-runtime`
because the branches diverged earlier).
Remove `packages/workflow/src/internal/serialization.ts` and its
`./internal/serialization` package.json export. Same story — zero
consumers, previously removed in #1082, then accidentally
reintroduced via `f04fd8e91`.
The `/v3/deployments/:id/events` endpoint mostly returned empty results in our wedge-debugging usage and the runId-substring filter made it slow when it did return data. The function-log fetch belongs in a dedicated diagnostic CLI command rather than baked into the test diagnostic block. Dropping for now; can be revived in a follow-up PR if needed.
Updates the per-package changesets to match AGENTS.md guidance and the
current state of the PR:
- Bump from `patch` to `minor` (snapshot runtime is a new feature, not
a bug fix; correctness matters when the changesets land on `stable`)
- Correct snapshot-runtime-core.md: snapshot is now the default, with
replay available via `WORKFLOW_RUNTIME=replay` (was incorrectly
describing snapshot as opt-in)
- Drop the misleading 'enforces uniqueness' line from
snapshot-runtime-world-vercel.md (no uniqueness work happens in this
package; that lives in workflow-server)
- Tighten language across all four changesets per AGENTS.md
('Keep the changesets terse')
…stack regression Per CI history (runs 25100278265 vs 25130930859), the regression boundary for the 'basic step error preserves message and stack trace' / 'cross-file step error preserves message and function names in stack' e2e tests on astro local-dev is commit 770c433 ('Add CI-visible runtime diagnostics for snapshot wedges'), NOT the later 9168353 source-map-strip commit. The astro-dev failure reproduces on both replay and snapshot runtimes with identical symptoms (function name shows up as `__getOwnPropDesc` instead of the actual step function name in the source-mapped stack), which rules out any snapshot-runtime specific cause. The STEP_HANDLER_DIAG entries were always-on `runtimeLogger.warn` calls inside the step queue handler. They didn't add real diagnostic value beyond what the existing OTel spans already cover; their main purpose was to grep-correlate step activity with SNAPSHOT_DIAG checkpoints in Vercel function logs during the wedge-debugging session that's now resolved. SNAPSHOT_DIAG and WORKFLOW_HANDLER_DIAG are kept; only the STEP_HANDLER_DIAG pair is removed. The exact mechanism by which the diagnostic warns affect the `stepFn.apply()` stack frame's source-mapped function name is still unclear (the most plausible explanation is that the line-shift in step-handler.ts perturbed Vite's dev-mode module graph in a way that changes which export getter wraps the step function reference at the `__copyProps` site shared with the namespace import in `_workflows.ts`). Reverting the diagnostic is sufficient to restore the test, and the diagnostic itself is not load-bearing.
There was a problem hiding this comment.
Pull request overview
Implements the new default snapshot-based workflow runtime (QuickJS WASM VM with snapshot/restore) and wires snapshot persistence into world backends, while keeping the existing event-replay runtime as an opt-out via WORKFLOW_RUNTIME=replay.
Changes:
- Add snapshot runtime execution path in
@workflow/core(VM bootstrap, snapshot save/load pipeline with compression + optional encryption, runtime-mode dispatch, and new telemetry attributes). - Introduce
snapshots.save/load/deleteto the@workflow/worldstorage interface and implement it forworld-vercel,world-postgres, andworld-local. - Expand CI/E2E coverage to run tests against both runtimes and reduce E2E flakiness by polling for hook registration instead of fixed sleeps.
Reviewed changes
Copilot reviewed 53 out of 54 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/create-test-matrix.mjs | Duplicates app matrix across snapshot and replay runtime axes. |
| pnpm-lock.yaml | Adds quickjs-wasi@2.0.0 lock entries. |
| packages/world/src/snapshots.ts | Adds SnapshotMetadataSchema (eventsCursor, createdAt). |
| packages/world/src/interfaces.ts | Extends Storage with snapshots.save/load/delete. |
| packages/world/src/index.ts | Exposes snapshot types/schema from @workflow/world. |
| packages/world-vercel/src/storage.ts | Wires snapshots into Vercel storage and instrumentation. |
| packages/world-vercel/src/snapshots.ts | Implements snapshot storage via workflow-server snapshot endpoints. |
| packages/world-vercel/src/snapshots.test.ts | Adds tests for PUT body correctness and retry behavior. |
| packages/world-postgres/test/storage.test.ts | Adds tests asserting dedup behavior for entity-creation races. |
| packages/world-postgres/src/storage.ts | Maps pg unique-violation for entity-creating events to EntityConflictError. |
| packages/world-postgres/src/snapshots.ts | Implements Postgres snapshot upsert/load/delete storage. |
| packages/world-postgres/src/index.ts | Wires snapshots storage into Postgres createStorage. |
| packages/world-postgres/src/drizzle/schema.ts | Adds snapshots table + entity-creation partial unique index. |
| packages/world-postgres/src/drizzle/migrations/meta/_journal.json | Registers new migrations in drizzle journal. |
| packages/world-postgres/src/drizzle/migrations/0010_add_snapshots_table.sql | Creates workflow.workflow_snapshots table. |
| packages/world-postgres/src/drizzle/migrations/0011_add_events_entity_creation_unique_index.sql | Adds partial unique index for step/hook/wait creation events. |
| packages/world-local/src/storage/snapshots-storage.ts | Adds filesystem-backed snapshot storage (bytes + metadata files). |
| packages/world-local/src/storage/index.ts | Wires snapshots storage into local storage and instrumentation. |
| packages/world-local/src/storage/events-storage.ts | Adds atomic lock-file dedup for step_created and wait_created. |
| packages/world-local/src/storage.test.ts | Adds race tests for local step/wait creation dedup behavior. |
| packages/world-local/src/queue.ts | Logs queue handler errors with stack for debugging. |
| packages/core/turbo.json | Adds generated VM bundle/assets files to build outputs. |
| packages/core/src/telemetry/semantic-conventions.ts | Adds snapshot runtime semantic convention attributes. |
| packages/core/src/source-map.ts | Adds stripInlineSourceMap() to reduce VM heap/snapshot size. |
| packages/core/src/source-map.test.ts | Tests stripInlineSourceMap() behavior. |
| packages/core/src/serialization/workflow-vm.ts | Adds VM-safe workflow-mode serializer/deserializer. |
| packages/core/src/serialization/workflow-vm.test.ts | Tests VM serializer and VM↔Node compatibility. |
| packages/core/src/serialization/vm-bundle-entry.ts | VM bundle entry: installs serde + deterministic ULID generator. |
| packages/core/src/serialization/types.ts | Adds compression format prefixes (gzip, zstd). |
| packages/core/src/serialization/reducers/common-vm.ts | Adds VM-safe reducers/revivers (base64 via btoa/atob). |
| packages/core/src/serialization/compression.ts | Adds compress/decompress layer with gzip/zstd feature detection. |
| packages/core/src/serialization/compression.test.ts | Tests compression layer behavior and codec selection. |
| packages/core/src/serialization/compat.test.ts | Adds compatibility tests between modular and legacy serialization APIs. |
| packages/core/src/serialization/codec-devalue.ts | Adds clarifying notes about modular modules vs legacy runtime path. |
| packages/core/src/serialization/codec-devalue-vm.ts | Adds VM-compatible devalue codec using VM reducers/revivers. |
| packages/core/src/runtime/start.ts | Propagates WORKFLOW_RUNTIME choice into executionContext. |
| packages/core/src/runtime/snapshot-runtime.ts | Implements QuickJS snapshot/restore runtime engine. |
| packages/core/src/runtime/snapshot-runtime.test.ts | Unit tests for snapshot runtime behavior and determinism. |
| packages/core/src/runtime/snapshot-entrypoint.ts | Integrates snapshot runtime into devkit entrypoint + storage pipeline. |
| packages/core/src/runtime/snapshot-entrypoint.test.ts | Tests snapshot-load skip heuristic. |
| packages/core/src/runtime/snapshot-encryption.test.ts | Tests compress→encrypt→decrypt→decompress contract. |
| packages/core/src/runtime/runtime-mode.ts | Adds WORKFLOW_RUNTIME parsing/validation. |
| packages/core/src/runtime/runtime-mode.test.ts | Tests runtime-mode env parsing. |
| packages/core/src/runtime.ts | Switches default runtime to snapshot with replay fallback. |
| packages/core/scripts/build-vm-serde-bundle.js | Generates VM serde bundle source used by snapshot runtime. |
| packages/core/scripts/build-quickjs-assets.js | Generates embedded quickjs-wasi wasm/extension assets. |
| packages/core/package.json | Adds quickjs-wasi dependency and generators to build script. |
| packages/core/e2e/e2e.test.ts | Replaces fixed hook sleeps with polling helper to reduce flakiness. |
| packages/core/.gitignore | Ignores generated VM bundle/assets files. |
| .github/workflows/tests.yml | Expands CI matrix across runtimes and avoids ARG_MAX in sticky comment. |
| .changeset/snapshot-runtime-world-vercel.md | Changeset for world-vercel snapshot storage + undici.request rationale. |
| .changeset/snapshot-runtime-world-postgres.md | Changeset for world-postgres snapshots + event uniqueness fix. |
| .changeset/snapshot-runtime-world-local.md | Changeset for world-local snapshots + event dedup fix. |
| .changeset/snapshot-runtime-core.md | Changeset for core snapshot runtime default + replay opt-out. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "scripts": { | ||
| "build": "genversion --es6 src/version.ts && tsc", | ||
| "build": "genversion --es6 src/version.ts && node scripts/build-vm-serde-bundle.js && node scripts/build-quickjs-assets.js && tsc", | ||
| "dev": "genversion --es6 src/version.ts && tsc --watch", | ||
| "clean": "tsc --build --clean && rm -rf dist src/version.ts docs ||:", |
| * The binary data is stored gzip-compressed in the `data` column. | ||
| * Metadata (`eventsCursor`, `createdAt`) lives alongside for cheap loads. | ||
| */ |
| const escapedCid = cid.replace(/"/g, '\\"'); | ||
| const eventData = |
| function arrayBufferToBase64( | ||
| value: ArrayBufferLike, | ||
| offset: number, | ||
| length: number | ||
| ): string { | ||
| if (length === 0) return '.'; | ||
| // btoa requires a binary string. Build it from the byte view. | ||
| const uint8 = new Uint8Array(value, offset, length); | ||
| let binary = ''; | ||
| for (let i = 0; i < uint8.length; i++) { | ||
| binary += String.fromCharCode(uint8[i]!); | ||
| } | ||
| return btoa(binary); |
Resolve conflicts: - packages/core/src/serialization/* (workflow.ts, step.ts, client.ts, codec-devalue.ts, errors.ts, common.ts): take main's version (post-#1849 SerializationError + post-#1851 first-class Error subclass reducers). - packages/core/src/serialization/types.ts: take main's per-Error-subclass payload shapes; re-add GZIP/ZSTD format prefixes from snapshot-runtime. - packages/core/src/serialization.ts: take main's V2 helpers (dehydrateStepError, hydrateStepError, dehydrateRunError, hydrateRunError, getWorldLazy import). - packages/core/src/runtime.ts: take main's V2 inline-replay loop + step-executor + memoizeEncryptionKey + dehydrateRunError patterns; layer back snapshot dispatch (useSnapshotRuntime + runWorkflowWithSnapshots) before the V2 main replay loop, after run_started setup. - packages/core/src/runtime/start.ts: take main's getWorldLazy; keep snapshot's getWorkflowRuntimeFromEnv usage. - packages/world-local/src/storage/index.ts: take main's local-var refactor + LocalStorage type; layer back snapshots storage entry. - packages/world-local/src/storage/events-storage.ts: take main's version (already includes #1877 dedup atomicity and #1851 Uint8Array passthrough). - packages/world-postgres: take main's tightened EntityConflictError gate (constraint name match) and waitCreated test assertion. Renumber snapshot's 0010_add_snapshots_table.sql to 0012; drop branch's duplicate 0011_add_events_entity_creation_unique_index.sql in favor of main's 0010 with dedup CTE. - packages/core/e2e/e2e.test.ts: take main's #1879 waitForHookDisposal. - .github/workflows/tests.yml: take main's runLabel/artifactSuffix naming scheme; keep snapshot's WORKFLOW_RUNTIME env var; take main's Windows job structure. - scripts/create-test-matrix.mjs: extend the runtime cross-product to fold runtime into runLabel and artifactSuffix so artifacts/job names remain unique. Snapshot dispatch is now layered on top of V2: when a workflow message arrives and the run's runtime mode is 'snapshot', runtime.ts delegates to runWorkflowWithSnapshots and returns. The V2 inline-replay loop and inline executeStep path remain in place for replay-mode runs and for inline step execution from background-step deliveries (snapshot mode will also re-route step queueing to the unified workflow queue in a follow-up commit so steps hit V2's executeStep instead of stepEntrypoint).
The V2 architecture (#1338) unified step execution into the workflow handler: step messages arrive on the workflow queue with a stepId payload and dispatch to executeStep inline. The separate stepEntrypoint route was removed. Update snapshot-entrypoint.ts to queue steps via the unified queue (`__wkf_workflow_<name>` with { runId, stepId, stepName, traceCarrier, requestedAt }) instead of the removed `__wkf_step_<id>` route. When the step result event lands and the runtime invokes for inline replay, runtime.ts's snapshot dispatch (added in the merge commit) routes back to runWorkflowWithSnapshots, which loads the snapshot and processes the new step_completed/step_failed events. Pin the V2 inline-execution invocation-count tests to replay mode — those tests assert V2-specific batching behavior (1 invocation for sequential steps, 2 for sleep+step) that snapshot runtime can't match since snapshots make a separate flow invocation per resume point.
Three coordinated changes so the snapshot runtime emits and consumes
the same error wire format as the V2 replay runtime — enabling
Error subclass identity (TypeError, FatalError, RetryableError, …),
cause chains, and non-Error throws to round-trip end-to-end.
* common-vm.ts: add VM-side reducers and revivers for every Error
subclass that common.ts already covers on the host (TypeError,
RangeError, SyntaxError, ReferenceError, EvalError, URIError,
AggregateError, FatalError, RetryableError) plus cause preservation
on the base Error reducer/reviver. Match by value.name (instance
property) for cross-realm + bundler-output robustness, mirroring
the host-side rationale. Preserves cause as a side-property after
construction since FatalError/RetryableError constructors don't
forward it.
* snapshot-runtime.ts: serialize the original thrown value via the
VM's workflow-serialize in the rejection handler, exposing it as
__workflowError.valueBytes alongside the existing host-visible
{message, name, stack} fields. checkWorkflowState surfaces the
bytes through SnapshotRuntimeResult.failed.valueBytes. Also
hydrates step_failed event errors via workflow-deserialize so
the workflow VM catch sees a properly-typed Error subclass with
cause chain instead of a synthesized FatalError\(message\).
* snapshot-entrypoint.ts: when result.failed.valueBytes is present,
hydrate via hydrateRunError, walk the cause chain and remap each
stack via the host source map (the VM can't), then re-dehydrate
for storage. Falls back to passing the bytes through (with
encryption) on rehydration failures, and to the legacy
Error-reconstruction path when valueBytes is absent (e.g.
extractError pseudo-failures from VM bootstrap).
E2E: 82/83 pass on nextjs-turbopack snapshot mode (up from 73/83
before these changes); the one remaining failure
\(wellKnownAgentWorkflow\) is a pre-existing snapshot-runtime
limitation \(workflows registered at separate routes are not in the
combined VM bundle\) unrelated to error serialization.
📊 Benchmark Results
workflow with no steps💻 Local Development
workflow with 1 step💻 Local Development
workflow with 10 sequential steps💻 Local Development
workflow with 25 sequential steps💻 Local Development
workflow with 50 sequential steps💻 Local Development
Promise.all with 10 concurrent steps💻 Local Development
Promise.all with 25 concurrent steps💻 Local Development
Promise.all with 50 concurrent steps💻 Local Development
Promise.race with 10 concurrent steps💻 Local Development
Promise.race with 25 concurrent steps💻 Local Development
Promise.race with 50 concurrent steps💻 Local Development
workflow with 10 sequential data payload steps (10KB)💻 Local Development
workflow with 25 sequential data payload steps (10KB)💻 Local Development
workflow with 50 sequential data payload steps (10KB)💻 Local Development
workflow with 10 concurrent data payload steps (10KB)💻 Local Development
workflow with 25 concurrent data payload steps (10KB)💻 Local Development
workflow with 50 concurrent data payload steps (10KB)💻 Local Development
Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
stream pipeline with 5 transform steps (1MB)💻 Local Development
10 parallel streams (1MB each)💻 Local Development
fan-out fan-in 10 streams (1MB each)💻 Local Development
SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
When @workflow/utils is a transitive dep of the consumer app (e.g.
astro depends on workflow but not @workflow/utils directly), pnpm
strict node_modules isolation makes the package unresolvable from
process.cwd(). The cwd-only createRequire then threw, getPortLazy
silently fell back to undefined-getPort, and step-side
getWorkflowMetadata().url defaulted to localhost:3000 instead of the
real port.
Add a fallback resolution from this module's own location
(import.meta.url) so we find @workflow/utils as a peer of
@workflow/core when the cwd path fails. Mirrors the dual-resolution
pattern in world.ts:getRuntimeRequire.
Symptom on CI: workflowAndStepMetadataWorkflow failed on astro Local
Prod and Local Postgres in snapshot mode because the workflow VM
correctly read port 4321 (snapshot-entrypoint uses a static
`import { getPort }` that bundlers resolve at build time) but the
step path's getPortLazy couldn't reach @workflow/utils through
astro's pnpm node_modules and reported port 3000. Replay mode
incidentally hid the bug because BOTH the workflow and step paths
fell back to 3000 (consistent-but-wrong) so the
toStrictEqual(workflowMetadata, innerWorkflowMetadata) assertion
passed despite both URLs being incorrect.
Summary
Implements the snapshot-based workflow runtime described in RFC #1298. Instead of replaying the full event log on every workflow handler invocation, workflows run inside a QuickJS WASM VM that is snapshotted at suspension points and restored on resumption — so each invocation only fetches and processes events that arrived since the last save.
The snapshot runtime is the default in this PR. The previous event-replay runtime remains available as an opt-out via
WORKFLOW_RUNTIME=replayorexecutionContext.workflowRuntime: 'replay'.How it works
compress → encryptpipeline (zstd on Node 22.15+, gzip fallback; AES-256-GCM when an encryption key is configured) and are persisted viaworld.snapshots.save.world.snapshots.loadreturns the bytes, the inversedecrypt → decompresspipeline restores them, andvm.restore()resumes the VM at the exact suspension point.eventsCursor, processes them, and either resolves to a result, suspends on a new pending op, or fails.Most of the snapshot-runtime work lives in
@workflow/core(runtime/snapshot-runtime.ts,runtime/snapshot-entrypoint.ts,serialization/compression.ts,serialization/vm-bundle-entry.ts); each world implementssnapshots.save/load/deletefor its storage backend.Scope of this PR
@workflow/core: snapshot runtime, VM bootstrap, event-cursor-driven resume, deterministic correlationIds (seeded ULIDs across concurrent VM invocations of the same resumption), encryption and compression pipeline,WORKFLOW_RUNTIMEenv-var dispatch with replay-runtime fallback, OTel spans/attributes for the snapshot lifecycle, CI-visible diagnostic checkpoints (SNAPSHOT_DIAG).@workflow/world: newSnapshotsinterface (save/load/delete) and metadata schema.@workflow/world-vercel: workflow-server snapshot endpoints (PUT/GET/DELETE /v2/runs/:runId/snapshot), opaque-bytes transport, switch toundici.request()for retry-with-Buffer-body correctness, atomic per-(run, correlation) uniqueness for entity-creating events.@workflow/world-postgres: newworkflow_snapshotstable, unique partial index onworkflow_events(run_id, correlation_id, type)for entity-creating events.@workflow/world-local: filesystem-backed snapshot storage ({runId}.bin+{runId}.json), atomic correlationId uniqueness forstep_created/wait_created.[snapshot, replay], full Vercel-prod E2E coverage of the snapshot runtime across 11 frameworks.Custom serializers (
Symbol.for('workflow-serialize')/Symbol.for('workflow-deserialize')) and workflow-sideDOMException/WorkflowFunctionround-trip through the VM serde bundle alongside the standard reducers.Out of scope / future work
runId(getVercelFunctionLogswas removed from the e2e diagnostic harness — belongs in its own PR).@opentelemetry/api,zod,ai-sdk, etc. — tree-shaking those out is a builder-side change worth pursuing later).snapshot.save+ storage RTT; further work could batch saves or skip them entirely for ops the runtime can recompute).Based on
serialization-refactor(PR #1299).