Skip to content

feat: serializable AbortController/AbortSignal#1301

Merged
pranaygp merged 82 commits into
mainfrom
pgp/serialize-abort-signal
May 5, 2026
Merged

feat: serializable AbortController/AbortSignal#1301
pranaygp merged 82 commits into
mainfrom
pgp/serialize-abort-signal

Conversation

@pranaygp
Copy link
Copy Markdown
Contributor

@pranaygp pranaygp commented Mar 9, 2026

Summary

Makes AbortController and AbortSignal first-class across workflow and step boundaries: serializable as start() arguments, step inputs/outputs, and inside Request objects; observable from real-time and replay paths.

  • Native AbortController / AbortSignal survive every serialization boundary (external → workflow → step → external) with full type fidelity for reason (DOMException, Error subclasses, custom classes).
  • Calling controller.abort() from the workflow body, from a step, or externally all converge on the same observable state — running steps see the abort in real-time; the workflow sees it deterministically on replay.
  • AbortSignal.any() and pre-aborted AbortSignal.abort() are supported inside the workflow VM. AbortSignal.timeout() is intentionally not supported (replay-non-deterministic) and points users to the sleep() + AbortController pattern.

Architecture

AbortController in a workflow is backed by two primitives:

  • Hook (durable, replay-deterministic) — abort state lands in the event log as a hook_received event. On replay, the events consumer chains _setAborted(reason) onto promiseQueue so listener firing is tied to the orchestrator's deterministic execution, not microtask scheduling.
  • Stream (real-time, cross-compute) — abort payload is also written to a per-controller _system_abort stream so a step running on a different process can observe the abort without waiting for the next replay tick.

When abort() is called from any context, both channels are written. Whichever delivers first drives the visible state; the other path provides redundancy. All three writer sites (workflow-side, step-side patched controller.abort(), external-listener-side) dehydrate via the same dehydrateStepArguments machinery so the hydration step (hydrateStepArguments) round-trips reasons with full fidelity and respects encryption.

System hooks created for abort controllers are tagged isSystem: true (consistent across world-local, world-postgres, and world-vercel). The workflow-server side (vercel/workflow-server#336) honors this when filtering public webhook listings — a hook that backs an abort controller doesn't get exposed via the public webhook endpoint.

What's in this PR

  • Reducer/reviver pairs for AbortController and AbortSignal in all three contexts (external, workflow VM, step). Workflow-VM revivers use the real WorkflowAbortSignal class so signal.addEventListener('abort', fn) actually fires (the natural pattern for code that receives a signal from another step).
  • createCreateAbortController(ctx) — workflow-VM AbortController that registers a system hook on construction, subscribes to the events consumer for hook_created/hook_received/hook_disposed, and propagates abort() through the suspension handler.
  • WorkflowServerWritableStream-based abort packets for real-time delivery to running steps, plus setupAbortStreamReader on the step side that wires the deserialized signal to the stream so an in-flight step's signal.aborted flips even if the replay path hasn't yet reached the workflow.
  • cancelAbortReaders walks step args (including the signal nested inside Request) after the step returns to cancel any dangling readers — prevents serverless functions from being kept alive by abort streams.
  • Implicit abort-hook disposal at workflow completiondrainPendingQueueItems marks any never-aborted system hook as disposed: true before the synthesized suspension, so unused controllers don't leak rows in the hooks table for the run's TTL.
  • isSystem plumbing end-to-end: world types, world-local and world-postgres schemas/storage, and the suspension handler emit it on hook_created.

Docs

Tests

  • Unit — Round-trip serialization with full type fidelity for reasons (string, DOMException, Error subclasses, encrypted), revival in all three contexts, addEventListener on revived signals, AbortSignal.any() ordering + iteration, hook integration, replay determinism. Includes regression tests for the listener-side JSON.stringify bug and the no-op addEventListener stub bug surfaced in review.
  • E2E — Cancels timeouts, parallel steps, in-flight fetch, hook-driven aborts, and a 4-variant matrix verifying that listener firing relative to hook.then() resolution is deterministic across replay (abortHookOrderingWorkflow).

Dependencies

Pairs with vercel/workflow-server#336 (already merged) for isSystem filtering on the public webhook endpoint, so abort hook tokens aren't exposed externally.

…ignal

Adds documentation and test infrastructure for making AbortController and
AbortSignal serializable across workflow and step boundaries. The feature
uses a dual hook+stream backing: hooks for deterministic replay in the
workflow context, streams for real-time propagation to running steps.

Docs:
- Cancellation guide (foundations) covering AbortSignal and run cancellation
- How Cancellation Works (how-it-works) explaining hook+stream internals
- AbortSignal.timeout() error page for the workflow VM restriction
- Updated serialization docs with AbortController/AbortSignal section

Tests (all .todo stubs for TDD):
- VM behavior: AbortController API, static methods, hook integration
- Step-side: stream reader setup, abort propagation, ops queue
- Serialization round-trips: all boundaries, encryption, nested structures
- Consistency: race conditions, partial failure, eventual convergence
- E2E workflows: timeout, parallel, step-initiated, hook-triggered, replay

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Mar 9, 2026

🦋 Changeset detected

Latest commit: 06f6654

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 18 packages
Name Type
@workflow/core Patch
workflow Patch
@workflow/builders Patch
@workflow/cli Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/vitest Patch
@workflow/web-shared Patch
@workflow/web Patch
@workflow/world-testing Patch
tarballs Patch
@workflow/ai Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Mar 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment May 5, 2026 3:40am
example-nextjs-workflow-webpack Ready Ready Preview, Comment May 5, 2026 3:40am
example-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-astro-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-express-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-fastify-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-hono-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-nitro-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-nuxt-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-sveltekit-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-tanstack-start-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workbench-vite-workflow Ready Ready Preview, Comment May 5, 2026 3:40am
workflow-docs Ready Ready Preview, ✅ 21 resolved, Open in v0 May 5, 2026 3:40am
workflow-nest Ready Ready Preview, Comment May 5, 2026 3:40am
workflow-swc-playground Ready Ready Preview, Comment May 5, 2026 3:40am
workflow-tanstack-start-workflow Error Error May 5, 2026 3:40am
workflow-tarballs Ready Ready Preview, Comment May 5, 2026 3:40am
workflow-web Ready Ready Preview, Comment May 5, 2026 3:40am

Comment thread docs/content/docs/errors/abort-signal-timeout-in-workflow.mdx Outdated
Change type from "error" to "troubleshooting" to match the valid
frontmatter schema used by all other error pages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pranaygp and others added 9 commits May 4, 2026 09:19
End-of-run now goes through the same suspension handler that processes a
real suspension. Previously, items left in the invocations queue when the
workflow function returned (or threw) were dropped with an "uncommitted
operation" warning — `controller.abort()` called as the last statement of
a workflow never actually propagated.

Concretely fixes:
- Abort hooks now write hook_received + stream packet so in-flight steps
  on other compute instances see signal.aborted=true and bail out.
- Unawaited hooks are created (so external callers can resume them).
- Unawaited steps and sleeps are queued (will execute / fire later).

Strengthens abortTimeoutWorkflow's test to inspect the event log for the
hook_received event — the original assertion only verified the workflow
VM's local signal.aborted, which was set synchronously by the abort()
call regardless of whether propagation actually happened. The strengthened
test fails on main and passes after this commit.

Drops the warnPendingQueueItems warning entirely. Drain failures are
swallowed so the workflow's own outcome (return value or thrown error)
remains the source of truth for the run's terminal state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ght fetch

The existing abort tests exercised either the polled `signal.aborted` read
path (longStep busy-wait) or the already-aborted-before-fetch path. Nothing
exercised the live listener path: signal starts non-aborted, step kicks off
a fetch against a slow endpoint, abort fires while fetch is awaiting the
response, and fetch's internal `signal.addEventListener('abort', …)` listener
cancels the in-flight HTTP request.

The pre-existing `fetchWithSignal` helper step was orphaned — defined but
not referenced by any workflow. Wires it into a new `abortFetchInFlightWorkflow`
that races a 30s fetch against a 2s sleep, aborts when the sleep wins, and
returns the step's catch-path result. The test asserts both `winner=timeout`
and `fetchResult.aborted=true`, which together prove fetch saw the cancellation
mid-flight (the natural-completion path would set ok=true,aborted=false).

Adds a local /api/delay endpoint to the nextjs-turbopack workbench so the test
doesn't depend on an external service. Honors the request's own AbortSignal
so cancelled connections close immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lation

The original test only asserted that the workflow VM's signal saw aborted=true
after a step called controller.abort(). It didn't actually verify that another
in-flight step received the cancellation through the backing stream — those
two paths are different (workflow VM signal updates via the hook event;
sibling-step propagation runs through the live stream packet).

Restructure the workflow to run longStep (a 30s polling loop on signal.aborted)
in parallel with abortFromStep (now sleeps 1s, then aborts). The new assertion
expects longStep.result === 'aborted' — proving it exited via the abort branch
within ~1.5s, NOT ran to its 30s natural completion. Returning 'completed'
would mean realtime cross-step cancellation is broken.

abortFromStep gained an optional delayMs parameter so it can be sequenced
against a sibling without an out-of-band sleep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ents

The abort stream packet was being encoded with bare `JSON.stringify({reason})`
on the writer and decoded with `JSON.parse(text).reason` on the reader. That
codec drops `undefined` (so a reason-less abort wrote literally `{}` and the
observability UI showed an empty stream), and doesn't handle DOMException or
any other type the rest of the codebase serializes via devalue+reducers.

Switch all three sites — suspension-handler workflow-side write, patched
abort step-side write, and `setupAbortStreamReader` — to use
`dehydrateStepArguments`/`hydrateStepArguments`. Now the `reason` round-trips
with full type fidelity (DOMException, custom errors, encrypted payloads),
matching what the hook event payload already does. The suspension handler
literally reuses the same dehydrated bytes for the event and the stream so
they're guaranteed identical.

Encryption key threading:
- Suspension handler: `encryptionKey` was already in scope.
- Patched abort: read from `contextStorage.getStore()?.encryptionKey` (set
  by the step handler before invoking the deserialize chain).
- Reader (`setupAbortStreamReader`): read from `contextStorage.getStore()?.encryptionKey`
  for the same reason; falls back to `undefined` when called outside step
  context (the hydrate path is key-tolerant).

On-disk verification:
- Before: chunk for `controller.abort()` (no reason) was `00 7b 7d` — 3 bytes,
  the literal JSON `{}`, no reason carried at all.
- After: chunk is `00 64 65 76 6c [{"aborted":1,"reason":2},true,"test"]` —
  43 bytes, devalue-flat-encoded with the reason intact.

Updated the existing stream-reader unit test to encode its mock payload
through the same dehydrate path so the reader can decode it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tiated determinism

The polled-`signal.aborted` path was the only abort consumption pattern
exercised end-to-end. Three new e2e tests fill the gaps:

- **abortListenerWorkflow** — `signal.addEventListener('abort', cb)` firing
  on the deserialized step-side signal. Distinct from abortFetchInFlightWorkflow
  which only proves it indirectly through fetch's internal listener; this one
  verifies user-attached listeners directly. Step resolves with via:'listener'
  if propagation worked, via:'timeout' on a 30s safety timeout if it didn't.

- **abortThrowIfAbortedMidFlightWorkflow** — throwIfAborted() in a polling
  loop, not just at step entry. The existing abortThrowIfAbortedWorkflow
  only covers the synchronous-throw case on a pre-aborted signal. This one
  starts the signal non-aborted, polls throwIfAborted every 500ms, and aborts
  from a sibling step after 1s. Verifies the DOMException propagates as
  FatalError (no retries) when fired mid-flight.

- **abortDeterministicBranchFromStepWorkflow** — counterpart to
  abortDeterministicBranchWorkflow, but with the abort source being a step
  (via the patched abort() path / hook event) instead of the workflow body.
  Both branch-reads MUST take the same path on every replay. Uncovered a
  real semantic: signal.aborted reflects step-initiated aborts only after
  the next promise-queue checkpoint (sleep, step await, etc.) since
  _setAborted is chained on promiseQueue. The test inserts the required
  sleep('1s') checkpoint and asserts both pre and post values.

Helper steps factored: stepWaitingOnAbortListener and stepPollingThrowIfAborted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shortcut would have masked a regression in the addEventListener-on-an-
already-aborted-signal contract. Per the AbortSignal spec, calling
addEventListener('abort', cb) on an aborted signal fires the callback (on a
microtask), so user code that subscribes via the listener path alone — the
common pattern — depends on it. Test the contract directly: rely solely on
the listener resolving the promise. If addEventListener-on-aborted ever
silently breaks, this test now reports via:'timeout' instead of paving over
it with a fast-path that reads signal.aborted directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ignal

* origin/main:
  [workbench] Add TanStack Start workbench and tests (#1875)
  Atomically dedupe duplicate step_created/wait_created events in world-local (#1877)
  Split tarball hosting out of docs into its own project (#1893)
  Replace fixed-sleep hook waits with event-driven waitForHook helper (#1879)
… hydrates abort reasons

The observability UI (and CLI) hydrates step IO via `observabilityRevivers`,
which had no `DOMException` entry. When a step returned a value containing
a DOMException (typically `{aborted, reason: <DOMException>}` — synthesized
by native AbortController when abort() is called with no reason), devalue's
`parse` would throw on the `["DOMException", ...]` tag, `hydrateStepIO`'s
try/catch would swallow it, and the raw devalue-flat string survived to
the UI. The user-visible result was step Output showing literal text like:

  devl[{"aborted":1,"reason":2},true,["DOMException",3]...]

instead of a JSON viewer with a proper DOMException card.

Add the reviver. Reconstruct as a real DOMException when the global is
available (modern browsers + Node 18+, where the o11y consumers run),
falling back to a name-tagged Error otherwise. Preserves message/name/
stack/cause for display.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing abortExternalSignalWorkflow only validates a static read of
an already-aborted signal — it tells us nothing about whether an abort
that fires AFTER serialization actually propagates from the caller process,
through the listener attached at workflow-start, into the backing stream,
and out into the deserialized signals on the in-flight step compute.

Add abortExternalSignalInFlightWorkflow that takes a non-aborted signal
and runs two parallel consumption patterns against it: longStep (polling
signal.aborted) and stepWaitingOnAbortListener (addEventListener path).

The test creates a fresh AbortController, calls start() with its non-aborted
signal, and aborts the source controller 1.5s later via setTimeout — well
after both steps are mid-flight on their compute instances.

Both consumers must see the cancellation:
- pollResult === 'aborted' (NOT 'completed' — that would mean longStep ran
  the full 30s without ever seeing signal.aborted=true)
- listenerResult.via === 'listener' (NOT 'timeout' — that would mean the
  addEventListener callback never fired)

This exercises the longest end-to-end abort path in the codebase:
  caller-process AbortController → serialization-time listener →
  backing stream → step compute → deserialized signal →
  (poll OR addEventListener)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pranaygp
Copy link
Copy Markdown
Contributor Author

pranaygp commented May 4, 2026

Summary of changes since my last review

A lot of activity since — bugs uncovered while exercising the e2e path end-to-end against a live dev server, the merge of #1647, doc corrections, and a deeper test pass.

Status now: branch is current with main, 846 unit tests pass, 21 abort e2e tests pass end-to-end through Next.js + Turbopack.

Bugs fixed

  1. signal field dropped from Request serializable type during a main-merge (1bce60fa8) — broke typecheck.

  2. DOMException serialization was broken (a9532014a)
    The reducer's first guard was types.isNativeError(value) — but isNativeError(new DOMException(...)) returns false in Node despite DOMException being instanceof Error. So DOMExceptions never matched any reducer and devalue fell through to its arbitrary-POJO failure path. Switched to a constructor-name check. Also fixed 7 pre-existing DOMException test failures on main.

  3. Pending queue items dropped at workflow completion (0d025f1f7) — the biggest bug
    controller.abort() called as the last statement of a workflow never propagated to in-flight steps on other compute instances. The runtime warned about the "uncommitted operation" but didn't flush it. End-of-run now drains through the same suspension handler that processes any other suspension — no special-casing for abort. Affects unawaited steps, hooks, sleeps too. Matches normal JS semantics where async work spawned by a function continues after it returns.

  4. Abort stream packet used bare JSON.stringify (d70ade7ce)
    Lost type info: controller.abort() with no reason wrote literally {} to the stream (3 bytes); DOMException / custom errors couldn't round-trip. Switched all three sites — suspension-handler workflow-side write, patched-abort step-side write, and setupAbortStreamReader — to use dehydrateStepArguments / hydrateStepArguments. Hook event payload and stream packet now produce byte-identical output.

  5. Observability UI rendered abort reasons as raw devl[...] text (7c07e344b)
    observabilityRevivers had no DOMException reviver. devalue.parse threw on ["DOMException", ...], hydrateStepIO's try/catch swallowed it, raw text reached the UI. Added the reviver — JSON viewer now renders DOMException as a proper Error card with name/message/stack.

  6. Original review feedback fixes (34f490d69)
    Listener-leak dedup via ABORT_LISTENER_ATTACHED marker symbol. Replaced token.replace('abrt_', '') string surgery in the suspension handler. Documented the deliberate sync-listener-firing divergence in the workflow VM. Added a changeset noting the AbortErrorFatalError behavior change.

New / strengthened tests (16 → 21 abort e2e)

The original abortTimeoutWorkflow test only verified the workflow VM's local signal.aborted — which is set synchronously by abort() regardless of whether anything propagates. Strengthened it to inspect the event log for hook_received, then added the missing tests:

Test Unique coverage
abortFromStepWorkflow (extended) Step-initiated abort actually cancels an in-flight sibling step (was: only checked the aborted signal state)
abortFetchInFlightWorkflow addEventListener path firing while fetch() is awaiting an HTTP response
abortListenerWorkflow Direct test of signal.addEventListener('abort', cb) on the deserialized step-side signal
abortThrowIfAbortedMidFlightWorkflow throwIfAborted() thrown mid-flight (was: only pre-aborted entry)
abortDeterministicBranchFromStepWorkflow Step-initiated aborts preserve replay determinism (counterpart to the existing workflow-body version)
abortExternalSignalInFlightWorkflow External signal aborted after start(), propagates mid-flight into nested polling + listener steps

Plus a workbench /api/delay route so the fetch-cancellation tests don't depend on a flaky external service.

Test stubs

The original review's biggest concern was 65 .todo stubs. Two things fixed it:

Current count: 0 .todos in the three abort test files, all assertions are executable.

Original review point status

Concern Status
c1 — listener leak in reducers attachAbortListenerOnce + marker symbol
c2 — token.replace('abrt_', '') string surgery ✅ replaced (getAbortStreamIdFromToken helper from #1647)
c3 — AbortErrorFatalError behavior change ✅ documented in changeset
c4 — sync listener firing ✅ kept sync (test depends on it for replay determinism), comment now explains the deliberate spec divergence
c5 — unused _runId ❌ withdrawn — actually used post-merge
c6 — tests as .todo stubs ✅ resolved

Docs corrections (64264015b)

While reviewing the implementation I found four real inaccuracies in the cancellation docs and fixed them:

  • Self-contradiction about whether signal.aborted is set synchronously when abort() is called in the workflow (it is)
  • "Step-side abort handler retries the hook resume" wasn't accurate (it's best-effort with eventual replay convergence — reworded)
  • "Pending queue items processed on completion" was wrong at the time (was warn-only) — now it IS true after the drain fix, doc matches the new behavior
  • Request.signal scope was oversimplified (only forwards aborted/tagged signals, not all)

Includes #1647

Merged @karthikscale3's follow-up. Highlights from there: dangling-stream-reader fix (cancelAbortReaders after step return), DurableAgent timeout regression (the typeof AbortController !== 'undefined' guard flipped because we added it to the VM globals), is_system migration plumbing through world-postgres + world-local, dedup of ~180 lines of reducer code into shared helpers.


CI running on 45d120743. Ready for another look.

pranaygp and others added 5 commits May 4, 2026 12:11
The previous setup added a /api/delay route to workbench/nextjs-turbopack
to give the test a slow endpoint to fetch against. That made the workflow
fail in CI on every other workbench (nextjs-webpack, astro, sveltekit, …)
since the route only existed on one of them — fetch returned 404 and the
test failed within 1s instead of taking the expected ~3s.

Switch to httpbin.org/delay/30, the same external-service pattern used by
other e2e workflows in this file (jsonplaceholder, example.com). Removes
the per-workbench dependency. Drops the now-unused deploymentUrl argument
from the workflow signature and test call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ignal

* origin/main:
  Fix pnpm type issue after tanstack PR (#1907)
  ci: pass stale-banner via path: to sticky-pull-request-comment in tests + benchmarks workflows (#1887)
…oller section

Two issues on the serialization foundations page:

1. `## Pass-by-Value Semantics` appeared twice. The second occurrence had no
   body, which rendered as an orphaned heading just above the AbortController
   section in the docs preview.

2. `## AbortController & AbortSignal` was at the bottom of the page, after
   `## Custom Class Serialization`. It belongs above the custom-class section
   so the standard serializable types are grouped together before the
   advanced topic.

Removes the empty duplicate; relocates the AbortController section to sit
between Request & Response and Custom Class Serialization. No content
changes inside the section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…button

The Run Cancellation section showed the programmatic path but didn't tie
it back to the UI. Add a callout: calling run.cancel() is the same action
as clicking the Cancel button on a run in the observability UI — both
produce identical run_cancelled events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two distinct paths: the workflow VM ships its own AbortSignal.any impl in
workflow/abort-controller.ts (composes WorkflowAbortSignals via listeners,
no stream/hook backing on the composite), while steps use the native
Node implementation over deserialized signals. Neither was tested.

abortAnyInWorkflowWorkflow exercises the VM impl directly: creates two
controllers, composes their signals via AbortSignal.any, aborts one, and
asserts the composite reflects the abort synchronously without any stream
round-trip. Also asserts the other source signal is unaffected so a
mass-abort regression would surface here.

abortAnyInStepWorkflow exercises the longest end-to-end path that uses
AbortSignal.any: source controller is aborted by a sibling step, abort
flows through the workflow's VM, then the backing stream, into the step's
deserialized signal, into the AbortSignal.any composite, into the user's
listener. Returning via:'timeout' instead of via:'listener' would mean a
break anywhere on that chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@VaguelySerious VaguelySerious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, please update the changesets to be less AI slop, see suggestions

Comment thread .changeset/fix-dom-exception-serialization.md Outdated
Comment thread .changeset/drain-pending-queue-on-completion.md Outdated
Comment thread .changeset/serializable-abort-controller.md Outdated
pranaygp and others added 5 commits May 4, 2026 14:04
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Pranay Prakash <pranay.gp@gmail.com>
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Pranay Prakash <pranay.gp@gmail.com>
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Pranay Prakash <pranay.gp@gmail.com>
…xample

Two toolbar-comment fixes on the abort-signal-timeout-in-workflow error
page:

1. The page title was Title Case ("AbortSignal.timeout() in Workflow")
   while every other page in docs/content/docs/errors/ uses the kebab-case
   slug as the title (e.g. timeout-in-workflow, fetch-in-workflow,
   workflow-not-registered). Match the convention.

2. The recommended replacement for AbortSignal.timeout() was a
   Promise.race that wrapped the abort + null sentinel + custom Error
   throw. Boil it down to the much simpler:

       const controller = new AbortController();
       void sleep("10s").then(() => controller.abort());
       return await fetchData(controller.signal);

   If fetchData finishes within 10s you get the response; if not, the
   timer fires controller.abort(), fetch rejects with AbortError, and
   the step's failure propagates to the workflow as a FatalError (no
   retries). Same observable behavior, no Promise.race scaffolding.

Adds abortVoidSleepTimeoutWorkflow + matching e2e test that exercises
this exact pattern end-to-end so the doc example is verified runnable
(not just pseudocode). Asserts the fetch is cancelled mid-flight by
the timer, returning aborted=true,ok=false from the step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@TooTallNate TooTallNate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big PR with sound architecture (dual-channel hook+stream backing for cross-compute abort propagation, with the hook providing durable replay state and the stream providing real-time delivery to running steps). Spent significant time tracing the implementation; calling out one blocking bug, a few moderate concerns, and a follow-up suggestion.

Blocking

attachAbortListenerOnce writes plain JSON; the reader hydrates with hydrateStepArguments

serialization.ts:671–682 — the listener attached when a native external AbortController crosses the external→workflow or workflow→step boundary writes the abort packet using bare JSON.stringify:

const packet = new TextEncoder().encode(
  JSON.stringify({ reason: signal.reason })
);
ops.push(writer.write(packet).then(() => writer.close()));

But setupAbortStreamReader (serialization.ts:1046–1051) reads with hydrateStepArguments(result.value, runId, encryptionKey), which expects a format-prefixed (and possibly encrypted) buffer per the established envelope.

When attachAbortListenerOnce's packet hits the reader, hydrateStepArguments throws "unknown serialization format" (or fails encryption). The reader's try { … } catch { controller.abort() } (lines 1052–1054) catches that — but with no reason. The user sees signal.reason === undefined even though the external caller did controller.abort('my-reason').

The two other writer sites (runtime/suspension-handler.ts:218–223 for the workflow-side write, and the patched controller.abort() in serialization.ts:1147–1170 for step-initiated aborts) both use dehydrateStepArguments correctly. This site was missed in that refactor.

The fix is mechanical — replace the JSON.stringify with dehydrateStepArguments({aborted:true, reason: signal.reason}, runId, encryptionKey). Worth adding a regression test that aborts externally with a non-trivial reason (e.g. a string, or a DOMException) and asserts signal.reason round-trips into the step. The existing tests use no-reason aborts which mask this entirely.

Wire-format change isn't capability-gated

The new AbortController and AbortSignal entries in SerializableSpecial (serialization/types.ts:127–132) are emitted unconditionally. An older @workflow/core worker deserializing inputs/payloads that contain the new shape will throw "unknown type" from devalue. There's no entry in the getRunCapabilities table to gate this on workflowCoreVersion, unlike the existing pattern (e.g. encryption format).

For partial-deploy / cross-deployment scenarios — e.g. a resumeHook(token, payload) from a newer deployment targeting an older run, or start({deploymentId}) cross-deployment — the older side will fail to revive. AGENTS.md's pattern is: bump workflowCoreVersion, add a capability flag, gate the producer on getRunCapabilities(targetRun.executionContext.workflowCoreVersion).serializableAbortController, fall back to a snapshot-only revive (e.g. an already-aborted signal with the snapshot reason) when the target doesn't support it.

This matters less if the abort feature is gated behind an opt-in user flag for the first release, or only used internally — but for shipping in start() args / resumeHook() payloads / step return values, the gating needs to be in place.

Workflow-server isSystem filtering dependency (already noted in PR description)

The PR description acknowledges this depends on workflow-server PR #336. Confirming for completeness: without server-side isSystem: true filtering, the public webhook endpoint would expose abort hook tokens, allowing externally-initiated aborts via the hook URL. Token format is abrt_<ULID> — not trivially guessable, but not encrypted either. The isSystem flag must be honored server-side before this can ship to production.

Moderate

Workflow VM revivers produce no-op addEventListener stubs

serialization.ts:5975–6016 (workflow-internal revivers) construct plain object stubs with addEventListener: () => {}. A workflow that does signal.addEventListener('abort', fn) after receiving a deserialized signal will silently never fire that callback. The proper class (WorkflowAbortSignal in workflow/abort-controller.ts:18) already exists; the workflow-VM revivers should use it. This is a silent correctness bug for the natural pattern of "register a listener on a signal you received as an arg."

First-run vs replay listener-firing site

On first-run, abort() fires listeners synchronously at the abort() call site. On replay, the events consumer chains _setAborted(reason) onto promiseQueue and listeners fire at the next promise-queue tick — typically before the workflow code reaches the corresponding abort() line. The abortHookOrderingWorkflow test was added to cover this and is currently skipped (commit test: skip abort+hook ordering e2e tests pending full integration).

This is acceptable as a documented constraint (the same shape as hook payloads), but: (a) the test should either be enabled and pass before merge or explicitly removed with a comment explaining the deferral, (b) the user-facing docs (cancellation.mdx) should document that listener firing relative to surrounding code is non-deterministic and should not be used for control flow that depends on the surrounding state.

Abort hooks never disposed

workflow/abort-controller.ts:151–154 handles hook_disposed events but nothing in the PR creates one. WorkflowAbortController has no dispose() method, and drainPendingQueueItems doesn't mark abort hooks as disposed: true before the synthetic suspension. Result: every new AbortController() that's never aborted leaks one row in the postgres hooks table for the run's lifetime (cleaned up only by run TTL).

Realistically bounded by user behavior — most workflows construct one or two controllers, not thousands — but worth either: (a) implicit dispose on workflow completion in drainPendingQueueItems, (b) a dispose method exposed on WorkflowAbortController, or (c) documenting the lifetime.

AbortSignal.any() listener cleanup + iteration

workflow/abort-controller.ts:200–229:

  • No removeEventListener on input signals after the composite aborts. If any input signal outlives the composite (e.g. an external long-lived controller), the listener closure (capturing composite) prevents GC. Bounded by run lifetime so not catastrophic, but real.
  • signals is iterated twice. A single-shot iterable (e.g. a generator) produces zero entries on the second pass. Native AbortSignal.any materializes the iterable into an array first; this implementation should too: const arr = Array.from(signals).

Unit test doesn't import the function under test

abort-controller-step.test.ts:85–161 reimplements reviveAbortController locally rather than importing it. The comment admits "Uses mock data directly to avoid dynamic imports that can cause hangs in vitest's module mock system." This means a regression in the production reviveAbortController won't surface in unit tests — only in the e2e tests in 99_e2e.ts. Worth sorting out the vitest mock issue rather than maintaining a parallel implementation in the test file.

Cold-start race window

setupAbortStreamReader opens a stream reader during step argument hydration, but reader.read() is async (network round-trip). User code in the step body can run before the first read resolves. If the workflow aborted concurrently with step pickup (after step.input was serialized but before the step's reader started), the user's first synchronous if (signal.aborted) check returns false even though the abort packet was already written.

Not a regression from any prior state — the snapshot-in-payload mechanism handles aborts that happened before enqueue, so this only affects the narrow window between enqueue and step pickup. But worth either: (a) a synchronous getChunks(limit:1) snapshot on the abort stream during reviver bootstrap to surface any already-written packet before user code runs, or (b) documenting the window so users know to await something async (a signal.throwIfAborted() check after the first await) rather than rely on synchronous polling.

Stale PR description

The PR description says "Implementation TBD — tests provide the specification to build against" and "all .todo stubs." Both are wrong by a wide margin — the PR has 78 commits, real implementation across 61 files, and ~120 it() blocks with full assertions. Reviewers reading the description may approve a 7000-line implementation thinking it's a spec-only PR. Worth updating before merge.

Follow-up suggestion: native AbortSignal.timeout() support

The timeout() static currently throws and points users to a sleep-based workaround. There's no architectural reason the workflow VM can't support it natively — the workaround is mechanical:

timeout(ms: number): WorkflowAbortSignal {
  const controller = new WorkflowAbortController();
  ctx.promiseQueue = ctx.promiseQueue.then(async () => {
    await ctx.sleep(ms);
    controller.abort(new DOMException('signal timed out', 'TimeoutError'));
  });
  return controller.signal;
}

Replay determinism comes for free since sleep() records a wait event and controller.abort() already goes through the hook machinery. Strictly improves UX vs. the workaround:

  • signal.reason.name === 'TimeoutError' (matches native), instead of 'AbortError' from the manual workaround.
  • Less boilerplate.
  • No void foot-gun.

Would need createAbortSignalStatics to take more than _vmGlobalThis — the orchestrator ctx (the same one passed to createCreateAbortController) gives access to sleep and promiseQueue. Modest refactor.

Not blocking — the workaround is documented and works — but worth considering as a follow-up PR. Better DX and removes a docs page that says "this thing doesn't work, here's how to fake it" which is awkward to ship as-is.

Nits

  • createAbortSignalStatics(_vmGlobalThis: Record<string, any>) — parameter is unused; if you do the timeout follow-up the signature changes anyway, but worth dropping the unused param now.
  • DOMException code field isn't preserved by the reducer (serialization/reducers/common.ts:6092–6102). Recovered via constructor lookup for spec-listed names (AbortError, TimeoutError, etc.) so the common case works; lost for custom names. Probably fine but worth a sentence in the changeset.
  • The aborted: true field in the dehydrated payload (suspension-handler.ts:218–223 etc.) is decorative — the consumer only reads payload.reason. Could just dehydrate the reason directly.
  • Request.signal re-tagging via copyAbortInternals (serialization.ts:5947–5952) depends on platform-specific Request internals. Worth a cross-runtime test (undici, Node 22, browser).
  • Reducer side-effect attaches a listener to user-provided signal objects via reduceAbortWithListener. The ABORT_LISTENER_ATTACHED symbol guards against duplicates, but it's still surprising that serializing an object adds a listener to it. Worth documenting somewhere.

What looks good

  • The dual hook+stream architecture is the right factoring: hook for durable replay state, stream for real-time delivery. Symmetric writer paths (workflow-side, step-side, external-side) all dehydrate via the same machinery (modulo the bug above).
  • getAbortStreamId / getAbortStreamIdFromToken round-tripping through the abrt_ prefix is clean.
  • cancelAbortReaders walking args/this/closure (and explicitly descending into Request.signal) for step cleanup is thorough.
  • ABORT_LISTENER_ATTACHED dedup symbol prevents fanout-listener accumulation when a single signal is passed to N steps.
  • The isSystem: true plumbing is consistent across world-postgres, world-local, and the schema migration — all that's missing is the workflow-server side.
  • AbortSignal.any() correctly handles already-aborted inputs in iteration order (matches native).
  • The AbortSignal.timeout() workaround is documented and works — even if it'd be nicer to support natively.

Summary

Five blocking issues to address before merge: (1) the JSON.stringify bug in attachAbortListenerOnce, (2) capability gating on workflowCoreVersion, (3) the dependency on workflow-server #336, (4) workflow-VM revivers using no-op stubs instead of WorkflowAbortSignal, (5) the stale PR description.

Several moderate concerns worth at least documenting: replay listener-firing divergence, hook leak on undisposed controllers, AbortSignal.any cleanup/iteration, cold-start race window.

Architecture is sound; the implementation needs a few corrections to match the design intent.

Five fixes from PR review:

1. attachAbortListenerOnce now dehydrates via dehydrateStepArguments instead
   of bare JSON.stringify so the reader (hydrateStepArguments) can decode
   the packet. Previously, an external AbortController aborted with a
   non-trivial reason (DOMException, Error, etc.) had its reason silently
   dropped — the reader's catch fell through to controller.abort() with no
   reason. Adds a regression test that aborts with a DOMException reason
   and asserts full type fidelity round-trips.

2. Workflow-VM revivers now use the real WorkflowAbortSignal class instead
   of plain-object stubs with addEventListener: () => {}. The previous
   stubs silently dropped listener registrations — signal.addEventListener
   ('abort', fn) on a deserialized signal would never fire. Adds two
   regression tests covering pre-aborted and post-aborted paths.

3. drainPendingQueueItems now marks any never-aborted system (abort) hook
   as disposed: true so unused AbortController instances don't leak rows
   in the hooks table for the run's TTL. User hooks are intentionally left
   alone (their lifetime is user-managed).

4. AbortSignal.any() materializes the iterable with Array.from(signals)
   before iterating twice (otherwise single-shot iterables like generators
   produce zero entries on the second pass) and removes its abort listeners
   from input signals after the composite aborts so closures don't pin
   long-lived inputs from GC. Drops the unused _vmGlobalThis param while
   we're touching the signature.

5. Unskips abortHookOrderingWorkflow e2e tests (4 variants). Bumps the
   workflow's trailing sleep to 10s so the test harness has time to resume
   the user hook before the workflow returns and `using` disposes it.

Updates PR description and the drain-pending-queue changeset.
@pranaygp
Copy link
Copy Markdown
Contributor Author

pranaygp commented May 5, 2026

@TooTallNate — addressed your review feedback in 409d9dc. Five fixes:

  1. attachAbortListenerOnce now uses dehydrateStepArguments instead of bare JSON.stringify. Reasons (DOMException, Error subclasses, custom classes, encrypted) round-trip with full type fidelity now. Added a regression test that aborts an external AbortController with a DOMException reason and asserts the receiver sees signal.reason instanceof DOMException — this would have failed under the previous code.

  2. Workflow-VM revivers now use the real WorkflowAbortSignal class instead of plain-object stubs with addEventListener: () => {}. signal.addEventListener('abort', fn) after deserialization now actually fires. Added two regression tests covering pre-aborted and post-aborted paths.

  3. Implicit abort-hook disposal at workflow completiondrainPendingQueueItems marks any never-aborted system hook as disposed: true so unused AbortControllers don't leave abandoned rows in the hooks table for the run's TTL. User hooks are intentionally left alone.

  4. AbortSignal.any() fixes — materializes the iterable once with Array.from(signals) so single-shot iterables (generators) work, and removes its abort listeners from input signals after the composite aborts. Also dropped the unused _vmGlobalThis parameter while touching the signature. Two regression tests added.

  5. Unskipped the 4-variant abortHookOrderingWorkflow e2e tests. Bumped the workflow's trailing sleep from 1s to 10s so the test harness has time to resume the user hook before the workflow returns and using disposes it.

Skipped per discussion with @pranaygp: capability gating on workflowCoreVersion. Decision: this feature is greenfield and only ships in v5, so cross-deployment compatibility isn't a real concern for this release. We can revisit if/when we ship to a stable channel where mixed-version runs are expected.

Already in place: isSystem plumbing across world-local/world-postgres/world-vercel. The workflow-server side (vercel/workflow-server#336) is already merged.

Not done (deferred to follow-up): native AbortSignal.timeout() support — your sketch is correct and is a UX improvement, but not blocking for this PR. Will land separately.

PR description has been updated to reflect what's actually shipping. The drain-pending-queue changeset now mentions the disposal behavior.

Verification: 913/913 unit tests pass; 26/26 abort e2e tests pass (the 4 newly-unskipped + 22 existing).

pranaygp added 3 commits May 5, 2026 11:36
…ignal

* origin/main:
  [core] Skip inline step execution when suspension also has a wait (#1924)
  [errors] Replace chalk import in @workfow/errors with inline ANSI shim (#1915)
  Fix compatibility with Zod 4.4.x (#1902)
  Serialize `run_failed`/`step_failed` errors through serialization pipeline (#1851)
  tarballs: redesign preview tarballs index page (#1911)
  Remove extra changeset (#1922)
  Add stable Next.js eager and lazy test coverage (#1747)
  Enforce per-(run, correlation) uniqueness for entity-creating events in world-postgres (#1878)
  fix(world-vercel): add default request timeout to workflow-server HTTP calls (#1807)
  Allow disabling step sourcemap with new `sourcemap` option in builders (#1842)
  [ci] Enable Vercel-prod e2e for tanstack-start (#1904)
  web: configure vercelPreset() for Vercel deployments (#1815)
  [core] Combine flow+step bundle and process steps eagerly (#1338)
  [world-vercel] Revert stream close control framing (#1891)
  [tarballs] Use turbo to build workspace deps before packing (#1908)

# Conflicts:
#	packages/core/src/runtime/step-handler.test.ts
#	packages/core/src/runtime/step-handler.ts
#	packages/core/src/runtime/suspension-handler.ts
#	packages/core/src/step.test.ts
#	packages/world-local/src/storage/events-storage.ts
#	packages/world-postgres/src/drizzle/migrations/meta/_journal.json
V2 (#1338) introduced a lazy world resolution path used by
WorkflowServerWritableStream. The abort-listener tests exercise this
path; add a parallel getWorldLazy mock so writeMock is observed in
both legacy and V2 code paths. Also threads the lazy mock through
the per-test mockReturnValue overrides for the three tests that
already overrode getWorld.
The events consumer was reading event.eventData.payload directly to
extract the abort reason. In production the payload is dehydrated
(written by the suspension handler via dehydrateStepArguments —
either as a Uint8Array or, with encryption, an encrypted Uint8Array
prefixed with `encr`), so the `'reason' in payload` check was always
false on replay. The signal still aborted, but with reason=undefined
— the user-supplied reason ('custom timeout reason', a DOMException,
etc.) was lost across the replay boundary.

Local single-invocation runs masked this because the workflow body's
synchronous _setAborted(reason) call already set the right state on
the first execution; the bug only surfaced when the workflow actually
re-replayed (multi-invocation Vercel runs), at which point the events
consumer overrode the synchronous reason with undefined before
checkSignalState serialized the signal to the next step.

Mirrors the pattern used for regular hook payloads in
workflow/hook.ts:117 — hydrate via hydrateStepReturnValue inside the
promiseQueue chain so the reason round-trips with full type fidelity
and respects the run's encryption key.

Caught by `abortReasonWorkflow: abort reason preserved across
boundaries` failing on Vercel-prod tanstack-start. Now passes on the
local nextjs-turbopack abort suite (26/26).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants