feat: serializable AbortController/AbortSignal#1301
Conversation
…ignal Adds documentation and test infrastructure for making AbortController and AbortSignal serializable across workflow and step boundaries. The feature uses a dual hook+stream backing: hooks for deterministic replay in the workflow context, streams for real-time propagation to running steps. Docs: - Cancellation guide (foundations) covering AbortSignal and run cancellation - How Cancellation Works (how-it-works) explaining hook+stream internals - AbortSignal.timeout() error page for the workflow VM restriction - Updated serialization docs with AbortController/AbortSignal section Tests (all .todo stubs for TDD): - VM behavior: AbortController API, static methods, hook integration - Step-side: stream reader setup, abort propagation, ops queue - Serialization round-trips: all boundaries, encryption, nested structures - Consistency: race conditions, partial failure, eventual convergence - E2E workflows: timeout, parallel, step-initiated, hook-triggered, replay Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 06f6654 The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Change type from "error" to "troubleshooting" to match the valid frontmatter schema used by all other error pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
End-of-run now goes through the same suspension handler that processes a real suspension. Previously, items left in the invocations queue when the workflow function returned (or threw) were dropped with an "uncommitted operation" warning — `controller.abort()` called as the last statement of a workflow never actually propagated. Concretely fixes: - Abort hooks now write hook_received + stream packet so in-flight steps on other compute instances see signal.aborted=true and bail out. - Unawaited hooks are created (so external callers can resume them). - Unawaited steps and sleeps are queued (will execute / fire later). Strengthens abortTimeoutWorkflow's test to inspect the event log for the hook_received event — the original assertion only verified the workflow VM's local signal.aborted, which was set synchronously by the abort() call regardless of whether propagation actually happened. The strengthened test fails on main and passes after this commit. Drops the warnPendingQueueItems warning entirely. Drain failures are swallowed so the workflow's own outcome (return value or thrown error) remains the source of truth for the run's terminal state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ght fetch
The existing abort tests exercised either the polled `signal.aborted` read
path (longStep busy-wait) or the already-aborted-before-fetch path. Nothing
exercised the live listener path: signal starts non-aborted, step kicks off
a fetch against a slow endpoint, abort fires while fetch is awaiting the
response, and fetch's internal `signal.addEventListener('abort', …)` listener
cancels the in-flight HTTP request.
The pre-existing `fetchWithSignal` helper step was orphaned — defined but
not referenced by any workflow. Wires it into a new `abortFetchInFlightWorkflow`
that races a 30s fetch against a 2s sleep, aborts when the sleep wins, and
returns the step's catch-path result. The test asserts both `winner=timeout`
and `fetchResult.aborted=true`, which together prove fetch saw the cancellation
mid-flight (the natural-completion path would set ok=true,aborted=false).
Adds a local /api/delay endpoint to the nextjs-turbopack workbench so the test
doesn't depend on an external service. Honors the request's own AbortSignal
so cancelled connections close immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lation The original test only asserted that the workflow VM's signal saw aborted=true after a step called controller.abort(). It didn't actually verify that another in-flight step received the cancellation through the backing stream — those two paths are different (workflow VM signal updates via the hook event; sibling-step propagation runs through the live stream packet). Restructure the workflow to run longStep (a 30s polling loop on signal.aborted) in parallel with abortFromStep (now sleeps 1s, then aborts). The new assertion expects longStep.result === 'aborted' — proving it exited via the abort branch within ~1.5s, NOT ran to its 30s natural completion. Returning 'completed' would mean realtime cross-step cancellation is broken. abortFromStep gained an optional delayMs parameter so it can be sequenced against a sibling without an out-of-band sleep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ents
The abort stream packet was being encoded with bare `JSON.stringify({reason})`
on the writer and decoded with `JSON.parse(text).reason` on the reader. That
codec drops `undefined` (so a reason-less abort wrote literally `{}` and the
observability UI showed an empty stream), and doesn't handle DOMException or
any other type the rest of the codebase serializes via devalue+reducers.
Switch all three sites — suspension-handler workflow-side write, patched
abort step-side write, and `setupAbortStreamReader` — to use
`dehydrateStepArguments`/`hydrateStepArguments`. Now the `reason` round-trips
with full type fidelity (DOMException, custom errors, encrypted payloads),
matching what the hook event payload already does. The suspension handler
literally reuses the same dehydrated bytes for the event and the stream so
they're guaranteed identical.
Encryption key threading:
- Suspension handler: `encryptionKey` was already in scope.
- Patched abort: read from `contextStorage.getStore()?.encryptionKey` (set
by the step handler before invoking the deserialize chain).
- Reader (`setupAbortStreamReader`): read from `contextStorage.getStore()?.encryptionKey`
for the same reason; falls back to `undefined` when called outside step
context (the hydrate path is key-tolerant).
On-disk verification:
- Before: chunk for `controller.abort()` (no reason) was `00 7b 7d` — 3 bytes,
the literal JSON `{}`, no reason carried at all.
- After: chunk is `00 64 65 76 6c [{"aborted":1,"reason":2},true,"test"]` —
43 bytes, devalue-flat-encoded with the reason intact.
Updated the existing stream-reader unit test to encode its mock payload
through the same dehydrate path so the reader can decode it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tiated determinism
The polled-`signal.aborted` path was the only abort consumption pattern
exercised end-to-end. Three new e2e tests fill the gaps:
- **abortListenerWorkflow** — `signal.addEventListener('abort', cb)` firing
on the deserialized step-side signal. Distinct from abortFetchInFlightWorkflow
which only proves it indirectly through fetch's internal listener; this one
verifies user-attached listeners directly. Step resolves with via:'listener'
if propagation worked, via:'timeout' on a 30s safety timeout if it didn't.
- **abortThrowIfAbortedMidFlightWorkflow** — throwIfAborted() in a polling
loop, not just at step entry. The existing abortThrowIfAbortedWorkflow
only covers the synchronous-throw case on a pre-aborted signal. This one
starts the signal non-aborted, polls throwIfAborted every 500ms, and aborts
from a sibling step after 1s. Verifies the DOMException propagates as
FatalError (no retries) when fired mid-flight.
- **abortDeterministicBranchFromStepWorkflow** — counterpart to
abortDeterministicBranchWorkflow, but with the abort source being a step
(via the patched abort() path / hook event) instead of the workflow body.
Both branch-reads MUST take the same path on every replay. Uncovered a
real semantic: signal.aborted reflects step-initiated aborts only after
the next promise-queue checkpoint (sleep, step await, etc.) since
_setAborted is chained on promiseQueue. The test inserts the required
sleep('1s') checkpoint and asserts both pre and post values.
Helper steps factored: stepWaitingOnAbortListener and stepPollingThrowIfAborted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shortcut would have masked a regression in the addEventListener-on-an-
already-aborted-signal contract. Per the AbortSignal spec, calling
addEventListener('abort', cb) on an aborted signal fires the callback (on a
microtask), so user code that subscribes via the listener path alone — the
common pattern — depends on it. Test the contract directly: rely solely on
the listener resolving the promise. If addEventListener-on-aborted ever
silently breaks, this test now reports via:'timeout' instead of paving over
it with a fast-path that reads signal.aborted directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ignal * origin/main: [workbench] Add TanStack Start workbench and tests (#1875) Atomically dedupe duplicate step_created/wait_created events in world-local (#1877) Split tarball hosting out of docs into its own project (#1893) Replace fixed-sleep hook waits with event-driven waitForHook helper (#1879)
… hydrates abort reasons
The observability UI (and CLI) hydrates step IO via `observabilityRevivers`,
which had no `DOMException` entry. When a step returned a value containing
a DOMException (typically `{aborted, reason: <DOMException>}` — synthesized
by native AbortController when abort() is called with no reason), devalue's
`parse` would throw on the `["DOMException", ...]` tag, `hydrateStepIO`'s
try/catch would swallow it, and the raw devalue-flat string survived to
the UI. The user-visible result was step Output showing literal text like:
devl[{"aborted":1,"reason":2},true,["DOMException",3]...]
instead of a JSON viewer with a proper DOMException card.
Add the reviver. Reconstruct as a real DOMException when the global is
available (modern browsers + Node 18+, where the o11y consumers run),
falling back to a name-tagged Error otherwise. Preserves message/name/
stack/cause for display.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing abortExternalSignalWorkflow only validates a static read of an already-aborted signal — it tells us nothing about whether an abort that fires AFTER serialization actually propagates from the caller process, through the listener attached at workflow-start, into the backing stream, and out into the deserialized signals on the in-flight step compute. Add abortExternalSignalInFlightWorkflow that takes a non-aborted signal and runs two parallel consumption patterns against it: longStep (polling signal.aborted) and stepWaitingOnAbortListener (addEventListener path). The test creates a fresh AbortController, calls start() with its non-aborted signal, and aborts the source controller 1.5s later via setTimeout — well after both steps are mid-flight on their compute instances. Both consumers must see the cancellation: - pollResult === 'aborted' (NOT 'completed' — that would mean longStep ran the full 30s without ever seeing signal.aborted=true) - listenerResult.via === 'listener' (NOT 'timeout' — that would mean the addEventListener callback never fired) This exercises the longest end-to-end abort path in the codebase: caller-process AbortController → serialization-time listener → backing stream → step compute → deserialized signal → (poll OR addEventListener) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary of changes since my last reviewA lot of activity since — bugs uncovered while exercising the e2e path end-to-end against a live dev server, the merge of #1647, doc corrections, and a deeper test pass. Status now: branch is current with Bugs fixed
New / strengthened tests (16 → 21 abort e2e)The original
Plus a workbench Test stubsThe original review's biggest concern was 65
Current count: 0 Original review point status
Docs corrections (
|
The previous setup added a /api/delay route to workbench/nextjs-turbopack to give the test a slow endpoint to fetch against. That made the workflow fail in CI on every other workbench (nextjs-webpack, astro, sveltekit, …) since the route only existed on one of them — fetch returned 404 and the test failed within 1s instead of taking the expected ~3s. Switch to httpbin.org/delay/30, the same external-service pattern used by other e2e workflows in this file (jsonplaceholder, example.com). Removes the per-workbench dependency. Drops the now-unused deploymentUrl argument from the workflow signature and test call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oller section Two issues on the serialization foundations page: 1. `## Pass-by-Value Semantics` appeared twice. The second occurrence had no body, which rendered as an orphaned heading just above the AbortController section in the docs preview. 2. `## AbortController & AbortSignal` was at the bottom of the page, after `## Custom Class Serialization`. It belongs above the custom-class section so the standard serializable types are grouped together before the advanced topic. Removes the empty duplicate; relocates the AbortController section to sit between Request & Response and Custom Class Serialization. No content changes inside the section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…button The Run Cancellation section showed the programmatic path but didn't tie it back to the UI. Add a callout: calling run.cancel() is the same action as clicking the Cancel button on a run in the observability UI — both produce identical run_cancelled events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two distinct paths: the workflow VM ships its own AbortSignal.any impl in workflow/abort-controller.ts (composes WorkflowAbortSignals via listeners, no stream/hook backing on the composite), while steps use the native Node implementation over deserialized signals. Neither was tested. abortAnyInWorkflowWorkflow exercises the VM impl directly: creates two controllers, composes their signals via AbortSignal.any, aborts one, and asserts the composite reflects the abort synchronously without any stream round-trip. Also asserts the other source signal is unaffected so a mass-abort regression would surface here. abortAnyInStepWorkflow exercises the longest end-to-end path that uses AbortSignal.any: source controller is aborted by a sibling step, abort flows through the workflow's VM, then the backing stream, into the step's deserialized signal, into the AbortSignal.any composite, into the user's listener. Returning via:'timeout' instead of via:'listener' would mean a break anywhere on that chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VaguelySerious
left a comment
There was a problem hiding this comment.
Code LGTM, please update the changesets to be less AI slop, see suggestions
Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Pranay Prakash <pranay.gp@gmail.com>
Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Pranay Prakash <pranay.gp@gmail.com>
Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Pranay Prakash <pranay.gp@gmail.com>
…xample
Two toolbar-comment fixes on the abort-signal-timeout-in-workflow error
page:
1. The page title was Title Case ("AbortSignal.timeout() in Workflow")
while every other page in docs/content/docs/errors/ uses the kebab-case
slug as the title (e.g. timeout-in-workflow, fetch-in-workflow,
workflow-not-registered). Match the convention.
2. The recommended replacement for AbortSignal.timeout() was a
Promise.race that wrapped the abort + null sentinel + custom Error
throw. Boil it down to the much simpler:
const controller = new AbortController();
void sleep("10s").then(() => controller.abort());
return await fetchData(controller.signal);
If fetchData finishes within 10s you get the response; if not, the
timer fires controller.abort(), fetch rejects with AbortError, and
the step's failure propagates to the workflow as a FatalError (no
retries). Same observable behavior, no Promise.race scaffolding.
Adds abortVoidSleepTimeoutWorkflow + matching e2e test that exercises
this exact pattern end-to-end so the doc example is verified runnable
(not just pseudocode). Asserts the fetch is cancelled mid-flight by
the timer, returning aborted=true,ok=false from the step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TooTallNate
left a comment
There was a problem hiding this comment.
Big PR with sound architecture (dual-channel hook+stream backing for cross-compute abort propagation, with the hook providing durable replay state and the stream providing real-time delivery to running steps). Spent significant time tracing the implementation; calling out one blocking bug, a few moderate concerns, and a follow-up suggestion.
Blocking
attachAbortListenerOnce writes plain JSON; the reader hydrates with hydrateStepArguments
serialization.ts:671–682 — the listener attached when a native external AbortController crosses the external→workflow or workflow→step boundary writes the abort packet using bare JSON.stringify:
const packet = new TextEncoder().encode(
JSON.stringify({ reason: signal.reason })
);
ops.push(writer.write(packet).then(() => writer.close()));But setupAbortStreamReader (serialization.ts:1046–1051) reads with hydrateStepArguments(result.value, runId, encryptionKey), which expects a format-prefixed (and possibly encrypted) buffer per the established envelope.
When attachAbortListenerOnce's packet hits the reader, hydrateStepArguments throws "unknown serialization format" (or fails encryption). The reader's try { … } catch { controller.abort() } (lines 1052–1054) catches that — but with no reason. The user sees signal.reason === undefined even though the external caller did controller.abort('my-reason').
The two other writer sites (runtime/suspension-handler.ts:218–223 for the workflow-side write, and the patched controller.abort() in serialization.ts:1147–1170 for step-initiated aborts) both use dehydrateStepArguments correctly. This site was missed in that refactor.
The fix is mechanical — replace the JSON.stringify with dehydrateStepArguments({aborted:true, reason: signal.reason}, runId, encryptionKey). Worth adding a regression test that aborts externally with a non-trivial reason (e.g. a string, or a DOMException) and asserts signal.reason round-trips into the step. The existing tests use no-reason aborts which mask this entirely.
Wire-format change isn't capability-gated
The new AbortController and AbortSignal entries in SerializableSpecial (serialization/types.ts:127–132) are emitted unconditionally. An older @workflow/core worker deserializing inputs/payloads that contain the new shape will throw "unknown type" from devalue. There's no entry in the getRunCapabilities table to gate this on workflowCoreVersion, unlike the existing pattern (e.g. encryption format).
For partial-deploy / cross-deployment scenarios — e.g. a resumeHook(token, payload) from a newer deployment targeting an older run, or start({deploymentId}) cross-deployment — the older side will fail to revive. AGENTS.md's pattern is: bump workflowCoreVersion, add a capability flag, gate the producer on getRunCapabilities(targetRun.executionContext.workflowCoreVersion).serializableAbortController, fall back to a snapshot-only revive (e.g. an already-aborted signal with the snapshot reason) when the target doesn't support it.
This matters less if the abort feature is gated behind an opt-in user flag for the first release, or only used internally — but for shipping in start() args / resumeHook() payloads / step return values, the gating needs to be in place.
Workflow-server isSystem filtering dependency (already noted in PR description)
The PR description acknowledges this depends on workflow-server PR #336. Confirming for completeness: without server-side isSystem: true filtering, the public webhook endpoint would expose abort hook tokens, allowing externally-initiated aborts via the hook URL. Token format is abrt_<ULID> — not trivially guessable, but not encrypted either. The isSystem flag must be honored server-side before this can ship to production.
Moderate
Workflow VM revivers produce no-op addEventListener stubs
serialization.ts:5975–6016 (workflow-internal revivers) construct plain object stubs with addEventListener: () => {}. A workflow that does signal.addEventListener('abort', fn) after receiving a deserialized signal will silently never fire that callback. The proper class (WorkflowAbortSignal in workflow/abort-controller.ts:18) already exists; the workflow-VM revivers should use it. This is a silent correctness bug for the natural pattern of "register a listener on a signal you received as an arg."
First-run vs replay listener-firing site
On first-run, abort() fires listeners synchronously at the abort() call site. On replay, the events consumer chains _setAborted(reason) onto promiseQueue and listeners fire at the next promise-queue tick — typically before the workflow code reaches the corresponding abort() line. The abortHookOrderingWorkflow test was added to cover this and is currently skipped (commit test: skip abort+hook ordering e2e tests pending full integration).
This is acceptable as a documented constraint (the same shape as hook payloads), but: (a) the test should either be enabled and pass before merge or explicitly removed with a comment explaining the deferral, (b) the user-facing docs (cancellation.mdx) should document that listener firing relative to surrounding code is non-deterministic and should not be used for control flow that depends on the surrounding state.
Abort hooks never disposed
workflow/abort-controller.ts:151–154 handles hook_disposed events but nothing in the PR creates one. WorkflowAbortController has no dispose() method, and drainPendingQueueItems doesn't mark abort hooks as disposed: true before the synthetic suspension. Result: every new AbortController() that's never aborted leaks one row in the postgres hooks table for the run's lifetime (cleaned up only by run TTL).
Realistically bounded by user behavior — most workflows construct one or two controllers, not thousands — but worth either: (a) implicit dispose on workflow completion in drainPendingQueueItems, (b) a dispose method exposed on WorkflowAbortController, or (c) documenting the lifetime.
AbortSignal.any() listener cleanup + iteration
workflow/abort-controller.ts:200–229:
- No
removeEventListeneron input signals after the composite aborts. If any input signal outlives the composite (e.g. an external long-lived controller), the listener closure (capturingcomposite) prevents GC. Bounded by run lifetime so not catastrophic, but real. signalsis iterated twice. A single-shot iterable (e.g. a generator) produces zero entries on the second pass. NativeAbortSignal.anymaterializes the iterable into an array first; this implementation should too:const arr = Array.from(signals).
Unit test doesn't import the function under test
abort-controller-step.test.ts:85–161 reimplements reviveAbortController locally rather than importing it. The comment admits "Uses mock data directly to avoid dynamic imports that can cause hangs in vitest's module mock system." This means a regression in the production reviveAbortController won't surface in unit tests — only in the e2e tests in 99_e2e.ts. Worth sorting out the vitest mock issue rather than maintaining a parallel implementation in the test file.
Cold-start race window
setupAbortStreamReader opens a stream reader during step argument hydration, but reader.read() is async (network round-trip). User code in the step body can run before the first read resolves. If the workflow aborted concurrently with step pickup (after step.input was serialized but before the step's reader started), the user's first synchronous if (signal.aborted) check returns false even though the abort packet was already written.
Not a regression from any prior state — the snapshot-in-payload mechanism handles aborts that happened before enqueue, so this only affects the narrow window between enqueue and step pickup. But worth either: (a) a synchronous getChunks(limit:1) snapshot on the abort stream during reviver bootstrap to surface any already-written packet before user code runs, or (b) documenting the window so users know to await something async (a signal.throwIfAborted() check after the first await) rather than rely on synchronous polling.
Stale PR description
The PR description says "Implementation TBD — tests provide the specification to build against" and "all .todo stubs." Both are wrong by a wide margin — the PR has 78 commits, real implementation across 61 files, and ~120 it() blocks with full assertions. Reviewers reading the description may approve a 7000-line implementation thinking it's a spec-only PR. Worth updating before merge.
Follow-up suggestion: native AbortSignal.timeout() support
The timeout() static currently throws and points users to a sleep-based workaround. There's no architectural reason the workflow VM can't support it natively — the workaround is mechanical:
timeout(ms: number): WorkflowAbortSignal {
const controller = new WorkflowAbortController();
ctx.promiseQueue = ctx.promiseQueue.then(async () => {
await ctx.sleep(ms);
controller.abort(new DOMException('signal timed out', 'TimeoutError'));
});
return controller.signal;
}Replay determinism comes for free since sleep() records a wait event and controller.abort() already goes through the hook machinery. Strictly improves UX vs. the workaround:
signal.reason.name === 'TimeoutError'(matches native), instead of'AbortError'from the manual workaround.- Less boilerplate.
- No
voidfoot-gun.
Would need createAbortSignalStatics to take more than _vmGlobalThis — the orchestrator ctx (the same one passed to createCreateAbortController) gives access to sleep and promiseQueue. Modest refactor.
Not blocking — the workaround is documented and works — but worth considering as a follow-up PR. Better DX and removes a docs page that says "this thing doesn't work, here's how to fake it" which is awkward to ship as-is.
Nits
createAbortSignalStatics(_vmGlobalThis: Record<string, any>)— parameter is unused; if you do the timeout follow-up the signature changes anyway, but worth dropping the unused param now.- DOMException
codefield isn't preserved by the reducer (serialization/reducers/common.ts:6092–6102). Recovered via constructor lookup for spec-listed names (AbortError,TimeoutError, etc.) so the common case works; lost for custom names. Probably fine but worth a sentence in the changeset. - The
aborted: truefield in the dehydrated payload (suspension-handler.ts:218–223etc.) is decorative — the consumer only readspayload.reason. Could just dehydrate the reason directly. Request.signalre-tagging viacopyAbortInternals(serialization.ts:5947–5952) depends on platform-specific Request internals. Worth a cross-runtime test (undici, Node 22, browser).- Reducer side-effect attaches a listener to user-provided signal objects via
reduceAbortWithListener. TheABORT_LISTENER_ATTACHEDsymbol guards against duplicates, but it's still surprising that serializing an object adds a listener to it. Worth documenting somewhere.
What looks good
- The dual hook+stream architecture is the right factoring: hook for durable replay state, stream for real-time delivery. Symmetric writer paths (workflow-side, step-side, external-side) all dehydrate via the same machinery (modulo the bug above).
getAbortStreamId/getAbortStreamIdFromTokenround-tripping through theabrt_prefix is clean.cancelAbortReaderswalking args/this/closure (and explicitly descending intoRequest.signal) for step cleanup is thorough.ABORT_LISTENER_ATTACHEDdedup symbol prevents fanout-listener accumulation when a single signal is passed to N steps.- The
isSystem: trueplumbing is consistent across world-postgres, world-local, and the schema migration — all that's missing is the workflow-server side. AbortSignal.any()correctly handles already-aborted inputs in iteration order (matches native).- The
AbortSignal.timeout()workaround is documented and works — even if it'd be nicer to support natively.
Summary
Five blocking issues to address before merge: (1) the JSON.stringify bug in attachAbortListenerOnce, (2) capability gating on workflowCoreVersion, (3) the dependency on workflow-server #336, (4) workflow-VM revivers using no-op stubs instead of WorkflowAbortSignal, (5) the stale PR description.
Several moderate concerns worth at least documenting: replay listener-firing divergence, hook leak on undisposed controllers, AbortSignal.any cleanup/iteration, cold-start race window.
Architecture is sound; the implementation needs a few corrections to match the design intent.
Five fixes from PR review:
1. attachAbortListenerOnce now dehydrates via dehydrateStepArguments instead
of bare JSON.stringify so the reader (hydrateStepArguments) can decode
the packet. Previously, an external AbortController aborted with a
non-trivial reason (DOMException, Error, etc.) had its reason silently
dropped — the reader's catch fell through to controller.abort() with no
reason. Adds a regression test that aborts with a DOMException reason
and asserts full type fidelity round-trips.
2. Workflow-VM revivers now use the real WorkflowAbortSignal class instead
of plain-object stubs with addEventListener: () => {}. The previous
stubs silently dropped listener registrations — signal.addEventListener
('abort', fn) on a deserialized signal would never fire. Adds two
regression tests covering pre-aborted and post-aborted paths.
3. drainPendingQueueItems now marks any never-aborted system (abort) hook
as disposed: true so unused AbortController instances don't leak rows
in the hooks table for the run's TTL. User hooks are intentionally left
alone (their lifetime is user-managed).
4. AbortSignal.any() materializes the iterable with Array.from(signals)
before iterating twice (otherwise single-shot iterables like generators
produce zero entries on the second pass) and removes its abort listeners
from input signals after the composite aborts so closures don't pin
long-lived inputs from GC. Drops the unused _vmGlobalThis param while
we're touching the signature.
5. Unskips abortHookOrderingWorkflow e2e tests (4 variants). Bumps the
workflow's trailing sleep to 10s so the test harness has time to resume
the user hook before the workflow returns and `using` disposes it.
Updates PR description and the drain-pending-queue changeset.
|
@TooTallNate — addressed your review feedback in 409d9dc. Five fixes:
Skipped per discussion with @pranaygp: capability gating on Already in place: Not done (deferred to follow-up): native PR description has been updated to reflect what's actually shipping. The drain-pending-queue changeset now mentions the disposal behavior. Verification: 913/913 unit tests pass; 26/26 abort e2e tests pass (the 4 newly-unskipped + 22 existing). |
…ignal * origin/main: [core] Skip inline step execution when suspension also has a wait (#1924) [errors] Replace chalk import in @workfow/errors with inline ANSI shim (#1915) Fix compatibility with Zod 4.4.x (#1902) Serialize `run_failed`/`step_failed` errors through serialization pipeline (#1851) tarballs: redesign preview tarballs index page (#1911) Remove extra changeset (#1922) Add stable Next.js eager and lazy test coverage (#1747) Enforce per-(run, correlation) uniqueness for entity-creating events in world-postgres (#1878) fix(world-vercel): add default request timeout to workflow-server HTTP calls (#1807) Allow disabling step sourcemap with new `sourcemap` option in builders (#1842) [ci] Enable Vercel-prod e2e for tanstack-start (#1904) web: configure vercelPreset() for Vercel deployments (#1815) [core] Combine flow+step bundle and process steps eagerly (#1338) [world-vercel] Revert stream close control framing (#1891) [tarballs] Use turbo to build workspace deps before packing (#1908) # Conflicts: # packages/core/src/runtime/step-handler.test.ts # packages/core/src/runtime/step-handler.ts # packages/core/src/runtime/suspension-handler.ts # packages/core/src/step.test.ts # packages/world-local/src/storage/events-storage.ts # packages/world-postgres/src/drizzle/migrations/meta/_journal.json
V2 (#1338) introduced a lazy world resolution path used by WorkflowServerWritableStream. The abort-listener tests exercise this path; add a parallel getWorldLazy mock so writeMock is observed in both legacy and V2 code paths. Also threads the lazy mock through the per-test mockReturnValue overrides for the three tests that already overrode getWorld.
The events consumer was reading event.eventData.payload directly to
extract the abort reason. In production the payload is dehydrated
(written by the suspension handler via dehydrateStepArguments —
either as a Uint8Array or, with encryption, an encrypted Uint8Array
prefixed with `encr`), so the `'reason' in payload` check was always
false on replay. The signal still aborted, but with reason=undefined
— the user-supplied reason ('custom timeout reason', a DOMException,
etc.) was lost across the replay boundary.
Local single-invocation runs masked this because the workflow body's
synchronous _setAborted(reason) call already set the right state on
the first execution; the bug only surfaced when the workflow actually
re-replayed (multi-invocation Vercel runs), at which point the events
consumer overrode the synchronous reason with undefined before
checkSignalState serialized the signal to the next step.
Mirrors the pattern used for regular hook payloads in
workflow/hook.ts:117 — hydrate via hydrateStepReturnValue inside the
promiseQueue chain so the reason round-trips with full type fidelity
and respects the run's encryption key.
Caught by `abortReasonWorkflow: abort reason preserved across
boundaries` failing on Vercel-prod tanstack-start. Now passes on the
local nextjs-turbopack abort suite (26/26).
Summary
Makes
AbortControllerandAbortSignalfirst-class across workflow and step boundaries: serializable asstart()arguments, step inputs/outputs, and insideRequestobjects; observable from real-time and replay paths.AbortController/AbortSignalsurvive every serialization boundary (external → workflow → step → external) with full type fidelity forreason(DOMException, Error subclasses, custom classes).controller.abort()from the workflow body, from a step, or externally all converge on the same observable state — running steps see the abort in real-time; the workflow sees it deterministically on replay.AbortSignal.any()and pre-abortedAbortSignal.abort()are supported inside the workflow VM.AbortSignal.timeout()is intentionally not supported (replay-non-deterministic) and points users to thesleep()+AbortControllerpattern.Architecture
AbortControllerin a workflow is backed by two primitives:hook_receivedevent. On replay, the events consumer chains_setAborted(reason)ontopromiseQueueso listener firing is tied to the orchestrator's deterministic execution, not microtask scheduling._system_abortstream so a step running on a different process can observe the abort without waiting for the next replay tick.When
abort()is called from any context, both channels are written. Whichever delivers first drives the visible state; the other path provides redundancy. All three writer sites (workflow-side, step-side patchedcontroller.abort(), external-listener-side) dehydrate via the samedehydrateStepArgumentsmachinery so the hydration step (hydrateStepArguments) round-trips reasons with full fidelity and respects encryption.System hooks created for abort controllers are tagged
isSystem: true(consistent acrossworld-local,world-postgres, andworld-vercel). The workflow-server side (vercel/workflow-server#336) honors this when filtering public webhook listings — a hook that backs an abort controller doesn't get exposed via the public webhook endpoint.What's in this PR
AbortControllerandAbortSignalin all three contexts (external, workflow VM, step). Workflow-VM revivers use the realWorkflowAbortSignalclass sosignal.addEventListener('abort', fn)actually fires (the natural pattern for code that receives a signal from another step).createCreateAbortController(ctx)— workflow-VMAbortControllerthat registers a system hook on construction, subscribes to the events consumer forhook_created/hook_received/hook_disposed, and propagatesabort()through the suspension handler.WorkflowServerWritableStream-based abort packets for real-time delivery to running steps, plussetupAbortStreamReaderon the step side that wires the deserialized signal to the stream so an in-flight step'ssignal.abortedflips even if the replay path hasn't yet reached the workflow.cancelAbortReaderswalks step args (including the signal nested insideRequest) after the step returns to cancel any dangling readers — prevents serverless functions from being kept alive by abort streams.drainPendingQueueItemsmarks any never-aborted system hook asdisposed: truebefore the synthesized suspension, so unused controllers don't leak rows in the hooks table for the run's TTL.isSystemplumbing end-to-end:worldtypes,world-localandworld-postgresschemas/storage, and the suspension handler emit it onhook_created.Docs
AbortSignal.timeout()error page — explains the VM restriction and thesleep()+AbortControllerworkaroundTests
addEventListeneron revived signals,AbortSignal.any()ordering + iteration, hook integration, replay determinism. Includes regression tests for the listener-sideJSON.stringifybug and the no-opaddEventListenerstub bug surfaced in review.fetch, hook-driven aborts, and a 4-variant matrix verifying that listener firing relative tohook.then()resolution is deterministic across replay (abortHookOrderingWorkflow).Dependencies
Pairs with vercel/workflow-server#336 (already merged) for
isSystemfiltering on the public webhook endpoint, so abort hook tokens aren't exposed externally.