fix(world-vercel): add default request timeout to workflow-server HTTP calls by karthikscale3 · Pull Request #1807 · vercel/workflow

karthikscale3 · 2026-04-17T21:52:49Z

Summary

Add a default per-request timeout to makeRequest() in world-vercel so hung responses from workflow-server can't burn compute up to the function's maxDuration.

Background (original problem)

A production workflow (wrun_01KPDFGK4QBFZ7XERXN9NP7VY2) on a preview deployment showed:

run_started POST to workflow-server took 47s (server under load)
Replay timeout fired at 240s
The run_failed write was sent but the response never came back (External APIs showed ∅ "Timed out while waiting for a response")
The function continued running for 15 minutes (hit maxDuration) before SIGTERM — ~11 minutes of compute burned doing nothing

Original fix (reverted)

The first version of this PR added a 30s hard-exit deadline to the replay timeout handler in packages/core/src/runtime.ts. Per Nate's review, this was the wrong layer: it only protected one of 27 world.events.create() call sites in core, leaving the other 26 (and every other world.* method going through makeRequest()) exposed to the exact same hang.

New fix

Moved the mitigation down into the world-vercel transport layer, where all world.* calls funnel through makeRequest():

packages/world-vercel/src/utils.ts — attaches AbortSignal.timeout(60_000) to every makeRequest() fetch. A TimeoutError or AbortError from fetch is converted into a WorkflowWorldError (with the original error preserved as cause and elapsed ms in the message), so existing catch sites handle it uniformly. The span is tagged with ErrorType('TIMEOUT').
Reverted the runtime-level exit deadline and the REPLAY_TIMEOUT_EXIT_DEADLINE_MS constant.

Why 60s: comfortably above the observed 47s p99 in the incident, well under the 240s replay timeout so upstream retries still have room, and much shorter than the maxDuration SIGTERM horizon.

Impact

Covers all world.* calls through world-vercel (events, runs, steps, queue, hooks, etc.), not just the replay timeout path.
Hangs now surface as typed WorkflowWorldErrors — existing catch sites get predictable retry/failure semantics instead of an infinite await.
Happy path is unchanged: AbortSignal.timeout() doesn't fire on normal requests and the unref'd timer doesn't hold the event loop open.

Test plan

pnpm typecheck on @workflow/core + @workflow/world-vercel
Added packages/world-vercel/src/utils.test.ts with two cases:
- TimeoutError from fetch is wrapped into WorkflowWorldError with elapsed ms and preserved cause
- Non-timeout errors (e.g. TypeError) propagate unchanged
Full world-vercel suite passes (81 tests)
Full core suite passes (591 tests)
Preview deployment: verify normal requests still succeed under the 60s budget
Preview deployment with simulated server hang: verify the call fails fast as WorkflowWorldError instead of running to maxDuration

VaguelySerious

AI review: no blocking issues

…imeout Made-with: Cursor # Conflicts: # packages/world-vercel/src/utils.test.ts # packages/world-vercel/src/utils.ts

@VaguelySerious

- Compose per-request timeout with caller-provided AbortSignal via AbortSignal.any() so a future caller passing options.signal doesn't silently lose the hang protection (and vice versa). - Move the floating eslint-disable-next-line for the undici dispatcher cast back next to the actual `fetch(... as any)` call where the suppression applies, instead of pointing at `const fetchStart`. Both nits flagged by @VaguelySerious in the AI review on PR #1807. Made-with: Cursor

Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Karthik Kalyan <105607645+karthikscale3@users.noreply.github.com>

TooTallNate

Approving — withdrawing my prior CHANGES_REQUESTED. The author took the suggestion from my earlier review: the process.exit workaround in runtime.ts is gone, replaced with a transport-level timeout in world-vercel/utils.ts:355 (AbortSignal.timeout(60_000) plumbed into the fetch call). This is the right layer — fixes the underlying issue at the single point where it lives, covers all 27+ world.events.create() call sites uniformly, produces a typed WorkflowWorldError that existing catch sites recognize.

Implementation looks solid:

Timeout primitive: AbortSignal.timeout(60_000) → DOMException with name: 'TimeoutError' → caught and wrapped as WorkflowWorldError with cause preserved. Standard pattern, clean.
Composition with options.signal: AbortSignal.any([options.signal, timeoutSignal]) for when callers eventually pass their own signals. Currently dead code per the comment, but wired correctly for future use.
Error mapping: error message includes ${method} ${endpoint} and ${elapsed}ms — useful for debugging, attaches url via the WorkflowWorldError constructor.
Span attributes: ErrorType('TIMEOUT') for OTEL, plus recordException. Consistent with sibling status-code branches.
Tests: utils.test.ts covers both the wrap-on-timeout path and the pass-through-on-non-timeout path. Mocks fetch with synthetic errors. All 94 world-vercel tests still pass.
Changeset scope: @workflow/world-vercel only, which is correct — no behavior change in core.

A few non-blocking concerns worth raising. None are gating; mostly forward-looking.

1. `start()` retry classification doesn't match timeouts as retryable

isRetryableStartError in start.ts:331 only matches WorkflowWorldError with status >= 500. The new timeout error has no status set, so it falls into the throw err branch at line 283.

Concrete consequence: when events.create(run_created) times out but the parallel queue dispatch already succeeded, the user sees start() throw "POST /runs/... timed out after 60000ms" while the workflow actually does run via the queue path. That's misleading — the right behavior is to mark this as "resilient start" and continue (runtime.ts will retry the run_created event later).

Suggested adjustment:

function isRetryableStartError(err: unknown): boolean {
  if (ThrottleError.is(err)) return true;
  if (WorkflowWorldError.is(err)) {
    // 5xx server errors and timeouts (no status) are both transient
    if (err.status === undefined) return true;
    if (err.status >= 500) return true;
  }
  return false;
}

This is technically a behavior change in core, so it'd need a separate @workflow/core patch in the changeset. Could be a follow-up if you want to keep this PR scoped to world-vercel.

2. Node 18 + `AbortSignal.any`

The repo's root engines.node is ^18.0.0 || ^20.0.0 || ^22.0.0 || ^24.0.0. AbortSignal.any() was added in Node 20.3 (May 2023) — not available on Node 18. The branch at utils.ts:359 only fires when options.signal is set, which the comment notes is currently never. So today this is a latent issue, not an active one. But anyone wiring up a caller-provided signal in the future will get a TypeError at runtime on Node 18.

Either drop Node 18 from engines (probably the right move overall — Node 18 is EOL April 2025 as of writing) or guard with a feature check. Could go in a separate PR.

3. Hardcoded 60s timeout

The chosen value barely covers the slowest legitimate case in your incident report (47s). A successful but slow-starting request at, say, 55s would now succeed but be on the edge; a hung request takes 60s to detect.

That's reasonable as a default, but I'd consider exposing it as a config knob (like the existing VERCEL_WORKFLOW_SERVER_URL env var pattern) for users with different latency profiles. Probably fine to wait for someone to ask.

Wrap-up

Good fix. The transport-level approach is correct and the concerns above are forward-looking polish, not gating.

TooTallNate · 2026-05-01T21:49:31Z

Aside on my own concerns from the approval — I should retract one thing. I was briefly worried the 60s timeout might affect the long-lived stream GET endpoint, which can legitimately stay open for the full function duration. Confirmed it doesn't: looking at streamer.ts, the long-lived read at line 283 (getReadable/get) goes through a direct fetch() call, not makeRequest. Same for the write/writeMulti/close paths — direct fetch.

Only the discrete request/response calls go through makeRequest:

events.create (and similar) — milliseconds normally
runs.get / runs.list — milliseconds normally
getChunks (paginated stream chunks) — bounded per-page
getInfo (stream metadata) — trivial

60s is appropriately generous for all of those. So the design is correct as-is — no concern about the streaming endpoint.

Co-authored-by: Cursor <cursoragent@cursor.com>

karthikscale3 added 30 commits January 27, 2026 10:51

Merge branch 'main' of github.com:vercel/workflow

ad8b7f1

Merge branch 'main' of github.com:vercel/workflow

e0cb61b

Merge branch 'main' of github.com:vercel/workflow

051cadc

Merge branch 'main' of github.com:vercel/workflow

af8ac1f

Merge branch 'main' of github.com:vercel/workflow

e68e7d2

Merge branch 'main' of github.com:vercel/workflow

8305e5b

Merge branch 'main' of github.com:vercel/workflow

f3da688

Merge branch 'main' of github.com:vercel/workflow

f8ee413

Merge branch 'main' of github.com:vercel/workflow

589cbd7

Merge branch 'main' of github.com:vercel/workflow

e979fdf

Merge branch 'main' of github.com:vercel/workflow

d4baed2

Merge branch 'main' of github.com:vercel/workflow

53503a4

Merge branch 'main' of github.com:vercel/workflow

d51e2e4

Merge branch 'main' of github.com:vercel/workflow

5123088

Merge branch 'main' of github.com:vercel/workflow

dd1a307

Merge branch 'main' of github.com:vercel/workflow

00bcfcd

Merge branch 'main' of github.com:vercel/workflow

f6a157b

Merge branch 'main' of github.com:vercel/workflow

816f35b

Merge branch 'main' of github.com:vercel/workflow

2b87dce

Merge branch 'main' of github.com:vercel/workflow

146de28

Merge branch 'main' of github.com:vercel/workflow

0ef6455

Merge branch 'main' of github.com:vercel/workflow

2d8c690

Merge branch 'main' of github.com:vercel/workflow

3513471

Merge branch 'main' of github.com:vercel/workflow

0f5d29e

Merge branch 'main' of github.com:vercel/workflow

c76001e

Merge branch 'main' of github.com:vercel/workflow

f97184f

Merge branch 'main' of github.com:vercel/workflow

f71c531

Merge branch 'main' of github.com:vercel/workflow

530e598

Merge branch 'main' of github.com:vercel/workflow

ea38511

Merge branch 'main' of github.com:vercel/workflow

19b1a3f

vercel Bot deployed to Preview – example-nextjs-workflow-webpack April 21, 2026 18:52 View deployment

vercel Bot deployed to Preview – workflow-swc-playground April 21, 2026 18:53 View deployment

VaguelySerious reviewed Apr 23, 2026

View reviewed changes

Comment thread packages/world-vercel/src/utils.ts Outdated

Comment thread packages/world-vercel/src/utils.ts Outdated

Comment thread packages/world-vercel/src/utils.ts

karthikscale3 added 2 commits April 28, 2026 13:17

Merge remote-tracking branch 'origin/main' into karthik/fix-runtime-t…

edd205b

…imeout Made-with: Cursor # Conflicts: # packages/world-vercel/src/utils.test.ts # packages/world-vercel/src/utils.ts

vercel Bot deployed to Preview – workflow-web April 28, 2026 20:21 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workbench-nuxt-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workbench-hono-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workbench-express-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workflow-docs April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – example-workflow April 28, 2026 20:22 View deployment

vercel Bot deployed to Preview – workbench-vite-workflow April 28, 2026 20:22 View deployment

Merge branch 'main' into karthik/fix-runtime-timeout

07a3960

VaguelySerious approved these changes Apr 30, 2026

View reviewed changes

Comment thread .changeset/fix-world-vercel-request-timeout.md Outdated

Comment thread packages/world-vercel/src/utils.ts Outdated

karthikscale3 and others added 2 commits May 1, 2026 14:07

Apply suggestions from code review

b76d8ca

Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Karthik Kalyan <105607645+karthikscale3@users.noreply.github.com>

Apply suggestions from code review

d62abf7

Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Karthik Kalyan <105607645+karthikscale3@users.noreply.github.com>

TooTallNate approved these changes May 1, 2026

View reviewed changes

karthikscale3 and others added 2 commits May 1, 2026 14:54

Merge branch 'main' into karthik/fix-runtime-timeout

e443b86

Merge branch 'main' into karthik/fix-runtime-timeout

410ed18

Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(world-vercel): add default request timeout to workflow-server HTTP calls#1807

fix(world-vercel): add default request timeout to workflow-server HTTP calls#1807
karthikscale3 merged 99 commits into
mainfrom
karthik/fix-runtime-timeout

karthikscale3 commented Apr 17, 2026 •

edited

Loading

Uh oh!

VaguelySerious left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TooTallNate left a comment

Uh oh!

TooTallNate commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

karthikscale3 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background (original problem)

Original fix (reverted)

New fix

Impact

Test plan

Uh oh!

VaguelySerious left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TooTallNate left a comment

Choose a reason for hiding this comment

1. start() retry classification doesn't match timeouts as retryable

2. Node 18 + AbortSignal.any

3. Hardcoded 60s timeout

Wrap-up

Uh oh!

TooTallNate commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karthikscale3 commented Apr 17, 2026 •

edited

Loading

1. `start()` retry classification doesn't match timeouts as retryable

2. Node 18 + `AbortSignal.any`