Skip to content

fix(world-vercel): add default request timeout to workflow-server HTTP calls#1807

Merged
karthikscale3 merged 99 commits into
mainfrom
karthik/fix-runtime-timeout
May 4, 2026
Merged

fix(world-vercel): add default request timeout to workflow-server HTTP calls#1807
karthikscale3 merged 99 commits into
mainfrom
karthik/fix-runtime-timeout

Conversation

@karthikscale3
Copy link
Copy Markdown
Contributor

@karthikscale3 karthikscale3 commented Apr 17, 2026

Summary

Add a default per-request timeout to makeRequest() in world-vercel so hung responses from workflow-server can't burn compute up to the function's maxDuration.

Background (original problem)

A production workflow (wrun_01KPDFGK4QBFZ7XERXN9NP7VY2) on a preview deployment showed:

  • run_started POST to workflow-server took 47s (server under load)
  • Replay timeout fired at 240s
  • The run_failed write was sent but the response never came back (External APIs showed ∅ "Timed out while waiting for a response")
  • The function continued running for 15 minutes (hit maxDuration) before SIGTERM — ~11 minutes of compute burned doing nothing

Original fix (reverted)

The first version of this PR added a 30s hard-exit deadline to the replay timeout handler in packages/core/src/runtime.ts. Per Nate's review, this was the wrong layer: it only protected one of 27 world.events.create() call sites in core, leaving the other 26 (and every other world.* method going through makeRequest()) exposed to the exact same hang.

New fix

Moved the mitigation down into the world-vercel transport layer, where all world.* calls funnel through makeRequest():

  • packages/world-vercel/src/utils.ts — attaches AbortSignal.timeout(60_000) to every makeRequest() fetch. A TimeoutError or AbortError from fetch is converted into a WorkflowWorldError (with the original error preserved as cause and elapsed ms in the message), so existing catch sites handle it uniformly. The span is tagged with ErrorType('TIMEOUT').
  • Reverted the runtime-level exit deadline and the REPLAY_TIMEOUT_EXIT_DEADLINE_MS constant.

Why 60s: comfortably above the observed 47s p99 in the incident, well under the 240s replay timeout so upstream retries still have room, and much shorter than the maxDuration SIGTERM horizon.

Impact

  • Covers all world.* calls through world-vercel (events, runs, steps, queue, hooks, etc.), not just the replay timeout path.
  • Hangs now surface as typed WorkflowWorldErrors — existing catch sites get predictable retry/failure semantics instead of an infinite await.
  • Happy path is unchanged: AbortSignal.timeout() doesn't fire on normal requests and the unref'd timer doesn't hold the event loop open.

Test plan

  • pnpm typecheck on @workflow/core + @workflow/world-vercel
  • Added packages/world-vercel/src/utils.test.ts with two cases:
    • TimeoutError from fetch is wrapped into WorkflowWorldError with elapsed ms and preserved cause
    • Non-timeout errors (e.g. TypeError) propagate unchanged
  • Full world-vercel suite passes (81 tests)
  • Full core suite passes (591 tests)
  • Preview deployment: verify normal requests still succeed under the 60s budget
  • Preview deployment with simulated server hang: verify the call fails fast as WorkflowWorldError instead of running to maxDuration

Copy link
Copy Markdown
Member

@VaguelySerious VaguelySerious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI review: no blocking issues

Comment thread packages/world-vercel/src/utils.ts Outdated
Comment thread packages/world-vercel/src/utils.ts Outdated
Comment thread packages/world-vercel/src/utils.ts
…imeout

Made-with: Cursor

# Conflicts:
#	packages/world-vercel/src/utils.test.ts
#	packages/world-vercel/src/utils.ts
- Compose per-request timeout with caller-provided AbortSignal via
  AbortSignal.any() so a future caller passing options.signal doesn't
  silently lose the hang protection (and vice versa).
- Move the floating eslint-disable-next-line for the undici dispatcher
  cast back next to the actual `fetch(... as any)` call where the
  suppression applies, instead of pointing at `const fetchStart`.

Both nits flagged by @VaguelySerious in the AI review on PR #1807.

Made-with: Cursor
Comment thread .changeset/fix-world-vercel-request-timeout.md Outdated
Comment thread packages/world-vercel/src/utils.ts Outdated
karthikscale3 and others added 2 commits May 1, 2026 14:07
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Karthik Kalyan <105607645+karthikscale3@users.noreply.github.com>
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Karthik Kalyan <105607645+karthikscale3@users.noreply.github.com>
Copy link
Copy Markdown
Member

@TooTallNate TooTallNate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — withdrawing my prior CHANGES_REQUESTED. The author took the suggestion from my earlier review: the process.exit workaround in runtime.ts is gone, replaced with a transport-level timeout in world-vercel/utils.ts:355 (AbortSignal.timeout(60_000) plumbed into the fetch call). This is the right layer — fixes the underlying issue at the single point where it lives, covers all 27+ world.events.create() call sites uniformly, produces a typed WorkflowWorldError that existing catch sites recognize.

Implementation looks solid:

  • Timeout primitive: AbortSignal.timeout(60_000) → DOMException with name: 'TimeoutError' → caught and wrapped as WorkflowWorldError with cause preserved. Standard pattern, clean.
  • Composition with options.signal: AbortSignal.any([options.signal, timeoutSignal]) for when callers eventually pass their own signals. Currently dead code per the comment, but wired correctly for future use.
  • Error mapping: error message includes ${method} ${endpoint} and ${elapsed}ms — useful for debugging, attaches url via the WorkflowWorldError constructor.
  • Span attributes: ErrorType('TIMEOUT') for OTEL, plus recordException. Consistent with sibling status-code branches.
  • Tests: utils.test.ts covers both the wrap-on-timeout path and the pass-through-on-non-timeout path. Mocks fetch with synthetic errors. All 94 world-vercel tests still pass.
  • Changeset scope: @workflow/world-vercel only, which is correct — no behavior change in core.

A few non-blocking concerns worth raising. None are gating; mostly forward-looking.

1. start() retry classification doesn't match timeouts as retryable

isRetryableStartError in start.ts:331 only matches WorkflowWorldError with status >= 500. The new timeout error has no status set, so it falls into the throw err branch at line 283.

Concrete consequence: when events.create(run_created) times out but the parallel queue dispatch already succeeded, the user sees start() throw "POST /runs/... timed out after 60000ms" while the workflow actually does run via the queue path. That's misleading — the right behavior is to mark this as "resilient start" and continue (runtime.ts will retry the run_created event later).

Suggested adjustment:

function isRetryableStartError(err: unknown): boolean {
  if (ThrottleError.is(err)) return true;
  if (WorkflowWorldError.is(err)) {
    // 5xx server errors and timeouts (no status) are both transient
    if (err.status === undefined) return true;
    if (err.status >= 500) return true;
  }
  return false;
}

This is technically a behavior change in core, so it'd need a separate @workflow/core patch in the changeset. Could be a follow-up if you want to keep this PR scoped to world-vercel.

2. Node 18 + AbortSignal.any

The repo's root engines.node is ^18.0.0 || ^20.0.0 || ^22.0.0 || ^24.0.0. AbortSignal.any() was added in Node 20.3 (May 2023) — not available on Node 18. The branch at utils.ts:359 only fires when options.signal is set, which the comment notes is currently never. So today this is a latent issue, not an active one. But anyone wiring up a caller-provided signal in the future will get a TypeError at runtime on Node 18.

Either drop Node 18 from engines (probably the right move overall — Node 18 is EOL April 2025 as of writing) or guard with a feature check. Could go in a separate PR.

3. Hardcoded 60s timeout

The chosen value barely covers the slowest legitimate case in your incident report (47s). A successful but slow-starting request at, say, 55s would now succeed but be on the edge; a hung request takes 60s to detect.

That's reasonable as a default, but I'd consider exposing it as a config knob (like the existing VERCEL_WORKFLOW_SERVER_URL env var pattern) for users with different latency profiles. Probably fine to wait for someone to ask.

Wrap-up

Good fix. The transport-level approach is correct and the concerns above are forward-looking polish, not gating.

@TooTallNate
Copy link
Copy Markdown
Member

Aside on my own concerns from the approval — I should retract one thing. I was briefly worried the 60s timeout might affect the long-lived stream GET endpoint, which can legitimately stay open for the full function duration. Confirmed it doesn't: looking at streamer.ts, the long-lived read at line 283 (getReadable/get) goes through a direct fetch() call, not makeRequest. Same for the write/writeMulti/close paths — direct fetch.

Only the discrete request/response calls go through makeRequest:

  • events.create (and similar) — milliseconds normally
  • runs.get / runs.list — milliseconds normally
  • getChunks (paginated stream chunks) — bounded per-page
  • getInfo (stream metadata) — trivial

60s is appropriately generous for all of those. So the design is correct as-is — no concern about the streaming endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants