fix(world-vercel): add default request timeout to workflow-server HTTP calls#1807
Conversation
VaguelySerious
left a comment
There was a problem hiding this comment.
AI review: no blocking issues
…imeout Made-with: Cursor # Conflicts: # packages/world-vercel/src/utils.test.ts # packages/world-vercel/src/utils.ts
- Compose per-request timeout with caller-provided AbortSignal via AbortSignal.any() so a future caller passing options.signal doesn't silently lose the hang protection (and vice versa). - Move the floating eslint-disable-next-line for the undici dispatcher cast back next to the actual `fetch(... as any)` call where the suppression applies, instead of pointing at `const fetchStart`. Both nits flagged by @VaguelySerious in the AI review on PR #1807. Made-with: Cursor
Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Karthik Kalyan <105607645+karthikscale3@users.noreply.github.com>
Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Karthik Kalyan <105607645+karthikscale3@users.noreply.github.com>
TooTallNate
left a comment
There was a problem hiding this comment.
Approving — withdrawing my prior CHANGES_REQUESTED. The author took the suggestion from my earlier review: the process.exit workaround in runtime.ts is gone, replaced with a transport-level timeout in world-vercel/utils.ts:355 (AbortSignal.timeout(60_000) plumbed into the fetch call). This is the right layer — fixes the underlying issue at the single point where it lives, covers all 27+ world.events.create() call sites uniformly, produces a typed WorkflowWorldError that existing catch sites recognize.
Implementation looks solid:
- Timeout primitive:
AbortSignal.timeout(60_000)→ DOMException withname: 'TimeoutError'→ caught and wrapped asWorkflowWorldErrorwithcausepreserved. Standard pattern, clean. - Composition with
options.signal:AbortSignal.any([options.signal, timeoutSignal])for when callers eventually pass their own signals. Currently dead code per the comment, but wired correctly for future use. - Error mapping: error message includes
${method} ${endpoint}and${elapsed}ms— useful for debugging, attachesurlvia theWorkflowWorldErrorconstructor. - Span attributes:
ErrorType('TIMEOUT')for OTEL, plusrecordException. Consistent with sibling status-code branches. - Tests:
utils.test.tscovers both the wrap-on-timeout path and the pass-through-on-non-timeout path. Mocks fetch with synthetic errors. All 94 world-vercel tests still pass. - Changeset scope:
@workflow/world-vercelonly, which is correct — no behavior change incore.
A few non-blocking concerns worth raising. None are gating; mostly forward-looking.
1. start() retry classification doesn't match timeouts as retryable
isRetryableStartError in start.ts:331 only matches WorkflowWorldError with status >= 500. The new timeout error has no status set, so it falls into the throw err branch at line 283.
Concrete consequence: when events.create(run_created) times out but the parallel queue dispatch already succeeded, the user sees start() throw "POST /runs/... timed out after 60000ms" while the workflow actually does run via the queue path. That's misleading — the right behavior is to mark this as "resilient start" and continue (runtime.ts will retry the run_created event later).
Suggested adjustment:
function isRetryableStartError(err: unknown): boolean {
if (ThrottleError.is(err)) return true;
if (WorkflowWorldError.is(err)) {
// 5xx server errors and timeouts (no status) are both transient
if (err.status === undefined) return true;
if (err.status >= 500) return true;
}
return false;
}This is technically a behavior change in core, so it'd need a separate @workflow/core patch in the changeset. Could be a follow-up if you want to keep this PR scoped to world-vercel.
2. Node 18 + AbortSignal.any
The repo's root engines.node is ^18.0.0 || ^20.0.0 || ^22.0.0 || ^24.0.0. AbortSignal.any() was added in Node 20.3 (May 2023) — not available on Node 18. The branch at utils.ts:359 only fires when options.signal is set, which the comment notes is currently never. So today this is a latent issue, not an active one. But anyone wiring up a caller-provided signal in the future will get a TypeError at runtime on Node 18.
Either drop Node 18 from engines (probably the right move overall — Node 18 is EOL April 2025 as of writing) or guard with a feature check. Could go in a separate PR.
3. Hardcoded 60s timeout
The chosen value barely covers the slowest legitimate case in your incident report (47s). A successful but slow-starting request at, say, 55s would now succeed but be on the edge; a hung request takes 60s to detect.
That's reasonable as a default, but I'd consider exposing it as a config knob (like the existing VERCEL_WORKFLOW_SERVER_URL env var pattern) for users with different latency profiles. Probably fine to wait for someone to ask.
Wrap-up
Good fix. The transport-level approach is correct and the concerns above are forward-looking polish, not gating.
|
Aside on my own concerns from the approval — I should retract one thing. I was briefly worried the 60s timeout might affect the long-lived stream GET endpoint, which can legitimately stay open for the full function duration. Confirmed it doesn't: looking at Only the discrete request/response calls go through
60s is appropriately generous for all of those. So the design is correct as-is — no concern about the streaming endpoint. |
Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Add a default per-request timeout to
makeRequest()inworld-vercelso hung responses from workflow-server can't burn compute up to the function'smaxDuration.Background (original problem)
A production workflow (
wrun_01KPDFGK4QBFZ7XERXN9NP7VY2) on a preview deployment showed:run_startedPOST to workflow-server took 47s (server under load)run_failedwrite was sent but the response never came back (External APIs showed∅ "Timed out while waiting for a response")maxDuration) before SIGTERM — ~11 minutes of compute burned doing nothingOriginal fix (reverted)
The first version of this PR added a 30s hard-exit deadline to the replay timeout handler in
packages/core/src/runtime.ts. Per Nate's review, this was the wrong layer: it only protected one of 27world.events.create()call sites in core, leaving the other 26 (and every otherworld.*method going throughmakeRequest()) exposed to the exact same hang.New fix
Moved the mitigation down into the
world-verceltransport layer, where allworld.*calls funnel throughmakeRequest():packages/world-vercel/src/utils.ts— attachesAbortSignal.timeout(60_000)to everymakeRequest()fetch. ATimeoutErrororAbortErrorfrom fetch is converted into aWorkflowWorldError(with the original error preserved ascauseand elapsed ms in the message), so existing catch sites handle it uniformly. The span is tagged withErrorType('TIMEOUT').REPLAY_TIMEOUT_EXIT_DEADLINE_MSconstant.Why 60s: comfortably above the observed 47s p99 in the incident, well under the 240s replay timeout so upstream retries still have room, and much shorter than the
maxDurationSIGTERM horizon.Impact
world.*calls throughworld-vercel(events, runs, steps, queue, hooks, etc.), not just the replay timeout path.WorkflowWorldErrors — existing catch sites get predictable retry/failure semantics instead of an infinite await.AbortSignal.timeout()doesn't fire on normal requests and the unref'd timer doesn't hold the event loop open.Test plan
pnpm typecheckon@workflow/core+@workflow/world-vercelpackages/world-vercel/src/utils.test.tswith two cases:TimeoutErrorfromfetchis wrapped intoWorkflowWorldErrorwith elapsed ms and preservedcauseTypeError) propagate unchangedworld-vercelsuite passes (81 tests)coresuite passes (591 tests)WorkflowWorldErrorinstead of running tomaxDuration