Retry failed VQS handlers immediately#1999
Conversation
🦋 Changeset detectedLatest commit: 45feb40 The changes in this PR will be included in the next version bump. This PR includes changesets to release 19 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (4 failed)astro (1 failed):
express (1 failed):
nuxt (1 failed):
vite (1 failed):
🐘 Local Postgres (1 failed)nextjs-webpack-stable-lazy-discovery-disabled (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
❌ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
| } | ||
| }, | ||
| { | ||
| retry: () => ({ afterSeconds: 0 }), |
There was a problem hiding this comment.
Done in c0a0ea35f / 107c5cbee: added a block comment explaining the 300s VQS visibility-timeout default, why we pass an explicit retry directive, and the idempotency expectation for close-together retries.
There was a problem hiding this comment.
Pull request overview
This PR updates @workflow/world-vercel queue handling so failed VQS workflow handler invocations request an immediate retry directive instead of relying on the default visibility timeout.
Changes:
- Adds a
retryoption toQueueClient.handleCallback. - Updates unit coverage for the callback retry option.
- Adds a patch changeset for
@workflow/world-vercel.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
packages/world-vercel/src/queue.ts |
Wires retry options into the VQS callback handler. |
packages/world-vercel/src/queue.test.ts |
Adds/updates assertions for handleCallback retry behavior. |
.changeset/retry-vqs-handler-errors-immediately.md |
Adds release metadata for the world-vercel patch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| retry: () => ({ afterSeconds: 0 }), | ||
| } |
There was a problem hiding this comment.
Updated in 107c5cbee: this no longer forces afterSeconds: 0. Handler failures now use delivery-count backoff starting at 1s and capped at 60s, so retries are quick without burning the delivery budget in a tight loop.
| expect( | ||
| options.retry(new Error('workflow server unavailable'), { | ||
| messageId: 'msg-123', | ||
| deliveryCount: 1, | ||
| }) | ||
| ).toEqual({ afterSeconds: 0 }); |
There was a problem hiding this comment.
Updated in 107c5cbee: the test now asserts the bounded retry sequence instead of { afterSeconds: 0 }.
karthikscale3
left a comment
There was a problem hiding this comment.
Diagnosis is right — confirmed independently from staging Datadog logs that the 300s gap is the @vercel/queue default visibilityTimeoutSeconds. Fix wires the right API surface, but afterSeconds: 0 overcorrects in a few ways. Inline comments cover them. Worth pairing with a workflow-server-side change that maps getaddrinfo EBUSY / transient AWS SDK errors to 503 Retry-After: 1 so the SDK can retry without ever bouncing back to the queue.
| } | ||
| }, | ||
| { | ||
| retry: () => ({ afterSeconds: 0 }), |
There was a problem hiding this comment.
afterSeconds: 0 overrides the configured retryAfterSeconds: 5.
Per the @vercel/queue README: "the message is re-delivered after the configured retry delay (from retryAfterSeconds in vercel.json or the retry callback's afterSeconds)". So returning 0 here bypasses the trigger-config value the SDK already generates.
For the concrete bug that motivated this (DNS EBUSY on a Fluid Compute lambda whose libuv DNS thread pool was wedged for ~9s), a 0s retry will land back on the same wedged container before it recycles. The configured 5s would actually be the right floor.
Suggest: retry: () => ({ afterSeconds: 5 }) to match retryAfterSeconds in the generated trigger config, or read the value from a shared constant so they can't drift apart.
There was a problem hiding this comment.
5 seconds seems too long still. where is that floor coming from?
There was a problem hiding this comment.
This is a good point about the interaction between afterSeconds in the retry callback and the retryAfterSeconds: 5 in the trigger config.
However, looking at the handler-side code in packages/core/src/runtime.ts (line 158) and packages/core/src/runtime/step-handler.ts (line 98), runs/steps are bounded at MAX_QUEUE_DELIVERIES = 48 attempts — the handler will gracefully fail the run before unbounded looping occurs. The comment in constants.ts does note the 48-attempt budget was sized assuming ~20 hours of elapsed time with the 5s backoff schedule, so with afterSeconds: 0 those 48 attempts would exhaust much faster.
That said, the main fix here (avoiding the 300s visibility lock) is clearly correct. The question is whether afterSeconds: 0 vs something like afterSeconds: 5 matters in practice — @pranaygp would know best whether VQS applies its own backoff on top of the afterSeconds value from the retry callback or if this truly replaces the entire schedule.
There was a problem hiding this comment.
Good catch. I agree that afterSeconds: 0 is more aggressive than needed. The retryAfterSeconds: 5 in the trigger config (also defined in packages/builders/src/constants.ts as WORKFLOW_QUEUE_TRIGGER.retryAfterSeconds) exists for a reason, and the retry callback's afterSeconds overrides it entirely.
The suggestion to use afterSeconds: 5 (or better, import from the shared WORKFLOW_QUEUE_TRIGGER constant) makes sense — it fixes the 300s visibility timeout problem while preserving the backoff floor that MAX_QUEUE_DELIVERIES = 48 was sized around.
There was a problem hiding this comment.
Adjusted in 107c5cbee after Pranay’s follow-up: instead of a 5s floor, the retry starts at 1s and then backs off by delivery count. That keeps the first retry quick while avoiding the original 0s hot-loop risk.
| } | ||
| }, | ||
| { | ||
| retry: () => ({ afterSeconds: 0 }), |
There was a problem hiding this comment.
No deliveryCount-aware backoff — poison messages will hot-loop.
This returns 0 for every delivery. For a permanently-broken message (malformed payload, schema mismatch, a workflow-server 500 that isn't transient), VQS will redeliver every ~0s until it hits its max-receive limit / DLQ. That's much hotter than the current 300s behavior.
Standard shape that handles both transient and stuck without classification:
retry: (_err, { deliveryCount }) => ({
afterSeconds: Math.min(2 ** (deliveryCount - 1), 60),
})Gives 1s → 2s → 4s → 8s … capped at 60s. Fast enough for transient (full recovery within seconds), gentle enough not to thrash on a poison message.
There was a problem hiding this comment.
Implemented in 107c5cbee: handler errors now use delivery-count-aware backoff (1s -> 2s -> 4s -> 8s ...) capped at 60s.
| } | ||
| }, | ||
| { | ||
| retry: () => ({ afterSeconds: 0 }), |
There was a problem hiding this comment.
Cross-system amplification risk during a workflow-server outage.
The 300s default is bad for happy-path latency but accidentally acts as a circuit-breaker when workflow-server is browning out — each in-flight message is held off-line for 5 min, naturally throttling redrive volume. With afterSeconds: 0, every dispatched VQS message will thrash workflow-server at full speed during exactly the situations we'd want to be gentle on.
The deliveryCount-aware backoff in my other comment addresses this too — once retries pile up, the per-message redrive rate self-throttles via exponential backoff.
There was a problem hiding this comment.
Addressed by the same backoff in 107c5cbee: repeated failures self-throttle up to a 60s cap instead of hammering workflow-server with 0s redrives.
| } | ||
| }, | ||
| { | ||
| retry: () => ({ afterSeconds: 0 }), |
There was a problem hiding this comment.
Worth a comment here explaining why this retry handler exists.
Future readers will see retry: () => ({ afterSeconds: 0 }) and wonder why we explicitly opt into a non-default. A short block comment naming the 300s visibilityTimeoutSeconds default and the incident that motivated it (or the linked PR / issue) would make this durable. Without it, the next person to touch this is likely to remove it as redundant.
Also worth a one-line note that workflow/step handlers must be idempotent — the event-sourced model already required this for the 300s replay path, but tightening the redrive window makes back-to-back replays much more likely in practice.
There was a problem hiding this comment.
Done in c0a0ea35f / 107c5cbee: added the explanatory comment and called out the idempotency requirement.
| messageId: 'msg-123', | ||
| deliveryCount: 1, | ||
| }) | ||
| ).toEqual({ afterSeconds: 0 }); |
There was a problem hiding this comment.
Test asserts the directive shape but not the contract.
This invokes options.retry directly with a stubbed error, which verifies the function we pass in returns the right thing — but it doesn't exercise the path where @vercel/queue actually calls retry on a real handler throw. If handleCallback's contract for retry ever changes (different metadata shape, different return-type, called/not-called timing), this test still passes.
A small integration-style test using a real QueueClient with a throwing handler, then asserting redelivery happened with the expected delay, would catch contract drift. Not blocking — @vercel/queue owns the other side — but worth considering as a follow-up.
There was a problem hiding this comment.
The test validates the contract that the production code passes to @vercel/queue — i.e., the retry callback shape and return value. Testing the actual handleCallback integration would require either not mocking @vercel/queue or a more complex integration test, which seems out of scope for a unit test. The mock-based approach here is consistent with how other handleCallback behavior is tested in this file (e.g., the existing mockHandleCallback pattern).
There was a problem hiding this comment.
Agree this is fine for now — it verifies the wiring correctly. An integration test with a real QueueClient would be more robust against contract drift but isn't blocking for this change.
There was a problem hiding this comment.
Leaving the real @vercel/queue contract test as a follow-up since that side is owned by the queue package. This PR now covers the world-vercel retry directive and the backoff values it returns.
There was a problem hiding this comment.
I don't think this will work. The problem is that we claim the message for 5 minutes. So even if we retry immediately, we will fail to claim the message.
Actually, this is a new message. We probably want some sort of backoff here, so we don't hammer the consumer. Lets discuss in slack
| // Without an explicit retry directive, @vercel/queue leaves failed | ||
| // handler messages invisible until the default 300s visibility timeout | ||
| // expires. Start retrying quickly, then back off by delivery count so | ||
| // an outage or poison message cannot hot-loop. Workflow handlers are | ||
| // event-sourced and must remain idempotent because queue retries can | ||
| // happen close together. |
There was a problem hiding this comment.
shouldn't VQS fallback to the configured 5s retryAfter configured in the config when nothing is explicitly set, instead of 300s visibility timeout?
There was a problem hiding this comment.
Checked against the @vercel/queue callback path in ~/github/vercel/vqs: the handler creates a consumer group with visibilityTimeoutSeconds defaulting to 300, and on handler throw it only calls changeVisibility when the retry callback returns { afterSeconds }. If the retry callback is absent/undefined, the error propagates and the message stays hidden until that visibility timeout expires. So this path does not appear to fall back to the trigger config’s retryAfterSeconds: 5; that matches the observed 300s retry delay and is why this PR keeps an explicit retry directive.
|
Backport PR opened against |
Summary
When the Workflow queue handler throws, ask VQS to make the current message visible immediately instead of waiting for the default 300 second visibility lease to expire.
Root cause: the generated Workflow trigger config uses
retryAfterSeconds: 5, but@workflow/world-vercelcalledQueueClient.handleCallbackwithout a retry option. On handler failure,@vercel/queuepropagated the error and the message stayed locked for the client defaultvisibilityTimeoutSeconds: 300.Changes
retry: () => ({ afterSeconds: 0 })to the VQS callback handler in@workflow/world-vercel.handleCallback.@workflow/world-vercel.Validation
git diff --check -- packages/world-vercel/src/queue.ts packages/world-vercel/src/queue.test.ts .changeset/retry-vqs-handler-errors-immediately.mdpnpm installwith Node v24.15.0pnpm --filter @workflow/utils buildpnpm --filter @workflow/errors buildpnpm --filter @workflow/world buildpnpm --filter @workflow/world-vercel buildpnpm --filter @workflow/world-vercel test -- queue.test.ts(69 tests passed)