Handle 409 errors gracefully for step_completed/failed/retrying#1120
Conversation
…p_retrying events When multiple step invocations race to complete/fail/retry a step, the server returns a 409 Conflict. Previously this would bubble up as an unhandled error. Now these are caught and logged as warnings, matching the pattern established for run_completed/run_failed in #1118. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: c59a690 The changes in this PR will be included in the next version bump. This PR includes changesets to release 14 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)nextjs-turbopack (1 failed):
🌍 Community Worlds (45 failed)turso (45 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
pranaygp
left a comment
There was a problem hiding this comment.
Solid PR — the 409 handling pattern is consistent and correct, and the tests cover the key paths well. Left a few minor comments.
| }, | ||
| }); | ||
| try { | ||
| await world.events.create(workflowRunId, { |
There was a problem hiding this comment.
Nit: This is the only step_failed event creation that does not use withServerErrorRetry. The other three sites (fatal error path, max retries exhausted path, and step_retrying) all wrap the call with it. Was this intentional? If not, might be worth wrapping for consistency:
try {
await withServerErrorRetry(() =>
world.events.create(workflowRunId, {
...
})
);
} catch (err) {(Pre-existing, but since you're already touching this block it'd be a clean fix.)
| }, | ||
| event: {}, | ||
| }); | ||
| mockStepFn.mockResolvedValue('step-result'); |
There was a problem hiding this comment.
Unused variable — callCount is incremented but never asserted on. Either assert it (e.g. expect(callCount).toBe(1) to verify step_completed was actually attempted) or remove it.
| ).rejects.toThrow('Internal Server Error'); | ||
| }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
Nice that you have a should re-throw non-409 errors from step_retrying test. Consider adding analogous re-throw tests for step_completed and step_failed (e.g. a 500 from step_completed should propagate, a 500 from step_failed in the max-retries path should propagate). This would fully cover all the throw err / throw stepFailErr branches.
pranaygp
left a comment
There was a problem hiding this comment.
Did an audit of 409/410 handling across all world.events.create() call sites in the SDK (cross-referenced against the server's event validation logic and the event sourcing docs). The changes in this PR are correct — here are the pre-existing gaps worth addressing:
Should fix in this PR
packages/core/src/runtime/step-handler.ts:150-153 — uses console.warn() instead of runtimeLogger.info() for the step_started 410 (workflow already completed) case. Every other 410/409 handler in the file uses the structured logger. This should be:
runtimeLogger.info(
'Workflow run already completed, skipping step',
{ workflowRunId, stepId, message: err.message }
);info is the right level here — this is an expected, non-actionable condition (same as the 410 handler in the catch block at line 476-486).
Follow-up PR candidates
These are all pre-existing and fine to address separately:
1. packages/core/src/runtime/resume-hook.ts:128 — hook_received has no error handling at all. Server can return 409 (hook disposed) or 410 (run done) — both benign races that should be caught and logged as info. A 404 here would be a real bug and should throw.
2. packages/core/src/runtime/run.ts:69 — run_cancelled (the Run.cancel() method) has no error handling. Server returns 409 if run already terminal — should catch and log as warn.
3. packages/core/src/runtime/runs.ts:95 — run_cancelled (the cancelRun() function) — same issue, different call site. Errors get wrapped in a generic Failed to cancel run message, losing the 409 signal.
4. packages/core/src/runtime.ts:145 — run_started has no error handling. Server returns 410 if run is already terminal. Should catch 410 and log as info ("run already completed, skipping").
5. packages/core/src/runtime/suspension-handler.ts:116 — hook_created catches 410 but not 409. Server returns 409 if the hookId already exists (duplicate creation, same as step_created/wait_created). Should catch 409 and log as info, matching the pattern at lines 183 and 238.
6. packages/core/src/runtime/runs.ts:190 — wait_completed in wakeUpRun() catches 409 silently (increments stoppedCount with no log). A debug log would help observability without noise.
7. packages/core/src/runtime/step-handler.ts:253 — (already commented on this) the pre-execution guard step_failed is the only one not wrapped in withServerErrorRetry.
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
When multiple step invocations race to complete/fail/retry a step, the server returns a 409 Conflict. Previously this would bubble up as an unhandled error. Now these are caught and logged as warnings, matching the pattern established for run_completed/run_failed in #1118.