[world] Re-enqueue active runs on world restart#1534
Conversation
When a world (local or postgres) is stopped and restarted, runs that were pending or running would get stuck because their in-memory queue messages were lost. On start(), both worlds now scan storage for active runs and re-enqueue them. The workflow handler's event-log replay makes this idempotent — duplicate enqueues are safe no-ops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: dff6564 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (59 failed)mongodb (3 failed):
redis (2 failed):
turso (54 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
Move the shared recovery logic into @workflow/world/src/recovery.ts so both world-local and world-postgres import it instead of duplicating the function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TooTallNate
left a comment
There was a problem hiding this comment.
Good feature — re-enqueueing active runs on restart is the right approach, and the shared reenqueueActiveRuns helper with cursor-based pagination is clean. The idempotency guarantee from event-log replay makes this safe by design.
What I verified:
- Type signatures line up:
Storage['runs'](withlist),Queue['queue'](enqueue method),ValidQueueNametemplate literal, andQueuePayloadunion all check out. resolveData: 'none'is the right choice — avoids deserializing input/output blobs when onlyrunIdandworkflowNameare needed.'cancelled'is correctly excluded — it's a terminal state across all world implementations with no residual work needed.- CI is green across all environments (local, postgres, vercel, windows, community worlds).
- Test coverage is solid: world-local tests exercise pending/running/completed/failed/empty scenarios; world-postgres tests cover active runs, empty, and pagination.
One blocking issue: reenqueueActiveRuns has no error handling — if runs.list() or queue() throws (e.g. corrupt storage, DB connection issue on restart), the entire start() call fails and the world never starts. This is a startup-path function so partial progress is better than total failure. See inline comment.
| const page = await runs.list({ | ||
| status, | ||
| resolveData: 'none', | ||
| pagination: { cursor }, |
There was a problem hiding this comment.
Blocking: If runs.list() or enqueue() throws here (e.g. corrupt run on disk, transient DB error), the entire start() will reject and the world won't come up at all. For a best-effort recovery path this is too fragile — a single bad run shouldn't prevent startup.
Consider wrapping the inner enqueue in a try/catch (log and continue), and optionally wrapping the outer loop too:
for (const run of page.data) {
try {
const queueName: ValidQueueName = `__wkf_workflow_${run.workflowName}`;
await enqueue(queueName, { runId: run.runId });
reenqueued++;
} catch (err) {
console.warn(
`[${label}] Failed to re-enqueue run ${run.runId}: ${err}`
);
}
}This way a single bad run doesn't prevent the rest from being recovered, and the world still starts.
There was a problem hiding this comment.
Good call — wrapped individual enqueue calls in try/catch so a single corrupt/bad run logs a warning and the rest still get recovered. Fixed in dff6564.
| } | ||
|
|
||
| beforeEach(() => { | ||
| vi.clearAllMocks(); |
There was a problem hiding this comment.
Nit (non-blocking): The comment on lines 87-91 is a bit confusing — it acknowledges the mock returns the same runs for both 'pending' and 'running' queries, leading to 4 enqueues, but then only asserts 2 addJob calls by name rather than verifying toHaveBeenCalledTimes(4). The mismatch between the comment's explanation and what's actually asserted could confuse future readers.
Consider either:
- Making the mock return distinct runs per status so the assertion is 1:1, or
- Adding
expect(workerUtilsMock.addJob).toHaveBeenCalledTimes(4)to match the comment
There was a problem hiding this comment.
Fixed — mockRunsList now takes a { pending, running } map so each status returns distinct runs and the assertion is toHaveBeenCalledTimes(2) matching exactly what we expect.
- Wrap individual enqueue calls in try/catch so a single bad run doesn't prevent the world from starting - Fix postgres test to return distinct runs per status instead of the same runs for both queries, making assertions 1:1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TooTallNate
left a comment
There was a problem hiding this comment.
Both issues from my previous review have been addressed in dff6564:
-
Error handling (was blocking): Individual enqueue calls are now wrapped in try/catch with
console.warn— a single bad run no longer prevents the world from starting. The rest of the runs still get re-enqueued. -
Test mock clarity (was non-blocking):
mockRunsListnow takes arunsByStatusmap with distinct runs per status, and the test assertstoHaveBeenCalledTimes(2)— the confusing comment about 4-vs-2 enqueues is gone.
CI failures are benchmark timeouts only, not related to this change.
LGTM.
Summary
pendingorrunningstate now get re-enqueued automatically duringstart()Closes #1531
Test plan
reenqueue.test.tsfor world-local: verifies pending/running runs are re-enqueued after restart, completed/failed runs are not, and empty data dirs are handledreenqueue.test.tsfor world-postgres: verifies active runs trigger graphile-worker jobs on start, no-op when no active runs, and pagination works correctly🤖 Generated with Claude Code