[core] Combine flow+step bundle and process steps eagerly#1338
Conversation
🦋 Changeset detectedLatest commit: 234fce4 The changes in this PR will be included in the next version bump. This PR includes changesets to release 22 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
|
|
Seems you had a lot of work here. It's not easy as it seems. Hopefully, this gonna solve some issues we are facing with Workflow. Peace, |
TooTallNate
left a comment
There was a problem hiding this comment.
Withdrawing my prior CHANGES_REQUESTED — the blocker I called out (the stale (err as any).meta?.retryAfter access in step-executor.ts) is fixed at line 169 with err.retryAfter ?? 1. As an aside, my fix recommendation in the prior review was based on a faulty type assumption (I confused TooEarlyError.retryAfter: number with RetryableError.retryAfter: Date). The author's err.retryAfter ?? 1 form is correct: it's a number of seconds with sensible fallback.
My other earlier concerns:
getStepNameFromEventfull-event-load: addressed.stepNameis now plumbed throughWorkflowInvokePayloadSchemaso the handler reads it directly from the queue message payload instead of doing a paginated event load. ✓step.errorguard for max retries: known limitation acknowledged in-thread. The proposed long-term fix (server-sidene(status, 'running')onstep_started) is separate work.- Background step → continuation enqueue safety: acknowledged as intentional.
So my specific items are resolved. But I'm not approving outright — there's still substantial unresolved review from @VaguelySerious touching real concerns I'd defer to them on:
- Removal of the LegacyStreamWorld shim
- Limiting queue concurrency to 1 in tests
- Comments removed without explanation in
suspension-handler.ts/runtime.ts - The
builder-deferred.tsNext.js behavior prediction - Several "is there a systematic fix instead of conditionally allowing edge cases" questions
Plus the auto-flagged event cache duplication bug from Vercel's review bot is worth confirming was understood and dismissed (or fixed).
Approval-wise this should land when @VaguelySerious's concerns are resolved. From my side: no remaining blockers.
TooTallNate
left a comment
There was a problem hiding this comment.
Thorough re-review for Peter. Spent significant time tracing four risk surfaces: the Vercel bot's auto-flag, the concurrent-step-execution model, inline-step replay determinism, and the open review threads. Bottom line up front: no ship-blockers from my side. A few coverage gaps and one acknowledged limitation worth being explicit about, but nothing I'd hold this PR on.
Vercel bot's "event cache duplication" auto-flag is a false positive
"wait_completed events pushed locally without advancing eventsCursor cause duplicate events on next incremental fetch."
The author anticipated exactly this scenario. runtime.ts:585–594 has explicit dedupe-by-eventId logic:
const existingIds = new Set(cachedEvents.map((e) => e.eventId));
for (const e of loaded.events) {
if (!existingIds.has(e.eventId)) {
cachedEvents.push(e);
}
}Verified end-to-end: locally-pushed wait_completed (line 685) keeps its real server-returned eventId, the next iteration's incremental fetch returns it again, the dedupe filter drops it, the consumer never sees a duplicate. The comment in the code calls out the exact scenario the bot flagged. Safe to dismiss the auto-flag.
Concurrent-step-execution: a partial fix shipping with a known gap
Verified the L198 guard (step.attempt > maxRetries + 1 && step.error) traces correctly for the scenario it targets — concurrent first-attempts before any user-code error has been recorded. Without step.error, the guard correctly skips and the step proceeds normally even with inflated step.attempt.
But there's a symmetric gap in the catch-block path at L469:
if (currentAttempt >= maxRetries + 1) { /* step_failed */ }This check has no && step.error guard, so it fires on the first real catch-block failure even when currentAttempt has been inflated by concurrent first-attempts. End result: a workflow with maxRetries=3 and 4+ concurrent handlers can permanently fail after a single transient flake, with a misleading log saying "exceeded max retries (4 retries)" when the step body actually only ran once.
This is the only substantive open thread on the PR (step-executor.ts:198), and the author has acknowledged it in a separate resolved thread:
"Agreed on the concurrent-retry limitation. The server-side
ne(status, 'running')WHERE clause is the right long-term fix — tracking as a follow-up."
So everyone's eyes-open. Worth being explicit in the changelog or PR description that under high queue back-pressure, default-maxRetries workflows can spuriously fail. Either of these would close the gap fully:
- Add the same
&& step.errorguard at L469 (mirrors L198, easy to do, but doesn't fix the L198 case for catch-block). - Server-side
ne(status, 'running')onstep_started(referenced in the resolved thread; also fixes the world-testing concurrency issue Peter resolved separately).
Not a ship-stopper, but I'd want it tracked as a known issue users can hit, not just an internal todo.
Inline-step replay: ordering subtlety in the dedupe path
The locally-pushed-then-incremental-fetch path can produce non-monotonic-by-eventId ordering in cachedEvents. Specifically: after wait_completed is pushed locally (L685) without advancing the cursor, the next incremental fetch can return concurrent events with eventIds between the local push and earlier events. After dedupe, cachedEvents becomes append-ordered but not eventId-ordered. The EventsConsumer walks by index, so it consumes them in append order, which is not the canonical replay order.
In practice this is unlikely to cause issues because:
- Within one handler's writes, the same world process generates monotonic ULIDs.
- Cross-handler events for the same run are gated by the ownership check (
step_createdownership at L813). - The dangerous case requires concurrent handler activity for the same run + non-trivial cross-process clock skew.
But it's a footgun. The cleanest fix would be: after locally pushing an event, set eventsCursor to that event's eventId so the next fetch skips it. This also eliminates the dedupe pass entirely. Worth considering as a follow-up.
"VaguelySerious" review threads — context for Peter
For other reviewers who may stumble on this: most of the open @VaguelySerious threads are Peter's own AI-review notes posted via tooling. Of 28 such threads, 27 are either resolved or outdated (the diff has moved past them). The single unresolved + non-outdated thread is the concurrent-retry concern above (step-executor.ts:198), which Peter has acknowledged as a known limitation in a sibling resolved thread.
External-reviewer threads (mine, the Vercel bot's) are all resolved or addressed. So the apparent "long list of unresolved concerns" from my prior comment is misleading — it was conflating Peter's self-review notes with external review.
Test coverage gaps (non-blocking)
A few coverage gaps for novel V2 behavior worth tracking even if you ship:
-
Dedupe path at
runtime.ts:585–594— zero test coverage. The behavior is novel to V2 and load-bearing for incremental-replay correctness. A unit test that:- First iteration fetches with cursor C0
- Manually pushes a
wait_completedto cachedEvents - Second iteration fetches and gets the same
wait_completedplus an interleaved new event - Asserts: no duplicates AND consumer sees no orphans
...would lock in the contract.
-
L198 guard's positive case — there's no test asserting that
attempt > maxRetries + 1withoutstep.errordoes NOT fail the step. The guard exists specifically for this case and is currently un-asserted. The twostep-handler.test.tstests at lines 296–340 and 343–388 both seterroron the fixture, so they exercise different paths. -
Concurrent-handler races — neither Scenario B (concurrent retry with prior error → premature L198 failure) nor Scenario D (catch-block premature failure with no
&& step.errorguard) has a test. Adding these would catch the bug currently in code if a future fix lands. -
builder-deferred.tsNext.js manifest reading — Peter agreed in review that readingapp-paths-manifest.jsonwas the right approach over regex prediction. Verified that DID land (L1171–1175 reads the manifest with comment "This avoids predicting Next.js conventions with regexes and instead reads from Next.js's own output"). ✅
What looks good
- The architectural shift is well-executed. The replay loop, ownership-gated inline dispatch, idempotency-key-based queue dedupe, builder unification, and
__steps_registeredrollup-tree-shaking guard are all clean. - The
getStepNameFromEventfull-event-load concern from my prior review was addressed —stepNamenow flows throughWorkflowInvokePayloadSchema. ✓ - The TooEarlyError stale
retryAfteraccess pattern is fixed at step-executor.ts:169. ✓ - The CJS
module.exportscollision (ESM format for steps whenbundleFinalOutput: true) is a subtle but correct fix. - The changelog doc
docs/content/docs/changelog/eager-processing.mdxis one of the better architecture docs I've seen in a PR — covers every edge case, failure mode, and platform-specific quirk.
Verdict
No blockers from my side. Ship if Vade and storytime green-light. The concurrent-retry premature-failure case is real but acknowledged, the auto-flagged "event cache duplication" is a non-issue, and the test coverage gaps are real but not regressions (the V2 behaviors they cover are net-new).
If you want one piece of polish before merge, the L469 && step.error mirror would close the most user-visible failure mode at zero cost — but I won't gate on it.
TooTallNate
left a comment
There was a problem hiding this comment.
Pre-emptively
Merges the V1 split flow/step routes (one queue message per step + one per flow continuation, two separate function invocations per step) into a single combined handler at /.well-known/workflow/v1/flow that executes steps inline within the same invocation as the workflow replay. A serial workflow with N steps that previously required ~2N+1 function invocations now completes in 1. Architecture ------------ - packages/core: workflowEntrypoint() handler with an inline replay loop. Each loop iteration: load events incrementally, replay workflow, inline-execute one owned step on suspension, loop. Background steps (from Promise.all) are queued back to the same handler with stepId so the in-flight handler advances the loop without queue round-trips whenever possible. - packages/builders: createCombinedBundle() emits a single ESM bundle with workflowEntrypoint(workflowCode) plus a side-effect __step_registrations bundle. ESM-by-default with createRequire banner for CJS deps (matches main's #1562 pattern). - packages/next, nestjs, sveltekit, astro, nitro, nuxt, hono, express, fastify, vite: framework builders updated to use createCombinedBundle. The step/ directory is no longer generated. Correctness invariants (V2 polish) ---------------------------------- - Single inline executor per step. Atomic step_created per correlationId (per-step in-process mutex in world-local; SQL-level guards in postgres/vercel) means exactly one handler claims ownership of each step. Inline execution gates on ownership; queue dispatch is unconditional with idempotency keys for crash recovery (matches V1). - onUnconsumedEvent reverts to its original PR #1055 contract: any unconsumed event fatals as CORRUPTED_EVENT_LOG. Source-level fixes removed every code path that produced orphaned step lifecycle events during V2 work, so the skip branch became unnecessary. - Lock-release polling interval lowered from 100ms to 10ms in flushable-stream so the V2 step-executor's ops-settle race typically resolves in ~5ms instead of ~50ms per writable-bearing step. - Per-iteration runs.get round-trip eliminated: concurrent completion is detected by scanning the loaded event log for terminal run events instead. - Worker-pool deadlock guard removed from Run#pollReturnValue. The fibonacci/recursive parent→child polling case is now handled by raising world-postgres default queueConcurrency from 10 to 50, with the limitation documented at three call sites (fibonacciWorkflow's JSDoc, queue.ts, eager-processing.mdx). - getWorldLazy module-scope singleton breaks the step-bundle import chain to world.ts so step bundles don't transitively pull in world-vercel + cbor-x + the VQS dev-handler resolver. World-local atomicity --------------------- - Per-step in-process mutex serializes events.create for step lifecycle events on a (runId, correlationId) key. step_started is rejected only when the step is already terminal — duplicate step_started on a non-terminal step is allowed (preserves retry semantics after SIGKILL). Runtime helpers --------------- - loadWorkflowRunEvents(runId, afterCursor?) consolidates the previous getAllWorkflowRunEvents{,WithCursor} + getNewWorkflowRunEvents into one helper. Documentation ------------- - docs/content/docs/changelog/eager-processing.mdx: full record of architecture, edge cases, and the fixes applied during the V2 work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Two CI failures from the earlier V2 cutover, now resolved: 1. @workflow/vitest unit tests (`packages/vitest/src/index.test.ts`) were still mocking the V1 BaseBuilder methods (createWorkflowsBundle + createStepsBundle) and expecting two world handlers (workflow + step). Updated the mocks to expose createCombinedBundle and asserted a single combined handler at __wkf_workflow_, which is what the V2 vitest builder actually emits. 2. E2E Local Dev Tests (nextjs-webpack) were asserting that step error stacks contain `99_e2e.ts` / `helpers.ts`, which held in pre-V2 webpack dev mode (which imported step sources directly). V2 inlines the step bundle into the combined flow route, and webpack's re-bundling collapses original step filenames out of the dev-mode source maps. Extended the existing nextjs-webpack carve-out in hasStepSourceMaps() to cover dev as well as prod, and updated the source-map follow-up callout in eager-processing.mdx to include nextjs-webpack in the matrix that needs source maps wired up for V2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflicts resolved: * `packages/core/e2e/e2e.test.ts` (PR #1879 vs. V2 hookDispose helper) Drop the locally-defined `waitForHook(expectedRunId)` shadow in the hookDisposeTestWorkflow test in favor of the new top-level `waitForHook(token, { runId })` API merged from main. Keep V2's event-driven `waitForHookDisposal()` helper, which is stricter than main's `await sleep(3_000)` (it polls `getHookByToken` for the HookNotFoundError signal rather than waiting a fixed interval). * `packages/world-local/src/storage/events-storage.ts` (PR #1877 vs. V2 per-step async mutex) Both branches were closing race conditions in the local world's step lifecycle. PR #1877 adds filesystem-level O_CREAT|O_EXCL locks for `step_created` and `wait_created`; the V2 branch already wraps step lifecycle events in an in-process `withStepLock` mutex. They compose: the mutex serializes within a single Node process; the filesystem lock additionally protects against cross-process races (multiple pnpm workers, redelivered queue messages). Resolution takes V2's mutex wrap as the base and patches in main's two `writeExclusive` claims at the start of the `step_created` and `wait_created` handlers, with comments noting the dual-layer guarantee. `packages/core/e2e/utils.ts` auto-merged: V2's nextjs-webpack + all-Vercel source-map carve-outs preserved alongside main's `tanstack-start` addition to `hasWorkflowSourceMaps`. Verified locally: `@workflow/core` and `@workflow/world-local` typecheck clean; all 343 world-local unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The standalone tarballs project's `vercel.json` set `"buildCommand": "pnpm --filter tarballs build"`, which only runs the tarballs package's `build` script (`node scripts/pack.ts`) and does NOT trigger turbo's `dependsOn: ["^build"]` chain. As a result, every workspace package was packed with an empty `dist/` directory — only `bin/`, `package.json`, `README.md`, and `docs/` made it into the tarballs. Downstream installs failed with `Cannot find module '<pkg>/dist/next.cjs'` (and similar for every other entry point). The previous docs-based pipeline didn't hit this because Vercel detected `docs` as a Next.js framework, ran turbo for it, and pack.ts was wired as a `prebuild` step that ran after deps were already built. Fix: switch the buildCommand to `pnpm turbo build --filter=tarballs` so turbo evaluates the build task for `tarballs`, which fans out via `^build` to compile every workspace dependency first, and then runs `tarballs#build` (pack.ts) against the populated `dist/` directories. Verified locally: a fresh build produces tarballs containing the full `dist/` tree (including `dist/next.cjs`, the file flight-booking-app was failing to resolve). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflicts resolved: * `packages/core/src/logger.ts` (PR #1849 vs. V2 webpack carve-out) PR #1849 ("Friendlier workflow errors") reintroduced the `debug` package as a static import and added a structured `Logger` interface with `child()` / `forRun()` and `composeLogLine`. The V2 branch had previously removed the `debug` static import because it pulls `debug/src/node` and a dynamic `require('tty')` into the Next.js webpack flow route, breaking V2 builds with `Dynamic require of "tty" is not supported`. Resolution combines both: keep main's `Logger` interface, `child()` / `forRun()`, and `composeLogLine` integration; replace the `debug` invocations with the V2 lightweight `matchesDebugNamespace` + `process.env.DEBUG` matcher so the flow route still bundles cleanly under webpack. Comment in the file documents why `debug` is intentionally absent. * `packages/core/src/runtime.ts` (PR #1849 vs. V2 inline replay loop) Multiple regions where main's structured-logging refactor and the V2 inline-execution loop made overlapping changes. Took V2's control flow as the base — its `if (!workflowRun) { ... }` setup block, main replay `while (true)` loop with timeout-based re-scheduling, terminal-event short-circuit, and per-iteration incremental event load are all preserved as-is — and folded in main's contributions: the scoped `runLogger = runtimeLogger.forRun( runId, workflowName)` is created once and used for run-level info / error logging. Discarded main's V1 wait-completion loop and `runWorkflow` user-code error block since V2 handles waits and user-code errors inside the inline loop body. Updated `buildWorkflowSuspensionMessage()` call to drop a stray `runId` first arg that didn't match the helper's 3-param signature. * `packages/core/e2e/e2e.test.ts` (PR #1849 unrelated to V2) No conflict from main's friendly-errors PR, auto-merged. (No hookDispose-style conflict this round.) Verified locally: all 820 `@workflow/core` unit tests pass; all 343 `@workflow/world-local` unit tests pass; `pnpm turbo typecheck` is green across all 40 typecheck tasks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


See changelog / architecture doc https://workflow-docs-git-peter-v2-flow.vercel.sh/docs/changelog/eager-processing
Likely solves all of these: