feat(engine): durable webhook outbox for the Node adapter#175
Conversation
The Node self-host adapter delivered webhooks fire-and-forget: an in-process send with 3 inline retries, and any failure or restart lost the event. The hosted Cloudflare path gets real durability from CF Queues + DLQ; self-hosters got message loss. The pending_events table existed in the schema but nothing consumed it. The Node adapter's event queue now uses pending_events as a consumed outbox: - send persists the row first (durable once send resolves), then kicks an immediate poll so delivery stays prompt. - A background poller (configurable interval) claims due rows with a single UPDATE ... WHERE id IN (subquery) RETURNING statement — atomic claim with attempts++ and a lease on process_after, per the no-interactive-transactions doctrine in ports/database.ts. A worker that crashes mid-delivery leaves the row reclaimable after the lease. - Delivery reuses deliverEvent unchanged (HMAC signing, terminal-4xx vs retryable classification): success deletes the row, terminal failures settle it as failed, retryable failures reschedule with capped exponential backoff until max_attempts is exhausted. - Startup resumes leftover due rows, so deliveries survive restarts. - cleanupOldEvents (24h) is wired into the poll cadence so settled rows are pruned. The EventQueue port contract and the Cloudflare path are unchanged; InProcessEventQueue is renamed to DurableEventQueue (engine-internal, no external importers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI. |
|
Warning Review limit reached
More reviews will be available in 29 minutes and 49 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9284e66b72
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI. |
|
Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI. |
…178) * feat(engine): durable webhook outbox for the Node adapter The Node self-host adapter delivered webhooks fire-and-forget: an in-process send with 3 inline retries, and any failure or restart lost the event. The hosted Cloudflare path gets real durability from CF Queues + DLQ; self-hosters got message loss. The pending_events table existed in the schema but nothing consumed it. The Node adapter's event queue now uses pending_events as a consumed outbox: - send persists the row first (durable once send resolves), then kicks an immediate poll so delivery stays prompt. - A background poller (configurable interval) claims due rows with a single UPDATE ... WHERE id IN (subquery) RETURNING statement — atomic claim with attempts++ and a lease on process_after, per the no-interactive-transactions doctrine in ports/database.ts. A worker that crashes mid-delivery leaves the row reclaimable after the lease. - Delivery reuses deliverEvent unchanged (HMAC signing, terminal-4xx vs retryable classification): success deletes the row, terminal failures settle it as failed, retryable failures reschedule with capped exponential backoff until max_attempts is exhausted. - Startup resumes leftover due rows, so deliveries survive restarts. - cleanupOldEvents (24h) is wired into the poll cadence so settled rows are pruned. The EventQueue port contract and the Cloudflare path are unchanged; InProcessEventQueue is renamed to DurableEventQueue (engine-internal, no external importers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #175 * chore: apply pr-reviewer fixes for #175 * chore: apply pr-reviewer fixes for #175 * feat(engine): persist-first webhook outbox for queue-backed adapters Move the pending_events outbox insert from the Node adapter into the engine send path so every adapter gets the same durability guarantee: routes insert the row synchronously in the request path (single cheap INSERT via routes/webhookOutbox.ts), then hand the row id to eventQueue.send in the background. If the queue send is lost (Workers isolate dies after the response, queue outage), the row stays pending and is re-enqueued by the sweep instead of vanishing. - QueuedEvent gains an optional outboxId; adapters that receive it must not insert a second row and the consumer settles the row after delivery. Absent outboxId keeps the legacy contract. - DurableEventQueue.send skips the insert when outboxId is present (no double-insert / double-delivery on the Node path). - New sweepPendingEvents(db, opts) claims due rows via the same atomic claimDueEvents the Node poller uses and returns them with complete/fail/reschedule settle callbacks so a scheduled handler can re-enqueue to an external queue without delivering directly. - Export the outbox primitives (enqueueEvent, claimDueEvents, completeEvent, failEvent, rescheduleEvent, sweepPendingEvents, cleanupOldEvents) from the engine package root for queue consumers. Tests: route-level persist-first ordering incl. sync/async queue-send failures leaving the row sweepable; Node adapter no-double-insert; sweep exactly-once claims under concurrent sweepers, lease expiry, and settle callbacks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #178 * chore: apply pr-reviewer fixes for #178 * chore: apply pr-reviewer fixes for #178 --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
Problem
Self-hosters had a webhook durability gap. The Node adapter's
InProcessEventQueuewas fire-and-forget:sendreturned immediately, delivery ran in-process with 3 inline retries, and its own header comment admitted "failures are logged, not retried across restarts". The hosted Cloudflare deployment gets real durability from CF Queues + DLQ; a self-host process crash or a subscriber outage longer than ~13 seconds meant silent event loss.Meanwhile the schema already had a
pending_eventsoutbox table (attempts,max_attempts,process_after,payload) andengine/eventQueue.tshadenqueueEvent/cleanupOldEvents— with zero consumers.Change
The Node adapter's queue (
DurableEventQueue, formerlyInProcessEventQueue) now usespending_eventsas a consumed outbox:sendawaits the outbox insert, so the event is durable the momentsendresolves, then kicks an immediate poll so latency stays low.status='pending',process_after <= now,attempts < max_attempts) with a singleUPDATE ... WHERE id IN (subquery) RETURNINGthat incrementsattemptsand leasesprocess_after— same single-statement style as the WHERE-guarded transitions inengine/delivery.ts, per the no-interactive-transactions doctrine inports/database.ts. A crash mid-delivery leaves the row reclaimable once the lease expires.deliverEventunchanged (HMAC signing, terminal-4xx-except-408/429 classification): success deletes the row; terminal failure settles it asfailed; retryable failure reschedules with capped exponential backoff (default 30s base, 15m cap, 5 attempts).createNodeRuntimestarts the poller and immediately drains leftover due rows.cleanupOldEvents(24h) runs on the poll cadence (default hourly).The
EventQueueport contract and the Cloudflare path are untouched. New engine helpers:claimDueEvents/completeEvent/failEvent/rescheduleEvent.NodeRuntimeOptions.eventQueueexposes poller tuning;NodeRuntime.webhookQueueis exposed for shutdown/tests, andclose()stops the poller.Tests
New suite
src/adapters/node/__tests__/event-queue.test.ts(better-sqlite3, stubbedfetch):sendpersists the row before delivery completesattempts++and backoff; not reclaimed before duefailedwith exactly one fetch, never reclaimedmax_attemptssettles asfailedand surfaces to telemetrycreateNodeRuntimeover the same DB file delivers the leftover rownpx turbo test: 18/18 tasks green, engine 62/62 tests passing.tsc --noEmitandeslintclean.Notes
InProcessEventQueue→DurableEventQueuerename: the old name described the fire-and-forget behavior; no importers outside the adapter's own index existed.attemptsincrements before delivery, so a crash mid-flight consumes one attempt — standard outbox tradeoff that avoids infinite redelivery loops.🤖 Generated with Claude Code