[core] Re-try runtime replays that exceed deadline up to three times#1740
Conversation
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
🦋 Changeset detectedLatest commit: 0be6831 The changes in this PR will be included in the next version bump. This PR includes changesets to release 17 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (74 failed)mongodb (7 failed):
redis (7 failed):
turso (60 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
karthikscale3
left a comment
There was a problem hiding this comment.
My only concern with this change is, if it goes through the retry loops, the queue retries after 300seconds and if it goes through 3 full retries, o11y will show that it was stuck in the queue for ~15minutes before either succeeding or failing. This will cause a bit of confusion. Also, should we introduce a run_retrying state to persist this state?
|
@karthikscale3 This would only happen for runs which get stuck, so it shouldn't be a common occurrence, and it's better than failing due to a single delayed replay (if it's temporary, and the second would work) |
Makes sense and sounds reasonable. I will let @TooTallNate and @pranaygp to weigh in too. |
TooTallNate
left a comment
There was a problem hiding this comment.
Clean, well-targeted change. The logic is correct:
- On attempts 1–3 (
metadata.attempt <= REPLAY_TIMEOUT_MAX_RETRIES):process.exit(1)without writingrun_failed. The message isn't acked, so the queue retries with an incremented attempt counter. - On attempt 4+: writes
run_failedthen exits. Same behavior as before this PR.
This is strictly more lenient than the previous behavior (immediate failure on first timeout) — it can only help, not hurt. Transient slowness (e.g., cold start + large event log) that causes a one-off timeout now gets retried instead of permanently failing the run.
The timeout retry budget (3 attempts) is drawn from the same metadata.attempt counter as MAX_QUEUE_DELIVERIES (48), so messages that have already been retried for other reasons get fewer timeout grace attempts. This is reasonable — a message already at attempt 10 has had plenty of chances.
Changeset is properly scoped.
|
Backport to To resolve manually: git fetch origin stable
git checkout stable
git cherry-pick 0810b75872e96d8d8aa6e3dbf4236304d57526a7
# Fix conflicts, then:
git cherry-pick --continue
git push origin stable |
|
Cherry-pick to |
…1740) Signed-off-by: Peter Wielander <mittgfu@gmail.com>
No description provided.