Skip to content

fix(backend): harden alert-prone retries#1928

Merged
riderx merged 9 commits into
mainfrom
codex/backend-alert-resilience
Apr 22, 2026
Merged

fix(backend): harden alert-prone retries#1928
riderx merged 9 commits into
mainfrom
codex/backend-alert-resilience

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented Apr 21, 2026

Summary (AI generated)

  • retry retryable Durable Object relocation failures in the files upload worker instead of surfacing them as 500s
  • skip stale cron_stat_app queue jobs when the target app is gone or no longer belongs to the queued org
  • retry transient PostgREST 5xx failures inside cron_stat_app before failing queue work
  • add focused unit coverage for the new retry helpers and the stale-job skip path

Motivation (AI generated)

PostHog and Discord alerts show repeated backend noise from two resilience gaps: transient Durable Object storage relocation during uploads, and transient or stale cron_stat_app queue work. These failures are operationally noisy and cause unnecessary retries even when the underlying request can succeed or should be skipped.

Business Impact (AI generated)

This reduces false-positive backend alerts, lowers avoidable queue churn, and makes uploads and billing-stat refreshes more reliable for production customers. Fewer noisy failures also improves observability by leaving real regressions easier to see.

Test Plan (AI generated)

  • bunx vitest run tests/backend-alert-resilience.unit.test.ts tests/files-r2-error.test.ts
  • bunx eslint supabase/functions/_backend/files/files.ts supabase/functions/_backend/triggers/cron_stat_app.ts tests/backend-alert-resilience.unit.test.ts tests/trigger-error-cases.test.ts
  • bun typecheck
  • Run Docker-backed integration trigger tests once the local Supabase stack is available

Generated with AI

Summary by CodeRabbit

  • New Features

    • Improved file upload reliability with automatic retrying for replayable forwarding failures and safer streaming semantics (HEAD vs streaming requests handled appropriately).
    • More resilient background sync: consistent retry-on-temporary-database-errors and graceful skipping with logged reasons when the target app is missing.
  • Tests

    • Added unit tests covering upload retry behavior, database-retry logic, and skipped-job handling for missing apps.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds retry wrappers for durable-object fetches used in TUS upload forwarding and for Supabase/PostgREST calls in cron_stat_app, switches app lookup to skip when missing, threads request context into retry helpers, exports test utilities, and adds/updates unit tests covering these behaviors.

Changes

Cohort / File(s) Summary
Durable Object Fetch / Upload Forwarding
supabase/functions/_backend/files/files.ts
Added fetchUploadHandlerWithRetry and isRetryableDurableObjectFetchError. Upload forwarding now builds an explicit Request with duplex: 'half', omits body for HEAD, routes via the retry wrapper, and exports filesTestUtils.
Supabase/PostgREST Retry & App Validation
supabase/functions/_backend/triggers/cron_stat_app.ts
Added runSupabaseResultWithRetry and isRetryablePostgrestError. Replaced direct .throwOnError() usage with retry-wrapped calls, changed app lookup to maybeSingle() and return { status: 'skipped', reason: ... } when missing, threaded c context into helpers, and exported cronStatAppTestUtils.
Tests & Test Helpers
tests/backend-alert-resilience.unit.test.ts, tests/cron_stat_app_followup.unit.test.ts, tests/trigger-error-cases.test.ts
Added tests for durable-object fetch retry and PostgREST retry behavior; refactored Supabase test stubs to { data, error } shape and updated assertions/call-count checks; added test asserting cron_stat_app skips when app missing.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant UploadHandler as "Upload Handler"
    participant DurableObject as "Durable Object"

    Client->>UploadHandler: Forward TUS request
    UploadHandler->>UploadHandler: Build Request (duplex: 'half'), attach body if method != HEAD
    UploadHandler->>DurableObject: fetch(request) attempt 1
    DurableObject-->>UploadHandler: Error (durableObjectReset / "moved to a different machine")
    UploadHandler->>UploadHandler: Wait (DO_FETCH_RETRY_DELAY_MS * attempt)
    UploadHandler->>DurableObject: fetch(request) attempt 2
    DurableObject-->>UploadHandler: Success (204)
    UploadHandler-->>Client: 204 No Content
Loading
sequenceDiagram
    autonumber
    participant CronJob
    participant CronStatApp as "Cron Stat App Handler"
    participant SupabaseClient
    participant PostgREST

    CronJob->>CronStatApp: POST { appId, orgId }
    CronStatApp->>SupabaseClient: get app via maybeSingle()
    SupabaseClient->>PostgREST: SELECT app
    PostgREST-->>SupabaseClient: { data: null, error: null }
    SupabaseClient-->>CronStatApp: app not found
    CronStatApp-->>CronJob: 200 { status: "skipped", reason: "app_not_found" }

    rect rgba(200,100,100,0.5)
      Note over CronStatApp,PostgREST: Retry flow for transient PostgREST 5xx errors
    end

    CronStatApp->>SupabaseClient: operation attempt 1
    SupabaseClient->>PostgREST: RPC / upsert / select
    PostgREST-->>SupabaseClient: 502 error (retryable)
    SupabaseClient-->>CronStatApp: retryable error
    CronStatApp->>CronStatApp: Retry with backoff
    CronStatApp->>SupabaseClient: operation attempt 2
    PostgREST-->>SupabaseClient: Success
    CronStatApp-->>CronJob: 200 { status: "completed" }
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇
I hopped through code at break of day,
Retry by retry I found my way.
Backoff beats and durable cheer,
Uploads land safe, the path is clear.
Hooray — carrots for every retry!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(backend): harden alert-prone retries' directly captures the main purpose of the PR—improving resilience and reducing alert noise from transient failures through retry logic.
Description check ✅ Passed The description includes a clear Summary section, detailed Motivation and Business Impact, plus a comprehensive Test Plan with specific commands and checkmarks, though manual testing steps and screenshots sections are not applicable to this backend change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/backend-alert-resilience

Comment @coderabbitai help to get the list of available commands and usage tips.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 21, 2026

Merging this PR will not alter performance

✅ 28 untouched benchmarks


Comparing codex/backend-alert-resilience (783dda6) with main (d4446de)

Open in CodSpeed

@riderx riderx marked this pull request as ready for review April 21, 2026 16:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/trigger-error-cases.test.ts (1)

46-59: Consider using it.concurrent() for parallel execution.

This test reads from the database but doesn't modify shared resources (it uses a nonexistent appId). Per coding guidelines, tests that don't modify shared resources can use it.concurrent() for faster CI/CD.

♻️ Optional: Use concurrent test execution
-  it('should skip stale jobs when the app no longer exists', async () => {
+  it.concurrent('should skip stale jobs when the app no longer exists', async () => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/trigger-error-cases.test.ts` around lines 46 - 59, The test "should
skip stale jobs when the app no longer exists" is safe to run in parallel
because it only reads and uses a nonexistent appId; change its declaration from
it(...) to it.concurrent(...) so the test runner executes it concurrently for
faster CI; locate the test by the string "should skip stale jobs when the app no
longer exists" and replace the it call with it.concurrent while keeping the same
assertions and body.
supabase/functions/_backend/triggers/cron_stat_app.ts (1)

44-64: Retry detection logic is sound, but consider using RegExp.exec() per static analysis.

The getRetryablePostgrestStatus function correctly extracts HTTP status codes from both structured error objects and error message strings. However, SonarCloud flags the use of String.match() on line 51.

♻️ Optional: Use RegExp.exec() for consistency with static analysis rules
 function getRetryablePostgrestStatus(error: unknown): number | null {
   if (error && typeof error === 'object') {
     if ('status' in error && typeof (error as { status?: unknown }).status === 'number') {
       return (error as { status: number }).status
     }

     if ('message' in error && typeof (error as { message?: unknown }).message === 'string') {
-      const match = (error as { message: string }).message.match(/error code:\s*(\d{3})/i)
+      const match = /error code:\s*(\d{3})/i.exec((error as { message: string }).message)
       if (match) {
         return Number.parseInt(match[1], 10)
       }
     }
   }

   return null
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/functions/_backend/triggers/cron_stat_app.ts` around lines 44 - 64,
Replace the use of String.match in getRetryablePostgrestStatus with RegExp.exec
for the message parsing branch: create a RegExp (e.g., /error
code:\s*(\d{3})/i), call regex.exec((error as { message: string }).message) and,
if the result is non-null, parse result[1] to an integer and return it; keep the
existing structured-object status check and leave isRetryablePostgrestError
unchanged to rely on getRetryablePostgrestStatus.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@supabase/functions/_backend/triggers/cron_stat_app.ts`:
- Around line 44-64: Replace the use of String.match in
getRetryablePostgrestStatus with RegExp.exec for the message parsing branch:
create a RegExp (e.g., /error code:\s*(\d{3})/i), call regex.exec((error as {
message: string }).message) and, if the result is non-null, parse result[1] to
an integer and return it; keep the existing structured-object status check and
leave isRetryablePostgrestError unchanged to rely on
getRetryablePostgrestStatus.

In `@tests/trigger-error-cases.test.ts`:
- Around line 46-59: The test "should skip stale jobs when the app no longer
exists" is safe to run in parallel because it only reads and uses a nonexistent
appId; change its declaration from it(...) to it.concurrent(...) so the test
runner executes it concurrently for faster CI; locate the test by the string
"should skip stale jobs when the app no longer exists" and replace the it call
with it.concurrent while keeping the same assertions and body.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc0a6635-2586-4114-9d3c-74bab2380d98

📥 Commits

Reviewing files that changed from the base of the PR and between a221796 and 2c9136f.

📒 Files selected for processing (5)
  • supabase/functions/_backend/files/files.ts
  • supabase/functions/_backend/triggers/cron_stat_app.ts
  • tests/backend-alert-resilience.unit.test.ts
  • tests/cron_stat_app_followup.unit.test.ts
  • tests/trigger-error-cases.test.ts

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c9136fedd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread supabase/functions/_backend/triggers/cron_stat_app.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
supabase/functions/_backend/triggers/cron_stat_app.ts (1)

183-187: Consider aligning retry policy with the 5xx-only pattern.

queueOrgPlanRefreshWithRetry retries on any failure (!result.ok), while runSupabaseResultWithRetry only retries on 5xx errors. This inconsistency may be intentional since this operation logs and continues rather than throwing, but aligning the policies would improve predictability.

♻️ Optional: Align retry to 5xx-only pattern
   }, {
     attempts: PLAN_REFRESH_RETRY_ATTEMPTS,
     baseDelayMs: PLAN_REFRESH_RETRY_DELAY_MS,
-    shouldRetry: result => !result.ok,
+    shouldRetry: (result) => {
+      if (result.ok) return false
+      return isRetryablePostgrestError(result.error)
+    },
   })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/functions/_backend/triggers/cron_stat_app.ts` around lines 183 -
187, The current call to queueOrgPlanRefreshWithRetry (with
PLAN_REFRESH_RETRY_ATTEMPTS and PLAN_REFRESH_RETRY_DELAY_MS) retries on any
non-ok result (!result.ok), which is inconsistent with
runSupabaseResultWithRetry's 5xx-only retry policy; update the shouldRetry
predicate passed to queueOrgPlanRefreshWithRetry to mirror
runSupabaseResultWithRetry (i.e., only retry when the Supabase response status
is a 5xx server error) or reuse the same predicate function so both retry
policies behave consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@supabase/functions/_backend/triggers/cron_stat_app.ts`:
- Around line 183-187: The current call to queueOrgPlanRefreshWithRetry (with
PLAN_REFRESH_RETRY_ATTEMPTS and PLAN_REFRESH_RETRY_DELAY_MS) retries on any
non-ok result (!result.ok), which is inconsistent with
runSupabaseResultWithRetry's 5xx-only retry policy; update the shouldRetry
predicate passed to queueOrgPlanRefreshWithRetry to mirror
runSupabaseResultWithRetry (i.e., only retry when the Supabase response status
is a 5xx server error) or reuse the same predicate function so both retry
policies behave consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6c1d4eff-2186-4f4c-bcb5-2a65081123e1

📥 Commits

Reviewing files that changed from the base of the PR and between f3cf24e and 25d975e.

📒 Files selected for processing (3)
  • supabase/functions/_backend/triggers/cron_stat_app.ts
  • tests/backend-alert-resilience.unit.test.ts
  • tests/cron_stat_app_followup.unit.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/cron_stat_app_followup.unit.test.ts
  • tests/backend-alert-resilience.unit.test.ts

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25d975e336

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread supabase/functions/_backend/files/files.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
supabase/functions/_backend/triggers/cron_stat_app.ts (1)

175-192: ⚠️ Potential issue | 🟠 Major

Add idempotency guard to queue_cron_stat_org_for_org before retrying it.

The current version of this SQL function (line 8949 in supabase/schemas/prod.sql) has no idempotency protection. If the first RPC call to queue_cron_stat_org_for_org succeeds in enqueueing a message but the client observes a transient 5xx error, the retry at lines 175–192 will enqueue the same org refresh twice. A prior migration (20251014105957) included a 1-hour rate-limit guard checking plan_calculated_at, but a later migration (20251019123107_fix_stats.sql) removed it. Restore or add an explicit dedupe mechanism—such as a database constraint or an updated guard in the function—to prevent duplicate queuing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/functions/_backend/triggers/cron_stat_app.ts` around lines 175 -
192, The retry path is enqueuing duplicate work because the RPC
queue_cron_stat_org_for_org (called via queueOrgPlanRefresh) lacks idempotency;
update the system by adding an explicit dedupe guard in that SQL function (or a
DB constraint on the queue table) so repeated calls within a short window are
no-ops—for example, restore the prior plan_calculated_at 1-hour check inside
queue_cron_stat_org_for_org or add a UNIQUE/indexed constraint that prevents a
second enqueue for the same org+job type until a TTL expires; then keep the
existing retry logic (retryWithBackoff calling queueOrgPlanRefresh) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@supabase/functions/_backend/triggers/cron_stat_app.ts`:
- Around line 175-192: The retry path is enqueuing duplicate work because the
RPC queue_cron_stat_org_for_org (called via queueOrgPlanRefresh) lacks
idempotency; update the system by adding an explicit dedupe guard in that SQL
function (or a DB constraint on the queue table) so repeated calls within a
short window are no-ops—for example, restore the prior plan_calculated_at 1-hour
check inside queue_cron_stat_org_for_org or add a UNIQUE/indexed constraint that
prevents a second enqueue for the same org+job type until a TTL expires; then
keep the existing retry logic (retryWithBackoff calling queueOrgPlanRefresh)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c5e2237-0715-4674-853d-23678071e787

📥 Commits

Reviewing files that changed from the base of the PR and between 25d975e and 52c0a94.

📒 Files selected for processing (2)
  • supabase/functions/_backend/triggers/cron_stat_app.ts
  • tests/cron_stat_app_followup.unit.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/cron_stat_app_followup.unit.test.ts

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56e22a0a3f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread supabase/functions/_backend/triggers/cron_stat_app.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 223003b611

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread supabase/functions/_backend/triggers/cron_stat_app.ts
@sonarqubecloud
Copy link
Copy Markdown

@riderx riderx merged commit 0ba9808 into main Apr 22, 2026
15 checks passed
@riderx riderx deleted the codex/backend-alert-resilience branch April 22, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant