fix(backend): harden alert-prone retries#1928
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds retry wrappers for durable-object fetches used in TUS upload forwarding and for Supabase/PostgREST calls in cron_stat_app, switches app lookup to skip when missing, threads request context into retry helpers, exports test utilities, and adds/updates unit tests covering these behaviors. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant UploadHandler as "Upload Handler"
participant DurableObject as "Durable Object"
Client->>UploadHandler: Forward TUS request
UploadHandler->>UploadHandler: Build Request (duplex: 'half'), attach body if method != HEAD
UploadHandler->>DurableObject: fetch(request) attempt 1
DurableObject-->>UploadHandler: Error (durableObjectReset / "moved to a different machine")
UploadHandler->>UploadHandler: Wait (DO_FETCH_RETRY_DELAY_MS * attempt)
UploadHandler->>DurableObject: fetch(request) attempt 2
DurableObject-->>UploadHandler: Success (204)
UploadHandler-->>Client: 204 No Content
sequenceDiagram
autonumber
participant CronJob
participant CronStatApp as "Cron Stat App Handler"
participant SupabaseClient
participant PostgREST
CronJob->>CronStatApp: POST { appId, orgId }
CronStatApp->>SupabaseClient: get app via maybeSingle()
SupabaseClient->>PostgREST: SELECT app
PostgREST-->>SupabaseClient: { data: null, error: null }
SupabaseClient-->>CronStatApp: app not found
CronStatApp-->>CronJob: 200 { status: "skipped", reason: "app_not_found" }
rect rgba(200,100,100,0.5)
Note over CronStatApp,PostgREST: Retry flow for transient PostgREST 5xx errors
end
CronStatApp->>SupabaseClient: operation attempt 1
SupabaseClient->>PostgREST: RPC / upsert / select
PostgREST-->>SupabaseClient: 502 error (retryable)
SupabaseClient-->>CronStatApp: retryable error
CronStatApp->>CronStatApp: Retry with backoff
CronStatApp->>SupabaseClient: operation attempt 2
PostgREST-->>SupabaseClient: Success
CronStatApp-->>CronJob: 200 { status: "completed" }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/trigger-error-cases.test.ts (1)
46-59: Consider usingit.concurrent()for parallel execution.This test reads from the database but doesn't modify shared resources (it uses a nonexistent
appId). Per coding guidelines, tests that don't modify shared resources can useit.concurrent()for faster CI/CD.♻️ Optional: Use concurrent test execution
- it('should skip stale jobs when the app no longer exists', async () => { + it.concurrent('should skip stale jobs when the app no longer exists', async () => {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/trigger-error-cases.test.ts` around lines 46 - 59, The test "should skip stale jobs when the app no longer exists" is safe to run in parallel because it only reads and uses a nonexistent appId; change its declaration from it(...) to it.concurrent(...) so the test runner executes it concurrently for faster CI; locate the test by the string "should skip stale jobs when the app no longer exists" and replace the it call with it.concurrent while keeping the same assertions and body.supabase/functions/_backend/triggers/cron_stat_app.ts (1)
44-64: Retry detection logic is sound, but consider usingRegExp.exec()per static analysis.The
getRetryablePostgrestStatusfunction correctly extracts HTTP status codes from both structured error objects and error message strings. However, SonarCloud flags the use ofString.match()on line 51.♻️ Optional: Use RegExp.exec() for consistency with static analysis rules
function getRetryablePostgrestStatus(error: unknown): number | null { if (error && typeof error === 'object') { if ('status' in error && typeof (error as { status?: unknown }).status === 'number') { return (error as { status: number }).status } if ('message' in error && typeof (error as { message?: unknown }).message === 'string') { - const match = (error as { message: string }).message.match(/error code:\s*(\d{3})/i) + const match = /error code:\s*(\d{3})/i.exec((error as { message: string }).message) if (match) { return Number.parseInt(match[1], 10) } } } return null }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/functions/_backend/triggers/cron_stat_app.ts` around lines 44 - 64, Replace the use of String.match in getRetryablePostgrestStatus with RegExp.exec for the message parsing branch: create a RegExp (e.g., /error code:\s*(\d{3})/i), call regex.exec((error as { message: string }).message) and, if the result is non-null, parse result[1] to an integer and return it; keep the existing structured-object status check and leave isRetryablePostgrestError unchanged to rely on getRetryablePostgrestStatus.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@supabase/functions/_backend/triggers/cron_stat_app.ts`:
- Around line 44-64: Replace the use of String.match in
getRetryablePostgrestStatus with RegExp.exec for the message parsing branch:
create a RegExp (e.g., /error code:\s*(\d{3})/i), call regex.exec((error as {
message: string }).message) and, if the result is non-null, parse result[1] to
an integer and return it; keep the existing structured-object status check and
leave isRetryablePostgrestError unchanged to rely on
getRetryablePostgrestStatus.
In `@tests/trigger-error-cases.test.ts`:
- Around line 46-59: The test "should skip stale jobs when the app no longer
exists" is safe to run in parallel because it only reads and uses a nonexistent
appId; change its declaration from it(...) to it.concurrent(...) so the test
runner executes it concurrently for faster CI; locate the test by the string
"should skip stale jobs when the app no longer exists" and replace the it call
with it.concurrent while keeping the same assertions and body.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bc0a6635-2586-4114-9d3c-74bab2380d98
📒 Files selected for processing (5)
supabase/functions/_backend/files/files.tssupabase/functions/_backend/triggers/cron_stat_app.tstests/backend-alert-resilience.unit.test.tstests/cron_stat_app_followup.unit.test.tstests/trigger-error-cases.test.ts
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c9136fedd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
🧹 Nitpick comments (1)
supabase/functions/_backend/triggers/cron_stat_app.ts (1)
183-187: Consider aligning retry policy with the 5xx-only pattern.
queueOrgPlanRefreshWithRetryretries on any failure (!result.ok), whilerunSupabaseResultWithRetryonly retries on 5xx errors. This inconsistency may be intentional since this operation logs and continues rather than throwing, but aligning the policies would improve predictability.♻️ Optional: Align retry to 5xx-only pattern
}, { attempts: PLAN_REFRESH_RETRY_ATTEMPTS, baseDelayMs: PLAN_REFRESH_RETRY_DELAY_MS, - shouldRetry: result => !result.ok, + shouldRetry: (result) => { + if (result.ok) return false + return isRetryablePostgrestError(result.error) + }, })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/functions/_backend/triggers/cron_stat_app.ts` around lines 183 - 187, The current call to queueOrgPlanRefreshWithRetry (with PLAN_REFRESH_RETRY_ATTEMPTS and PLAN_REFRESH_RETRY_DELAY_MS) retries on any non-ok result (!result.ok), which is inconsistent with runSupabaseResultWithRetry's 5xx-only retry policy; update the shouldRetry predicate passed to queueOrgPlanRefreshWithRetry to mirror runSupabaseResultWithRetry (i.e., only retry when the Supabase response status is a 5xx server error) or reuse the same predicate function so both retry policies behave consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@supabase/functions/_backend/triggers/cron_stat_app.ts`:
- Around line 183-187: The current call to queueOrgPlanRefreshWithRetry (with
PLAN_REFRESH_RETRY_ATTEMPTS and PLAN_REFRESH_RETRY_DELAY_MS) retries on any
non-ok result (!result.ok), which is inconsistent with
runSupabaseResultWithRetry's 5xx-only retry policy; update the shouldRetry
predicate passed to queueOrgPlanRefreshWithRetry to mirror
runSupabaseResultWithRetry (i.e., only retry when the Supabase response status
is a 5xx server error) or reuse the same predicate function so both retry
policies behave consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6c1d4eff-2186-4f4c-bcb5-2a65081123e1
📒 Files selected for processing (3)
supabase/functions/_backend/triggers/cron_stat_app.tstests/backend-alert-resilience.unit.test.tstests/cron_stat_app_followup.unit.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/cron_stat_app_followup.unit.test.ts
- tests/backend-alert-resilience.unit.test.ts
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25d975e336
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
supabase/functions/_backend/triggers/cron_stat_app.ts (1)
175-192:⚠️ Potential issue | 🟠 MajorAdd idempotency guard to
queue_cron_stat_org_for_orgbefore retrying it.The current version of this SQL function (line 8949 in
supabase/schemas/prod.sql) has no idempotency protection. If the first RPC call toqueue_cron_stat_org_for_orgsucceeds in enqueueing a message but the client observes a transient 5xx error, the retry at lines 175–192 will enqueue the same org refresh twice. A prior migration (20251014105957) included a 1-hour rate-limit guard checkingplan_calculated_at, but a later migration (20251019123107_fix_stats.sql) removed it. Restore or add an explicit dedupe mechanism—such as a database constraint or an updated guard in the function—to prevent duplicate queuing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/functions/_backend/triggers/cron_stat_app.ts` around lines 175 - 192, The retry path is enqueuing duplicate work because the RPC queue_cron_stat_org_for_org (called via queueOrgPlanRefresh) lacks idempotency; update the system by adding an explicit dedupe guard in that SQL function (or a DB constraint on the queue table) so repeated calls within a short window are no-ops—for example, restore the prior plan_calculated_at 1-hour check inside queue_cron_stat_org_for_org or add a UNIQUE/indexed constraint that prevents a second enqueue for the same org+job type until a TTL expires; then keep the existing retry logic (retryWithBackoff calling queueOrgPlanRefresh) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@supabase/functions/_backend/triggers/cron_stat_app.ts`:
- Around line 175-192: The retry path is enqueuing duplicate work because the
RPC queue_cron_stat_org_for_org (called via queueOrgPlanRefresh) lacks
idempotency; update the system by adding an explicit dedupe guard in that SQL
function (or a DB constraint on the queue table) so repeated calls within a
short window are no-ops—for example, restore the prior plan_calculated_at 1-hour
check inside queue_cron_stat_org_for_org or add a UNIQUE/indexed constraint that
prevents a second enqueue for the same org+job type until a TTL expires; then
keep the existing retry logic (retryWithBackoff calling queueOrgPlanRefresh)
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7c5e2237-0715-4674-853d-23678071e787
📒 Files selected for processing (2)
supabase/functions/_backend/triggers/cron_stat_app.tstests/cron_stat_app_followup.unit.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/cron_stat_app_followup.unit.test.ts
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56e22a0a3f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 223003b611
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|



Summary (AI generated)
cron_stat_appqueue jobs when the target app is gone or no longer belongs to the queued orgcron_stat_appbefore failing queue workMotivation (AI generated)
PostHog and Discord alerts show repeated backend noise from two resilience gaps: transient Durable Object storage relocation during uploads, and transient or stale
cron_stat_appqueue work. These failures are operationally noisy and cause unnecessary retries even when the underlying request can succeed or should be skipped.Business Impact (AI generated)
This reduces false-positive backend alerts, lowers avoidable queue churn, and makes uploads and billing-stat refreshes more reliable for production customers. Fewer noisy failures also improves observability by leaving real regressions easier to see.
Test Plan (AI generated)
bunx vitest run tests/backend-alert-resilience.unit.test.ts tests/files-r2-error.test.tsbunx eslint supabase/functions/_backend/files/files.ts supabase/functions/_backend/triggers/cron_stat_app.ts tests/backend-alert-resilience.unit.test.ts tests/trigger-error-cases.test.tsbun typecheckGenerated with AI
Summary by CodeRabbit
New Features
Tests