Improve backend resilience for file and trigger workflows#1564
Conversation
📝 WalkthroughWalkthroughThese changes enhance error handling and resilience across Supabase backend functions by introducing retry mechanisms with exponential backoff for cloud operations, wrapping critical calls in try/catch blocks with improved logging via cloudlogErr, and converting some error scenarios to graceful responses instead of exceptions. Changes
Possibly related PRs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18b78ef969
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await supabase | ||
| .from('stripe_info') | ||
| .update({ plan_calculated_at: new Date().toISOString() }) | ||
| .eq('customer_id', body.customerId) |
There was a problem hiding this comment.
Gate plan_calculated_at update on successful plan calc
The new inner try/catch swallows failures from checkPlanStatusOnly, but the timestamp update still runs unconditionally. Because queue_cron_plan_for_org checks plan_calculated_at to skip re-queueing for an hour (see supabase/migrations/20251007134349_cron_plan_from_stats_backend.sql), a transient failure now marks the org as “calculated” and suppresses retries, leaving plan status/usage stale until the next window. Consider updating plan_calculated_at only when the plan calculation actually succeeds.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR strengthens backend resilience around plan checks, file downloads, and trigger workflows, with more robust error handling and retry logic.
Changes:
- Wrap plan status calculation and cron-trigger logic with targeted try/catch blocks and structured logging to avoid aborting flows on transient failures.
- Add retry/backoff helpers for manifest size retrieval and channel updates, and make some trigger error paths return success rather than failing the trigger.
- Adjust the global
onErrorhandler to skip reading the body for safe HTTP methods and improve file download error handling by mapping R2/storage failures to a structured 503 response.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
supabase/functions/_backend/utils/plans.ts |
Wraps calculatePlanStatus in a try/catch within checkPlanStatusOnly, logging failures and returning early to avoid failing the entire cron when plan calculation errors occur. |
supabase/functions/_backend/utils/on_error.ts |
Changes the error handler to only attempt JSON body parsing for non-GET/HEAD/OPTIONS methods, reducing spurious “failToReadBody” errors and preserving existing structured logging and alerting. |
supabase/functions/_backend/triggers/on_manifest_create.ts |
Introduces getManifestSizeWithRetry for repeated s3.getSize attempts and alters error handling so some invalid/zero-size cases now log and return BRES instead of throwing, changing the observable error contract. |
supabase/functions/_backend/triggers/on_channel_update.ts |
Adds updateChannelsWithRetry with retry/backoff for Supabase updates and changes missing id/app_id cases from throwing simpleError to returning BRES, altering previously tested error responses. |
supabase/functions/_backend/triggers/cron_stat_org.ts |
Wraps checkPlanStatusOnly and the plan_calculated_at update in inner try/catch blocks to log and continue on failure while still ensuring the DB client is closed in the finally block. |
supabase/functions/_backend/files/files.ts |
Hardens the R2-backed GET handler by wrapping head/get calls in try/catch, logging failures, and mapping get exceptions to a 503 upstream_unavailable error while preserving 404 behavior for missing objects and existing caching logic. |
| if (!record.id) { | ||
| cloudlog({ requestId: c.get('requestId'), message: 'No id' }) | ||
| throw simpleError('no_id', 'No id', { record }) | ||
| return c.json(BRES) | ||
| } | ||
| if (!record.app_id) { | ||
| throw simpleError('no_app_id', 'No app id included the request', { record }) | ||
| cloudlog({ requestId: c.get('requestId'), message: 'No app id included the request' }) | ||
| return c.json(BRES) |
There was a problem hiding this comment.
This change turns invalid payload cases (missing id or app_id) from an error (simpleError('no_id'/'no_app_id')) into a silent success response (BRES), which breaks the existing contract and the tests in tests/trigger-error-cases.test.ts that assert a 400 response with error: 'no_app_id' for this scenario. If the goal is to make the trigger more resilient, consider keeping a 4xx error for structurally invalid records (or updating the tests and any callers explicitly), rather than returning a generic success that hides configuration issues and makes debugging harder.
| if (!record.app_version_id || !record.s3_path) { | ||
| cloudlog({ requestId: c.get('requestId'), message: 'no app_version_id or s3_path' }) | ||
| throw simpleError('no_app_version_id_or_s3_path', 'No app_version_id or s3_path', { record }) | ||
| return c.json(BRES) |
There was a problem hiding this comment.
Previously, when app_version_id or s3_path was missing, this trigger threw simpleError('no_app_version_id_or_s3_path', ...), and the error-case test in tests/trigger-error-cases.test.ts explicitly asserts a 400 response with error: 'no_app_version_id_or_s3_path'. Changing this to return c.json(BRES) silently treats an invalid manifest row as success, which both breaks those tests and makes it harder to detect schema/configuration issues; consider preserving the explicit error (or updating tests and any consumers if this contract change is intentional).
| return c.json(BRES) | |
| throw simpleError('no_app_version_id_or_s3_path', 'no app_version_id or s3_path', { record }) |
| let object: R2ObjectBody | null = null | ||
| try { | ||
| object = await new RetryBucket(bucket, DEFAULT_RETRY_PARAMS).get(fileId, { | ||
| range: c.req.raw.headers, | ||
| }) | ||
| } | ||
| catch (error) { | ||
| cloudlogErr({ requestId: c.get('requestId'), message: 'getHandler files get failed', fileId, error }) | ||
| throw quickError(503, 'upstream_unavailable', 'File storage temporarily unavailable', { fileId }) |
There was a problem hiding this comment.
The new catch block around RetryBucket(bucket, DEFAULT_RETRY_PARAMS).get(...) introduces a specific 503 upstream_unavailable error path for R2/storage failures, but there is no test exercising this behavior even though we already have /files error-case tests (see tests/error-cases.test.ts). To prevent regressions and to lock in the intended contract for transient storage outages (status code and error payload), consider adding a test that forces R2Bucket.get to throw and asserts the 503 response with the expected error code.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@supabase/functions/_backend/triggers/cron_stat_org.ts`:
- Around line 34-46: The stripe_info plan_calculated_at update currently runs
unconditionally when body.customerId exists even if checkPlanStatusOnly()
failed; add a boolean flag (e.g., planStatusSucceeded) that is set true only
when checkPlanStatusOnly() completes without throwing and use that flag to gate
the update so it only runs on success; also replace the Supabase call at the
update block (currently using supabaseAdmin(c).from('stripe_info').update(...))
with the existing drizzleClient (initialized earlier) and perform the equivalent
Drizzle/PG update against the stripe_info table to set plan_calculated_at,
preserving the same logging (cloudlog) on success and error.
In `@supabase/functions/_backend/triggers/on_channel_update.ts`:
- Around line 1-29: The retry loop in updateChannelsWithRetry currently only
handles resolved error results from operation() and will stop retrying if
operation() throws; modify updateChannelsWithRetry so the call to operation() is
wrapped in try/catch: invoke operation() inside the loop, catch any thrown error
and treat it like a failure (assign to lastError), continue to the next retry
after the existing delay logic, and only return early when operation() resolves
with no error; keep the existing cloudlogErr call (including c.get('requestId')
and context) to log lastError when all attempts fail.
| try { | ||
| const supabase = supabaseAdmin(c) | ||
| await supabase | ||
| .from('stripe_info') | ||
| .update({ plan_calculated_at: new Date().toISOString() }) | ||
| .eq('customer_id', body.customerId) | ||
| .throwOnError() | ||
|
|
||
| cloudlog({ requestId: c.get('requestId'), message: 'plan calculated timestamp updated', customerId: body.customerId }) | ||
| cloudlog({ requestId: c.get('requestId'), message: 'plan calculated timestamp updated', customerId: body.customerId }) | ||
| } | ||
| catch (error) { | ||
| cloudlog({ requestId: c.get('requestId'), message: 'plan calculated timestamp update failed', customerId: body.customerId, error }) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n supabase/functions/_backend/triggers/cron_stat_org.ts | head -60Repository: Cap-go/capgo
Length of output: 2271
Gate the timestamp update on plan check success and use Drizzle/PG client for database operations.
The timestamp update runs regardless of whether checkPlanStatusOnly() succeeds or fails—the error is caught at line 28–30 but no flag is set, so the stripe_info update at line 34–46 will execute when body.customerId is present. Add a planStatusSucceeded flag to gate the timestamp update.
Additionally, use drizzleClient (already initialized at line 22–23) instead of supabaseAdmin() at line 35 to align with the migration to Drizzle/PG clients for all database operations in backend code.
🤖 Prompt for AI Agents
In `@supabase/functions/_backend/triggers/cron_stat_org.ts` around lines 34 - 46,
The stripe_info plan_calculated_at update currently runs unconditionally when
body.customerId exists even if checkPlanStatusOnly() failed; add a boolean flag
(e.g., planStatusSucceeded) that is set true only when checkPlanStatusOnly()
completes without throwing and use that flag to gate the update so it only runs
on success; also replace the Supabase call at the update block (currently using
supabaseAdmin(c).from('stripe_info').update(...)) with the existing
drizzleClient (initialized earlier) and perform the equivalent Drizzle/PG update
against the stripe_info table to set plan_calculated_at, preserving the same
logging (cloudlog) on success and error.
| import type { Context } from 'hono' | ||
| import type { MiddlewareKeyVariables } from '../utils/hono.ts' | ||
| import type { Database } from '../utils/supabase.types.ts' | ||
| import { Hono } from 'hono/tiny' | ||
| import { BRES, middlewareAPISecret, simpleError, triggerValidator } from '../utils/hono.ts' | ||
| import { cloudlog } from '../utils/logging.ts' | ||
| import { cloudlog, cloudlogErr } from '../utils/logging.ts' | ||
| import { supabaseAdmin } from '../utils/supabase.ts' | ||
|
|
||
| export const app = new Hono<MiddlewareKeyVariables>() | ||
|
|
||
| const UPDATE_RETRY_ATTEMPTS = 3 | ||
| const UPDATE_RETRY_DELAY_MS = 300 | ||
|
|
||
| async function updateChannelsWithRetry( | ||
| c: Context<MiddlewareKeyVariables>, | ||
| operation: () => Promise<{ error: any }>, | ||
| context: Record<string, unknown>, | ||
| ) { | ||
| let lastError: any | ||
| for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) { | ||
| const { error } = await operation() | ||
| if (!error) | ||
| return | ||
| lastError = error | ||
| if (attempt < UPDATE_RETRY_ATTEMPTS - 1) | ||
| await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1))) | ||
| } | ||
| cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context }) | ||
| } |
There was a problem hiding this comment.
Ensure retries cover thrown errors too.
If operation() rejects/throws, the loop exits and retries are skipped. Wrap it so failures still retry.
♻️ Suggested fix
async function updateChannelsWithRetry(
c: Context<MiddlewareKeyVariables>,
operation: () => Promise<{ error: any }>,
context: Record<string, unknown>,
) {
let lastError: any
for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) {
- const { error } = await operation()
- if (!error)
- return
- lastError = error
+ try {
+ const { error } = await operation()
+ if (!error)
+ return
+ lastError = error
+ }
+ catch (error) {
+ lastError = error
+ }
if (attempt < UPDATE_RETRY_ATTEMPTS - 1)
await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1)))
}
cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context })
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import type { Context } from 'hono' | |
| import type { MiddlewareKeyVariables } from '../utils/hono.ts' | |
| import type { Database } from '../utils/supabase.types.ts' | |
| import { Hono } from 'hono/tiny' | |
| import { BRES, middlewareAPISecret, simpleError, triggerValidator } from '../utils/hono.ts' | |
| import { cloudlog } from '../utils/logging.ts' | |
| import { cloudlog, cloudlogErr } from '../utils/logging.ts' | |
| import { supabaseAdmin } from '../utils/supabase.ts' | |
| export const app = new Hono<MiddlewareKeyVariables>() | |
| const UPDATE_RETRY_ATTEMPTS = 3 | |
| const UPDATE_RETRY_DELAY_MS = 300 | |
| async function updateChannelsWithRetry( | |
| c: Context<MiddlewareKeyVariables>, | |
| operation: () => Promise<{ error: any }>, | |
| context: Record<string, unknown>, | |
| ) { | |
| let lastError: any | |
| for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) { | |
| const { error } = await operation() | |
| if (!error) | |
| return | |
| lastError = error | |
| if (attempt < UPDATE_RETRY_ATTEMPTS - 1) | |
| await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1))) | |
| } | |
| cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context }) | |
| } | |
| import type { Context } from 'hono' | |
| import type { MiddlewareKeyVariables } from '../utils/hono.ts' | |
| import type { Database } from '../utils/supabase.types.ts' | |
| import { Hono } from 'hono/tiny' | |
| import { BRES, middlewareAPISecret, simpleError, triggerValidator } from '../utils/hono.ts' | |
| import { cloudlog, cloudlogErr } from '../utils/logging.ts' | |
| import { supabaseAdmin } from '../utils/supabase.ts' | |
| export const app = new Hono<MiddlewareKeyVariables>() | |
| const UPDATE_RETRY_ATTEMPTS = 3 | |
| const UPDATE_RETRY_DELAY_MS = 300 | |
| async function updateChannelsWithRetry( | |
| c: Context<MiddlewareKeyVariables>, | |
| operation: () => Promise<{ error: any }>, | |
| context: Record<string, unknown>, | |
| ) { | |
| let lastError: any | |
| for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) { | |
| try { | |
| const { error } = await operation() | |
| if (!error) | |
| return | |
| lastError = error | |
| } | |
| catch (error) { | |
| lastError = error | |
| } | |
| if (attempt < UPDATE_RETRY_ATTEMPTS - 1) | |
| await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1))) | |
| } | |
| cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context }) | |
| } |
🤖 Prompt for AI Agents
In `@supabase/functions/_backend/triggers/on_channel_update.ts` around lines 1 -
29, The retry loop in updateChannelsWithRetry currently only handles resolved
error results from operation() and will stop retrying if operation() throws;
modify updateChannelsWithRetry so the call to operation() is wrapped in
try/catch: invoke operation() inside the loop, catch any thrown error and treat
it like a failure (assign to lastError), continue to the next retry after the
existing delay logic, and only return early when operation() resolves with no
error; keep the existing cloudlogErr call (including c.get('requestId') and
context) to log lastError when all attempts fail.


Summary (AI generated)
Test plan (AI generated)
Screenshots (AI generated)
Checklist (AI generated)
accordingly.
my tests
Generated with AI
Summary by CodeRabbit