Skip to content

Improve backend resilience for file and trigger workflows#1564

Closed
riderx wants to merge 3 commits into
mainfrom
riderx/log-fix-plan
Closed

Improve backend resilience for file and trigger workflows#1564
riderx wants to merge 3 commits into
mainfrom
riderx/log-fix-plan

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented Feb 3, 2026

Summary (AI generated)

  • Harden R2 file reads to reduce transient failures
  • Add retries and safer error handling for channel/manifest/cron triggers

Test plan (AI generated)

  • bun lint:backend
  • bun test:backend

Screenshots (AI generated)

  • N/A

Checklist (AI generated)

  • My code follows the code style of this project and passes .
  • My change requires a change to the documentation.
  • I have updated the documentation
    accordingly.
  • My change has adequate E2E test coverage.
  • I have tested my code manually, and I have provided steps how to reproduce
    my tests

Generated with AI

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error handling and recovery mechanisms for file operations, channel updates, and manifest processing.
    • Added automatic retry logic for channel and manifest updates to improve reliability.
    • Improved error logging and handling of edge cases with missing or invalid data.

Copilot AI review requested due to automatic review settings February 3, 2026 06:12
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

These changes enhance error handling and resilience across Supabase backend functions by introducing retry mechanisms with exponential backoff for cloud operations, wrapping critical calls in try/catch blocks with improved logging via cloudlogErr, and converting some error scenarios to graceful responses instead of exceptions.

Changes

Cohort / File(s) Summary
Retry Mechanisms for Cloud Operations
supabase/functions/_backend/triggers/on_channel_update.ts, supabase/functions/_backend/triggers/on_manifest_create.ts
Added retry logic with configurable attempts and progressive delays for channel updates (iOS, Android, public) and S3 manifest size retrieval. Implements exponential backoff and logs final failures via cloudlogErr.
Enhanced Error Handling & Logging
supabase/functions/_backend/files/files.ts, supabase/functions/_backend/triggers/cron_stat_org.ts, supabase/functions/_backend/utils/plans.ts
Wrapped critical operations (file head/get, plan status checks, stripe updates) in try/catch blocks. Replaces silent failures or exceptions with conditional logging via cloudlogErr and graceful error responses (503 for file operations).
HTTP Request Processing
supabase/functions/_backend/utils/on_error.ts
Added conditional request body reading based on HTTP method; skips body capture for GET/HEAD/OPTIONS requests to reduce unnecessary processing.

Possibly related PRs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through error logs with glee,
Retry, retry, until success we see!
With exponential delays and catches so neat,
Resilience blooms where failures once beat. 🌿✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description covers the main changes (retries, error handling) and test plan (linters), but the checklist items are incomplete: documentation updates are unchecked and no manual test reproduction steps are provided despite the checklist stating they should be. Provide manual test reproduction steps or clarify why they are not applicable; confirm whether documentation updates to Cap-go/website are needed and update accordingly.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main focus of the changeset: improving resilience across multiple backend workflows (file reads, channel updates, cron triggers) through retry mechanisms and enhanced error handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch riderx/log-fix-plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18b78ef969

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +36 to +39
await supabase
.from('stripe_info')
.update({ plan_calculated_at: new Date().toISOString() })
.eq('customer_id', body.customerId)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate plan_calculated_at update on successful plan calc

The new inner try/catch swallows failures from checkPlanStatusOnly, but the timestamp update still runs unconditionally. Because queue_cron_plan_for_org checks plan_calculated_at to skip re-queueing for an hour (see supabase/migrations/20251007134349_cron_plan_from_stats_backend.sql), a transient failure now marks the org as “calculated” and suppresses retries, leaving plan status/usage stale until the next window. Consider updating plan_calculated_at only when the plan calculation actually succeeds.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens backend resilience around plan checks, file downloads, and trigger workflows, with more robust error handling and retry logic.

Changes:

  • Wrap plan status calculation and cron-trigger logic with targeted try/catch blocks and structured logging to avoid aborting flows on transient failures.
  • Add retry/backoff helpers for manifest size retrieval and channel updates, and make some trigger error paths return success rather than failing the trigger.
  • Adjust the global onError handler to skip reading the body for safe HTTP methods and improve file download error handling by mapping R2/storage failures to a structured 503 response.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
supabase/functions/_backend/utils/plans.ts Wraps calculatePlanStatus in a try/catch within checkPlanStatusOnly, logging failures and returning early to avoid failing the entire cron when plan calculation errors occur.
supabase/functions/_backend/utils/on_error.ts Changes the error handler to only attempt JSON body parsing for non-GET/HEAD/OPTIONS methods, reducing spurious “failToReadBody” errors and preserving existing structured logging and alerting.
supabase/functions/_backend/triggers/on_manifest_create.ts Introduces getManifestSizeWithRetry for repeated s3.getSize attempts and alters error handling so some invalid/zero-size cases now log and return BRES instead of throwing, changing the observable error contract.
supabase/functions/_backend/triggers/on_channel_update.ts Adds updateChannelsWithRetry with retry/backoff for Supabase updates and changes missing id/app_id cases from throwing simpleError to returning BRES, altering previously tested error responses.
supabase/functions/_backend/triggers/cron_stat_org.ts Wraps checkPlanStatusOnly and the plan_calculated_at update in inner try/catch blocks to log and continue on failure while still ensuring the DB client is closed in the finally block.
supabase/functions/_backend/files/files.ts Hardens the R2-backed GET handler by wrapping head/get calls in try/catch, logging failures, and mapping get exceptions to a 503 upstream_unavailable error while preserving 404 behavior for missing objects and existing caching logic.

Comment on lines +14 to +41
if (!record.id) {
cloudlog({ requestId: c.get('requestId'), message: 'No id' })
throw simpleError('no_id', 'No id', { record })
return c.json(BRES)
}
if (!record.app_id) {
throw simpleError('no_app_id', 'No app id included the request', { record })
cloudlog({ requestId: c.get('requestId'), message: 'No app id included the request' })
return c.json(BRES)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change turns invalid payload cases (missing id or app_id) from an error (simpleError('no_id'/'no_app_id')) into a silent success response (BRES), which breaks the existing contract and the tests in tests/trigger-error-cases.test.ts that assert a 400 response with error: 'no_app_id' for this scenario. If the goal is to make the trigger more resilient, consider keeping a 4xx error for structurally invalid records (or updating the tests and any callers explicitly), rather than returning a generic success that hides configuration issues and makes debugging harder.

Copilot uses AI. Check for mistakes.
if (!record.app_version_id || !record.s3_path) {
cloudlog({ requestId: c.get('requestId'), message: 'no app_version_id or s3_path' })
throw simpleError('no_app_version_id_or_s3_path', 'No app_version_id or s3_path', { record })
return c.json(BRES)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, when app_version_id or s3_path was missing, this trigger threw simpleError('no_app_version_id_or_s3_path', ...), and the error-case test in tests/trigger-error-cases.test.ts explicitly asserts a 400 response with error: 'no_app_version_id_or_s3_path'. Changing this to return c.json(BRES) silently treats an invalid manifest row as success, which both breaks those tests and makes it harder to detect schema/configuration issues; consider preserving the explicit error (or updating tests and any consumers if this contract change is intentional).

Suggested change
return c.json(BRES)
throw simpleError('no_app_version_id_or_s3_path', 'no app_version_id or s3_path', { record })

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +133
let object: R2ObjectBody | null = null
try {
object = await new RetryBucket(bucket, DEFAULT_RETRY_PARAMS).get(fileId, {
range: c.req.raw.headers,
})
}
catch (error) {
cloudlogErr({ requestId: c.get('requestId'), message: 'getHandler files get failed', fileId, error })
throw quickError(503, 'upstream_unavailable', 'File storage temporarily unavailable', { fileId })
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new catch block around RetryBucket(bucket, DEFAULT_RETRY_PARAMS).get(...) introduces a specific 503 upstream_unavailable error path for R2/storage failures, but there is no test exercising this behavior even though we already have /files error-case tests (see tests/error-cases.test.ts). To prevent regressions and to lock in the intended contract for transient storage outages (status code and error payload), consider adding a test that forces R2Bucket.get to throw and asserts the 503 response with the expected error code.

Copilot generated this review using guidance from repository custom instructions.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Feb 3, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
21.4% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@supabase/functions/_backend/triggers/cron_stat_org.ts`:
- Around line 34-46: The stripe_info plan_calculated_at update currently runs
unconditionally when body.customerId exists even if checkPlanStatusOnly()
failed; add a boolean flag (e.g., planStatusSucceeded) that is set true only
when checkPlanStatusOnly() completes without throwing and use that flag to gate
the update so it only runs on success; also replace the Supabase call at the
update block (currently using supabaseAdmin(c).from('stripe_info').update(...))
with the existing drizzleClient (initialized earlier) and perform the equivalent
Drizzle/PG update against the stripe_info table to set plan_calculated_at,
preserving the same logging (cloudlog) on success and error.

In `@supabase/functions/_backend/triggers/on_channel_update.ts`:
- Around line 1-29: The retry loop in updateChannelsWithRetry currently only
handles resolved error results from operation() and will stop retrying if
operation() throws; modify updateChannelsWithRetry so the call to operation() is
wrapped in try/catch: invoke operation() inside the loop, catch any thrown error
and treat it like a failure (assign to lastError), continue to the next retry
after the existing delay logic, and only return early when operation() resolves
with no error; keep the existing cloudlogErr call (including c.get('requestId')
and context) to log lastError when all attempts fail.

Comment on lines +34 to +46
try {
const supabase = supabaseAdmin(c)
await supabase
.from('stripe_info')
.update({ plan_calculated_at: new Date().toISOString() })
.eq('customer_id', body.customerId)
.throwOnError()

cloudlog({ requestId: c.get('requestId'), message: 'plan calculated timestamp updated', customerId: body.customerId })
cloudlog({ requestId: c.get('requestId'), message: 'plan calculated timestamp updated', customerId: body.customerId })
}
catch (error) {
cloudlog({ requestId: c.get('requestId'), message: 'plan calculated timestamp update failed', customerId: body.customerId, error })
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n supabase/functions/_backend/triggers/cron_stat_org.ts | head -60

Repository: Cap-go/capgo

Length of output: 2271


Gate the timestamp update on plan check success and use Drizzle/PG client for database operations.

The timestamp update runs regardless of whether checkPlanStatusOnly() succeeds or fails—the error is caught at line 28–30 but no flag is set, so the stripe_info update at line 34–46 will execute when body.customerId is present. Add a planStatusSucceeded flag to gate the timestamp update.

Additionally, use drizzleClient (already initialized at line 22–23) instead of supabaseAdmin() at line 35 to align with the migration to Drizzle/PG clients for all database operations in backend code.

🤖 Prompt for AI Agents
In `@supabase/functions/_backend/triggers/cron_stat_org.ts` around lines 34 - 46,
The stripe_info plan_calculated_at update currently runs unconditionally when
body.customerId exists even if checkPlanStatusOnly() failed; add a boolean flag
(e.g., planStatusSucceeded) that is set true only when checkPlanStatusOnly()
completes without throwing and use that flag to gate the update so it only runs
on success; also replace the Supabase call at the update block (currently using
supabaseAdmin(c).from('stripe_info').update(...)) with the existing
drizzleClient (initialized earlier) and perform the equivalent Drizzle/PG update
against the stripe_info table to set plan_calculated_at, preserving the same
logging (cloudlog) on success and error.

Comment on lines +1 to +29
import type { Context } from 'hono'
import type { MiddlewareKeyVariables } from '../utils/hono.ts'
import type { Database } from '../utils/supabase.types.ts'
import { Hono } from 'hono/tiny'
import { BRES, middlewareAPISecret, simpleError, triggerValidator } from '../utils/hono.ts'
import { cloudlog } from '../utils/logging.ts'
import { cloudlog, cloudlogErr } from '../utils/logging.ts'
import { supabaseAdmin } from '../utils/supabase.ts'

export const app = new Hono<MiddlewareKeyVariables>()

const UPDATE_RETRY_ATTEMPTS = 3
const UPDATE_RETRY_DELAY_MS = 300

async function updateChannelsWithRetry(
c: Context<MiddlewareKeyVariables>,
operation: () => Promise<{ error: any }>,
context: Record<string, unknown>,
) {
let lastError: any
for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) {
const { error } = await operation()
if (!error)
return
lastError = error
if (attempt < UPDATE_RETRY_ATTEMPTS - 1)
await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1)))
}
cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context })
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure retries cover thrown errors too.

If operation() rejects/throws, the loop exits and retries are skipped. Wrap it so failures still retry.

♻️ Suggested fix
 async function updateChannelsWithRetry(
   c: Context<MiddlewareKeyVariables>,
   operation: () => Promise<{ error: any }>,
   context: Record<string, unknown>,
 ) {
   let lastError: any
   for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) {
-    const { error } = await operation()
-    if (!error)
-      return
-    lastError = error
+    try {
+      const { error } = await operation()
+      if (!error)
+        return
+      lastError = error
+    }
+    catch (error) {
+      lastError = error
+    }
     if (attempt < UPDATE_RETRY_ATTEMPTS - 1)
       await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1)))
   }
   cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context })
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import type { Context } from 'hono'
import type { MiddlewareKeyVariables } from '../utils/hono.ts'
import type { Database } from '../utils/supabase.types.ts'
import { Hono } from 'hono/tiny'
import { BRES, middlewareAPISecret, simpleError, triggerValidator } from '../utils/hono.ts'
import { cloudlog } from '../utils/logging.ts'
import { cloudlog, cloudlogErr } from '../utils/logging.ts'
import { supabaseAdmin } from '../utils/supabase.ts'
export const app = new Hono<MiddlewareKeyVariables>()
const UPDATE_RETRY_ATTEMPTS = 3
const UPDATE_RETRY_DELAY_MS = 300
async function updateChannelsWithRetry(
c: Context<MiddlewareKeyVariables>,
operation: () => Promise<{ error: any }>,
context: Record<string, unknown>,
) {
let lastError: any
for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) {
const { error } = await operation()
if (!error)
return
lastError = error
if (attempt < UPDATE_RETRY_ATTEMPTS - 1)
await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1)))
}
cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context })
}
import type { Context } from 'hono'
import type { MiddlewareKeyVariables } from '../utils/hono.ts'
import type { Database } from '../utils/supabase.types.ts'
import { Hono } from 'hono/tiny'
import { BRES, middlewareAPISecret, simpleError, triggerValidator } from '../utils/hono.ts'
import { cloudlog, cloudlogErr } from '../utils/logging.ts'
import { supabaseAdmin } from '../utils/supabase.ts'
export const app = new Hono<MiddlewareKeyVariables>()
const UPDATE_RETRY_ATTEMPTS = 3
const UPDATE_RETRY_DELAY_MS = 300
async function updateChannelsWithRetry(
c: Context<MiddlewareKeyVariables>,
operation: () => Promise<{ error: any }>,
context: Record<string, unknown>,
) {
let lastError: any
for (let attempt = 0; attempt < UPDATE_RETRY_ATTEMPTS; attempt++) {
try {
const { error } = await operation()
if (!error)
return
lastError = error
}
catch (error) {
lastError = error
}
if (attempt < UPDATE_RETRY_ATTEMPTS - 1)
await new Promise(resolve => setTimeout(resolve, UPDATE_RETRY_DELAY_MS * (attempt + 1)))
}
cloudlogErr({ requestId: c.get('requestId'), message: 'on_channel_update failed after retries', error: lastError, ...context })
}
🤖 Prompt for AI Agents
In `@supabase/functions/_backend/triggers/on_channel_update.ts` around lines 1 -
29, The retry loop in updateChannelsWithRetry currently only handles resolved
error results from operation() and will stop retrying if operation() throws;
modify updateChannelsWithRetry so the call to operation() is wrapped in
try/catch: invoke operation() inside the loop, catch any thrown error and treat
it like a failure (assign to lastError), continue to the next retry after the
existing delay logic, and only return early when operation() resolves with no
error; keep the existing cloudlogErr call (including c.get('requestId') and
context) to log lastError when all attempts fail.

@riderx riderx closed this Feb 6, 2026
@riderx riderx deleted the riderx/log-fix-plan branch March 17, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants