Skip to content

[codex] harden files upload retries and alert resilience#1934

Merged
riderx merged 7 commits into
mainfrom
codex/fix-files-do-retries
Apr 23, 2026
Merged

[codex] harden files upload retries and alert resilience#1934
riderx merged 7 commits into
mainfrom
codex/fix-files-do-retries

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented Apr 22, 2026

Summary (AI generated)

  • retry transient Durable Object reset responses for TUS uploads, including zero-byte create requests that can now be safely replayed
  • recover PATCH uploads by reloading the current TUS offset after a retryable Durable Object reset instead of surfacing a generic hard failure
  • harden backend env lookups and retry transient on_manifest_create manifest file_size updates, with regression coverage for all three failure paths

Motivation (AI generated)

Cloudflare Durable Object storage resets were being treated as terminal upload failures even when the request was safe to replay, and the error-reporting path could throw its own ENVIRONMENT lookup error when bindings were unavailable on the active context. Separately, on_manifest_create still treated transient PostgREST 5xx responses as permanent failures, which matched the queue alerts.

Business Impact (AI generated)

This reduces failed file uploads for bundle and attachment flows, cuts noisy Discord alerts that masked the real upload issue, and makes manifest follow-up jobs more resilient to transient infrastructure hiccups. That lowers support load and improves release reliability for customers shipping updates.

Test Plan (AI generated)

  • bunx vitest run tests/backend-alert-resilience.unit.test.ts
  • bun run lint:backend
  • bun run supabase:with-env -- bunx vitest run tests/trigger-error-cases.test.ts
  • bun run supabase:with-env -- bunx vitest run tests/tus-upload.test.ts

Generated with AI

Summary by CodeRabbit

  • Bug Fixes

    • Smarter upload retry flow: replayable uploads are retried; unreplayable uploads attempt state recovery (may return 409) and handlers signal temporary retryability (503 with retry header).
    • Improved error signaling when upload handlers are temporarily unavailable.
  • Refactor

    • Centralized PostgREST-aware retry logic for consistent retry decisions across jobs.
    • Unified environment-variable access for more reliable configuration reads.
  • Tests

    • Added tests for upload retries, offset recovery, manifest-update retries, and env-binding behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Durable-object upload forwarding/retry logic rewritten (replay heuristics, header-based retry signaling, HEAD-based offset recovery), PostgREST retry predicates centralized and used across triggers/statistics/manifest update, env access normalized, and new unit tests added for upload retry behaviors and manifest retry.

Changes

Cohort / File(s) Summary
Upload/DO retry & helpers
supabase/functions/_backend/files/files.ts, supabase/functions/_backend/files/util.ts, supabase/functions/_backend/files/uploadHandler.ts
New replay heuristics (requestHasNonEmptyUploadBody), conditional body forwarding (buildDurableObjectRequest, duplex: 'half' only when forwarding), header-driven retry (X_UPLOAD_HANDLER_RETRYABLE), HEAD-based offset recovery when replay disallowed, DO fetch-error retry delegated to isRetryableDurableObjectResetError, and DO timeout extended to 30 minutes.
PostgREST retry utilities
supabase/functions/_backend/utils/retry.ts
Add RetryableResult and PostgREST-aware helpers: getRetryablePostgrestStatus, isRetryablePostgrestStatus, isRetryablePostgrestError, isRetryablePostgrestResult.
Cron triggers & statistics
supabase/functions/_backend/triggers/cron_stat_app.ts, supabase/functions/_backend/triggers/cron_sync_sub.ts, supabase/functions/_backend/public/statistics/index.ts
Replace local retry heuristics with shared PostgREST predicates; update exports/tests to reference centralized helpers.
Manifest update retry
supabase/functions/_backend/triggers/on_manifest_create.ts
Introduce runManifestUpdateWithRetry using retryWithBackoff and isRetryablePostgrestResult; validate results and export test utilities.
Env utilities
supabase/functions/_backend/utils/utils.ts
Add getContextEnv to prefer env(c) with fallback to c.env; update existInEnv and getEnv to use normalized accessor.
Tests
tests/backend-alert-resilience.unit.test.ts
Add Vitest coverage for DO retry paths (POST/PATCH empty & zero-byte replay, HEAD-based offset recovery), manifest-update retry behavior, and env helper edge cases.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant UploadHandler as Upload Handler
    participant DurableObject as Durable Object
    participant RetryBackoff as Backoff/Retry

    Client->>UploadHandler: POST/PATCH upload request
    UploadHandler->>UploadHandler: evaluate requestHasNonEmptyUploadBody
    alt Body replayable
        UploadHandler->>DurableObject: fetch (attach body, duplex: 'half', extended timeout)
    else No body / non-replayable
        UploadHandler->>DurableObject: fetch (no body)
    end

    DurableObject-->>UploadHandler: response (e.g. 503 + X-Capgo-DO-Retryable:1)
    UploadHandler->>UploadHandler: classify via header / isRetryableDurableObjectResetError
    alt Can replay
        UploadHandler->>RetryBackoff: schedule retry with backoff (re-send request)
        RetryBackoff-->>UploadHandler: retry outcome
        UploadHandler->>Client: forward final response
    else Cannot replay
        UploadHandler->>DurableObject: HEAD (remove content-length)
        DurableObject-->>UploadHandler: HEAD response (Upload-Offset)
        UploadHandler->>Client: respond 409 with recovered offset headers
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

Suggested labels

💰 Rewarded

Poem

🐰 Hoppity-hop, a retry and a beam,

Durable objects nudge back into the stream,
HEADs and offsets guide the patch and the post,
Backoff drums softly while I nibble the toast,
I thump my foot — resilient code, brimful of gleam.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[codex] harden files upload retries and alert resilience' clearly and concisely summarizes the main changes: improving retry logic for file uploads and making alert systems more resilient.
Description check ✅ Passed The description includes a comprehensive summary of changes, motivation, business impact, and a detailed test plan. All key required sections are present and well-populated.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-files-do-retries

Comment @coderabbitai help to get the list of available commands and usage tips.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 22, 2026

Merging this PR will not alter performance

✅ 28 untouched benchmarks


Comparing codex/fix-files-do-retries (9a77b45) with main (fca6320)

Open in CodSpeed

@riderx riderx marked this pull request as ready for review April 22, 2026 20:44
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/backend-alert-resilience.unit.test.ts (1)

76-76: Consider using it.concurrent() for the new tests.

Per coding guidelines, tests should use it.concurrent() when possible to maximize parallelism. The new tests at lines 76, 110, 244, and 273 use isolated state and mocks, so they can safely run concurrently.

♻️ Suggested change for new tests
-  it('retries retryable durable object responses for empty-body upload creation requests', async () => {
+  it.concurrent('retries retryable durable object responses for empty-body upload creation requests', async () => {

Apply similar changes to the other new tests at lines 110, 244, and 273.

As per coding guidelines: tests/**/*.{ts,js}: Use it.concurrent() instead of it() when possible to run tests in parallel within the same file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/backend-alert-resilience.unit.test.ts` at line 76, Replace the
synchronous test declarations with concurrent ones: change the it(...) calls for
the tests whose titles include "retries retryable durable object responses for
empty-body upload creation requests" and the other new tests referenced at lines
110, 244, and 273 to use it.concurrent(...) instead of it(...), ensuring the
same test callback and mocks/state remain unchanged so they run safely in
parallel.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@supabase/functions/_backend/utils/utils.ts`:
- Around line 174-181: existInEnv currently uses the "in" operator which treats
keys with undefined values as present; change it to check the actual value so
undefined bindings are treated as absent (e.g., use getContextEnv(c)[key] !==
undefined or Object.hasOwn(contextEnv, key) && contextEnv[key] !== undefined).
Also update getEnv to consistently treat undefined the same way (read
contextEnv[key], return it if !== undefined, otherwise fall through to fallback)
so callers like the MAIN_SUPABASE_DB_URL branch don't see a
configured-but-undefined value and skip fallbacks; reference existInEnv, getEnv
and getContextEnv to locate the code to change.

---

Nitpick comments:
In `@tests/backend-alert-resilience.unit.test.ts`:
- Line 76: Replace the synchronous test declarations with concurrent ones:
change the it(...) calls for the tests whose titles include "retries retryable
durable object responses for empty-body upload creation requests" and the other
new tests referenced at lines 110, 244, and 273 to use it.concurrent(...)
instead of it(...), ensuring the same test callback and mocks/state remain
unchanged so they run safely in parallel.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d57291c-c195-489d-bfb0-c13205abcb86

📥 Commits

Reviewing files that changed from the base of the PR and between c299d70 and 59d2d91.

📒 Files selected for processing (10)
  • supabase/functions/_backend/files/files.ts
  • supabase/functions/_backend/files/uploadHandler.ts
  • supabase/functions/_backend/files/util.ts
  • supabase/functions/_backend/public/statistics/index.ts
  • supabase/functions/_backend/triggers/cron_stat_app.ts
  • supabase/functions/_backend/triggers/cron_sync_sub.ts
  • supabase/functions/_backend/triggers/on_manifest_create.ts
  • supabase/functions/_backend/utils/retry.ts
  • supabase/functions/_backend/utils/utils.ts
  • tests/backend-alert-resilience.unit.test.ts

Comment thread supabase/functions/_backend/utils/utils.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f19bb7d368

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread supabase/functions/_backend/files/files.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76fc55e3d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread supabase/functions/_backend/files/files.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/backend-alert-resilience.unit.test.ts (1)

204-243: Consider asserting retry recovery uses HEAD on the second fetch.

This test validates the final 409 + Upload-Offset, which is great. Adding call-shape assertions would make the regression stricter and protect the PATCH→HEAD recovery contract.

Suggested test hardening
   const response = await filesTestUtils.fetchUploadHandlerWithRetry(
@@
   expect(response.status).toBe(409)
   expect(response.headers.get('Upload-Offset')).toBe('5242880')
   expect(handler.fetch).toHaveBeenCalledTimes(2)
+  const firstCallRequest = handler.fetch.mock.calls[0]?.[0] as Request
+  const secondCallRequest = handler.fetch.mock.calls[1]?.[0] as Request
+  expect(firstCallRequest.method).toBe('PATCH')
+  expect(secondCallRequest.method).toBe('HEAD')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/backend-alert-resilience.unit.test.ts` around lines 204 - 243, The test
should assert the retry recovery uses a HEAD request on the second fetch to
enforce the PATCH→HEAD recovery contract; update the test for 'recovers upload
offset after a retryable durable object patch response' to add a call-shape
assertion against handler.fetch (e.g., using toHaveBeenNthCalledWith or similar)
to verify the second call was made with method 'HEAD' (and include the expected
URL/Request shape via expect.objectContaining or equivalent), while keeping the
existing assertions on response.status, Upload-Offset and call count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/backend-alert-resilience.unit.test.ts`:
- Around line 204-243: The test should assert the retry recovery uses a HEAD
request on the second fetch to enforce the PATCH→HEAD recovery contract; update
the test for 'recovers upload offset after a retryable durable object patch
response' to add a call-shape assertion against handler.fetch (e.g., using
toHaveBeenNthCalledWith or similar) to verify the second call was made with
method 'HEAD' (and include the expected URL/Request shape via
expect.objectContaining or equivalent), while keeping the existing assertions on
response.status, Upload-Offset and call count.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bfd1403a-3d36-4d59-83ba-d3f4f8c4ddf0

📥 Commits

Reviewing files that changed from the base of the PR and between 76fc55e and 62eb0f5.

📒 Files selected for processing (2)
  • supabase/functions/_backend/files/files.ts
  • tests/backend-alert-resilience.unit.test.ts
✅ Files skipped from review due to trivial changes (1)
  • supabase/functions/_backend/files/files.ts

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62eb0f5821

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread supabase/functions/_backend/files/files.ts Outdated
@sonarqubecloud
Copy link
Copy Markdown

@riderx riderx merged commit 574c178 into main Apr 23, 2026
15 checks passed
@riderx riderx deleted the codex/fix-files-do-retries branch April 23, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant