Skip to content

fix: suppress transient operational alerts#1939

Merged
riderx merged 8 commits into
mainfrom
codex/quiet-transient-alerts
Apr 23, 2026
Merged

fix: suppress transient operational alerts#1939
riderx merged 8 commits into
mainfrom
codex/quiet-transient-alerts

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented Apr 23, 2026

Summary (AI generated)

  • suppress Discord crash alerts for expected operational 503 responses from builder and file storage paths
  • return a retryable upload response when a Durable Object moves during a streaming TUS upload instead of surfacing a crash alert
  • delay queue failure Discord alerts until the retry budget is exhausted and include clearer failure details when an alert is finally sent
  • add focused tests for the new alert suppression and retry behavior

Motivation (AI generated)

The current alerting path pages on transient infrastructure events and retryable queue failures, which creates noisy Discord messages that are not actionable and makes real crashes harder to spot.

Business Impact (AI generated)

This reduces alert fatigue for the team, makes production incidents easier to triage, and lowers the chance that a real outage is buried under transient builder, storage, or queue noise.

Test Plan (AI generated)

  • bun lint
  • bun lint:backend
  • bun typecheck
  • bunx vitest run tests/backend-alert-resilience.unit.test.ts tests/on-error-posthog.unit.test.ts tests/queue-consumer-message-shape.unit.test.ts tests/builder-payload.unit.test.ts tests/files-r2-error.test.ts
  • bun run test:backend
  • bun run test:all

Generated with AI

Summary by CodeRabbit

  • Bug Fixes

    • Streaming uploads now return a clear retryable 503 (with retry headers) when resumable uploads become temporarily unavailable.
    • Builder/API errors now surface richer status metadata and standardized 503 responses.
  • Refactor

    • Centralized error handling with optional alert suppression to avoid noisy alerts.
    • Queue failure reporting now includes structured, truncated failure details, sanitized response previews, and clearer retry/archive analysis.
  • Tests

    • Updated coverage for retry behavior, alert suppression, queue filtering, sanitization, timeouts, and concurrency.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Return a retryable 503 for non-replayable streaming uploads when a Durable Object fails, add alert-suppression options to core error helpers, centralize builder-unavailable errors, and enrich queue-consumer failure reporting with structured details and retry-budget filtering.

Changes

Cohort / File(s) Summary
Error handling core
supabase/functions/_backend/utils/hono.ts, supabase/functions/_backend/utils/on_error.ts
Add QuickErrorOptions; extend quickError(..., options) to carry suppressDiscordAlert on error causes; onError skips Discord 5xx alerts when suppression is present.
Upload handler retry logic
supabase/functions/_backend/files/files.ts
Refactor fetchUploadHandlerWithRetry to compute durable-object retryability once; return standardized 503 retryable_upload_unavailable for non-replayable streaming uploads; disable alerting on upstream 503s; export filesTestUtils.retryableUploadUnavailableResponse.
Builder request handling
supabase/functions/_backend/public/build/request.ts
Add local throwBuilderUnavailable helper to throw 503 service_unavailable with { alert: false }; include builder response status/statusText in thrown metadata and rethrow existing 503s.
Queue consumer & reporting
supabase/functions/_backend/triggers/queue_consumer.ts
Introduce structured FailureDetail, extractErrorDetails, getActionableQueueFailures honoring MAX_QUEUE_READS, and sanitizeDiscordResponseBody; update Discord embeds to include failure codes, truncated messages, sanitized bodies, and retry-budget analysis.
Tests & test utils
tests/backend-alert-resilience.unit.test.ts, tests/on-error-posthog.unit.test.ts, tests/queue-consumer-message-shape.unit.test.ts, tests/cli-new-encryption.test.ts, tests/cli.test.ts
Expect retryable 503 for streaming upload test; add test for suppressDiscordAlert behavior; add actionable-failure and sanitization tests; export new test helpers; increase some timeouts and convert some concurrent tests to non-concurrent.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Function
    participant DurableObject
    participant Upstream

    Client->>Function: PATCH streaming upload
    Function->>DurableObject: Proxy streaming request
    DurableObject-->>Function: Fetch error (object moved / DO fetch failure)
    alt Request is replayable OR DO error not retryable
        Function->>Function: determine not retryable
        Function-->>Client: Propagate underlying error (throw)
    else Request non-replayable && DO error considered retryable
        Function->>Function: determine retryable durable-object error
        Function-->>Client: Respond 503 (retryable_upload_unavailable) with Retry-After and Tus-Resumable
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • Ari4ka

Poem

🐰 I hopped through streams where uploads fray,

When objects move mid-patch I pause and say,
"Wait, retry later" — no alarm to sound,
Quiet the bells, keep logs safe and sound,
I nibble code and bounce away.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description covers the summary, motivation, and business impact with comprehensive details; however, it does not follow the provided template structure (missing explicit 'Summary', 'Test plan', 'Screenshots', and 'Checklist' sections). Restructure the description to match the template: add a 'Summary' section, a 'Test plan' section with clear reproduction steps, note 'Screenshots' as N/A for backend changes, and complete the checklist with checkboxes.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: suppress transient operational alerts' directly and clearly summarizes the main objective of the PR—suppressing unwanted Discord alerts for transient infrastructure events.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/quiet-transient-alerts

Comment @coderabbitai help to get the list of available commands and usage tips.

@riderx riderx marked this pull request as ready for review April 23, 2026 14:40
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/queue-consumer-message-shape.unit.test.ts (1)

60-105: Good test coverage for actionable failure filtering.

The tests properly cover the three key scenarios: retries remaining, retry budget exhausted, and ignored error codes. Using it.concurrent() aligns with the coding guidelines.

Minor observation: the threshold value 5 is hardcoded rather than derived from MAX_QUEUE_READS. If the constant changes, these tests would need manual updates. Consider exporting MAX_QUEUE_READS from the test utils if you want tests to stay in sync automatically—though this is optional since the current approach clearly documents the expected threshold behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/queue-consumer-message-shape.unit.test.ts` around lines 60 - 105, The
tests hardcode the retry threshold (5) which can drift from the implementation;
update the tests to import and use the MAX_QUEUE_READS constant from the test
utilities instead of literal 5 so they remain in sync: export MAX_QUEUE_READS
from __queueConsumerTestUtils__ (or the module that defines it) and replace any
hardcoded 5 in the tests (references: getActionableQueueFailures and the failure
objects' read_count) with that imported MAX_QUEUE_READS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 226-232: The Discord alert currently includes raw response_body
from actionableFailures (constructed via truncateDiscordField and mapping
detail.response_body), which may leak secrets; change the logic in the block
building the '🧾 Response Body' field to either omit the field entirely, only
include it when running in a non-production environment (check NODE_ENV or a
dedicated feature flag), or sanitize each detail.response_body by redacting
likely secrets before joining (e.g., replace email patterns, long
hex/base64/token-like strings, and API-key patterns with "[REDACTED]"); ensure
you still reference detail.function_name for context and keep
truncateDiscordField applied after redaction.

---

Nitpick comments:
In `@tests/queue-consumer-message-shape.unit.test.ts`:
- Around line 60-105: The tests hardcode the retry threshold (5) which can drift
from the implementation; update the tests to import and use the MAX_QUEUE_READS
constant from the test utilities instead of literal 5 so they remain in sync:
export MAX_QUEUE_READS from __queueConsumerTestUtils__ (or the module that
defines it) and replace any hardcoded 5 in the tests (references:
getActionableQueueFailures and the failure objects' read_count) with that
imported MAX_QUEUE_READS.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 31970d5c-151f-4bfe-abd7-2c07ba47850e

📥 Commits

Reviewing files that changed from the base of the PR and between 8b70cfe and 9394ef4.

📒 Files selected for processing (8)
  • supabase/functions/_backend/files/files.ts
  • supabase/functions/_backend/public/build/request.ts
  • supabase/functions/_backend/triggers/queue_consumer.ts
  • supabase/functions/_backend/utils/hono.ts
  • supabase/functions/_backend/utils/on_error.ts
  • tests/backend-alert-resilience.unit.test.ts
  • tests/on-error-posthog.unit.test.ts
  • tests/queue-consumer-message-shape.unit.test.ts

Comment thread supabase/functions/_backend/triggers/queue_consumer.ts
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 23, 2026

Merging this PR will not alter performance

✅ 28 untouched benchmarks


Comparing codex/quiet-transient-alerts (59a386e) with main (be0c55e)

Open in CodSpeed

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
supabase/functions/_backend/files/files.ts (1)

158-170: Add retryability header parity on the fallback 503 response.

This helper already signals retryability via body + Retry-After. Adding X-Upload-Handler-Retryable: '1' keeps signaling consistent and easier for clients/middleware to consume programmatically.

Suggested diff
 function retryableUploadUnavailableResponse(): Response {
   return new Response(JSON.stringify({
     error: 'upload_retryable',
     message: 'Upload worker moved during this request. Retry the upload request.',
   }), {
     status: 503,
     headers: {
       'Content-Type': 'application/json',
       'Retry-After': '1',
       'Tus-Resumable': TUS_VERSION,
+      [X_UPLOAD_HANDLER_RETRYABLE]: '1',
     },
   })
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/functions/_backend/files/files.ts` around lines 158 - 170, The
fallback 503 response from retryableUploadUnavailableResponse() signals
retryability in body and Retry-After but lacks the X-Upload-Handler-Retryable
header; update the headers object returned by
retryableUploadUnavailableResponse() to include 'X-Upload-Handler-Retryable':
'1' alongside 'Retry-After' and 'Tus-Resumable' so clients/middleware can detect
retryability consistently and programmatically.
tests/cli.test.ts (1)

102-102: Extract timeout into a shared constant.

60000 is repeated in many tests; using one named constant improves readability and keeps future tuning centralized.

Suggested change
+const TEST_TIMEOUT_MS = 60_000
...
-  }, 60000)
+  }, TEST_TIMEOUT_MS)

Also applies to: 135-135, 200-200, 220-220, 231-231, 258-258, 291-291

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/cli.test.ts` at line 102, Extract the repeated numeric timeout (60000)
used in tests into a single named constant (e.g., DEFAULT_TEST_TIMEOUT) in
tests/cli.test.ts (or a shared test utils file) and replace all literal
occurrences (lines where 60000 is passed as the second argument to
test/it/timeouts at startLine 102 and also at 135, 200, 220, 231, 258, 291) with
that constant; ensure the constant is exported or imported if you place it in a
shared module and update any references to use the constant name instead of the
raw number so future timeout tuning is centralized and readable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/cli.test.ts`:
- Line 138: The tests under the suite describe('tests CLI upload options in
parallel', ...) were changed from concurrent to serial; restore concurrency by
replacing serial test declarations with concurrent ones (use it.concurrent(...)
or test.concurrent(...)) for the individual test cases inside this describe
block (and likewise for the other affected suites referenced at lines 185, 233,
260) so each isolated fixture test runs in parallel; locate test names inside
that describe and change their calls from it(...) to it.concurrent(...) ensuring
any shared-resources remain unaffected.

---

Nitpick comments:
In `@supabase/functions/_backend/files/files.ts`:
- Around line 158-170: The fallback 503 response from
retryableUploadUnavailableResponse() signals retryability in body and
Retry-After but lacks the X-Upload-Handler-Retryable header; update the headers
object returned by retryableUploadUnavailableResponse() to include
'X-Upload-Handler-Retryable': '1' alongside 'Retry-After' and 'Tus-Resumable' so
clients/middleware can detect retryability consistently and programmatically.

In `@tests/cli.test.ts`:
- Line 102: Extract the repeated numeric timeout (60000) used in tests into a
single named constant (e.g., DEFAULT_TEST_TIMEOUT) in tests/cli.test.ts (or a
shared test utils file) and replace all literal occurrences (lines where 60000
is passed as the second argument to test/it/timeouts at startLine 102 and also
at 135, 200, 220, 231, 258, 291) with that constant; ensure the constant is
exported or imported if you place it in a shared module and update any
references to use the constant name instead of the raw number so future timeout
tuning is centralized and readable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b9ba675c-6d2c-49b1-929f-e99a96e57438

📥 Commits

Reviewing files that changed from the base of the PR and between cc91b26 and 77ffe7f.

📒 Files selected for processing (3)
  • supabase/functions/_backend/files/files.ts
  • tests/backend-alert-resilience.unit.test.ts
  • tests/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/backend-alert-resilience.unit.test.ts

Comment thread tests/cli.test.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
supabase/functions/_backend/triggers/queue_consumer.ts (1)

264-268: ⚠️ Potential issue | 🟠 Major

Redact error_message before sending it to Discord.

detail.error_message comes straight from downstream payloads, but unlike response_body it bypasses sanitizeDiscordResponseBody(). Tokens, emails, or stack fragments can still leak through the summary line. This is the same privacy class as the earlier raw-body issue.

Possible fix
-                  const errorInfo = detail.error_code ? ` | Error: ${detail.error_code}` : ''
-                  const messageInfo = detail.error_message ? ` | ${truncateDiscordField(detail.error_message.replace(/\s+/g, ' ').trim(), 180)}` : ''
+                  const errorInfo = detail.error_code ? ` | Error: ${detail.error_code}` : ''
+                  const sanitizedMessage = detail.error_message
+                    ? truncateDiscordField(sanitizeDiscordResponseBody(detail.error_message).replace(/\s+/g, ' ').trim(), 180)
+                    : ''
+                  const messageInfo = sanitizedMessage ? ` | ${sanitizedMessage}` : ''
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/functions/_backend/triggers/queue_consumer.ts` around lines 264 -
268, The summary line in the Discord summary uses detail.error_message directly
(inside the actionableFailures.map() that builds the string returned by
truncateDiscordField), which can leak tokens/emails/stack fragments; fix it by
passing detail.error_message through the same sanitizer used for response bodies
(e.g., sanitizeDiscordResponseBody) or a redaction helper before trimming and
truncating — replace detail.error_message.replace(...).trim() with something
like sanitizeDiscordResponseBody(detail.error_message) (then trim and truncate
as now) so the map that builds the `**${detail.function_name}** ...
${messageInfo}` string never includes raw error_message.
🧹 Nitpick comments (1)
supabase/functions/_backend/triggers/queue_consumer.ts (1)

14-15: Use MAX_QUEUE_READS as the only retry-threshold source.

Line 14 introduces the shared constant, but Line 157 and Line 165 still hard-code 5. If this value changes later, alert suppression and actual archive behavior will silently diverge.

Suggested cleanup
-    acc[message.read_ct <= 5 ? 0 : 1].push(message)
+    acc[message.read_ct <= MAX_QUEUE_READS ? 0 : 1].push(message)
...
-  // Archive messages that have been read 5 or more times
+  // Archive messages that have been read more than MAX_QUEUE_READS times
   if (messagesToSkip.length > 0) {
-    cloudlog(`[${queueName}] Archiving ${messagesToSkip.length} messages that have been read 5 or more times.`)
+    cloudlog(`[${queueName}] Archiving ${messagesToSkip.length} messages that have been read more than ${MAX_QUEUE_READS} times.`)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/functions/_backend/triggers/queue_consumer.ts` around lines 14 - 15,
Replace the two hard-coded retry thresholds of 5 with the shared constant
MAX_QUEUE_READS so the alert suppression and archive logic stay in sync; locate
usages around the retry/attempt checks (references to the alert suppression
check and the archive-on-max-retries branch in queue consumer logic) and change
the literal 5 to MAX_QUEUE_READS (ensure imports/visibility of MAX_QUEUE_READS
are correct where used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 364-379: In extractErrorDetails(), the code only returns
errorMessage when errorCode is a string, which drops message-only JSON errors
like { message: "..." }; update the function so it always captures and returns
errorMessage (from payload.message or payload.errorMessage) even when errorCode
is null/non-string, by returning an object with bodyPreview, errorCode (string
or null) and errorMessage (string or null) based on the payload variables
(payload, errorCode, errorMessage) rather than gating errorMessage on typeof
errorCode === 'string'.

---

Duplicate comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 264-268: The summary line in the Discord summary uses
detail.error_message directly (inside the actionableFailures.map() that builds
the string returned by truncateDiscordField), which can leak tokens/emails/stack
fragments; fix it by passing detail.error_message through the same sanitizer
used for response bodies (e.g., sanitizeDiscordResponseBody) or a redaction
helper before trimming and truncating — replace
detail.error_message.replace(...).trim() with something like
sanitizeDiscordResponseBody(detail.error_message) (then trim and truncate as
now) so the map that builds the `**${detail.function_name}** ... ${messageInfo}`
string never includes raw error_message.

---

Nitpick comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 14-15: Replace the two hard-coded retry thresholds of 5 with the
shared constant MAX_QUEUE_READS so the alert suppression and archive logic stay
in sync; locate usages around the retry/attempt checks (references to the alert
suppression check and the archive-on-max-retries branch in queue consumer logic)
and change the literal 5 to MAX_QUEUE_READS (ensure imports/visibility of
MAX_QUEUE_READS are correct where used).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a9fb2a9-40e3-4115-9a6a-573037222db2

📥 Commits

Reviewing files that changed from the base of the PR and between e242566 and f5d9efa.

📒 Files selected for processing (1)
  • supabase/functions/_backend/triggers/queue_consumer.ts

Comment thread supabase/functions/_backend/triggers/queue_consumer.ts
@sonarqubecloud
Copy link
Copy Markdown

@riderx riderx merged commit 59a24ff into main Apr 23, 2026
15 of 16 checks passed
@riderx riderx deleted the codex/quiet-transient-alerts branch April 23, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant