fix: suppress transient operational alerts#1939
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReturn a retryable 503 for non-replayable streaming uploads when a Durable Object fails, add alert-suppression options to core error helpers, centralize builder-unavailable errors, and enrich queue-consumer failure reporting with structured details and retry-budget filtering. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Function
participant DurableObject
participant Upstream
Client->>Function: PATCH streaming upload
Function->>DurableObject: Proxy streaming request
DurableObject-->>Function: Fetch error (object moved / DO fetch failure)
alt Request is replayable OR DO error not retryable
Function->>Function: determine not retryable
Function-->>Client: Propagate underlying error (throw)
else Request non-replayable && DO error considered retryable
Function->>Function: determine retryable durable-object error
Function-->>Client: Respond 503 (retryable_upload_unavailable) with Retry-After and Tus-Resumable
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/queue-consumer-message-shape.unit.test.ts (1)
60-105: Good test coverage for actionable failure filtering.The tests properly cover the three key scenarios: retries remaining, retry budget exhausted, and ignored error codes. Using
it.concurrent()aligns with the coding guidelines.Minor observation: the threshold value
5is hardcoded rather than derived fromMAX_QUEUE_READS. If the constant changes, these tests would need manual updates. Consider exportingMAX_QUEUE_READSfrom the test utils if you want tests to stay in sync automatically—though this is optional since the current approach clearly documents the expected threshold behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/queue-consumer-message-shape.unit.test.ts` around lines 60 - 105, The tests hardcode the retry threshold (5) which can drift from the implementation; update the tests to import and use the MAX_QUEUE_READS constant from the test utilities instead of literal 5 so they remain in sync: export MAX_QUEUE_READS from __queueConsumerTestUtils__ (or the module that defines it) and replace any hardcoded 5 in the tests (references: getActionableQueueFailures and the failure objects' read_count) with that imported MAX_QUEUE_READS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 226-232: The Discord alert currently includes raw response_body
from actionableFailures (constructed via truncateDiscordField and mapping
detail.response_body), which may leak secrets; change the logic in the block
building the '🧾 Response Body' field to either omit the field entirely, only
include it when running in a non-production environment (check NODE_ENV or a
dedicated feature flag), or sanitize each detail.response_body by redacting
likely secrets before joining (e.g., replace email patterns, long
hex/base64/token-like strings, and API-key patterns with "[REDACTED]"); ensure
you still reference detail.function_name for context and keep
truncateDiscordField applied after redaction.
---
Nitpick comments:
In `@tests/queue-consumer-message-shape.unit.test.ts`:
- Around line 60-105: The tests hardcode the retry threshold (5) which can drift
from the implementation; update the tests to import and use the MAX_QUEUE_READS
constant from the test utilities instead of literal 5 so they remain in sync:
export MAX_QUEUE_READS from __queueConsumerTestUtils__ (or the module that
defines it) and replace any hardcoded 5 in the tests (references:
getActionableQueueFailures and the failure objects' read_count) with that
imported MAX_QUEUE_READS.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 31970d5c-151f-4bfe-abd7-2c07ba47850e
📒 Files selected for processing (8)
supabase/functions/_backend/files/files.tssupabase/functions/_backend/public/build/request.tssupabase/functions/_backend/triggers/queue_consumer.tssupabase/functions/_backend/utils/hono.tssupabase/functions/_backend/utils/on_error.tstests/backend-alert-resilience.unit.test.tstests/on-error-posthog.unit.test.tstests/queue-consumer-message-shape.unit.test.ts
…-alerts # Conflicts: # tests/backend-alert-resilience.unit.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
supabase/functions/_backend/files/files.ts (1)
158-170: Add retryability header parity on the fallback 503 response.This helper already signals retryability via body +
Retry-After. AddingX-Upload-Handler-Retryable: '1'keeps signaling consistent and easier for clients/middleware to consume programmatically.Suggested diff
function retryableUploadUnavailableResponse(): Response { return new Response(JSON.stringify({ error: 'upload_retryable', message: 'Upload worker moved during this request. Retry the upload request.', }), { status: 503, headers: { 'Content-Type': 'application/json', 'Retry-After': '1', 'Tus-Resumable': TUS_VERSION, + [X_UPLOAD_HANDLER_RETRYABLE]: '1', }, }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/functions/_backend/files/files.ts` around lines 158 - 170, The fallback 503 response from retryableUploadUnavailableResponse() signals retryability in body and Retry-After but lacks the X-Upload-Handler-Retryable header; update the headers object returned by retryableUploadUnavailableResponse() to include 'X-Upload-Handler-Retryable': '1' alongside 'Retry-After' and 'Tus-Resumable' so clients/middleware can detect retryability consistently and programmatically.tests/cli.test.ts (1)
102-102: Extract timeout into a shared constant.
60000is repeated in many tests; using one named constant improves readability and keeps future tuning centralized.Suggested change
+const TEST_TIMEOUT_MS = 60_000 ... - }, 60000) + }, TEST_TIMEOUT_MS)Also applies to: 135-135, 200-200, 220-220, 231-231, 258-258, 291-291
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/cli.test.ts` at line 102, Extract the repeated numeric timeout (60000) used in tests into a single named constant (e.g., DEFAULT_TEST_TIMEOUT) in tests/cli.test.ts (or a shared test utils file) and replace all literal occurrences (lines where 60000 is passed as the second argument to test/it/timeouts at startLine 102 and also at 135, 200, 220, 231, 258, 291) with that constant; ensure the constant is exported or imported if you place it in a shared module and update any references to use the constant name instead of the raw number so future timeout tuning is centralized and readable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/cli.test.ts`:
- Line 138: The tests under the suite describe('tests CLI upload options in
parallel', ...) were changed from concurrent to serial; restore concurrency by
replacing serial test declarations with concurrent ones (use it.concurrent(...)
or test.concurrent(...)) for the individual test cases inside this describe
block (and likewise for the other affected suites referenced at lines 185, 233,
260) so each isolated fixture test runs in parallel; locate test names inside
that describe and change their calls from it(...) to it.concurrent(...) ensuring
any shared-resources remain unaffected.
---
Nitpick comments:
In `@supabase/functions/_backend/files/files.ts`:
- Around line 158-170: The fallback 503 response from
retryableUploadUnavailableResponse() signals retryability in body and
Retry-After but lacks the X-Upload-Handler-Retryable header; update the headers
object returned by retryableUploadUnavailableResponse() to include
'X-Upload-Handler-Retryable': '1' alongside 'Retry-After' and 'Tus-Resumable' so
clients/middleware can detect retryability consistently and programmatically.
In `@tests/cli.test.ts`:
- Line 102: Extract the repeated numeric timeout (60000) used in tests into a
single named constant (e.g., DEFAULT_TEST_TIMEOUT) in tests/cli.test.ts (or a
shared test utils file) and replace all literal occurrences (lines where 60000
is passed as the second argument to test/it/timeouts at startLine 102 and also
at 135, 200, 220, 231, 258, 291) with that constant; ensure the constant is
exported or imported if you place it in a shared module and update any
references to use the constant name instead of the raw number so future timeout
tuning is centralized and readable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b9ba675c-6d2c-49b1-929f-e99a96e57438
📒 Files selected for processing (3)
supabase/functions/_backend/files/files.tstests/backend-alert-resilience.unit.test.tstests/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/backend-alert-resilience.unit.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
supabase/functions/_backend/triggers/queue_consumer.ts (1)
264-268:⚠️ Potential issue | 🟠 MajorRedact
error_messagebefore sending it to Discord.
detail.error_messagecomes straight from downstream payloads, but unlikeresponse_bodyit bypassessanitizeDiscordResponseBody(). Tokens, emails, or stack fragments can still leak through the summary line. This is the same privacy class as the earlier raw-body issue.Possible fix
- const errorInfo = detail.error_code ? ` | Error: ${detail.error_code}` : '' - const messageInfo = detail.error_message ? ` | ${truncateDiscordField(detail.error_message.replace(/\s+/g, ' ').trim(), 180)}` : '' + const errorInfo = detail.error_code ? ` | Error: ${detail.error_code}` : '' + const sanitizedMessage = detail.error_message + ? truncateDiscordField(sanitizeDiscordResponseBody(detail.error_message).replace(/\s+/g, ' ').trim(), 180) + : '' + const messageInfo = sanitizedMessage ? ` | ${sanitizedMessage}` : ''🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/functions/_backend/triggers/queue_consumer.ts` around lines 264 - 268, The summary line in the Discord summary uses detail.error_message directly (inside the actionableFailures.map() that builds the string returned by truncateDiscordField), which can leak tokens/emails/stack fragments; fix it by passing detail.error_message through the same sanitizer used for response bodies (e.g., sanitizeDiscordResponseBody) or a redaction helper before trimming and truncating — replace detail.error_message.replace(...).trim() with something like sanitizeDiscordResponseBody(detail.error_message) (then trim and truncate as now) so the map that builds the `**${detail.function_name}** ... ${messageInfo}` string never includes raw error_message.
🧹 Nitpick comments (1)
supabase/functions/_backend/triggers/queue_consumer.ts (1)
14-15: UseMAX_QUEUE_READSas the only retry-threshold source.Line 14 introduces the shared constant, but Line 157 and Line 165 still hard-code
5. If this value changes later, alert suppression and actual archive behavior will silently diverge.Suggested cleanup
- acc[message.read_ct <= 5 ? 0 : 1].push(message) + acc[message.read_ct <= MAX_QUEUE_READS ? 0 : 1].push(message) ... - // Archive messages that have been read 5 or more times + // Archive messages that have been read more than MAX_QUEUE_READS times if (messagesToSkip.length > 0) { - cloudlog(`[${queueName}] Archiving ${messagesToSkip.length} messages that have been read 5 or more times.`) + cloudlog(`[${queueName}] Archiving ${messagesToSkip.length} messages that have been read more than ${MAX_QUEUE_READS} times.`)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/functions/_backend/triggers/queue_consumer.ts` around lines 14 - 15, Replace the two hard-coded retry thresholds of 5 with the shared constant MAX_QUEUE_READS so the alert suppression and archive logic stay in sync; locate usages around the retry/attempt checks (references to the alert suppression check and the archive-on-max-retries branch in queue consumer logic) and change the literal 5 to MAX_QUEUE_READS (ensure imports/visibility of MAX_QUEUE_READS are correct where used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 364-379: In extractErrorDetails(), the code only returns
errorMessage when errorCode is a string, which drops message-only JSON errors
like { message: "..." }; update the function so it always captures and returns
errorMessage (from payload.message or payload.errorMessage) even when errorCode
is null/non-string, by returning an object with bodyPreview, errorCode (string
or null) and errorMessage (string or null) based on the payload variables
(payload, errorCode, errorMessage) rather than gating errorMessage on typeof
errorCode === 'string'.
---
Duplicate comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 264-268: The summary line in the Discord summary uses
detail.error_message directly (inside the actionableFailures.map() that builds
the string returned by truncateDiscordField), which can leak tokens/emails/stack
fragments; fix it by passing detail.error_message through the same sanitizer
used for response bodies (e.g., sanitizeDiscordResponseBody) or a redaction
helper before trimming and truncating — replace
detail.error_message.replace(...).trim() with something like
sanitizeDiscordResponseBody(detail.error_message) (then trim and truncate as
now) so the map that builds the `**${detail.function_name}** ... ${messageInfo}`
string never includes raw error_message.
---
Nitpick comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 14-15: Replace the two hard-coded retry thresholds of 5 with the
shared constant MAX_QUEUE_READS so the alert suppression and archive logic stay
in sync; locate usages around the retry/attempt checks (references to the alert
suppression check and the archive-on-max-retries branch in queue consumer logic)
and change the literal 5 to MAX_QUEUE_READS (ensure imports/visibility of
MAX_QUEUE_READS are correct where used).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7a9fb2a9-40e3-4115-9a6a-573037222db2
📒 Files selected for processing (1)
supabase/functions/_backend/triggers/queue_consumer.ts
|



Summary (AI generated)
Motivation (AI generated)
The current alerting path pages on transient infrastructure events and retryable queue failures, which creates noisy Discord messages that are not actionable and makes real crashes harder to spot.
Business Impact (AI generated)
This reduces alert fatigue for the team, makes production incidents easier to triage, and lowers the chance that a real outage is buried under transient builder, storage, or queue noise.
Test Plan (AI generated)
Generated with AI
Summary by CodeRabbit
Bug Fixes
Refactor
Tests