feat(cloud): rate limiting + Sentry wiring + Stripe webhook hardening#33
Conversation
Three buckets, each configurable via env:
- public (60/min/IP) on /health and /ready
- authenticated (300/min/user) on /billing/{checkout,portal,subscription}
- sensitive (10/min/IP) on /billing/webhook
The authenticated limiter keys by hashed bearer-token prefix so each
user gets their own budget; it falls back to IP keying when no auth
header is present. 429 responses carry a uniform JSON shape plus
Retry-After when slowapi surfaces it.
Test env defaults to disabled so the existing 115 tests are unaffected;
set GRADATA_RATE_LIMIT_ENABLED=true to exercise the caps (see
tests/test_rate_limit.py for boundary + token-keying coverage).
Co-Authored-By: Gradata <noreply@gradata.ai>
- Pass profiles_sample_rate=0.1 into sentry_sdk.init (config value was defined but never forwarded). - Add tag_user(user_id, workspace_id) helper that writes to the current isolation scope so any later exception carries the owning user's UUID and workspace tag. Wired into get_current_user_id, get_current_brain, and require_operator. - Expand _SENSITIVE_KEYS to scrub email / email_address / user_email in Sentry event extras and contexts. Opaque user.id is fine under send_default_pii=False; the raw email never is. - No-op when DSN is unset (production behaviour unchanged). Co-Authored-By: Gradata <noreply@gradata.ai>
- New migration 005_processed_webhooks.sql: one row per successfully
handled event.id, PK conflict gives us idempotency for free.
- Reject events with `created` older than 5 minutes (replay window is
shorter than Stripe's 3-day retry schedule, so legitimate deliveries
always fall inside it).
- Reject events missing `created` entirely (can't date-check them, treat
as malformed — signature-verified live events always include it).
- Look up event.id in processed_webhooks before dispatching; return 200
with {"duplicate": true} when we've seen it before so the handler
doesn't double-apply state changes (subscription plan flips, etc).
- Insert event.id into processed_webhooks after successful handling.
PK conflicts from parallel deliveries are swallowed defensively.
- Signature verification (stripe.Webhook.construct_event) unchanged —
that was already correct.
Migration note added to RAILWAY-ENV.md. When the table doesn't exist
yet, idempotency degrades to always-process (logged warning), so the
handler keeps working during rollout.
Co-Authored-By: Gradata <noreply@gradata.ai>
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 4 minutes and 24 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughImplements Stripe webhook hardening (signature verification, age check, atomic claim-first idempotency), adds DB migration for processed webhooks, introduces app-wide rate limiting with three buckets and middleware integration, expands Sentry sampling/tagging/scrubbing, and adds tests and env docs. Changes
Sequence Diagram(s)sequenceDiagram
participant Stripe as Stripe
participant Client as HTTP Client
participant App as Gradata App
participant DB as Postgres (processed_webhooks)
participant Handler as Internal Handler
Client->>App: POST /api/v1/billing/webhook (signature header)
App->>App: stripe.Webhook.construct_event (verify signature)
alt signature invalid
App-->>Client: 400 Bad Request
else signature valid
App->>App: extract event.id, event.created
alt missing created OR created too old
App-->>Client: 400 Bad Request
else within age
App->>DB: INSERT event_id ... status='processing' (claim)
alt insert conflict OR existing status='processed'
App-->>Client: 200 {"duplicate": true}
else claim acquired
App->>Handler: process event (business logic)
Handler-->>App: processing complete
App->>DB: UPDATE processed_webhooks SET status='processed', processed_at=now() WHERE event_id=...
App-->>Client: 200 {"status":"ok","duplicate":false}
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloud/app/auth.py`:
- Around line 112-123: Centralize Sentry tagging by extracting the duplicated
tag_user logic into a single helper (e.g., tag_user_for_brain) and call it from
this function instead of inline calls, or alternatively add tagging inside the
common auth primitives (verify_api_key and verify_jwt) so every auth path
(including get_brain_for_request) applies the same Sentry context; update
get_brain_for_request to invoke the new helper or rely on the updated verify_*
functions, and ensure the helper accepts the brain/user_id and workspace_id
resolution logic used here (brain.get("user_id") or brain.get("id"),
workspace_id from brain or rows[0]) so tagging is consistent across codepaths.
In `@cloud/app/middleware.py`:
- Around line 22-37: The Retry-After header in _rate_limit_handler is being
truncated via int(retry_after) which floors values (0.8 -> 0) causing immediate
client retries; update the logic to compute the ceiling of retry_after and clamp
it to at least 1 when retry_after is present (e.g., use math.ceil(retry_after)
and max(1, ...)) and then set headers["Retry-After"] to the stringified integer
so the response always tells clients to wait at least one second.
In `@cloud/app/rate_limit.py`:
- Around line 54-66: Both limiter and auth_limiter are constructed with
headers_enabled=False so clients don't receive X-RateLimit-* headers; change
their constructors to use headers_enabled=True (i.e., update the Limiter(...)
calls that create limiter and auth_limiter) so the middleware will emit
X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers; keep
the existing enabled handling via refresh_enabled_flag() as-is since enabled is
mutable post-construction.
In `@cloud/app/routes/billing.py`:
- Around line 209-223: Replay-protection is too strict for potential clock skew:
adjust the age check in the handler that calls _event_created(event) to allow a
small grace window by comparing age against WEBHOOK_MAX_AGE_SECONDS plus a new
constant (e.g., WEBHOOK_MAX_AGE_GRACE_SECONDS or GRACE_SECONDS), update the
rejection log to include the effective threshold and use the same combined
threshold when raising the HTTPException; ensure the new constant is defined
near other webhook config so it's discoverable and documented.
In `@cloud/app/sentry_init.py`:
- Around line 37-42: The current scrubber only walks dicts and misses list/tuple
containers; update the scrub routine (the _scrub_dict/_scrub_value function) to
recursively traverse list and tuple values as well as dicts, replacing any dict
entries whose key lowercased matches entries in _SENSITIVE_KEYS with
"[Filtered]" and returning a new tuple for tuple inputs; ensure you handle
nested lists/tuples of dicts so keys like "email", "email_address", and
"user_email" are scrubbed even when embedded in lists.
In `@cloud/migrations/005_processed_webhooks.sql`:
- Around line 10-17: The schema currently records events only after successful
handling which allows a race where parallel deliveries both pass the lookup and
double-apply effects; change the processed_webhooks design to an atomic
claim-first flow: add columns like status (ENUM or TEXT), claimed_at TIMESTAMPTZ
NULL, processed_at TIMESTAMPTZ NULL (and optionally processor_id) to the
processed_webhooks table, then in cloud/app/routes/billing.py perform an atomic
INSERT ... ON CONFLICT DO NOTHING to "claim" the event_id before calling
_handle_event(event) and check the insert result (if 0 rows affected, skip
processing), and after successful handling UPDATE the row to status='processed'
and set processed_at (or wrap claim+handle+update in a transaction if desired)
so no two workers can both process the same event.
In `@cloud/tests/test_rate_limit.py`:
- Around line 78-96: The test test_rate_limit_disabled_in_test_env_by_default
resets rate_limit.limiter but not rate_limit.auth_limiter, which can leave auth
limiter state leaking between tests; after calling rate_limit.limiter.reset()
(and/or after rate_limit.refresh_enabled_flag()), also call
rate_limit.auth_limiter.reset() so both limiters are cleared before creating the
TestClient, ensuring consistent isolation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: baab2914-0d91-4156-b4d3-178fc99f309c
📒 Files selected for processing (12)
cloud/RAILWAY-ENV.mdcloud/app/auth.pycloud/app/config.pycloud/app/middleware.pycloud/app/rate_limit.pycloud/app/routes/billing.pycloud/app/routes/health.pycloud/app/sentry_init.pycloud/migrations/005_processed_webhooks.sqlcloud/tests/test_billing.pycloud/tests/test_rate_limit.pycloud/tests/test_sentry.py
📜 Review details
🔇 Additional comments (20)
cloud/tests/test_billing.py (5)
228-238: LGTM — Helper updated to support webhook hardening tests.The
setdefaultapproach cleanly injects requiredidandcreatedfields while allowing individual tests to override them. Usingid(event)for unique IDs is pragmatic for test isolation.
334-341: Intentional design:_post_withomits defaults for precise hardening tests.Unlike
TestWebhook._post_event, this helper correctly avoids injecting defaults so tests can exercise edge cases like missingcreatedor specificidvalues.
343-356: LGTM — Stale event rejection test.The 6-minute offset correctly exceeds the 5-minute
WEBHOOK_MAX_AGE_SECONDSthreshold, validating replay protection.
370-402: LGTM — Idempotency test validates duplicate detection.Good approach: seeding
processed_webhooksviamock_supabase.add_response, then asserting noupdatecalls occur and response includes"duplicate": true.
404-436: LGTM — New event recording test.Properly verifies that successful handling triggers an insert to
processed_webhookswith correctevent_idandevent_type.cloud/tests/test_rate_limit.py (4)
10-38: LGTM — Well-structured fixture with proper state isolation.The fixture correctly:
- Clears settings cache before setting env vars
- Calls
refresh_enabled_flag()to sync limiter state- Resets both limiters before and after test execution
This prevents cross-test state bleed as noted in the relevant code snippet about shared MemoryStorage.
40-59: LGTM — Public bucket boundary tests.First test validates under-limit success; second validates 429 on overflow with proper detail message and optional Retry-After header check.
62-75: LGTM — Sensitive bucket test approach is sound.Intentionally triggering signature/config failures (400/503) to prove the limiter runs first is a valid test strategy. The third request hitting 429 confirms the sensitive bucket's 2/min cap.
98-116: LGTM — Auth keying unit test.Good coverage of the
_auth_or_ip_keyfunction:
- Different tokens → different keys
- Same token → deterministic (same key)
- Missing auth → IP fallback with
ip:prefix- Present auth →
u:prefixcloud/app/rate_limit.py (4)
30-39: LGTM — Auth key derivation is well-designed.The blake2b hash with 6-byte digest (48 bits / 12 hex chars) provides sufficient collision resistance for rate-limit keying while keeping keys compact. The
u:andip:prefixes ensure namespace separation.
42-51: LGTM — Environment-aware toggle logic.Explicit env var takes precedence; otherwise defaults to enabled (non-test) or disabled (test). This design prevents accidental rate-limit failures in the existing test suite.
69-78: LGTM —refresh_enabled_flagaddresses import-time evaluation.This correctly allows test fixtures to override env vars after import, then sync both limiters to the new state.
96-101: LGTM — Decorator factories are clean.Using callables (
_public,_authenticated,_sensitive) for lazy limit resolution is the correct slowapi pattern.cloud/app/routes/billing.py (7)
56-61: LGTM — Rate limit decorator applied to checkout.The
@auth_limitdecorator correctly uses the authenticated bucket (300/min per user).
101-104: LGTM — Rate limit decorator applied to portal.Consistent use of
@auth_limitfor authenticated billing endpoints.
129-156: LGTM — Event accessor helpers are robust.The polymorphic accessors (
_event_id,_event_type,_event_created) correctly handle both dict events and Stripe Event objects, with appropriate null/type checks.
159-169: Defensive fallback on lookup failure is appropriate.Returning
False(not processed) when the table doesn't exist allows webhook handling to continue during migration rollout. The warning log provides visibility.
172-180: Insert exception handling is correct for PK race.Two concurrent deliveries of the same event could race to insert; the second will fail with a PK conflict. Logging at
infolevel and continuing is the right behavior since the event was already processed.
225-234: LGTM — Idempotency check and marker insertion are correctly sequenced.The flow is correct:
- Check if already processed → return early with
duplicate: True- Handle the event
- Mark as processed
This ensures events aren't marked until successfully handled.
322-326: LGTM — Rate limit decorator applied to subscription endpoint.Consistent use of
@auth_limitfor the subscription status endpoint.
| # IP-keyed limiter (public + sensitive buckets) | ||
| limiter = Limiter( | ||
| key_func=get_remote_address, | ||
| enabled=limits_enabled(), | ||
| headers_enabled=False, | ||
| ) | ||
|
|
||
| # Bearer-token-keyed limiter (authenticated API bucket) | ||
| auth_limiter = Limiter( | ||
| key_func=_auth_or_ip_key, | ||
| enabled=limits_enabled(), | ||
| headers_enabled=False, | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🌐 Web query:
slowapi Limiter constructor parameters headers_enabled
💡 Result:
The slowapi Limiter constructor includes a parameter headers_enabled of type bool with default value False. It controls whether to inject X-RateLimit-* response headers (such as X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) into responses when rate limits are applied. Full list of key constructor parameters from official API documentation and source analysis: - key_func: Callable[..., str] (required) - Function to extract client identifier (e.g., IP address). - default_limits: List[StrOrCallableStr] = [] - Global fallback limits for routes without specific limits. - application_limits: List[StrOrCallableStr] = [] - Limits applied to the entire application. - headers_enabled: bool = False - Enable rate limit response headers. - strategy: Optional[str] = None - Rate limiting strategy (e.g., "fixed-window"). - storage_uri: Optional[str] = None - Storage backend URI (e.g., "redis://localhost:6379"). - storage_options: Dict[str, str] = {} - Additional storage options. - auto_check: bool = True - Automatically check limits in decorator chain. - swallow_errors: bool = False - Suppress exceptions on storage failures. - in_memory_fallback: List[StrOrCallableStr] = [] - Fallback limits when storage is down. - Additional parameters: enabled (bool=True), retry_after, key_prefix, key_style, in_memory_fallback_enabled, config_filename. Usage example: from slowapi import Limiter from slowapi.util import get_remote_address limiter = Limiter( key_func=get_remote_address, headers_enabled=True, # Enables X-RateLimit-* headers default_limits=["100/hour"] )
Citations:
- 1: https://deepwiki.com/laurentS/slowapi/3.1-limiter-class
- 2: https://deepwiki.com/laurentS/slowapi/5-configuration
- 3: https://slowapi.readthedocs.io/en/latest/api/
🏁 Script executed:
rg -n "refresh_enabled_flag" cloud/app/Repository: Gradata/gradata
Length of output: 399
🏁 Script executed:
rg -n "enabled" cloud/app/rate_limit.py -A 2 -B 2Repository: Gradata/gradata
Length of output: 1062
🏁 Script executed:
cat -n cloud/app/rate_limit.pyRepository: Gradata/gradata
Length of output: 4067
Consider enabling headers_enabled=True to expose rate-limit headers to clients.
With headers_enabled=False, clients cannot see their remaining request budget. Enabling X-RateLimit-* headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) improves API usability for integrators. The enabled property is already mutable post-construction and correctly updated by refresh_enabled_flag() at lines 77–78.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cloud/app/rate_limit.py` around lines 54 - 66, Both limiter and auth_limiter
are constructed with headers_enabled=False so clients don't receive
X-RateLimit-* headers; change their constructors to use headers_enabled=True
(i.e., update the Limiter(...) calls that create limiter and auth_limiter) so
the middleware will emit X-RateLimit-Limit, X-RateLimit-Remaining, and
X-RateLimit-Reset headers; keep the existing enabled handling via
refresh_enabled_flag() as-is since enabled is mutable post-construction.
Security/correctness fixes from CodeRabbit CHANGES_REQUESTED review: - billing: atomic claim-first webhook idempotency. Insert row with status='processing' BEFORE handler runs; PK conflict = race loss, skip handler. Eliminates double-apply under concurrent delivery. - migrations/005: add status + claimed_at + processed_at columns to support claim-first design. Keep old created_at index for rollout compat. - middleware: ceil+clamp Retry-After so 0.8s doesn't floor to 0s and cause immediate client retries. - sentry_init: scrubber now recurses through list/tuple containers (prev only walked dicts — emails in nested arrays slipped through). - auth: centralise Sentry user tagging in _tag_brain() so every auth path (get_current_brain + get_brain_for_request) gets the same Sentry context. Previously get_brain_for_request was untagged. - billing: add WEBHOOK_MAX_AGE_GRACE_SECONDS=30 for NTP/edge clock skew. - tests: reset auth_limiter alongside limiter in disabled-env test. - tests: add claim-race-loss test asserting handler skips on PK conflict. - tests: add list/tuple scrubbing test for Sentry PII protection. Declined: headers_enabled=True on slowapi Limiter. slowapi's X-RateLimit-* injection requires route handlers to take a `response: Response` param and is incompatible with Pydantic/JSONResponse returns. Tracked for a follow-up route migration. Retry-After on 429 still emitted from the exception handler. All 130 cloud tests + 2070 SDK tests pass. No new ruff/pyright errors. Co-Authored-By: Gradata <noreply@gradata.ai>
Deploying gradata-dashboard with
|
| Latest commit: |
616cc99
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://496a4a70.gradata-dashboard.pages.dev |
| Branch Preview URL: | https://feat-api-security-hardening.gradata-dashboard.pages.dev |
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloud/app/auth.py`:
- Around line 128-134: After verify_jwt(cred) (and similarly immediately after
verify_api_key(...) on the API-key branch) call tag_user(user_id) to set the
Sentry user context before any authorization checks that can fail; then after
loading the brain record via db.select("brains", filters={"user_id": user_id})
and before raising or returning, call tag_workspace/workspace tagging (or reuse
_tag_brain(rows[0], user_id=user_id) to enrich Sentry with workspace_id) so the
403/404/operator-denied paths are tagged; apply the same change to the other
similar blocks referenced (around the verify_jwt/verify_api_key flows at the
other ranges) so tagging happens immediately after identity verification and is
then enriched after the brain DB read.
In `@cloud/app/config.py`:
- Around line 29-34: The Settings.rate_limit_enabled field is unused because
limits_enabled() in cloud/app/rate_limit.py reads os.environ directly; update
limits_enabled() to obtain the flag from get_settings().rate_limit_enabled
(import get_settings from cloud.app.config if needed) and use that boolean to
enable/disable the limiter instead of reading GRADATA_RATE_LIMIT_ENABLED from
os.environ (remove or ignore the current os.environ fallback so the Settings
value controls the kill-switch).
In `@cloud/migrations/005_processed_webhooks.sql`:
- Around line 21-29: The current webhook handler treats any existing row as a
duplicate, turning a live 'processing' claim into a permanent poison pill;
update the billing route logic in cloud/app/routes/billing.py (the handler
around the call to _handle_event()) to only short‑circuit when status =
'processed' (not merely any row), and/or implement reclaim logic that, when an
existing row has status = 'processing' and claimed_at is older than your retry
window (e.g., >10 minutes), either DELETE or reset that row so the handler can
re-claim and rerun _handle_event(); ensure you update checks and any
upsert/insert paths to set claimed_at, status, and processed_at consistently to
avoid stuck rows.
In `@cloud/tests/test_billing.py`:
- Around line 338-345: TestWebhookHardening._post_with currently mirrors
TestWebhook._post_event but intentionally omits the setdefault calls for
id/created to allow tests to control those fields; add a short inline comment
above the _post_with method explaining that omission (reference
TestWebhook._post_event and the setdefault behavior) so future readers
understand it's deliberate and not a bug.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c2ecd596-0821-4ba1-a2ca-81069441c60d
📒 Files selected for processing (12)
cloud/RAILWAY-ENV.mdcloud/app/auth.pycloud/app/config.pycloud/app/middleware.pycloud/app/rate_limit.pycloud/app/routes/billing.pycloud/app/routes/health.pycloud/app/sentry_init.pycloud/migrations/005_processed_webhooks.sqlcloud/tests/test_billing.pycloud/tests/test_rate_limit.pycloud/tests/test_sentry.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cloudflare Pages
🔇 Additional comments (24)
cloud/app/rate_limit.py (3)
30-39: LGTM — Token-keyed rate limiting is well implemented.The
_auth_or_ip_keyfunction correctly:
- Parses the
Authorizationheader case-insensitively- Hashes the bearer token using blake2b (48 bits is sufficient for rate-limit disambiguation)
- Falls back to IP for anonymous requests with clear prefixes (
u:vsip:)
42-74: LGTM — Environment-aware rate limiting toggle is correctly implemented.The design appropriately:
- Defaults to enabled in production, disabled in tests (prevents flaky test failures)
- Allows explicit override via
GRADATA_RATE_LIMIT_ENABLED- Documents the
headers_enabled=Falsedecision with clear rationale
77-109: LGTM — Refresh mechanism and decorator factories are well designed.The lazy callable approach for limit strings allows configuration changes without restarting, and
refresh_enabled_flag()correctly synchronizes both limiter instances after settings are cleared.cloud/tests/test_rate_limit.py (4)
10-37: LGTM — Test fixture provides proper isolation.The
limited_clientfixture correctly:
- Enables rate limiting explicitly via env var
- Clears settings cache before and after
- Resets both
limiterandauth_limiterto prevent state leakage between tests
40-76: LGTM — Rate limit boundary tests are comprehensive.Tests correctly verify:
- Requests under limit succeed (lines 40-44)
- Requests over limit return 429 with proper JSON detail (lines 47-59)
- Sensitive bucket enforces tighter caps independently (lines 62-75)
78-100: LGTM — Disabled-by-default test validates test environment behavior.The test properly verifies that rate limiting is a no-op in test environments unless explicitly enabled, preventing flaky test failures in the existing test suite.
102-120: LGTM — Auth keying unit tests are thorough.The mock
_Reqclass correctly simulates FastAPI's Request interface for testing_auth_or_ip_key, verifying:
- Different tokens produce different keys
- Same token produces deterministic keys
- Missing auth falls back to IP-prefixed keys
cloud/tests/test_billing.py (6)
227-238: LGTM — Helper updated to match hardening requirements.The
_post_eventhelper now injects defaultidandcreatedfields viasetdefault, ensuring existing tests continue to work with the new replay protection while allowing individual tests to override these values.
265-271: LGTM — Assertions correctly filter for workspace-specific writes.The filtering for
table == "workspaces"accounts for the newprocessed_webhooksupdates, keeping test assertions focused on the business logic being verified.
347-373: LGTM — Replay protection tests validate boundary conditions.Tests correctly verify:
- Events older than 5 minutes are rejected with 400 (line 359)
- Events missing
createdare rejected with 400 (line 372)
374-406: LGTM — Duplicate detection test properly validates idempotency.The test seeds
processed_webhookswith an existing event_id and verifies:
- Response is 200 with
duplicate: True- Handler does not run (no captured updates)
408-457: LGTM — Claim-first workflow test validates the two-phase insert/update pattern.The test correctly asserts:
- A claim insert occurs with
status='processing'before handling- A completion update flips status to
'processed'after success
459-499: LGTM — Race condition test validates claim-loss behavior.The test simulates a PK conflict on insert and correctly verifies:
- Response is 200 with
duplicate: True- Handler does NOT touch workspaces (preventing double-apply)
This is the core correctness guarantee of the claim-first design.
cloud/app/routes/billing.py (11)
56-62: LGTM — Rate limiting correctly applied to checkout endpoint.The
@auth_limitdecorator is properly placed, and the parameter reordering (request: Requestfirst) ensures slowapi can extract the request for key derivation.
101-106: LGTM — Portal endpoint correctly rate-limited.
129-137: LGTM — Replay protection constants include clock skew grace.The 30-second grace period addresses the previous review feedback about tight timing with potential clock drift between Stripe's edge and your servers.
140-161: LGTM — Event accessor helpers handle both dict and object forms.This defensive approach ensures compatibility across Stripe SDK versions where
construct_eventmay return different types.
164-183: LGTM —_already_processedprovides cheap duplicate short-circuit with graceful degradation.The try/except block correctly handles the migration rollout scenario where
processed_webhookstable may not exist yet, falling back to "not processed" rather than failing all webhooks.
222-239: LGTM —_mark_processedis correctly best-effort.Since the claim row already exists with
status='processing', a failed update still guarantees idempotency. Logging the failure is appropriate for observability.
242-252: LGTM — Webhook handler docstring accurately describes hardening measures.
268-283: LGTM — Replay protection correctly rejects stale events.The age check uses the combined threshold (
WEBHOOK_MAX_AGE_SECONDS + WEBHOOK_MAX_AGE_GRACE_SECONDS) and logs sufficient detail for debugging rejected events.
285-310: LGTM — Idempotency implementation follows the claim-first pattern correctly.The flow is:
- Cheap short-circuit via
_already_processed(avoids INSERT for known duplicates)- Atomic claim via
_claim_event(PK conflict = lose race = skip)- Handle event only if claim won
- Mark processed (best-effort, non-fatal)
Returning HTTP 200 for duplicates is correct webhook behavior — Stripe will not retry on 200.
396-401: LGTM — Subscription endpoint correctly rate-limited.
186-219: The broad exception catch is intentional and documented in the function's docstring.The code explicitly states that "conflict, network, table missing" are treated identically as "did not claim" because it's "safer to drop a duplicate than to double-apply a subscription update." This is a deliberate safety-first trade-off: any failure (PK conflict, network timeout, missing table) results in skipping processing to avoid double-applying webhook updates.
If distinguishing network/timeout errors for retry logic is desired, that would require reconsidering the error handling strategy throughout this webhook processing flow, not just this function.
Remove unused `_scrub_and_enrich` in sentry_init.py — it was never wired into `before_send` (which uses `_scrub_event` directly) and has no callers in cloud/ or tests/. Pure dead-code removal, no behaviour change. Declined (conservatism on security-sensitive code): - Webhook event extractors (_event_id/_type/_created) kept as separate named helpers — explicit is better than clever on the audit path. - `_mark_processed`'s unused event_type param kept for symmetry with `_claim_event` — trivial cost, cleaner signature. - `_scrub_dict` compat shim kept as documented one-liner. - Atomic claim-first logic and _scrub_value recursion untouched. Co-Authored-By: Gradata <noreply@gradata.ai>
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Polite re-ping: CR review is still showing CHANGES_REQUESTED but all line-level actionables from the last review were addressed in the most recent push. Could you re-review when convenient? @coderabbitai full review |
|
✅ Actions performedFull review triggered. |
- rate_limit.py: wire Settings.rate_limit_enabled through limits_enabled(). Env var override (GRADATA_RATE_LIMIT_ENABLED) still takes precedence for tests/ops, but .env and Settings() values are now honoured instead of ignored. CR flagged the config field as dead wiring. - billing.py: fix poison-pill in webhook idempotency. _already_processed now short-circuits only on status='processed' (was: any row). _claim_event reclaims 'processing' rows older than WEBHOOK_CLAIM_STALE_SECONDS (10min) using a claimed_at-scoped UPDATE so racing reclaims can't both win. Previously a worker crash between claim and completion would permanently block every Stripe retry for that event_id. - test_billing.py: seed duplicate-test with status='processed' to match new contract; add _post_with docstring explaining the intentional divergence from _post_event; add two regression tests (stale reclaim runs handler, active claim does not). Security constraints preserved: - Stripe signature verification unchanged (still via construct_event). - Rate limit enforcement unchanged (still on by default in non-test env). - Sentry PII scrubbing unchanged (emails/tokens still filtered).
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Addressed CodeRabbit round-3 feedback in 8bfc271:
Security invariants preserved:
Tests: 132 passed (130 existing + 2 new regression tests). Push SHA: |
|
@coderabbitai review — fixes pushed, please re-verify. |
|
✅ Actions performedFull review triggered. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai review — fixes pushed since last review, please re-verify |
|
✅ Actions performedFull review triggered. |
|
@coderabbitai review — CR Pro active, please re-verify. |
|
✅ Actions performedFull review triggered. |
|
@coderabbitai review — re-verify latest pushes |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cloud/app/auth.py (1)
128-133:⚠️ Potential issue | 🟡 MinorTag identity earlier on JWT/operator paths to cover denied/not-found branches.
In these flows, tagging happens after checks that can raise 403/404, so those failures can miss
user.idcontext. Tag immediately after identity verification (verify_jwt(...)/claims["sub"]), then enrich later with workspace when available.Suggested patch
async def get_current_brain( @@ - user_id = await verify_jwt(cred) + user_id = await verify_jwt(cred) + tag_user(user_id) db = get_db() @@ async def get_brain_for_request( @@ - user_id = await verify_jwt(cred) + user_id = await verify_jwt(cred) + tag_user(user_id) brain = await verify_brain_ownership(brain_id, user_id) _tag_brain(brain, user_id=user_id) @@ async def require_operator( @@ claims = await verify_jwt_claims(credentials.credentials) user_id = claims["sub"] + tag_user(user_id) @@ - tag_user(user_id) return user_idAlso applies to: 171-174, 205-217
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cloud/app/auth.py` around lines 128 - 133, After verify_jwt(...) returns the user id, call the tagging helper immediately to record the identity context (e.g., invoke _tag_brain or the common tagging function with user_id) before performing DB lookups or raising HTTPException; then later, after get_db()/rows/workspace resolution, call the same tagging function again to enrich the tag with workspace/brain info. Apply this pattern around the verify_jwt -> db.select flow (current block using verify_jwt, get_db, rows, and _tag_brain) and similarly at the other mentioned sites (the flows around lines ~171-174 and ~205-217) so 403/404 branches include user.id context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloud/app/routes/billing.py`:
- Around line 197-245: _claim_event currently collapses "already processed",
"another worker holds claim", and "DB/table unavailable" into a single False
result which leads stripe_webhook to ack retries incorrectly; change
_claim_event (and the stripe_webhook call sites that use it) to return distinct
outcomes (e.g., an enum or tuple) indicating: PROCESSED (row exists with
status='processed'), LOCKED (status='processing' and claimed_at is fresh),
RECLAIMED/CLAIMED (we won the claim), and ERROR/UNAVAILABLE (DB errors or
missing table). Update stripe_webhook handling so only PROCESSED results return
200 {"duplicate": true}; LOCKED and ERROR/UNAVAILABLE must return a non-2xx (or
explicitly opt to bypass idempotency only when ERROR/UNAVAILABLE is
intentionally chosen), and CLAIMED proceeds to call _handle_event; reference
function names _claim_event, stripe_webhook, and _handle_event when making these
control-flow changes.
- Around line 257-266: The reclaim UPDATE currently only filters on event_id and
claimed_at_raw and can overwrite a row that another worker already set to
processed; add a condition that the existing row's status is "processing" to the
CAS filter so the db.update on the "processed_webhooks" table only succeeds if
the row is still processing. Modify the filters passed to the db.update call
(the one updating "processed_webhooks" with data {"status": "processing",
"claimed_at": "now()"}) to include status="processing" (reference the db.update
call and the _mark_processed path).
---
Duplicate comments:
In `@cloud/app/auth.py`:
- Around line 128-133: After verify_jwt(...) returns the user id, call the
tagging helper immediately to record the identity context (e.g., invoke
_tag_brain or the common tagging function with user_id) before performing DB
lookups or raising HTTPException; then later, after get_db()/rows/workspace
resolution, call the same tagging function again to enrich the tag with
workspace/brain info. Apply this pattern around the verify_jwt -> db.select flow
(current block using verify_jwt, get_db, rows, and _tag_brain) and similarly at
the other mentioned sites (the flows around lines ~171-174 and ~205-217) so
403/404 branches include user.id context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0f5bbbbc-bea6-4e7b-9c45-04a120e22515
📒 Files selected for processing (12)
cloud/RAILWAY-ENV.mdcloud/app/auth.pycloud/app/config.pycloud/app/middleware.pycloud/app/rate_limit.pycloud/app/routes/billing.pycloud/app/routes/health.pycloud/app/sentry_init.pycloud/migrations/005_processed_webhooks.sqlcloud/tests/test_billing.pycloud/tests/test_rate_limit.pycloud/tests/test_sentry.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cloudflare Pages
🔇 Additional comments (10)
cloud/app/auth.py (1)
101-110: Good centralization of auth Sentry context.
_tag_brain(...)is a clean consolidation point and keeps user/workspace tagging consistent across auth paths.cloud/app/middleware.py (2)
23-43: 429 response shaping and Retry-After handling look correct.The ceil+floor-to-1 behavior on
Retry-Afterand the consistent JSON body are solid.
48-55: Limiter wiring throughapp.stateis correctly integrated.Refreshing the enabled flag before registration and exposing both limiters on
app.stateis a clean setup for decorated routes.cloud/app/config.py (1)
26-34: Config additions are clear and scoped well.The new Sentry profiling and rate-limit bucket settings are straightforward and match the intended runtime controls.
cloud/tests/test_sentry.py (1)
116-201: Strong regression coverage added for the Sentry hardening paths.These tests close the key gaps (nested container scrubbing, profile sampling wiring, and
tag_userbehavior with/without active Sentry).cloud/app/routes/health.py (1)
43-52: Rate-limit integration on health endpoints looks correct.The decorator + request signature update is applied consistently to both probes.
cloud/RAILWAY-ENV.md (1)
35-68: Docs update is practical and rollout-friendly.The added migration callout and env var sections make the deployment expectations explicit.
cloud/migrations/005_processed_webhooks.sql (1)
16-43: Migration shape matches the hardened webhook claim model.The status + timestamp fields and claim-time index support the handler’s idempotency/reclaim flow well.
cloud/app/sentry_init.py (1)
69-102: Sentry hardening changes are well implemented.Recursive scrubbing, profiles sampling forwarding, and isolation-scope tagging are all aligned with the stated goals.
Also applies to: 134-153, 156-174
cloud/tests/test_billing.py (1)
510-612: Good regression coverage for stale vs. active claims.These two tests pin the intended reclaim boundary well and make the claim lifecycle much harder to regress accidentally.
- Tag Sentry user context immediately after JWT/API-key verification so 403 "Not your brain", 404 "No brain found", and operator-denied rejections carry identity in Sentry (CR minor on auth.py). - Remove redundant processed_webhooks INSERT from _handle_event. The outer _claim_event/_mark_processed wrappers own idempotency; the inner INSERT was a stale path from the pre-claim-flow code kept accidentally by the main-branch merge, causing duplicate-row test failures.
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Rebased on main (now up-to-date) |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
@coderabbitai review — rebased on latest main, all prior fixes applied |
|
✅ Actions performedFull review triggered. |
CRITICAL: _claim_event() collapsed all failure modes (missing migration, DB down, PK conflict) into "return False" which the webhook route acked as 200 duplicate. A crash in the claim store silently dropped Stripe events. Split into distinct outcomes via a new ClaimStoreUnavailable exception — the route now returns 503 on store failures so Stripe retries, and only returns 200 duplicate on a legitimate race loss. MAJOR: the stale-claim reclaim UPDATE filtered only on event_id + claimed_at, so if a worker flipped the row to status='processed' between our SELECT and UPDATE, we'd reset it back to 'processing' and re-run the handler. Added status='processing' to the CAS filter. Tests: added test_claim_store_unavailable_returns_503_not_duplicate_ack covering the CRITICAL regression. Updated the race-loss test to seed the winner's active claim row (since _already_processed lookup failure no longer silently returns False — the follow-up claim path handles it). All 143 cloud tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Addressed the latest CR blockers in 616cc99. CRITICAL (billing.py:245) — fixed. MAJOR (billing.py:266) — fixed. The stale-claim reclaim UPDATE now filters on Minor (auth.py:139) — already addressed in 988d947 via the centralised Tests: added Preserved: Stripe signature verification, rate limit defaults, Sentry PII scrubbing — no behavioural changes outside the webhook claim state machine. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cloud/app/auth.py (1)
151-164:⚠️ Potential issue | 🟡 MinorMissing Sentry tagging in
get_current_user_id_flexible.This auth dependency authenticates users but doesn't call
tag_user, unlikeget_current_user_id(line 147) andget_current_brain. If an endpoint using this dependency raises an exception, the Sentry event won't have user context.🛡️ Proposed fix to add Sentry tagging
async def get_current_user_id_flexible( credentials: HTTPAuthorizationCredentials = Security(_bearer), ) -> str: """Extract user_id from either a SDK API key or a Supabase JWT. Unlike :func:`get_current_brain`, this does NOT require that a brain already exists — it only identifies the caller. Use for endpoints that should work for brand-new users who haven't created a brain yet. """ cred = credentials.credentials if cred.startswith("gd_"): brain = await verify_api_key(cred) - return brain["user_id"] - return await verify_jwt(cred) + user_id = brain["user_id"] + tag_user(user_id) + return user_id + user_id = await verify_jwt(cred) + tag_user(user_id) + return user_id🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cloud/app/auth.py` around lines 151 - 164, get_current_user_id_flexible is missing a call to tag_user so Sentry won't have user context for errors; update the function to call tag_user with the resolved user_id before returning in both branches (after verify_api_key(cred) and after verify_jwt(cred)) — e.g., when cred startswith "gd_" extract user_id from brain = await verify_api_key(cred), call tag_user(brain["user_id"]) then return brain["user_id"], and in the JWT path do id = await verify_jwt(cred); tag_user(id); return id. Ensure you import or reference the existing tag_user function used by get_current_user_id/get_current_brain.cloud/tests/test_billing.py (1)
250-250: 🧹 Nitpick | 🔵 TrivialRemove unnecessary
if True:statement.This appears to be a leftover artifact. The
if True:block doesn't affect execution but adds visual noise.🧹 Proposed cleanup
mock_supabase.update = _capture_update # type: ignore[method-assign] - if True: - event = { - "type": "customer.subscription.updated", + event = { + "type": "customer.subscription.updated",Also applies to lines 283 and 307.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cloud/tests/test_billing.py` at line 250, Remove the redundant "if True:" blocks that were left as artifacts (they appear around the test assertions in the billing test file) — locate each standalone "if True:" and its indented block and simply unindent its body into the surrounding test scope (delete the "if True:" line and adjust indentation), ensuring test logic and fixtures remain identical; repeat this cleanup for the other occurrences of the same pattern so no "if True:" statements remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloud/app/routes/billing.py`:
- Around line 303-312: The updates to the processed_webhooks row currently pass
the literal string "now()" (in the db.update calls that set claimed_at and
processed_at), which results in the string 'now()' being stored; change these to
real ISO timestamps by importing datetime and timezone and setting claimed_at
and processed_at to datetime.now(timezone.utc).isoformat() when building the
data dicts for the db.update calls (refer to the db.update calls that set
data={"status":"processing","claimed_at":...} and the later update that sets
processed_at); ensure the datetime import is added at the top of the module.
---
Outside diff comments:
In `@cloud/app/auth.py`:
- Around line 151-164: get_current_user_id_flexible is missing a call to
tag_user so Sentry won't have user context for errors; update the function to
call tag_user with the resolved user_id before returning in both branches (after
verify_api_key(cred) and after verify_jwt(cred)) — e.g., when cred startswith
"gd_" extract user_id from brain = await verify_api_key(cred), call
tag_user(brain["user_id"]) then return brain["user_id"], and in the JWT path do
id = await verify_jwt(cred); tag_user(id); return id. Ensure you import or
reference the existing tag_user function used by
get_current_user_id/get_current_brain.
In `@cloud/tests/test_billing.py`:
- Line 250: Remove the redundant "if True:" blocks that were left as artifacts
(they appear around the test assertions in the billing test file) — locate each
standalone "if True:" and its indented block and simply unindent its body into
the surrounding test scope (delete the "if True:" line and adjust indentation),
ensuring test logic and fixtures remain identical; repeat this cleanup for the
other occurrences of the same pattern so no "if True:" statements remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 154d9612-8f2d-4d7b-a3ce-1237fbf829ba
📒 Files selected for processing (3)
cloud/app/auth.pycloud/app/routes/billing.pycloud/tests/test_billing.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cloudflare Pages
🔇 Additional comments (22)
cloud/app/auth.py (3)
101-111: LGTM! Well-designed centralized Sentry tagging helper.The
_tag_brainfunction correctly centralizes the tagging logic with appropriate fallbacks: explicituser_id(JWT path) takes precedence, thenbrain.user_id, thenbrain.id. Theworkspace_idresolution handles falsy values correctly by converting empty strings toNone.
124-139: LGTM! Early tagging ensures Sentry context is set before potential failures.Both authentication paths now tag immediately after verification succeeds:
- API-key path: tags via
_tag_brainat line 127- JWT path: tags user at line 133 before brain lookup, then enriches with workspace at line 138
This ensures 403/404 errors are properly attributed in Sentry.
231-232: LGTM! Operator path now tags immediately after JWT verification.Tagging happens before email resolution and domain checks, so operator-denied 403s will carry identity context in Sentry.
cloud/app/routes/billing.py (9)
56-61: LGTM! Rate limiting correctly applied to checkout endpoint.The
@auth_limitdecorator is properly placed below the route decorator, and the parameter reorder(request, body, ...)is required for slowapi to access the request object.
129-137: LGTM! Clock skew grace period added for replay protection.The 30-second grace period on top of the 5-minute max age addresses potential clock drift between Stripe's edge servers and your infrastructure.
140-161: LGTM! Event helpers handle both dict and Stripe.Event objects.Clean defensive implementation that works regardless of how the event is represented.
170-203: LGTM!_already_processedcorrectly short-circuits only on completed events.The function only returns
Trueforstatus='processed'rows, allowing staleprocessingclaims to be reclaimed by_claim_event. Lookup failures are treated as "not processed" so the real guard in_claim_eventcan surfaceClaimStoreUnavailableif needed.
206-213: LGTM!ClaimStoreUnavailableexception enables proper failure mode handling.This distinction between "store unavailable" and "legitimate duplicate" allows the webhook handler to return 503 for the former (triggering Stripe retries) rather than incorrectly acknowledging an unprocessed event.
327-348: LGTM! Defensive timestamp parsing with safe default.The
_claim_is_stalefunction handles various timestamp formats (ISO8601 string, datetime object) and correctly treats unparseable values as non-stale to avoid accidental double-processing.
371-412: LGTM! Webhook replay protection is well-implemented.The handler correctly:
- Rejects events missing
created(line 399-403)- Rejects events older than max_age + grace (line 407-412)
- Applies the
@sensitive_limitrate limiter
414-451: LGTM! Claim-first idempotency with proper failure mode handling.The flow correctly:
- Short-circuits on already-processed events (line 419-421)
- Returns 503 on
ClaimStoreUnavailableso Stripe retries (line 436-439)- Returns 200 duplicate on claim race loss (line 440-442)
- Runs handler only when claim is won
- Marks processed after successful handler (line 446-449)
454-462: LGTM! Clear documentation of idempotency responsibility.The docstring correctly documents that
_handle_eventtrusts its caller to have won the atomic claim and must NOT insert intoprocessed_webhooks.cloud/tests/test_billing.py (10)
227-238: LGTM! Test helper updated to inject required fields for hardening checks.The
setdefaultcalls ensure every test event hasidandcreatedfields, matching real Stripe event structure while allowing individual tests to override.
265-271: LGTM! Test correctly filters to workspace writes only.The filter
[c for c in captured if c["table"] == "workspaces"]correctly excludes theprocessed_webhooksstatus update that happens after the handler, focusing assertions on the actual business logic.
338-352: LGTM! Clear documentation of_post_withvs_post_eventdistinction.The comment explains why this helper intentionally omits the
setdefaultcalls, allowing hardening tests precise control overidandcreatedfields.
354-367: LGTM! Replay protection test validates stale event rejection.The test correctly creates an event 6 minutes old (exceeding the 5min + 30s threshold) and verifies a 400 response with "too old" message.
381-415: LGTM! Duplicate detection test validates short-circuit behavior.The test seeds a
status='processed'row and verifies the handler doesn't run for duplicate events.
417-466: LGTM! Claim-first flow test validates INSERT then UPDATE lifecycle.The test verifies:
- Exactly one claim INSERT with
status='processing'- One completion UPDATE flipping to
status='processed'
468-517: LGTM! Claim race loss test validates double-apply prevention.The test simulates losing the PK conflict race to another worker with an active (recent) claim and verifies the handler doesn't run.
519-573: LGTM! Stale claim reclaim test validates poison-pill recovery.The test creates a 20-minute-old
processingrow (exceeding the 10-minute stale window) and verifies:
- A reclaim UPDATE occurs with the prior
claimed_atin the filter- The handler runs after successful reclaim
575-621: LGTM! Active claim test validates no premature reclaim.The test creates a 30-second-old
processingrow (within the 10-minute window) and verifies neither reclaim nor handler execution occurs.
623-671: LGTM! Store unavailable test validates 503 failure mode.Critical regression test ensuring that when the
processed_webhooksstore is unreachable, the handler returns 503 (triggering Stripe retries) rather than incorrectly acknowledging the event.
| try: | ||
| updated = await db.update( | ||
| "processed_webhooks", | ||
| data={"status": "processing", "claimed_at": "now()"}, | ||
| filters={ | ||
| "event_id": event_id, | ||
| "claimed_at": claimed_at_raw, | ||
| "status": "processing", | ||
| }, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for how the db wrapper handles update data values
# to verify if "now()" is treated as a SQL expression
# Find the db.py file and check update method implementation
fd -t f 'db.py' --exec cat {}
# Also check if there are any tests that verify "now()" handling
rg -n "now\(\)" --type py -C 3Repository: Gradata/gradata
Length of output: 22208
🏁 Script executed:
#!/bin/bash
# Search for processed_webhooks table definition/schema
rg -n "processed_webhooks" --type py -A 5 -B 5
# Search for table creation SQL or migrations
fd -t f -e sql -e py | xargs grep -l "processed_webhooks" 2>/dev/null || echo "No SQL files found"
# Look for tests related to webhook processing
fd -t f 'test*.py' | xargs grep -l "processed_webhooks" 2>/dev/null || echo "No tests found"Repository: Gradata/gradata
Length of output: 22184
🏁 Script executed:
#!/bin/bash
# Read the migration file for processed_webhooks table
cat cloud/migrations/005_processed_webhooks.sqlRepository: Gradata/gradata
Length of output: 2276
🏁 Script executed:
#!/bin/bash
# Search for other timestamp updates in the codebase
rg -n "processed_at|claimed_at|updated_at|created_at" cloud/app/routes/billing.py -A 2 -B 2
# Look for any examples of timestamp handling in updates
rg -n "DEFAULT now()" cloud/migrations/ -A 3 -B 3
# Search for how timestamps are normally passed in data updates
rg -n "data=\{" cloud/app/routes/ -A 5 | head -60Repository: Gradata/gradata
Length of output: 12820
Replace "now()" string literals with properly formatted timestamps.
The reclaim UPDATE at line 306 and completion UPDATE at line 362 pass "now()" as a string literal to db.update(). Since SupabaseClient.update() sends JSON directly to PostgREST without special function call handling, PostgreSQL will receive the literal string 'now()' instead of the current timestamp, breaking the staleness window logic.
Use the pattern established elsewhere in the codebase (e.g., cloud/app/routes/sync.py:205):
from datetime import datetime, timezone
data={"status": "processing", "claimed_at": datetime.now(timezone.utc).isoformat()}Apply the same fix to line 362's processed_at update.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cloud/app/routes/billing.py` around lines 303 - 312, The updates to the
processed_webhooks row currently pass the literal string "now()" (in the
db.update calls that set claimed_at and processed_at), which results in the
string 'now()' being stored; change these to real ISO timestamps by importing
datetime and timezone and setting claimed_at and processed_at to
datetime.now(timezone.utc).isoformat() when building the data dicts for the
db.update calls (refer to the db.update calls that set
data={"status":"processing","claimed_at":...} and the later update that sets
processed_at); ensure the datetime import is added at the top of the module.
) * feat(cloud): GDPR endpoints (/me/export, /me/delete, /me/data-summary) Adds GDPR Article 15 (access) and Article 17 (erasure) endpoints plus a data summary endpoint for account-settings UIs. Soft-delete cascades via migration 005 — actual row purge will run as a nightly Railway cron after the 30-day grace period (TODO tracked in the SQL). - GET /me/export: returns inline JSON payload (<10MB), rate-limited to 1 call/24h via gdpr_export_requests ledger table - POST /me/delete: 202 Accepted, tombstones user + owned workspaces + owned brains, logs a confirmation email (ESP wire-up pending) - GET /me/data-summary: counts + oldest/newest timestamps - auth.py + users.py now filter soft-deleted rows so tombstoned users cannot authenticate and tombstoned workspaces disappear from profile - 13 new tests (UUID fixtures, no PII), 128 total cloud tests passing * feat(marketing): DPA, SLA, and subprocessors pages (draft) Adds three enterprise/GDPR compliance pages requested on Day 1 by enterprise procurement teams. All three are marked DRAFT — pending legal review at the top of the page so no one mistakes them for counsel-reviewed contracts. - /legal/dpa: GDPR Article 28 clauses (definitions, scope, security, sub-processors, data-subject rights, breach notification, SCCs) - /legal/sla: 99.9% uptime target, service-credit table, support response matrix by severity, incident communication policy - /legal/subprocessors: Railway, Supabase, Stripe, Cloudflare, Sentry with purpose, data categories, and location per subprocessor - Footer now links DPA/SLA/Subprocessors alongside Privacy + Terms * feat: security.txt (RFC 9116) + incident response runbook - marketing/public/.well-known/security.txt: contact security@gradata.ai, 90-day coordinated-disclosure window, scoped to gradata.ai apex + api + app + SDK repo, expires 2027-04-13 (12 months out). - cloud/RUNBOOK-INCIDENT.md: step-by-step for the top 5 incidents we expect — api.gradata.ai 500s, Railway outage, Supabase pool exhaustion, Stripe webhook backlog, GDPR endpoint misbehavior. Includes severity table, communication templates, and escalation contacts. * chore(cloud): rename migration 005_soft_delete -> 006_soft_delete PR #33 (API security hardening) introduced 005_processed_webhooks.sql. PR #33 has been merged and 005_processed_webhooks already applied to prod Supabase. Renaming this migration to 006 to resolve the collision. Co-Authored-By: Gradata <noreply@gradata.ai> * chore: update docstring reference to 006_soft_delete.sql Co-Authored-By: Gradata <noreply@gradata.ai> * chore: address CodeRabbit feedback on PR #36 - Draft DPA + subprocessors pages: noindex until legal sign-off (Major) - DPA: soften deletion language to match implementation — nightly purge cron is still out of scope, so we now describe the 30-day window as a target via the scheduled purge workflow (Major) - Runbook: remove destructive `DELETE FROM gdpr_export_requests` instruction — the DSR ledger is an audit trail and must stay immutable. Replaced with an operator-approved, time-boxed override procedure requiring approver + reason + follow-up ticket (Major) - GDPR export: shrink TOCTOU window by inserting the rate-limit ledger row BEFORE generating the payload, not after (Minor) - security.txt: "open source" -> "open-source" compound adjective (Minor) Declined (with reason): - N+1 queries in data-summary / export / delete: the current PostgREST `db.select` wrapper only supports `eq` filters, no `IN` and no transactions. Batching needs a db-layer change tracked separately. - Test fixture defensive defaults: MockSupabaseClient already uses defaultdict, so unregistered tables return [] — no fix needed. - Supabase admin email fallback in delete_me: no admin client wired yet; current defensive None-handling is sufficient and the confirmation email itself is still a stub. Co-Authored-By: Gradata <noreply@gradata.ai> * chore: address CodeRabbit round-2 feedback on PR #36 - gdpr.py: idempotency guard on /me/delete — repeat calls return existing ledger state instead of resetting the 30-day purge window or re-cascading. Reuses the pre-cascade users fetch for the email lookup (one fewer round-trip). - 006_soft_delete.sql: wrap policy creation in DROP POLICY IF EXISTS so the migration is re-runnable (Postgres has no CREATE POLICY IF NOT EXISTS). - subprocessors/page.tsx: hyphenate "open-source" as a compound adjective. - Add inline notes flagging the tracked follow-ups for the previously declined items (cascade N+1 needs db.select IN-support; admin email needs ESP wired before _send_deletion_confirmation can dispatch). - New test: test_delete_is_idempotent_when_already_soft_deleted. Co-Authored-By: Gradata <noreply@gradata.ai> * chore: address CodeRabbit round-3 feedback on PR #36 (rate-limit query) Flag the _enforce_export_rate_limit full-scan as a db-layer follow-up. SupabaseClient.select currently only accepts eq filters — no order_by / limit support — so narrowing the query requires extending the wrapper. Leaving a TODO to track rather than scoping the db-layer change into this PR. * chore: address CodeRabbit round-4 on PR #36; reorder cascade + best-effort rollback - Guard /me/data-summary and /me/export against soft-deleted accounts (JWT stays valid until TTL; add users.deleted_at IS NULL check, 404 on deleted). - Don't burn the 24h export quota on failed exports: move the gdpr_export_requests ledger insert to AFTER size check passes so 507 oversized-payload rejections don't block the user from retrying. - Strip raw user_id and email from GDPR info logs (log counts/has_email only) since these logs are long-lived and compliance-sensitive. - Reorder soft-delete cascade to tombstone subordinate resources (workspaces, brains) BEFORE the users row, so mid-cascade failures leave the account recoverable rather than orphaned-but-tombstoned. Wrap in try/except with best-effort rollback of partial subordinate updates and raise HTTP 500 so the client retries. Proper transactional support is tracked as a follow-up (requires db-wrapper changes). - Add tests for: soft-deleted-user read block, ledger-row-only-on-success, cascade subordinate-first ordering, mid-cascade rollback. Co-Authored-By: Gradata <noreply@gradata.ai> * refactor(cloud): simplify pass on gdpr-compliance branch - cloud/app/routes/gdpr.py: extract _collect_brain_children() to dedupe the four-table per-brain fetch (used by both /me/data-summary and /me/export), and parallelize each table x brain select via asyncio.gather. Merge _min_created_at / _max_created_at into a single _created_at_range() helper. Compliance-critical preserved untouched: _require_active_user 404 guard, ledger-insert-after-size-check, PII-scrubbed logs, cascade reorder + best-effort rollback. - marketing/src/components/LegalPageHeader.tsx (new): shared header + LegalDraftBanner for the three new legal pages (DPA, SLA, subprocessors) — drops ~60 lines of duplicated markup. - All 135 cloud tests pass. ruff/pyright clean vs pre-existing baseline. - Marketing `next build` green; all five /legal/* routes prerender. Co-Authored-By: Gradata <noreply@gradata.ai> * fix(review): address CodeRabbit feedback round 5 on GDPR Collapse GDPR soft-delete cascade from N+1 per-row PATCHes to 3 bulk PATCHes using PostgREST id=in.(...). This addresses the recurring CodeRabbit comment about the cascade being non-atomic: we still can't use a true transaction (SupabaseClient has no transactional API), but we shrink the mid-cascade failure window to three single round-trips and preserve the subordinates-first ordering + best-effort rollback. - SupabaseClient.update now accepts list/tuple/set filter values and emits PostgREST in.(...) syntax. Empty list filters are a no-op (return []) so we never fire an unfiltered PATCH that would rewrite every row in the table. - gdpr.delete_me cascades via 3 writes (workspaces, brains, users) instead of N+1 and rolls back with at most 2 bulk PATCHes on failure. - MockSupabaseClient.update mirrors the empty-list-filter no-op so tests reflect prod semantics. - test_delete_rolls_back_partial_cascade_on_mid_flight_failure updated to assert bulk-filter rollback shape. 135/135 cloud tests pass. * fix(legal): address round-6 CodeRabbit review - SLA page: noindex/nofollow while draft (robots metadata) - SLA tables: add <caption> + scope="col" for a11y - DPA: use full /api/v1/me/export and /api/v1/me/delete paths Co-Authored-By: Gradata <noreply@gradata.ai> * fix(db): refuse unfiltered update/delete (CR round-7) CodeRabbit flagged that db.update() could issue a full-table PATCH when filters=None or {}. Now raises HTTPException 500 with a scrubbed message before touching PostgREST. Applied the same guard to db.delete() for defense-in-depth. Mock client mirrors the new contract so tests exercise the guard path. Co-Authored-By: Gradata <noreply@gradata.ai> * fix(cloud/db): defense-in-depth guard + tests for unfiltered PATCH/DELETE Round-9 response to CodeRabbit critical finding on PR #36. The primary guard (`if not filters: raise`) was already present from earlier rounds; this commit adds: - Post-transform empty-params guard in delete() mirroring update() - Dedicated test_db_guards.py with 5 tests proving update() and delete() never fire an HTTP call when filters is None or {}, using a fake http client that asserts on any call --------- Co-authored-by: Oliver Le <oliver@gradata.com> Co-authored-by: Gradata <noreply@gradata.ai>
Summary
Closes three production gaps on the Gradata Cloud FastAPI backend. Three atomic commits, each independently revertible.
slowapi— three buckets (public / authenticated / sensitive) covering/health,/ready,/billing/*,/billing/webhook. Limits are env-configurable; defaults: 60/min public, 300/min per user, 10/min sensitive.profiles_sample_rate=0.1, tags events withuser.id+workspace_idviatag_user()called from every auth dependency, and scrubsemail/email_address/user_emailfrom event extras.createdolder than 5 minutes), idempotency (skip already-processedevent.ids via newprocessed_webhookstable + migration 005). Signature verification was already active.Test plan
cd cloud && pytest-> 128 passedmigrations/005_processed_webhooks.sqlin Supabase before enabling webhook hardening in prod (idempotency degrades gracefully without it — logged warning, no 5xx).Env vars added (all optional, all in RAILWAY-ENV.md)
GRADATA_RATE_LIMIT_ENABLED(default: on in prod, off in tests)GRADATA_RATE_LIMIT_PUBLIC/_AUTHENTICATED/_SENSITIVEGRADATA_SENTRY_PROFILES_SAMPLE_RATE(default0.1)Backwards-compat
processed_webhookstable during rollout (logs warning, processes event).Generated with Gradata