Skip to content

Implement Cloudflare R2 support for zero-egress shared repo downloads#438

Merged
jfrench9 merged 1 commit into
mainfrom
feature/r2-shared-repo-downloads
Mar 7, 2026
Merged

Implement Cloudflare R2 support for zero-egress shared repo downloads#438
jfrench9 merged 1 commit into
mainfrom
feature/r2-shared-repo-downloads

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Mar 7, 2026

Summary

Introduces Cloudflare R2 as the storage backend for shared repository downloads, replacing the previous S3-only backup pipeline. This migration enables zero-egress cost downloads by leveraging Cloudflare R2's free egress pricing model, significantly reducing infrastructure costs for serving shared repository data.

Key Changes

New R2 Publishing Pipeline

  • Added a new r2_publish module for publishing shared repository artifacts to Cloudflare R2
  • Extended the Dagster shared repository publish assets with R2 upload capabilities
  • Added R2-specific environment configuration variables (account ID, access keys, bucket settings)

Storage & Configuration

  • Introduced new storage graph configuration for R2 connectivity (config/storage/graph.py)
  • Extended environment configuration (config/env.py) with R2 credential and endpoint settings
  • Updated billing configuration to accommodate the new storage tier

Backup Pipeline Refactor

  • Removed the legacy backup.py module and its associated pipeline configs that handled the previous backup approach
  • Refactored pipeline jobs to align with the new R2-based publishing workflow
  • Updated the backup service, backup manager, and database models to work with R2-hosted artifacts
  • Simplified the pipeline __init__.py exports and module structure

API & Service Layer Updates

  • Modified the backup router and backup service to support R2 download URLs
  • Updated the database model to reflect the new storage location schema

Breaking Changes

  • Removed backup.py pipeline module: Any downstream code or jobs relying on robosystems.adapters.sec.pipeline.backup will break. The functionality has been replaced by the R2 publish pipeline.
  • Removed pipeline backup configs: The BackupConfig and related configuration classes have been removed from configs.py.
  • Download URL format changes: Backup/download endpoints now return R2-based URLs instead of S3 presigned URLs, which may affect clients consuming these endpoints.
  • New required environment variables: R2 credentials and endpoint configuration must be set for the system to function correctly (see .env.example).

Testing

  • Removed obsolete tests for the deleted backup module (test_backup.py) and associated config tests (test_configs.py)
  • Updated pipeline init and job tests to reflect the new module structure
  • Updated download rate limit middleware tests to account for the new storage backend behavior
  • All remaining tests have been adjusted to align with the refactored architecture

Infrastructure Considerations

  • New environment variables required: Cloudflare R2 account ID, access key ID, secret access key, and bucket name must be provisioned in all deployment environments
  • Cloudflare R2 bucket setup: A properly configured R2 bucket with appropriate CORS and access policies must be created prior to deployment
  • S3 compatibility: R2 uses an S3-compatible API, so existing S3 client libraries are reused with modified endpoint configuration
  • Cost impact: This change is expected to significantly reduce egress costs by leveraging R2's zero-egress pricing for download traffic
  • Migration path: Existing S3-hosted artifacts may need to be migrated or dual-served during the transition period

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/r2-shared-repo-downloads
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Added configuration options for Cloudflare R2 in `.env.example` and `EnvConfig`.
- Introduced new asset `sec_lbug_r2_published` to publish SEC database to R2.
- Updated pipeline jobs to include `sec_lbug_r2_publish_job` for R2 publishing.
- Removed the legacy backup functionality and associated configurations.
- Enhanced backup service to support R2 uploads and updated relevant models and routers.
- Adjusted billing plans to reflect zero-egress download capabilities.
- Updated tests to remove references to the deprecated backup functionality.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Mar 7, 2026

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Mar 7, 2026

Claude finished @jfrench9's task in 3m 12s —— View job


PR Review: Implement Cloudflare R2 support for zero-egress shared repo downloads

Tasks:

  • Understand the request
  • Read changed files
  • Analyze for bugs, security issues, and best practices
  • Post comprehensive review feedback

Summary

This is a well-structured PR that adds Cloudflare R2 as a zero-egress download backend while keeping S3 for the replica fleet. The architecture is clean — thin asset wrapper → shared publish_to_r2() helper → on-instance OnInstanceBackupService — and the overall implementation follows the existing patterns. Below are issues ranging from stale docs to missing safety checks.


Issues

1. Stale field description in BackupRequest model

robosystems/graph_api/models/database.py:82 — the regex pattern was updated to include r2_download, but the description still lists only 4 types:

description="Backup type: 'standard' (ZIP to S3), 'replica' (raw .lbug to S3), 'shared_repository' (tar.gz to S3), 'duckdb_staging' (raw .duckdb to S3)",
pattern="^(standard|replica|shared_repository|duckdb_staging|r2_download)$",

r2_download is now accepted by the validator but invisible in the API docs. The s3_destination field description on line 87 also still says "required for replica/shared_repository types" — omitting r2_download. Fix this →


2. _upload_replica logs and returns s3:// URIs for R2 uploads

robosystems/graph_api/core/backup_service.py:253 — the function is reused for R2 via a client override, but still logs s3://:

logger.info(f"[Task {task_id}] Uploading {db_path.name} to s3://{bucket}/{key}")

And the return dict always has "s3_uri":

return {
    "status": "success",
    "s3_uri": f"s3://{bucket}/{key}",   # misleading for R2 uploads
    ...
}

When backup_type == "r2_download", these should use r2:// so operators don't misread the logs. Fix this →


3. No startup validation for partially configured R2 credentials

robosystems/config/validation.py has no R2 checks. If R2_ENDPOINT_URL is set but R2_ACCESS_KEY_ID or R2_SECRET_ACCESS_KEY is empty (or vice versa), the service starts fine and then fails at the first publish with an opaque boto3 auth error. A simple guard in validation.py would surface misconfiguration at startup:

# Pseudocode
if env.R2_ENDPOINT_URL and not (env.R2_ACCESS_KEY_ID and env.R2_SECRET_ACCESS_KEY):
    issues.append("R2_ENDPOINT_URL is set but R2 credentials are missing")

Fix this →


4. Missing type annotation on db_resource in publish_to_r2()

robosystems/dagster/assets/shared_repositories/publish.py:263 — the project uses basedpyright and the other publish functions (publish_to_s3, publish_duckdb_to_s3) don't take a DB resource, so this stands out:

def publish_to_r2(
    context: AssetExecutionContext,
    graph_id: str,
    db_resource,         # no type annotation
) -> MaterializeResult:

db_resource should be typed as DatabaseResource. Fix this →


5. Redundant head_object verification in publish_to_r2()

robosystems/dagster/assets/shared_repositories/publish.py:309-314 — after _run_r2_backup() returns, a second head_object is called to get file_size. However, OnInstanceBackupService._upload_replica() already calls head_object internally to verify the upload (backup_service.py:292). The second call could be avoided if backup_with_sse propagated uploaded_size_bytes from the task result (which _upload_replica does include in its return dict). Not a correctness bug, but doubles R2 API calls per publish for large files.


6. No new tests for the R2 publish path

test_backup.py was correctly removed (legacy module deleted), but no new tests were added for:

  • publish_to_r2() in shared_repositories/publish.py
  • _upsert_r2_backup_record()
  • get_r2_download_key() in config/storage/graph.py
  • OnInstanceBackupService R2 routing (backup_type == "r2_download" path)

The rate limit tests were well updated, but the core publish logic has no coverage. Given that this path runs from Dagster and directly mutates GraphBackup records, even a basic unit test with a mock session would add meaningful confidence.


Minor Notes

  • config/storage/graph.py docstring: The module-level docstring documents the S3 path structure but does not include the R2 downloads/{graph_id}/{graph_id}.lbug layout. Worth adding for consistency.
  • get_r2_config() returns {} when unconfigured: callers that skip the emptiness check (like the head_object call at publish.py:311) would silently target AWS S3 instead of R2 if R2_ENDPOINT_URL is unset but R2_BUCKET_NAME is set. The current code is safe because _run_r2_backup_get_r2_client() raises before reaching that line, but the dependency on execution order is fragile. An early guard in publish_to_r2() would be more explicit.
  • sec_lbug_r2_published asset has deps=["sec_graph_materialized"]: This means materialization will always attempt to trigger R2 publish via lineage. If R2 credentials aren't yet provisioned in staging, this will fail every materialization run. Consider gating the asset on env.R2_ENDPOINT_URL the same way the dev environment is skipped, or making R2 a separate opt-in job only.

Verdict

The overall approach is solid and the refactoring is clean. The critical-path code (R2 backup routing, credential injection, S3-compatible client) looks correct. The main things worth addressing before merge are the stale API model description (#1), the missing startup validation (#3), and at minimum one test for the new publish_to_r2() path (#6). The s3:// log mislabeling (#2) is a low-priority quality issue.
|

@jfrench9 jfrench9 merged commit 6fdf5b7 into main Mar 7, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/r2-shared-repo-downloads branch March 10, 2026 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant