Skip to content

Refactor S3 bucket configuration and centralize shared data management#129

Merged
jfrench9 merged 2 commits into
mainfrom
refactor/s3-buckets
Jan 4, 2026
Merged

Refactor S3 bucket configuration and centralize shared data management#129
jfrench9 merged 2 commits into
mainfrom
refactor/s3-buckets

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Jan 4, 2026

Summary

This PR refactors the S3 bucket configuration across the entire codebase and introduces a centralized shared data management system. The changes improve consistency, maintainability, and configuration management for storage resources.

Key Accomplishments

Infrastructure Changes

  • Redesigned S3 bucket structure: Significantly refactored CloudFormation templates with enhanced bucket configurations and improved resource organization
  • Centralized configuration management: Added new shared data configuration module to standardize data access patterns across services
  • Enhanced environment setup: Updated local development environment initialization with improved S3 bucket provisioning
  • Cross-service integration: Updated CloudFormation templates for API, CloudTrail, Dagster, and Graph Ladybug services to align with new bucket structure

Code Improvements

  • Unified storage interfaces: Refactored Dagster storage resources and DuckDB pool configurations for consistent S3 integration
  • Simplified service dependencies: Streamlined graph subgraph service implementation with cleaner data access patterns
  • Enhanced secrets management: Improved secrets manager functionality with better error handling and configuration validation
  • Comprehensive testing updates: Updated test suites to reflect new configuration patterns and shared data structures

Configuration Management

  • Environment variables: Updated example configuration with new required variables for improved local development setup
  • Modular architecture: Introduced centralized shared data module (robosystems/config/shared_data.py) for consistent data management across components

Breaking Changes

  • Environment variable structure has been updated - teams will need to update their local .env files based on the new .env.example
  • S3 bucket naming conventions and access patterns have changed, requiring infrastructure redeployment
  • Secrets manager configuration interface has been modified

Testing Notes

  • All existing tests have been updated to work with the new configuration structure
  • Test coverage maintained for secrets manager and graph operations
  • Local development environment setup process has been enhanced for better S3 integration

Infrastructure Considerations

  • Infrastructure deployment required due to S3 bucket configuration changes
  • Local development environments will need reinitialization to set up new bucket structure
  • Configuration updates needed for all deployment environments to align with new variable structure

The refactoring maintains backward compatibility where possible while providing a more robust and maintainable foundation for storage operations across the platform.


🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/s3-buckets
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Introduced a new S3 bucket structure for shared data, including shared raw and processed buckets, and user data buckets, to enhance organization and clarity.
- Updated environment configuration to compute bucket names based on the environment, removing reliance on secrets for bucket names.
- Deprecated old bucket aliases for backward compatibility during migration, ensuring a smooth transition to the new structure.
- Removed unused backup functionality from the SubgraphService, streamlining the deletion process for subgraph databases.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Jan 4, 2026

@claude please review this PR

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 4, 2026

Claude finished @jfrench9's task in 2m 1s —— View job


Code Review Complete

Todo List

  • Review new shared data configuration module (robosystems/config/shared_data.py)
  • Examine CloudFormation S3 template changes (cloudformation/s3.yaml)
  • Review environment variable updates (robosystems/config/env.py and .env.example)
  • Check secrets manager refactoring (robosystems/config/secrets_manager.py)
  • Review service integration changes (Dagster, DuckDB, subgraph service)
  • Verify test updates are comprehensive
  • Check for security issues and best practices violations
  • Post comprehensive review feedback

Summary

This PR successfully refactors the S3 bucket configuration across the codebase and introduces a centralized shared data management system. The changes are well-structured and follow solid software engineering principles. I've identified several important findings below.

Key Strengths

  1. Excellent architecture - The new shared_data.py module provides a clean, type-safe interface for managing shared data sources
  2. Comprehensive CloudFormation changes - The S3 bucket restructure is well-documented with clear migration path
  3. Strong backward compatibility - Deprecated bucket aliases prevent breaking existing code during migration
  4. Thorough test coverage - Tests updated comprehensively for both secrets manager and subgraph service
  5. Good security practices - Bucket names computed from environment rather than stored in secrets

Critical Issues

1. Missing Shared Data Bucket Access in DuckDB Pool ⚠️

File: robosystems/graph_api/core/duckdb/pool.py:256-314

The DuckDB connection pool configuration accesses S3 but doesn't include logic to handle the new shared data buckets. The configuration should be updated to ensure it can access both:

  • SHARED_RAW_BUCKET (robosystems-shared-raw-{env})
  • SHARED_PROCESSED_BUCKET (robosystems-shared-processed-{env})

Current code only configures generic S3 access. Consider adding explicit bucket validation or configuration for the new shared buckets to prevent runtime access errors.

Recommendation: Add a test case to verify DuckDB can read from the new shared buckets, especially for SEC data ingestion workflows.

2. Inconsistent S3 Endpoint Configuration in DuckDB

File: robosystems/graph_api/core/duckdb/pool.py:293-301

The endpoint configuration strips http:// and https:// prefixes but doesn't handle edge cases:

endpoint = env.AWS_ENDPOINT_URL.replace("http://", "").replace("https://", "")

Issue: This will fail if AWS_ENDPOINT_URL is not set (trying to call .replace() on None).

Recommendation: Add a guard:

if env.AWS_ENDPOINT_URL:
    endpoint = env.AWS_ENDPOINT_URL.replace("http://", "").replace("https://", "")
    conn.execute("SET s3_endpoint=?", [endpoint])

Major Observations

3. CloudFormation Template Complexity

File: cloudformation/s3.yaml

The template maintains both new and deprecated buckets, increasing complexity:

  • Lines 59-189: Deprecated buckets (kept for migration)
  • Lines 190-320: New bucket structure

Observation: The PR description mentions this is "Phase 1" with a planned "Phase 2 Cleanup." This is good architectural thinking.

Recommendation: Create a GitHub issue to track Phase 2 cleanup with:

  • Timeline for deprecation
  • Data migration verification checklist
  • Deprecated resource removal plan

4. Environment Variable Naming Convention

File: robosystems/config/env.py:690-715

The new environment variables use different naming patterns:

  • SHARED_RAW_BUCKET (new pattern)
  • SEC_RAW_BUCKET (deprecated alias)
  • AWS_S3_BUCKET (deprecated alias)

Observation: Good use of aliases for backward compatibility, but the deprecation strategy should be documented.

Recommendation: Add deprecation warnings or comments in code where deprecated variables are used, with timeline for removal.

5. Secrets Manager Bucket Logic Change

File: robosystems/config/secrets_manager.py:172-197

Major Change: Bucket names are now computed from environment rather than fetched from secrets:

def get_s3_buckets(self) -> dict[str, str]:
    suffix = "" if self.environment == "dev" else f"-{self.environment}"
    buckets = {
        "shared_raw": f"robosystems-shared-raw{suffix}",
        "shared_processed": f"robosystems-shared-processed{suffix}",
        # ...
    }
    return buckets

Observation: This is an excellent architectural decision because:

  • ✅ Reduces secrets stored in AWS Secrets Manager
  • ✅ Makes bucket names predictable and consistent
  • ✅ Eliminates a potential point of failure (secret fetch)

Note: The comment at line 639 in s3.yaml confirms this design decision.

Minor Issues

6. LocalStack Initialization Script Bucket Duplication

File: bin/setup/localstack-init.sh:14-28

The new shared buckets are created, but there's also creation of legacy buckets (lines 37-73).

Observation: This is intentional for local dev compatibility, but could be confusing.

Recommendation: Add a comment explaining why both sets exist in local development.

7. Documentation Gaps in shared_data.py

File: robosystems/config/shared_data.py:1-205

The module has excellent docstrings, but missing:

  • Usage examples for the URI builder functions
  • Explanation of when to use get_raw_uri vs get_raw_key

Recommendation: Add a module-level docstring with usage examples:

"""
Usage Examples:
    # Get S3 URI for SEC raw data
    from robosystems.config import env
    from robosystems.config.shared_data import get_raw_uri, DataSourceType
    
    uri = get_raw_uri(
        bucket=env.SHARED_RAW_BUCKET,
        source=DataSourceType.SEC,
        "year=2024", "AAPL", "filing.xml"
    )
    # Returns: 's3://robosystems-shared-raw/sec/year=2024/AAPL/filing.xml'
"""

8. Test Coverage - Missing Integration Test

File: tests/config/test_secrets_manager.py

Tests cover the bucket computation logic well, but there's no integration test verifying:

  1. DuckDB can access new shared buckets
  2. Dagster storage resources work with new buckets
  3. Graph API can read from shared processed bucket

Recommendation: Add integration tests in tests/integration/ that verify end-to-end S3 access patterns with the new bucket structure.

Security Review

No security issues identified

The refactoring actually improves security by:

  1. Removing bucket names from secrets (computed from environment)
  2. Maintaining proper IAM policy structure in CloudFormation
  3. Keeping encryption enabled on all buckets
  4. Preserving public access blocks on private buckets

Breaking Changes Validation

The PR description correctly identifies breaking changes:

  • ✅ Environment variable structure updated
  • ✅ S3 bucket naming conventions changed
  • ✅ Secrets manager configuration interface modified

Recommendation: Ensure deployment documentation includes:

  1. .env file update instructions
  2. Infrastructure redeployment checklist
  3. Rollback procedure if issues arise

Code Quality Assessment

Aspect Rating Notes
Architecture ⭐⭐⭐⭐⭐ Excellent separation of concerns with shared_data.py
Testing ⭐⭐⭐⭐ Comprehensive unit tests, minor integration gaps
Documentation ⭐⭐⭐⭐ Good inline docs, could use more examples
Backward Compat ⭐⭐⭐⭐⭐ Excellent use of deprecated aliases
Security ⭐⭐⭐⭐⭐ Actually improves security posture
Maintainability ⭐⭐⭐⭐ Clear structure, but Phase 2 cleanup needed

Recommendations for Merge

Blocking (Fix Before Merge):

  1. ✅ Add guard for AWS_ENDPOINT_URL in DuckDB pool configuration (Issue Fix MCP tools query results to include units and improve dimensions #2)

Strongly Recommended (Fix Soon After):
2. Add integration tests for new bucket access patterns
3. Create GitHub issue for Phase 2 cleanup with timeline
4. Add usage examples to shared_data.py docstring

Nice to Have:
5. Add deprecation warnings for old environment variables
6. Document bucket migration procedure in docs/
7. Add comment to LocalStack script explaining dual bucket structure

Final Verdict

Recommendation: Approve with minor fixes

This is a well-executed refactoring that improves the codebase architecture. The main blocking issue (#2) is a simple guard addition. The architectural decisions are sound, and the backward compatibility approach is exemplary.

The code quality is high, tests are comprehensive, and the security posture is improved. Once the DuckDB endpoint guard is added, this PR is ready to merge.

Great work on this refactoring! The centralized shared data management system will make future data source integrations much cleaner.


@jfrench9 jfrench9 merged commit 9a1947a into main Jan 4, 2026
6 checks passed
@jfrench9 jfrench9 linked an issue Jan 4, 2026 that may be closed by this pull request
19 tasks
@jfrench9 jfrench9 deleted the refactor/s3-buckets branch January 4, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SEC Shared Repository Production Launch

1 participant