Refactor S3 bucket configuration and centralize shared data management by jfrench9 · Pull Request #129 · RoboFinSystems/robosystems

jfrench9 · 2026-01-04T04:01:14Z

Summary

This PR refactors the S3 bucket configuration across the entire codebase and introduces a centralized shared data management system. The changes improve consistency, maintainability, and configuration management for storage resources.

Key Accomplishments

Infrastructure Changes

Redesigned S3 bucket structure: Significantly refactored CloudFormation templates with enhanced bucket configurations and improved resource organization
Centralized configuration management: Added new shared data configuration module to standardize data access patterns across services
Enhanced environment setup: Updated local development environment initialization with improved S3 bucket provisioning
Cross-service integration: Updated CloudFormation templates for API, CloudTrail, Dagster, and Graph Ladybug services to align with new bucket structure

Code Improvements

Unified storage interfaces: Refactored Dagster storage resources and DuckDB pool configurations for consistent S3 integration
Simplified service dependencies: Streamlined graph subgraph service implementation with cleaner data access patterns
Enhanced secrets management: Improved secrets manager functionality with better error handling and configuration validation
Comprehensive testing updates: Updated test suites to reflect new configuration patterns and shared data structures

Configuration Management

Environment variables: Updated example configuration with new required variables for improved local development setup
Modular architecture: Introduced centralized shared data module (robosystems/config/shared_data.py) for consistent data management across components

Breaking Changes

Environment variable structure has been updated - teams will need to update their local .env files based on the new .env.example
S3 bucket naming conventions and access patterns have changed, requiring infrastructure redeployment
Secrets manager configuration interface has been modified

Testing Notes

All existing tests have been updated to work with the new configuration structure
Test coverage maintained for secrets manager and graph operations
Local development environment setup process has been enhanced for better S3 integration

Infrastructure Considerations

Infrastructure deployment required due to S3 bucket configuration changes
Local development environments will need reinitialization to set up new bucket structure
Configuration updates needed for all deployment environments to align with new variable structure

The refactoring maintains backward compatibility where possible while providing a more robust and maintainable foundation for storage operations across the platform.

🤖 Generated with Claude Code

Branch Info:

Source: refactor/s3-buckets
Target: main
Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Introduced a new S3 bucket structure for shared data, including shared raw and processed buckets, and user data buckets, to enhance organization and clarity. - Updated environment configuration to compute bucket names based on the environment, removing reliance on secrets for bucket names. - Deprecated old bucket aliases for backward compatibility during migration, ensuring a smooth transition to the new structure. - Removed unused backup functionality from the SubgraphService, streamlining the deletion process for subgraph databases.

…/s3-buckets

jfrench9 · 2026-01-04T04:01:15Z

@claude please review this PR

chatgpt-codex-connector · 2026-01-04T04:01:20Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

claude · 2026-01-04T04:01:32Z

Claude finished @jfrench9's task in 2m 1s —— View job

Code Review Complete

Todo List

Review new shared data configuration module (robosystems/config/shared_data.py)
Examine CloudFormation S3 template changes (cloudformation/s3.yaml)
Review environment variable updates (robosystems/config/env.py and .env.example)
Check secrets manager refactoring (robosystems/config/secrets_manager.py)
Review service integration changes (Dagster, DuckDB, subgraph service)
Verify test updates are comprehensive
Check for security issues and best practices violations
Post comprehensive review feedback

Summary

This PR successfully refactors the S3 bucket configuration across the codebase and introduces a centralized shared data management system. The changes are well-structured and follow solid software engineering principles. I've identified several important findings below.

Key Strengths

Excellent architecture - The new shared_data.py module provides a clean, type-safe interface for managing shared data sources
Comprehensive CloudFormation changes - The S3 bucket restructure is well-documented with clear migration path
Strong backward compatibility - Deprecated bucket aliases prevent breaking existing code during migration
Thorough test coverage - Tests updated comprehensively for both secrets manager and subgraph service
Good security practices - Bucket names computed from environment rather than stored in secrets

Critical Issues

1. Missing Shared Data Bucket Access in DuckDB Pool ⚠️

File: robosystems/graph_api/core/duckdb/pool.py:256-314

The DuckDB connection pool configuration accesses S3 but doesn't include logic to handle the new shared data buckets. The configuration should be updated to ensure it can access both:

SHARED_RAW_BUCKET (robosystems-shared-raw-{env})
SHARED_PROCESSED_BUCKET (robosystems-shared-processed-{env})

Current code only configures generic S3 access. Consider adding explicit bucket validation or configuration for the new shared buckets to prevent runtime access errors.

Recommendation: Add a test case to verify DuckDB can read from the new shared buckets, especially for SEC data ingestion workflows.

2. Inconsistent S3 Endpoint Configuration in DuckDB

File: robosystems/graph_api/core/duckdb/pool.py:293-301

The endpoint configuration strips http:// and https:// prefixes but doesn't handle edge cases:

endpoint = env.AWS_ENDPOINT_URL.replace("http://", "").replace("https://", "")

Issue: This will fail if AWS_ENDPOINT_URL is not set (trying to call .replace() on None).

Recommendation: Add a guard:

if env.AWS_ENDPOINT_URL:
    endpoint = env.AWS_ENDPOINT_URL.replace("http://", "").replace("https://", "")
    conn.execute("SET s3_endpoint=?", [endpoint])

Major Observations

3. CloudFormation Template Complexity

File: cloudformation/s3.yaml

The template maintains both new and deprecated buckets, increasing complexity:

Lines 59-189: Deprecated buckets (kept for migration)
Lines 190-320: New bucket structure

Observation: The PR description mentions this is "Phase 1" with a planned "Phase 2 Cleanup." This is good architectural thinking.

Recommendation: Create a GitHub issue to track Phase 2 cleanup with:

Timeline for deprecation
Data migration verification checklist
Deprecated resource removal plan

4. Environment Variable Naming Convention

File: robosystems/config/env.py:690-715

The new environment variables use different naming patterns:

SHARED_RAW_BUCKET (new pattern)
SEC_RAW_BUCKET (deprecated alias)
AWS_S3_BUCKET (deprecated alias)

Observation: Good use of aliases for backward compatibility, but the deprecation strategy should be documented.

Recommendation: Add deprecation warnings or comments in code where deprecated variables are used, with timeline for removal.

5. Secrets Manager Bucket Logic Change

File: robosystems/config/secrets_manager.py:172-197

Major Change: Bucket names are now computed from environment rather than fetched from secrets:

def get_s3_buckets(self) -> dict[str, str]:
    suffix = "" if self.environment == "dev" else f"-{self.environment}"
    buckets = {
        "shared_raw": f"robosystems-shared-raw{suffix}",
        "shared_processed": f"robosystems-shared-processed{suffix}",
        # ...
    }
    return buckets

Observation: This is an excellent architectural decision because:

✅ Reduces secrets stored in AWS Secrets Manager
✅ Makes bucket names predictable and consistent
✅ Eliminates a potential point of failure (secret fetch)

Note: The comment at line 639 in s3.yaml confirms this design decision.

Minor Issues

6. LocalStack Initialization Script Bucket Duplication

File: bin/setup/localstack-init.sh:14-28

The new shared buckets are created, but there's also creation of legacy buckets (lines 37-73).

Observation: This is intentional for local dev compatibility, but could be confusing.

Recommendation: Add a comment explaining why both sets exist in local development.

7. Documentation Gaps in shared_data.py

File: robosystems/config/shared_data.py:1-205

The module has excellent docstrings, but missing:

Usage examples for the URI builder functions
Explanation of when to use get_raw_uri vs get_raw_key

Recommendation: Add a module-level docstring with usage examples:

"""
Usage Examples:
    # Get S3 URI for SEC raw data
    from robosystems.config import env
    from robosystems.config.shared_data import get_raw_uri, DataSourceType
    
    uri = get_raw_uri(
        bucket=env.SHARED_RAW_BUCKET,
        source=DataSourceType.SEC,
        "year=2024", "AAPL", "filing.xml"
    )
    # Returns: 's3://robosystems-shared-raw/sec/year=2024/AAPL/filing.xml'
"""

8. Test Coverage - Missing Integration Test

File: tests/config/test_secrets_manager.py

Tests cover the bucket computation logic well, but there's no integration test verifying:

DuckDB can access new shared buckets
Dagster storage resources work with new buckets
Graph API can read from shared processed bucket

Recommendation: Add integration tests in tests/integration/ that verify end-to-end S3 access patterns with the new bucket structure.

Security Review

✅ No security issues identified

The refactoring actually improves security by:

Removing bucket names from secrets (computed from environment)
Maintaining proper IAM policy structure in CloudFormation
Keeping encryption enabled on all buckets
Preserving public access blocks on private buckets

Breaking Changes Validation

The PR description correctly identifies breaking changes:

✅ Environment variable structure updated
✅ S3 bucket naming conventions changed
✅ Secrets manager configuration interface modified

Recommendation: Ensure deployment documentation includes:

.env file update instructions
Infrastructure redeployment checklist
Rollback procedure if issues arise

Code Quality Assessment

Aspect	Rating	Notes
Architecture	⭐⭐⭐⭐⭐	Excellent separation of concerns with `shared_data.py`
Testing	⭐⭐⭐⭐	Comprehensive unit tests, minor integration gaps
Documentation	⭐⭐⭐⭐	Good inline docs, could use more examples
Backward Compat	⭐⭐⭐⭐⭐	Excellent use of deprecated aliases
Security	⭐⭐⭐⭐⭐	Actually improves security posture
Maintainability	⭐⭐⭐⭐	Clear structure, but Phase 2 cleanup needed

Recommendations for Merge

Blocking (Fix Before Merge):

✅ Add guard for AWS_ENDPOINT_URL in DuckDB pool configuration (Issue Fix MCP tools query results to include units and improve dimensions #2)

Strongly Recommended (Fix Soon After):
2. Add integration tests for new bucket access patterns
3. Create GitHub issue for Phase 2 cleanup with timeline
4. Add usage examples to shared_data.py docstring

Nice to Have:
5. Add deprecation warnings for old environment variables
6. Document bucket migration procedure in docs/
7. Add comment to LocalStack script explaining dual bucket structure

Final Verdict

Recommendation: Approve with minor fixes ✅

This is a well-executed refactoring that improves the codebase architecture. The main blocking issue (#2) is a simple guard addition. The architectural decisions are sound, and the backward compatibility approach is exemplary.

The code quality is high, tests are comprehensive, and the security posture is improved. Once the DuckDB endpoint guard is added, this PR is ready to merge.

Great work on this refactoring! The centralized shared data management system will make future data source integrations much cleaner.

jfrench9 added 2 commits January 3, 2026 22:00

Merge commit '2b2083ed5c593d1a622724d449739d3d78e3186b' into refactor…

367ad46

…/s3-buckets

jfrench9 merged commit 9a1947a into main Jan 4, 2026
6 checks passed

jfrench9 linked an issue Jan 4, 2026 that may be closed by this pull request

SEC Shared Repository Production Launch #117

Closed

19 tasks

jfrench9 deleted the refactor/s3-buckets branch January 4, 2026 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor S3 bucket configuration and centralize shared data management#129

Refactor S3 bucket configuration and centralize shared data management#129
jfrench9 merged 2 commits into
mainfrom
refactor/s3-buckets

jfrench9 commented Jan 4, 2026

Uh oh!

jfrench9 commented Jan 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jan 4, 2026

Uh oh!

claude Bot commented Jan 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Jan 4, 2026

Summary

Key Accomplishments

Infrastructure Changes

Code Improvements

Configuration Management

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Jan 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jan 4, 2026

Uh oh!

claude Bot commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Todo List

Summary

Key Strengths

Critical Issues

1. Missing Shared Data Bucket Access in DuckDB Pool ⚠️

2. Inconsistent S3 Endpoint Configuration in DuckDB

Major Observations

3. CloudFormation Template Complexity

4. Environment Variable Naming Convention

5. Secrets Manager Bucket Logic Change

Minor Issues

6. LocalStack Initialization Script Bucket Duplication

7. Documentation Gaps in shared_data.py

8. Test Coverage - Missing Integration Test

Security Review

Breaking Changes Validation

Code Quality Assessment

Recommendations for Merge

Final Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jan 4, 2026 •

edited

Loading