Add shared replicas infrastructure for graph deployment workflows by jfrench9 · Pull Request #233 · RoboFinSystems/robosystems

jfrench9 · 2026-01-23T05:57:18Z

Summary

This PR introduces comprehensive shared replicas infrastructure to support distributed graph deployments across multiple environments. The implementation adds new deployment workflows, configuration management, and monitoring capabilities to enable horizontal scaling of graph processing workloads.

Key Accomplishments

New Deployment Infrastructure: Added dedicated workflow for managing graph replicas with automated deployment capabilities
Enhanced Configuration Management: Integrated shared replica settings into environment configuration with proper validation
Expanded Monitoring: Extended SEC sensors with additional monitoring capabilities for replica health and performance tracking
Cross-Environment Support: Updated staging and production workflows to support shared replica deployments
CloudFormation Integration: Enhanced API infrastructure templates to accommodate shared replica architecture

Infrastructure Considerations

New GitHub Actions workflow added specifically for graph replica management (373 lines)
Significant enhancements to existing deployment workflows for API and graph services
CloudFormation templates updated to support the new replica architecture
Dagster definitions expanded to handle replica-aware job scheduling and monitoring
Environment-specific configuration added for both staging and production deployments

Testing Notes

Verify that existing graph deployment workflows continue to function correctly
Test the new replica deployment workflow in staging environment before production rollout
Confirm that SEC sensors properly monitor both primary and replica instances
Validate that environment configurations are correctly applied across all deployment targets
Ensure CloudFormation stack updates deploy successfully without service interruption

Breaking Changes

None. This is an additive change that maintains backward compatibility with existing deployment processes.

🤖 Generated with Claude Code

Branch Info:

Source: feature/shared-replicas-infra
Target: main
Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Introduced shared replicas configuration in `deploy-api.yml` to support horizontal read scaling with an optional ALB endpoint URL. - Created a new workflow `deploy-graph-replicas.yml` for deploying read-only replica fleets for shared repositories. - Updated `deploy-graph.yml` to include shared replicas parameters and outputs. - Enhanced `prod.yml` and `staging.yml` to incorporate shared replicas settings from environment variables. - Added shared repository snapshot sensor in Dagster to trigger snapshots after SEC materialization. - Updated setup script to initialize shared replicas configuration for production and staging environments.

jfrench9 · 2026-01-23T05:57:19Z

@claude please review this PR

claude · 2026-01-23T05:57:34Z

Claude finished @jfrench9's task in 2m 33s —— View job

PR Review: Shared Replicas Infrastructure

I've completed a comprehensive review of the shared replicas infrastructure changes. This is a well-structured implementation that adds horizontal read scaling capabilities for shared repositories. Below are my findings organized by severity.

Summary

Overall Assessment: This is a solid infrastructure addition with good separation of concerns and proper integration with existing systems. The implementation follows AWS best practices for auto-scaling read replicas. However, there are several security, reliability, and operational concerns that should be addressed before merging.

Changed Files: 11 files, +608/-2 lines

New workflow: deploy-graph-replicas.yml (373 lines)
New CloudFormation: graph-ladybug-replicas.yaml (563 lines)
Modified: deployment workflows, Dagster jobs/sensors, configuration

🔴 Critical Issues

1. Missing CloudFormation Template in Repository

Location: .github/workflows/deploy-graph-replicas.yml:261

The workflow references cloudformation/graph-ladybug-replicas.yaml but this file was not committed to the repository (shows as "unknown SHA" in the PR). This will cause deployment failures.

Fix: Ensure the CloudFormation template is committed:

git add cloudformation/graph-ladybug-replicas.yaml
git commit --amend

2. Hardcoded DynamoDB Table Name

Locations:

cloudformation/graph-ladybug-replicas.yaml:214
cloudformation/graph-ladybug-replicas.yaml:372

The CloudFormation template hardcodes the DynamoDB table name as robosystems-graph-${Environment}-instances instead of looking it up from the graph infrastructure stack outputs.

Risk: If the table name changes or doesn't follow this exact pattern, deployments will fail.

Fix: Add a parameter for the instance registry table ARN/name from the graph infra stack, similar to how other resources are passed in.

3. No Rollback Strategy for Snapshot Updates

Location: .github/workflows/deploy-graph-replicas.yml:175-223

The snapshot resolution logic doesn't track previous snapshots. If a new snapshot is corrupted or causes issues, there's no automated rollback mechanism.

Recommendation:

Tag snapshots with version numbers or timestamps
Implement a rollback parameter to use previous snapshot
Consider blue/green deployment strategy for replica updates

4. Potential Race Condition in Sensor

Location: robosystems/dagster/sensors/sec.py:488-510

The sec_post_materialize_snapshot_sensor could trigger multiple snapshot jobs if SEC materialization runs complete in rapid succession, as the run_key includes the date but not sufficient uniqueness.

Fix: Add a check to prevent concurrent snapshot jobs:

# Check if a snapshot job is already running
active_runs = context.instance.get_runs(
    filters=RunsFilter(
        job_name="shared_repository_snapshot_job",
        statuses=[DagsterRunStatus.STARTED, DagsterRunStatus.QUEUED]
    )
)
if active_runs:
    context.log.info(f"Snapshot job already running ({len(active_runs)} active runs), skipping")
    return

🟡 High Priority Issues

5. Security: Overly Permissive DynamoDB Permissions

Location: cloudformation/graph-ladybug-replicas.yaml:206-214

Read-only replicas have full write permissions (PutItem, UpdateItem, DeleteItem) to DynamoDB. While they need to register themselves, they shouldn't have full write access.

Fix: Restrict to only the operations needed for instance registration:

Action:
  - dynamodb:PutItem  # For registration
  - dynamodb:UpdateItem  # For status updates
  - dynamodb:GetItem  # For health checks
Condition:
  StringEquals:
    dynamodb:LeadingKeys:
      - "${aws:ec2:InstanceId}"  # Can only modify their own record

6. Missing Environment Variable in API

Location: cloudformation/api.yaml:679-680

The API receives SHARED_REPLICA_ALB_URL as an environment variable, but there's no corresponding configuration in robosystems/config/env.py to read and use this value.

Impact: The API won't be able to route read requests to replicas.

Fix: Add to env.py:

# Shared replicas configuration (optional horizontal read scaling)
SHARED_REPLICA_ALB_URL = get_str_env("SHARED_REPLICA_ALB_URL", "")

7. ALB Health Check May Be Too Aggressive

Location: cloudformation/graph-ladybug-replicas.yaml:465-470

Health check settings:

Interval: 30s
Timeout: 10s
Unhealthy threshold: 3

This means an instance can be marked unhealthy in just 90 seconds (3 × 30s). For graph database replicas that may have longer query times during heavy load, this could cause unnecessary terminations.

Recommendation: Increase unhealthy threshold to 5 (2.5 minutes) or increase interval to 60s.

8. No Encryption for Snapshot Data

Location: cloudformation/graph-ladybug-replicas.yaml:270-276

While the EBS volumes are encrypted, the snapshot parameter doesn't validate or require that the snapshot itself is encrypted. AWS allows creating unencrypted snapshots from encrypted volumes if not careful.

Fix: Add snapshot encryption validation in the workflow before using it, or document that snapshots MUST be encrypted.

9. Incomplete Error Handling in Workflow

Location: .github/workflows/deploy-graph-replicas.yml:217-223

When no snapshot is found, the workflow exits with error but doesn't provide actionable remediation beyond "Run Dagster job".

Improvement: Add a conditional step to automatically trigger the Dagster job if no snapshot exists (with confirmation), or at minimum provide the exact command to run.

🟢 Medium Priority Issues

10. Missing Documentation for New Inputs

Location: .github/workflows/prod.yml:429-434, .github/workflows/staging.yml:446-451

The new shared replica configuration variables are added but not documented in any README or configuration guide.

Fix: Add documentation for:

SHARED_REPLICAS_ENABLED_{PROD|STAGING}
SHARED_REPLICAS_MIN_INSTANCES_{PROD|STAGING}
SHARED_REPLICAS_MAX_INSTANCES_{PROD|STAGING}
SHARED_REPLICAS_DESIRED_CAPACITY_{PROD|STAGING}
SHARED_REPLICAS_INSTANCE_TYPE_{PROD|STAGING}

11. Hardcoded Snapshot Tags in Workflow

Location: .github/workflows/deploy-graph-replicas.yml:181-187

The snapshot lookup filters by tags GraphId=sec and Environment, but these tags are not documented as being set by the Dagster job that creates snapshots.

Fix: Add a comment in the workflow explaining the tagging convention, or better yet, reference a centralized tagging configuration.

12. No Metrics for Replica Performance

Location: cloudformation/graph-ladybug-replicas.yaml:488-527

Only two CloudWatch alarms are configured (unhealthy hosts and latency). Missing important metrics:

Read query throughput
Connection pool saturation
Memory utilization
Network I/O

Recommendation: Add additional alarms for operational visibility.

13. Launch Template Version Pinning

Location: cloudformation/graph-ladybug-replicas.yaml:399-400

Using $Latest for launch template version can cause issues if the template is updated while instances are launching.

Fix: Consider using explicit version numbers and updating the ASG separately when changing launch template.

14. User Data Script Lacks Error Handling

Location: cloudformation/graph-ladybug-replicas.yaml:305-383

The UserData script uses set -e but doesn't have proper error handling for Docker operations. If Docker pull fails, the instance will signal failure but won't provide detailed logs.

Improvement: Add specific error handling and logging:

if ! docker pull ${ECRRepositoryUrl}:${ECRImageTag}; then
  echo "ERROR: Failed to pull Docker image" | tee -a /var/log/user-data-errors.log
  /opt/aws/bin/cfn-signal -e 1 --reason "Docker pull failed" --stack ${AWS::StackName} --resource ReplicaAutoScalingGroup --region ${AWS::Region}
  exit 1
fi

15. Sensor Run Key Could Cause Duplicates

Location: robosystems/dagster/sensors/sec.py:539-540

The run_key format includes date and first 8 chars of run_id, but if materialization jobs run multiple times per day (manual triggers, retries), this could create duplicates.

Fix: Use full run_id or add a counter suffix.

🔵 Low Priority / Style Issues

16. Inconsistent Naming: "Shared" vs "Replica"

Throughout the codebase, sometimes "shared replicas", sometimes just "replicas", sometimes "shared repository". Pick one consistent term.

Suggestion: Use "shared repository replicas" consistently in documentation and comments.

17. Missing Job Run Link in Comment

The comment template at the end doesn't include the deploy-graph-replicas.yml workflow reference in documentation.

18. ALB Name Too Generic

Location: cloudformation/graph-ladybug-replicas.yaml:444

ALB name is robosystems-shared-${Environment} which doesn't indicate it's specifically for graph replicas. Could conflict with other "shared" services.

Suggestion: robosystems-graph-replicas-${Environment}

19. Verbose Logging in Production

Location: .github/workflows/deploy-graph-replicas.yml:153, 173, 195

Many emoji-filled log messages (📦, 🐳, 🔍, 📸, ❌, ✅) which may make log parsing harder.

Opinion: This is a minor style choice, but consider using structured logging for production workflows.

✅ What's Done Well

Excellent separation of concerns - Replicas are in a separate workflow and CloudFormation stack
Proper use of conditionals - HasMinInstances condition handles scale-to-zero scenarios
Good health check configuration - Proper grace periods and rolling update strategy
Security groups follow least privilege - Only necessary ports open
Read-only mode enforcement - Environment variable LBUG_READ_ONLY=true prevents writes
Auto-scaling properly configured - CPU-based with sensible defaults
Integration with existing systems - Properly integrates with Dagster, API, and graph infrastructure
Backward compatible - All changes are additive with default values

Testing Recommendations

Before merging:

Test snapshot workflow - Manually verify Dagster snapshot job creates properly tagged snapshots
Test replica deployment - Deploy to staging and verify replicas can serve read queries
Test API integration - Verify API can route to replica ALB when configured
Test auto-scaling - Generate load and verify replicas scale up/down correctly
Test snapshot updates - Create new snapshot and verify launch template updates and rolling refresh works
Test failure scenarios - What happens if snapshot doesn't exist, ALB is unhealthy, etc.

Recommendations Summary

Must Fix Before Merge:

Issue Inital open source release of robosystems #1: Commit missing CloudFormation template
Issue Fix MCP tools query results to include units and improve dimensions #2: Fix hardcoded DynamoDB table name
Issue chore: improve documentation and suppress build output #4: Prevent concurrent snapshot jobs
Issue Fix worker monitor Valkey connection with Secrets Manager auth #6: Add missing env config for replica ALB URL

Should Fix Soon:

Issue Enhance SEC pipeline security and add reconsolidation support #3: Add rollback strategy for snapshots
Issue Remove unused Account and Disclosure transaction nodes from graph schema #5: Tighten DynamoDB IAM permissions
Issue #7: Adjust health check thresholds
Issue #8: Validate snapshot encryption

Can Address Later:

Documentation improvements (Issue #10)
Additional monitoring (Issue Refactor: Archive Kuzu for ongoing support #12)
Error handling improvements (Issue chore: standardize environment variable naming for Graph API configuration #14)
Style/naming consistency (Issues Chore/address pr issues #16-19)

Conclusion

This is a well-architected feature that adds important horizontal scaling capabilities for read-heavy workloads. The implementation follows AWS best practices and integrates cleanly with the existing infrastructure. However, the critical issues around the missing CloudFormation file and some security/reliability concerns should be addressed before merging.

Recommendation: Request changes to address critical issues #1, #2, #4, and #6, then approve after fixes.

- Updated references from 'instances' to 'instance-registry' in CloudFormation and related scripts for consistency. - Refactored shared repository job imports in Dagster definitions for clarity. - Enhanced the SEC sensor to prevent concurrent executions of the shared repository snapshot job, improving reliability.

jfrench9 merged commit 863a4b3 into main Jan 23, 2026
7 checks passed

jfrench9 deleted the feature/shared-replicas-infra branch January 23, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shared replicas infrastructure for graph deployment workflows#233

Add shared replicas infrastructure for graph deployment workflows#233
jfrench9 merged 2 commits into
mainfrom
feature/shared-replicas-infra

jfrench9 commented Jan 23, 2026

Uh oh!

jfrench9 commented Jan 23, 2026

Uh oh!

claude Bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Jan 23, 2026

Summary

Key Accomplishments

Infrastructure Considerations

Testing Notes

Breaking Changes

Uh oh!

jfrench9 commented Jan 23, 2026

Uh oh!

claude Bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Shared Replicas Infrastructure

Summary

🔴 Critical Issues

1. Missing CloudFormation Template in Repository

2. Hardcoded DynamoDB Table Name

3. No Rollback Strategy for Snapshot Updates

4. Potential Race Condition in Sensor

🟡 High Priority Issues

5. Security: Overly Permissive DynamoDB Permissions

6. Missing Environment Variable in API

7. ALB Health Check May Be Too Aggressive

8. No Encryption for Snapshot Data

9. Incomplete Error Handling in Workflow

🟢 Medium Priority Issues

10. Missing Documentation for New Inputs

11. Hardcoded Snapshot Tags in Workflow

12. No Metrics for Replica Performance

13. Launch Template Version Pinning

14. User Data Script Lacks Error Handling

15. Sensor Run Key Could Cause Duplicates

🔵 Low Priority / Style Issues

16. Inconsistent Naming: "Shared" vs "Replica"

17. Missing Job Run Link in Comment

18. ALB Name Too Generic

19. Verbose Logging in Production

✅ What's Done Well

Testing Recommendations

Recommendations Summary

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jan 23, 2026 •

edited

Loading