Skip to content

Add shared replicas infrastructure for graph deployment workflows#233

Merged
jfrench9 merged 2 commits into
mainfrom
feature/shared-replicas-infra
Jan 23, 2026
Merged

Add shared replicas infrastructure for graph deployment workflows#233
jfrench9 merged 2 commits into
mainfrom
feature/shared-replicas-infra

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR introduces comprehensive shared replicas infrastructure to support distributed graph deployments across multiple environments. The implementation adds new deployment workflows, configuration management, and monitoring capabilities to enable horizontal scaling of graph processing workloads.

Key Accomplishments

  • New Deployment Infrastructure: Added dedicated workflow for managing graph replicas with automated deployment capabilities
  • Enhanced Configuration Management: Integrated shared replica settings into environment configuration with proper validation
  • Expanded Monitoring: Extended SEC sensors with additional monitoring capabilities for replica health and performance tracking
  • Cross-Environment Support: Updated staging and production workflows to support shared replica deployments
  • CloudFormation Integration: Enhanced API infrastructure templates to accommodate shared replica architecture

Infrastructure Considerations

  • New GitHub Actions workflow added specifically for graph replica management (373 lines)
  • Significant enhancements to existing deployment workflows for API and graph services
  • CloudFormation templates updated to support the new replica architecture
  • Dagster definitions expanded to handle replica-aware job scheduling and monitoring
  • Environment-specific configuration added for both staging and production deployments

Testing Notes

  • Verify that existing graph deployment workflows continue to function correctly
  • Test the new replica deployment workflow in staging environment before production rollout
  • Confirm that SEC sensors properly monitor both primary and replica instances
  • Validate that environment configurations are correctly applied across all deployment targets
  • Ensure CloudFormation stack updates deploy successfully without service interruption

Breaking Changes

None. This is an additive change that maintains backward compatibility with existing deployment processes.


🤖 Generated with Claude Code

Branch Info:

  • Source: feature/shared-replicas-infra
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Introduced shared replicas configuration in `deploy-api.yml` to support horizontal read scaling with an optional ALB endpoint URL.
- Created a new workflow `deploy-graph-replicas.yml` for deploying read-only replica fleets for shared repositories.
- Updated `deploy-graph.yml` to include shared replicas parameters and outputs.
- Enhanced `prod.yml` and `staging.yml` to incorporate shared replicas settings from environment variables.
- Added shared repository snapshot sensor in Dagster to trigger snapshots after SEC materialization.
- Updated setup script to initialize shared replicas configuration for production and staging environments.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Jan 23, 2026

Claude finished @jfrench9's task in 2m 33s —— View job


PR Review: Shared Replicas Infrastructure

I've completed a comprehensive review of the shared replicas infrastructure changes. This is a well-structured implementation that adds horizontal read scaling capabilities for shared repositories. Below are my findings organized by severity.

Summary

Overall Assessment: This is a solid infrastructure addition with good separation of concerns and proper integration with existing systems. The implementation follows AWS best practices for auto-scaling read replicas. However, there are several security, reliability, and operational concerns that should be addressed before merging.

Changed Files: 11 files, +608/-2 lines

  • New workflow: deploy-graph-replicas.yml (373 lines)
  • New CloudFormation: graph-ladybug-replicas.yaml (563 lines)
  • Modified: deployment workflows, Dagster jobs/sensors, configuration

🔴 Critical Issues

1. Missing CloudFormation Template in Repository

Location: .github/workflows/deploy-graph-replicas.yml:261

The workflow references cloudformation/graph-ladybug-replicas.yaml but this file was not committed to the repository (shows as "unknown SHA" in the PR). This will cause deployment failures.

Fix: Ensure the CloudFormation template is committed:

git add cloudformation/graph-ladybug-replicas.yaml
git commit --amend

2. Hardcoded DynamoDB Table Name

Locations:

  • cloudformation/graph-ladybug-replicas.yaml:214
  • cloudformation/graph-ladybug-replicas.yaml:372

The CloudFormation template hardcodes the DynamoDB table name as robosystems-graph-${Environment}-instances instead of looking it up from the graph infrastructure stack outputs.

Risk: If the table name changes or doesn't follow this exact pattern, deployments will fail.

Fix: Add a parameter for the instance registry table ARN/name from the graph infra stack, similar to how other resources are passed in.

3. No Rollback Strategy for Snapshot Updates

Location: .github/workflows/deploy-graph-replicas.yml:175-223

The snapshot resolution logic doesn't track previous snapshots. If a new snapshot is corrupted or causes issues, there's no automated rollback mechanism.

Recommendation:

  • Tag snapshots with version numbers or timestamps
  • Implement a rollback parameter to use previous snapshot
  • Consider blue/green deployment strategy for replica updates

4. Potential Race Condition in Sensor

Location: robosystems/dagster/sensors/sec.py:488-510

The sec_post_materialize_snapshot_sensor could trigger multiple snapshot jobs if SEC materialization runs complete in rapid succession, as the run_key includes the date but not sufficient uniqueness.

Fix: Add a check to prevent concurrent snapshot jobs:

# Check if a snapshot job is already running
active_runs = context.instance.get_runs(
    filters=RunsFilter(
        job_name="shared_repository_snapshot_job",
        statuses=[DagsterRunStatus.STARTED, DagsterRunStatus.QUEUED]
    )
)
if active_runs:
    context.log.info(f"Snapshot job already running ({len(active_runs)} active runs), skipping")
    return

🟡 High Priority Issues

5. Security: Overly Permissive DynamoDB Permissions

Location: cloudformation/graph-ladybug-replicas.yaml:206-214

Read-only replicas have full write permissions (PutItem, UpdateItem, DeleteItem) to DynamoDB. While they need to register themselves, they shouldn't have full write access.

Fix: Restrict to only the operations needed for instance registration:

Action:
  - dynamodb:PutItem  # For registration
  - dynamodb:UpdateItem  # For status updates
  - dynamodb:GetItem  # For health checks
Condition:
  StringEquals:
    dynamodb:LeadingKeys:
      - "${aws:ec2:InstanceId}"  # Can only modify their own record

6. Missing Environment Variable in API

Location: cloudformation/api.yaml:679-680

The API receives SHARED_REPLICA_ALB_URL as an environment variable, but there's no corresponding configuration in robosystems/config/env.py to read and use this value.

Impact: The API won't be able to route read requests to replicas.

Fix: Add to env.py:

# Shared replicas configuration (optional horizontal read scaling)
SHARED_REPLICA_ALB_URL = get_str_env("SHARED_REPLICA_ALB_URL", "")

7. ALB Health Check May Be Too Aggressive

Location: cloudformation/graph-ladybug-replicas.yaml:465-470

Health check settings:

  • Interval: 30s
  • Timeout: 10s
  • Unhealthy threshold: 3

This means an instance can be marked unhealthy in just 90 seconds (3 × 30s). For graph database replicas that may have longer query times during heavy load, this could cause unnecessary terminations.

Recommendation: Increase unhealthy threshold to 5 (2.5 minutes) or increase interval to 60s.

8. No Encryption for Snapshot Data

Location: cloudformation/graph-ladybug-replicas.yaml:270-276

While the EBS volumes are encrypted, the snapshot parameter doesn't validate or require that the snapshot itself is encrypted. AWS allows creating unencrypted snapshots from encrypted volumes if not careful.

Fix: Add snapshot encryption validation in the workflow before using it, or document that snapshots MUST be encrypted.

9. Incomplete Error Handling in Workflow

Location: .github/workflows/deploy-graph-replicas.yml:217-223

When no snapshot is found, the workflow exits with error but doesn't provide actionable remediation beyond "Run Dagster job".

Improvement: Add a conditional step to automatically trigger the Dagster job if no snapshot exists (with confirmation), or at minimum provide the exact command to run.


🟢 Medium Priority Issues

10. Missing Documentation for New Inputs

Location: .github/workflows/prod.yml:429-434, .github/workflows/staging.yml:446-451

The new shared replica configuration variables are added but not documented in any README or configuration guide.

Fix: Add documentation for:

  • SHARED_REPLICAS_ENABLED_{PROD|STAGING}
  • SHARED_REPLICAS_MIN_INSTANCES_{PROD|STAGING}
  • SHARED_REPLICAS_MAX_INSTANCES_{PROD|STAGING}
  • SHARED_REPLICAS_DESIRED_CAPACITY_{PROD|STAGING}
  • SHARED_REPLICAS_INSTANCE_TYPE_{PROD|STAGING}

11. Hardcoded Snapshot Tags in Workflow

Location: .github/workflows/deploy-graph-replicas.yml:181-187

The snapshot lookup filters by tags GraphId=sec and Environment, but these tags are not documented as being set by the Dagster job that creates snapshots.

Fix: Add a comment in the workflow explaining the tagging convention, or better yet, reference a centralized tagging configuration.

12. No Metrics for Replica Performance

Location: cloudformation/graph-ladybug-replicas.yaml:488-527

Only two CloudWatch alarms are configured (unhealthy hosts and latency). Missing important metrics:

  • Read query throughput
  • Connection pool saturation
  • Memory utilization
  • Network I/O

Recommendation: Add additional alarms for operational visibility.

13. Launch Template Version Pinning

Location: cloudformation/graph-ladybug-replicas.yaml:399-400

Using $Latest for launch template version can cause issues if the template is updated while instances are launching.

Fix: Consider using explicit version numbers and updating the ASG separately when changing launch template.

14. User Data Script Lacks Error Handling

Location: cloudformation/graph-ladybug-replicas.yaml:305-383

The UserData script uses set -e but doesn't have proper error handling for Docker operations. If Docker pull fails, the instance will signal failure but won't provide detailed logs.

Improvement: Add specific error handling and logging:

if ! docker pull ${ECRRepositoryUrl}:${ECRImageTag}; then
  echo "ERROR: Failed to pull Docker image" | tee -a /var/log/user-data-errors.log
  /opt/aws/bin/cfn-signal -e 1 --reason "Docker pull failed" --stack ${AWS::StackName} --resource ReplicaAutoScalingGroup --region ${AWS::Region}
  exit 1
fi

15. Sensor Run Key Could Cause Duplicates

Location: robosystems/dagster/sensors/sec.py:539-540

The run_key format includes date and first 8 chars of run_id, but if materialization jobs run multiple times per day (manual triggers, retries), this could create duplicates.

Fix: Use full run_id or add a counter suffix.


🔵 Low Priority / Style Issues

16. Inconsistent Naming: "Shared" vs "Replica"

Throughout the codebase, sometimes "shared replicas", sometimes just "replicas", sometimes "shared repository". Pick one consistent term.

Suggestion: Use "shared repository replicas" consistently in documentation and comments.

17. Missing Job Run Link in Comment

The comment template at the end doesn't include the deploy-graph-replicas.yml workflow reference in documentation.

18. ALB Name Too Generic

Location: cloudformation/graph-ladybug-replicas.yaml:444

ALB name is robosystems-shared-${Environment} which doesn't indicate it's specifically for graph replicas. Could conflict with other "shared" services.

Suggestion: robosystems-graph-replicas-${Environment}

19. Verbose Logging in Production

Location: .github/workflows/deploy-graph-replicas.yml:153, 173, 195

Many emoji-filled log messages (📦, 🐳, 🔍, 📸, ❌, ✅) which may make log parsing harder.

Opinion: This is a minor style choice, but consider using structured logging for production workflows.


✅ What's Done Well

  1. Excellent separation of concerns - Replicas are in a separate workflow and CloudFormation stack
  2. Proper use of conditionals - HasMinInstances condition handles scale-to-zero scenarios
  3. Good health check configuration - Proper grace periods and rolling update strategy
  4. Security groups follow least privilege - Only necessary ports open
  5. Read-only mode enforcement - Environment variable LBUG_READ_ONLY=true prevents writes
  6. Auto-scaling properly configured - CPU-based with sensible defaults
  7. Integration with existing systems - Properly integrates with Dagster, API, and graph infrastructure
  8. Backward compatible - All changes are additive with default values

Testing Recommendations

Before merging:

  1. Test snapshot workflow - Manually verify Dagster snapshot job creates properly tagged snapshots
  2. Test replica deployment - Deploy to staging and verify replicas can serve read queries
  3. Test API integration - Verify API can route to replica ALB when configured
  4. Test auto-scaling - Generate load and verify replicas scale up/down correctly
  5. Test snapshot updates - Create new snapshot and verify launch template updates and rolling refresh works
  6. Test failure scenarios - What happens if snapshot doesn't exist, ALB is unhealthy, etc.

Recommendations Summary

Must Fix Before Merge:

Should Fix Soon:

Can Address Later:


Conclusion

This is a well-architected feature that adds important horizontal scaling capabilities for read-heavy workloads. The implementation follows AWS best practices and integrates cleanly with the existing infrastructure. However, the critical issues around the missing CloudFormation file and some security/reliability concerns should be addressed before merging.

Recommendation: Request changes to address critical issues #1, #2, #4, and #6, then approve after fixes.


- Updated references from 'instances' to 'instance-registry' in CloudFormation and related scripts for consistency.
- Refactored shared repository job imports in Dagster definitions for clarity.
- Enhanced the SEC sensor to prevent concurrent executions of the shared repository snapshot job, improving reliability.
@jfrench9 jfrench9 merged commit 863a4b3 into main Jan 23, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/shared-replicas-infra branch January 23, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant