Add shared replicas infrastructure for graph deployment workflows#233
Conversation
- Introduced shared replicas configuration in `deploy-api.yml` to support horizontal read scaling with an optional ALB endpoint URL. - Created a new workflow `deploy-graph-replicas.yml` for deploying read-only replica fleets for shared repositories. - Updated `deploy-graph.yml` to include shared replicas parameters and outputs. - Enhanced `prod.yml` and `staging.yml` to incorporate shared replicas settings from environment variables. - Added shared repository snapshot sensor in Dagster to trigger snapshots after SEC materialization. - Updated setup script to initialize shared replicas configuration for production and staging environments.
|
@claude please review this PR |
|
Claude finished @jfrench9's task in 2m 33s —— View job PR Review: Shared Replicas InfrastructureI've completed a comprehensive review of the shared replicas infrastructure changes. This is a well-structured implementation that adds horizontal read scaling capabilities for shared repositories. Below are my findings organized by severity. SummaryOverall Assessment: This is a solid infrastructure addition with good separation of concerns and proper integration with existing systems. The implementation follows AWS best practices for auto-scaling read replicas. However, there are several security, reliability, and operational concerns that should be addressed before merging. Changed Files: 11 files, +608/-2 lines
🔴 Critical Issues1. Missing CloudFormation Template in RepositoryLocation: The workflow references Fix: Ensure the CloudFormation template is committed: git add cloudformation/graph-ladybug-replicas.yaml
git commit --amend2. Hardcoded DynamoDB Table NameLocations:
The CloudFormation template hardcodes the DynamoDB table name as Risk: If the table name changes or doesn't follow this exact pattern, deployments will fail. Fix: Add a parameter for the instance registry table ARN/name from the graph infra stack, similar to how other resources are passed in. 3. No Rollback Strategy for Snapshot UpdatesLocation: The snapshot resolution logic doesn't track previous snapshots. If a new snapshot is corrupted or causes issues, there's no automated rollback mechanism. Recommendation:
4. Potential Race Condition in SensorLocation: The Fix: Add a check to prevent concurrent snapshot jobs: # Check if a snapshot job is already running
active_runs = context.instance.get_runs(
filters=RunsFilter(
job_name="shared_repository_snapshot_job",
statuses=[DagsterRunStatus.STARTED, DagsterRunStatus.QUEUED]
)
)
if active_runs:
context.log.info(f"Snapshot job already running ({len(active_runs)} active runs), skipping")
return🟡 High Priority Issues5. Security: Overly Permissive DynamoDB PermissionsLocation: Read-only replicas have full write permissions (PutItem, UpdateItem, DeleteItem) to DynamoDB. While they need to register themselves, they shouldn't have full write access. Fix: Restrict to only the operations needed for instance registration: Action:
- dynamodb:PutItem # For registration
- dynamodb:UpdateItem # For status updates
- dynamodb:GetItem # For health checks
Condition:
StringEquals:
dynamodb:LeadingKeys:
- "${aws:ec2:InstanceId}" # Can only modify their own record6. Missing Environment Variable in APILocation: The API receives Impact: The API won't be able to route read requests to replicas. Fix: Add to # Shared replicas configuration (optional horizontal read scaling)
SHARED_REPLICA_ALB_URL = get_str_env("SHARED_REPLICA_ALB_URL", "")7. ALB Health Check May Be Too AggressiveLocation: Health check settings:
This means an instance can be marked unhealthy in just 90 seconds (3 × 30s). For graph database replicas that may have longer query times during heavy load, this could cause unnecessary terminations. Recommendation: Increase unhealthy threshold to 5 (2.5 minutes) or increase interval to 60s. 8. No Encryption for Snapshot DataLocation: While the EBS volumes are encrypted, the snapshot parameter doesn't validate or require that the snapshot itself is encrypted. AWS allows creating unencrypted snapshots from encrypted volumes if not careful. Fix: Add snapshot encryption validation in the workflow before using it, or document that snapshots MUST be encrypted. 9. Incomplete Error Handling in WorkflowLocation: When no snapshot is found, the workflow exits with error but doesn't provide actionable remediation beyond "Run Dagster job". Improvement: Add a conditional step to automatically trigger the Dagster job if no snapshot exists (with confirmation), or at minimum provide the exact command to run. 🟢 Medium Priority Issues10. Missing Documentation for New InputsLocation: The new shared replica configuration variables are added but not documented in any README or configuration guide. Fix: Add documentation for:
11. Hardcoded Snapshot Tags in WorkflowLocation: The snapshot lookup filters by tags Fix: Add a comment in the workflow explaining the tagging convention, or better yet, reference a centralized tagging configuration. 12. No Metrics for Replica PerformanceLocation: Only two CloudWatch alarms are configured (unhealthy hosts and latency). Missing important metrics:
Recommendation: Add additional alarms for operational visibility. 13. Launch Template Version PinningLocation: Using Fix: Consider using explicit version numbers and updating the ASG separately when changing launch template. 14. User Data Script Lacks Error HandlingLocation: The UserData script uses Improvement: Add specific error handling and logging: if ! docker pull ${ECRRepositoryUrl}:${ECRImageTag}; then
echo "ERROR: Failed to pull Docker image" | tee -a /var/log/user-data-errors.log
/opt/aws/bin/cfn-signal -e 1 --reason "Docker pull failed" --stack ${AWS::StackName} --resource ReplicaAutoScalingGroup --region ${AWS::Region}
exit 1
fi15. Sensor Run Key Could Cause DuplicatesLocation: The run_key format includes date and first 8 chars of run_id, but if materialization jobs run multiple times per day (manual triggers, retries), this could create duplicates. Fix: Use full run_id or add a counter suffix. 🔵 Low Priority / Style Issues16. Inconsistent Naming: "Shared" vs "Replica"Throughout the codebase, sometimes "shared replicas", sometimes just "replicas", sometimes "shared repository". Pick one consistent term. Suggestion: Use "shared repository replicas" consistently in documentation and comments. 17. Missing Job Run Link in CommentThe comment template at the end doesn't include the deploy-graph-replicas.yml workflow reference in documentation. 18. ALB Name Too GenericLocation: ALB name is Suggestion: 19. Verbose Logging in ProductionLocation: Many emoji-filled log messages (📦, 🐳, 🔍, 📸, ❌, ✅) which may make log parsing harder. Opinion: This is a minor style choice, but consider using structured logging for production workflows. ✅ What's Done Well
Testing RecommendationsBefore merging:
Recommendations SummaryMust Fix Before Merge:
Should Fix Soon:
Can Address Later:
ConclusionThis is a well-architected feature that adds important horizontal scaling capabilities for read-heavy workloads. The implementation follows AWS best practices and integrates cleanly with the existing infrastructure. However, the critical issues around the missing CloudFormation file and some security/reliability concerns should be addressed before merging. Recommendation: Request changes to address critical issues #1, #2, #4, and #6, then approve after fixes. |
- Updated references from 'instances' to 'instance-registry' in CloudFormation and related scripts for consistency. - Refactored shared repository job imports in Dagster definitions for clarity. - Enhanced the SEC sensor to prevent concurrent executions of the shared repository snapshot job, improving reliability.
Summary
This PR introduces comprehensive shared replicas infrastructure to support distributed graph deployments across multiple environments. The implementation adds new deployment workflows, configuration management, and monitoring capabilities to enable horizontal scaling of graph processing workloads.
Key Accomplishments
Infrastructure Considerations
Testing Notes
Breaking Changes
None. This is an additive change that maintains backward compatibility with existing deployment processes.
🤖 Generated with Claude Code
Branch Info:
feature/shared-replicas-inframainCo-Authored-By: Claude noreply@anthropic.com