Skip to content

Fix LadybugDB shared replica deployment and lifecycle management#303

Merged
jfrench9 merged 3 commits into
mainfrom
bugfix/shared-replica-deploy-issues
Feb 6, 2026
Merged

Fix LadybugDB shared replica deployment and lifecycle management#303
jfrench9 merged 3 commits into
mainfrom
bugfix/shared-replica-deploy-issues

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Feb 6, 2026

Summary

This PR resolves deployment issues with LadybugDB shared replicas by implementing proper initialization scripts and updating the CloudFormation infrastructure. The changes ensure reliable deployment and registration of shared replica instances.

Key Changes

  • New replica initialization: Added comprehensive userdata script for LadybugDB shared replica instances to handle proper startup, configuration, and service registration
  • Enhanced deployment workflow: Updated GitHub Actions workflow to support shared replica deployment with improved error handling and validation
  • CloudFormation template improvements: Refined the shared snapshot lifecycle policy configuration and updated resource descriptions for better clarity
  • Instance registration updates: Modified the common graph instance registration script to support shared replica requirements

Key Accomplishments

  • Resolves deployment failures for LadybugDB shared replicas
  • Implements proper instance lifecycle management for shared environments
  • Enhances monitoring and health check capabilities for replica instances
  • Improves infrastructure as code documentation and maintainability

Breaking Changes

None. All changes are additive or internal improvements that maintain backward compatibility.

Testing Notes

  • Verify shared replica instances deploy successfully and register properly
  • Confirm CloudFormation stack updates apply without errors
  • Test replica health checks and monitoring endpoints
  • Validate snapshot lifecycle policies are correctly applied

Infrastructure Considerations

  • New userdata script adds ~347 lines of instance initialization logic
  • CloudFormation template restructured for better resource organization
  • Deployment workflow now includes additional validation steps
  • Enhanced logging and monitoring capabilities for troubleshooting

🤖 Generated with Claude Code

Branch Info:

  • Source: bugfix/shared-replica-deploy-issues
  • Target: main
  • Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…ployment workflow

- Added a new userdata script for shared replicas to handle instance setup, including environment validation, Docker installation, and ECR login.
- Updated the deployment workflow to upload userdata scripts to S3, ensuring the main script is downloaded and executed during instance bootstrapping.
- Enhanced CloudFormation template comments for clarity on DynamoDB registration processes for shared replicas.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Feb 6, 2026

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Feb 6, 2026

Claude finished @jfrench9's task in 1m 35s —— View job


PR Review Progress

  • Read and analyze all changed files
  • Review new userdata script (ladybug-replica.sh)
  • Review deployment workflow changes
  • Review CloudFormation template changes
  • Review registration script changes
  • Provide comprehensive review feedback

Code Review Summary

I've reviewed all 4 changed files in this PR. Overall, the implementation is solid and addresses the deployment issues for LadybugDB shared replicas. However, I've identified several issues ranging from critical bugs to suggestions for improvement.

🔴 Critical Issues

1. Hard failure on snapshot not found (deploy-graph-replicas.yml:218)

The workflow fails with exit 1 when no snapshot is found, but this breaks initial deployments. For first-time deployments, there may legitimately be no snapshot yet.

Impact: Initial stack deployments will fail.

Recommendation: Add a parameter to allow deploying with an empty data volume for initial setup, or provide clearer instructions on creating the first snapshot before deployment.

Fix this →

2. Incorrect Docker profile for replicas (ladybug-replica.sh:245)

The script sets DOCKER_PROFILE="ladybug-shared-writer" for replicas, but replicas should use a read-only profile, not a writer profile.

# Line 245 - Should be read-only profile
export DOCKER_PROFILE="ladybug-shared-writer"  # ⚠️ Wrong for replicas

Impact: Replicas may have write permissions when they should be read-only, potentially causing data corruption or replication conflicts.

Recommendation: Create and use ladybug-shared-replica or ladybug-shared-readonly profile.

Fix this →

3. Race condition in volume detection (ladybug-replica.sh:104-116)

The 2-minute timeout may be insufficient for volumes under heavy load or in congested availability zones. Additionally, there's no check if the volume is actually ready to mount (just attached).

Impact: Intermittent deployment failures in slower regions or during AWS service degradation.

Recommendation: Increase timeout to 300 seconds and add a check for volume "ready" state using lsblk or blkid.

Fix this →

🟡 High Priority Issues

4. No validation that snapshot contains expected data (ladybug-replica.sh:141-147)

The script only warns if /mnt/ladybug-data/databases doesn't exist but continues anyway, creating empty directories. This means a replica could start with no data and appear healthy.

Impact: Silent data loss scenarios where replicas appear healthy but serve no data.

Recommendation: Fail deployment if expected directories/files are missing from snapshot.

Fix this →

5. Missing error handling in S3 upload loop (deploy-graph-replicas.yml:237-243)

The loop uploads shared scripts but doesn't verify each upload succeeded. The set -e at line 222 should catch failures, but the loop doesn't provide visibility into which script failed.

Recommendation: Add explicit error checking and logging for each script upload in the loop.

Fix this →

6. Inconsistent snapshot source logging (deploy-graph-replicas.yml:192, 208)

The workflow logs which source the snapshot came from but doesn't expose this as an output, making post-deployment debugging difficult.

Recommendation: Add snapshot_source to job outputs for observability.

Fix this →

🟢 Medium Priority Issues

7. Hardcoded device names may fail on newer instance types (ladybug-replica.sh:105-108)

The script checks for /dev/nvme1n1 and /dev/xvdf, but newer instance types or different configurations may use different device names.

Recommendation: Use AWS CLI or metadata service to discover the attached volume by tags or attachment info instead of hardcoding device names.

8. Health check runs before container is ready (ladybug-replica.sh:269-276)

The script marks the instance as "healthy" immediately after starting the container without verifying the container is actually serving traffic.

Impact: ALB may route traffic to instances before they're ready, causing 502/503 errors.

Recommendation: Add a retry loop that calls the /health endpoint before marking as healthy.

Fix this →

9. Missing rollback handling for failed deployments (deploy-graph-replicas.yml:292-314)

When stack update fails, there's error checking but no rollback strategy. CloudFormation will auto-rollback, but the workflow doesn't detect or report this state.

Recommendation: Add CloudFormation rollback detection and reporting.

10. Replica health verification timeout too short (deploy-graph-replicas.yml:356-371)

The health check only waits 5 minutes (30 iterations × 10 seconds) for replicas to become healthy. Given the userdata script setup time, this may not be enough.

Impact: Workflow may report success before replicas are actually healthy.

Recommendation: Increase to at least 10 minutes or align with the health check grace period (300 seconds from CFN).

Fix this →

💡 Suggestions & Best Practices

11. Consider adding snapshot age validation (deploy-graph-replicas.yml:176-194)

The workflow uses the latest snapshot but doesn't check how old it is. An old snapshot could contain stale data.

Suggestion: Add a warning if the snapshot is older than expected (e.g., > 7 days).

12. Add replica count to DynamoDB (register-graph-instance.sh:84-116)

Shared masters track database_count, but replicas don't have an equivalent metric for tracking load or capacity.

Suggestion: Add replica_capacity_pct or similar metric for replica load balancing.

13. Log rotation configuration is minimal (ladybug-replica.sh:314-324)

The logrotate config only keeps 7 days of logs with daily rotation, which may not be enough for debugging production issues.

Suggestion: Consider increasing retention to 14-30 days or add compression after 3 days.

14. No metrics or alarms for snapshot age (graph-ladybug-replicas.yaml:541-579)

The CloudFormation template has alarms for unhealthy hosts and latency but nothing for data freshness.

Suggestion: Add CloudWatch metrics/alarms for snapshot age to detect replication lag.

15. Cron health check lacks failure notifications (ladybug-replica.sh:311)

The health check cron job logs to a file but doesn't alert on failures.

Suggestion: Add SNS notification or CloudWatch Log metric filter for health check failures.

📝 Minor Issues & Code Quality

16. Inconsistent emoji usage (ladybug-replica.sh:142, 330)

The script uses checkmark emojis which goes against CLAUDE.md guidelines: "Only in interactive scripts (/examples/), never in production code or logs"

Note: These are production infrastructure scripts, not interactive examples.

17. Redundant environment variable exports (ladybug-replica.sh:250-263)

Variables are exported earlier in the script and then re-exported to /etc/environment. Consider consolidating.

18. Missing version/revision tracking (ladybug-replica.sh:1-5)

The userdata script doesn't include version information, making it hard to track which version is deployed.

Suggestion: Add version header or metadata.

19. CloudFormation description formatting (graph-ladybug-replicas.yaml:254-261)

The DLM policy has a multi-line comment block that's helpful but breaks from the single-line description pattern used elsewhere.

Minor consistency issue: Consider moving extended documentation to a separate README.

🎯 Security Considerations

20. IAM policy allows all ECR resources (graph-ladybug-replicas.yaml:193-199)

The ECR permissions use Resource: "*" which is broader than necessary.

Recommendation: Scope to specific ECR repositories if possible, though GetAuthorizationToken requires *.

21. No encryption validation for snapshot (deploy-graph-replicas.yml:170-218)

The workflow doesn't verify that the snapshot is encrypted before using it.

Recommendation: Add encryption check to ensure compliance.

✅ Positive Highlights

  • Excellent use of shared scripts for consistency across writer and replica deployments
  • Good separation of concerns between bootstrap and main userdata scripts
  • Comprehensive health checking with cron-based monitoring
  • Well-documented CloudFormation parameters and resources
  • Proper ALB integration for replica fleet
  • Good error handling in most paths with detailed logging
  • Strong CloudWatch integration for monitoring and alerting

Summary

Overall Assessment: The PR makes significant progress on shared replica deployment, but has 2 critical issues that should be fixed before merge:

  1. Incorrect Docker profile (replicas using writer profile)
  2. Hard failure on missing snapshot (breaks initial deployments)

The code is well-structured and thoroughly documented. After addressing the critical issues, this will be a solid foundation for LadybugDB shared replica infrastructure.

Recommendation: Request changes to fix critical issues #2 and #3, then re-review.


| Branch: bugfix/shared-replica-deploy-issues

@jfrench9 jfrench9 merged commit d19d171 into main Feb 6, 2026
7 checks passed
@jfrench9 jfrench9 deleted the bugfix/shared-replica-deploy-issues branch February 7, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant