Fix deployment infrastructure and improve health checks by jfrench9 · Pull Request #54 · RoboFinSystems/robosystems

jfrench9 · 2025-11-01T01:24:41Z

Summary

This PR addresses critical deployment issues and enhances the reliability of our infrastructure through improved health checks, input validation, and environment management across GitHub Actions workflows and deployment scripts.

Key Accomplishments

Infrastructure Improvements

Enhanced Health Checks: Improved health check mechanisms in GitHub Actions to better detect deployment failures and service readiness
Input Validation: Added robust input validation to deployment workflows to prevent configuration errors
Environment Management: Streamlined environment variable handling and container startup processes
Workflow Optimization: Removed unnecessary permissions from deployment workflows and improved error handling

Configuration Updates

Environment Consistency: Updated staging workflow to use environment-specific image tags instead of 'latest' for better version control
Secret Management: Updated API key retrieval to use new naming conventions for graph API integration
CloudFormation: Added missing environment variables to infrastructure templates for proper service configuration

Code Organization

Script Refactoring: Reorganized and simplified container startup scripts for better maintainability
Command Structure: Updated build and deployment command organization for improved developer experience

Breaking Changes

Secret naming convention has changed for graph API integration - existing deployments will need to update secret references
Container image tagging strategy updated for staging environment

Testing Notes

All existing tests have been updated to reflect new secret naming conventions
Health check improvements have been validated in staging environment
Container startup scripts tested across different environment configurations

Infrastructure Considerations

Deployment workflows now have more robust failure detection and recovery mechanisms
Environment-specific configuration is more strictly enforced
Container orchestration has been simplified while maintaining reliability
CloudFormation templates include additional environment context for better service discovery

🤖 Generated with Claude Code

Branch Info:

Source: bugfix/deployment-fixes
Target: main
Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…eploying Neo4j writers, streamlining the configuration.

…n template - Introduced AWS_STACK_NAME export in the setup script for Kuzu writers to enhance environment variable management. - Updated both writer configurations to ensure consistency in environment variable exports.

…raph API integration - Changed the default secret name from "kuzu" to "graph-api" to align with the new naming convention for the graph API. - Added a TODO comment to consider moving the secret management to a dedicated secrets manager module in the future.

…ect new naming convention for graph API

…'latest' for consistent environment configuration

- Updated the GitHub Actions workflow to use a universal startup script for container management. - Modified the Kuzu writer userdata script to persist ECR-related environment variables for health check restarts. - Enhanced the script to source persisted variables and retrieve EC2 metadata dynamically for improved configuration management.

- Moved demo scripts to a new section for better organization. - Updated the command structure for demo scripts to ensure consistency and clarity. - Renamed parameters in duckdb-query commands for improved readability.

…nd environment management - Updated the handle-deployment-failure action to retrieve failed steps directly from job details, streamlining error reporting. - Modified the refresh-ecs action to include an additional output key for cluster name retrieval, enhancing flexibility. - Expanded the kuzu-writer userdata script to persist more environment variables, ensuring comprehensive configuration for container restarts.

…s and input validation - Added health check port determination based on backend type in refresh-graph actions. - Implemented input validation for backend, node type, and environment in refresh-graph-containers action. - Updated container naming logic to be backend-agnostic, simplifying deployment configurations. - Persisted additional environment variables in the Neo4j writer userdata script for better container management.

jfrench9 · 2025-11-01T01:24:42Z

@claude please review this PR

claude · 2025-11-01T01:24:58Z

Claude finished @jfrench9's task —— View job

PR Review Complete

I've completed a thorough review of PR #54. Here's my comprehensive feedback:

Overall Assessment

This PR makes significant improvements to deployment infrastructure, health checks, and environment management. The changes are well-structured and address real operational concerns. However, I've identified several issues that should be addressed before merging.

Critical Issues

1. Missing Environment Variable in Container Refresh (`refresh-graph-containers/action.yml:107`)

The ECR_IMAGE variable is constructed from components in the SSM command but never exported:

export ECR_IMAGE="${ECR_URI}:${{ inputs.environment }}"

Problem: This variable is used by run-graph-container.sh but isn't defined in the environment passed to the script.

Fix: Add to the environment setup in lines 98-113:

export ECR_URI="${ECR_URI}"
export ECR_IMAGE_TAG="${{ inputs.environment }}"
export ECR_IMAGE="${ECR_URI}:${{ inputs.environment }}"

2. Inconsistent Health Check Port Logic (`refresh-graph-containers/action.yml:83-87`)

The health check port is determined in the action but the backend type isn't validated:

if [ "${{ inputs.backend }}" == "neo4j" ]; then
  HEALTH_PORT="7474"
else
  HEALTH_PORT="8001"
fi

Problem: If backend is neither "neo4j" nor "kuzu", it defaults to 8001 without warning.

Recommendation: Add validation after input validation (around line 62):

if [[ ! "${{ inputs.backend }}" =~ ^(kuzu|neo4j)$ ]]; then
  echo "❌ Backend must be validated: ${{ inputs.backend }}"
  exit 1
fi

3. Missing Secret Name Update in Test (`test_auth_middleware.py:278`)

The test hardcodes the old secret name pattern:

mock_client.get_secret_value.assert_called_once_with(
    SecretId="robosystems/prod/graph-api"  # ✅ This is correct
)

While this is actually correct for the new naming convention, the PR description mentions "Secret naming convention has changed" but doesn't clearly document what the old vs new patterns are.

Recommendation: Add a comment in auth.py documenting the secret naming convention:

# Secret naming: robosystems/{env}/graph-api (unified for all graph backends)
# Old pattern: robosystems/{env}/kuzu (deprecated)

High-Priority Issues

4. Deployment Workflow Permissions Removed Without Documentation (`.github/workflows/deploy-neo4j-writers.yml:1-7`)

The workflow removed permissions: section but the PR description just says "Removed unnecessary permissions."

Issue: No documentation of which permissions were removed or why they were unnecessary.

Recommendation: Add a comment explaining the permission inheritance:

# Inherits repository-level permissions from calling workflow
# No additional permissions needed for deployment operations

5. Unsafe Image Tag Usage in Staging (`.github/workflows/staging.yml:462`)

The staging workflow uses a hardcoded tag:

ecr_image_tag: staging

Issue: If the staging tag is updated during deployment, there's no version pinning for rollbacks.

Recommendation: Consider using commit SHA for staging deployments:

ecr_image_tag: ${{ github.sha }}

6. Environment Variable Persistence Issues (`kuzu-writer.sh:372-385`, `neo4j-writer.sh:376-392`)

Both scripts persist environment variables to /etc/environment, but some critical variables (like INSTANCE_ID) are instance-specific and shouldn't persist across reboots:

Problem: If an instance is reimaged or replaced, stale instance metadata could cause issues.

Recommendation: Separate static config from dynamic metadata. Only persist static variables to /etc/environment.

Medium-Priority Issues

7. Hardcoded Wait Times (`refresh-graph-asg/action.yml:366`)

sleep "${{ inputs.health-check-timeout }}"

The script waits the full timeout period regardless of health status. This could be optimized to poll for health and exit early on success.

8. Missing Error Context in Volume Manager Invocation (`kuzu-writer.sh:129`)

aws lambda invoke ... || echo "Failed to invoke Volume Manager"

Issue: Error is swallowed with generic message. No context about why it failed.

Recommendation: Capture and log the actual error:

if ! aws lambda invoke ... 2>&1 | tee /tmp/volume-manager-error.log; then
  echo "Failed to invoke Volume Manager. Error: $(cat /tmp/volume-manager-error.log)"
fi

9. Duplicate Code in Userdata Scripts (`kuzu-writer.sh:236-263`, `neo4j-writer.sh:236-265`)

Both scripts have identical S3 download logic for shared scripts. This could be extracted to a common function or script.

10. Missing Validation for SharedRepositories (`cloudformation/kuzu-writers.yaml:65-69`)

The parameter has a pattern constraint but doesn't validate that the repository names are from an allowed set:

AllowedPattern: "^$|^[a-z]+(,[a-z]+)*$"

Recommendation: Add to the description which repositories are valid (sec, industry, economic, etc.) or add a more specific pattern.

Low-Priority Issues / Suggestions

11. Inconsistent CloudWatch Namespace Naming (`kuzu-writer.sh:43-45`)

Uses RoboSystemsKuzu/${ENV_CAPITALIZED} but Neo4j uses RoboSystemsGraph/${ENV_CAPITALIZED}/Neo4j....

Suggestion: Standardize to RoboSystemsGraph/${ENV_CAPITALIZED}/Kuzu for consistency.

12. Magic Numbers in Health Checks (`run-graph-container.sh:176-177`)

for i in {1..12}; do  # 12 * 5 = 60 seconds

Suggestion: Use variables for clarity:

HEALTH_CHECK_ATTEMPTS=12
HEALTH_CHECK_INTERVAL=5
for i in $(seq 1 $HEALTH_CHECK_ATTEMPTS); do

13. Justfile Command Reorganization (`justfile:90-107`)

The demo commands were reorganized but the comment structure could be improved for better discoverability.

14. Missing Validation in Workflow (`staging.yml:462`)

The image tag is hardcoded without validation that the image actually exists in ECR.

Positive Aspects

Excellent Input Validation: The new validation in refresh-graph-asg (lines 69-84) is thorough
Improved Error Handling: Better failure detection in deployment workflows
Unified Health Checks: Consolidation of health check logic across backends is well done
Environment Consistency: The staging workflow image tag change (staging instead of latest) improves consistency
Code Reuse: The shared script pattern (run-graph-container.sh) reduces duplication effectively

Breaking Changes Validation

The PR correctly identifies breaking changes:

✅ Secret naming convention change is documented
✅ Image tagging strategy change is documented
⚠️ Migration path for existing deployments could be clearer

Recommendation: Add a migration section to the PR description:

## Migration Steps
1. Update secrets in AWS Secrets Manager to use new naming: `robosystems/{env}/graph-api`
2. Update staging environment to pull `staging` tag instead of `latest`
3. Verify health check endpoints respond on correct ports

Testing Recommendations

Test container refresh on both Kuzu and Neo4j backends
Verify health checks work with new port logic
Test volume attachment failure scenarios
Validate secret retrieval with new naming convention
Test ASG refresh with allocated databases (force-refresh flow)

Summary

Recommendation: Request changes to address critical issues #1-3 before merging. The other issues can be addressed in follow-up PRs if needed for velocity, but the missing environment variable (#1) could cause runtime failures.

The PR makes valuable improvements to deployment reliability and infrastructure management. Once the critical issues are addressed, this will be a solid enhancement to the deployment pipeline.

Tasks

Read all changed files (15 files)
Review GitHub Actions workflow changes
Review userdata script changes
Review CloudFormation template changes
Review application code changes
Compile comprehensive review feedback

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-01T01:27:53Z

            \"#!/bin/bash\",
            \"set -e\",
            \"echo \\\"🔄 Starting container refresh for $CONTAINER_NAME\\\"\",
+            \"echo \\\"Loading environment variables...\\\"\",
+            \"if [ -f /etc/environment ]; then set -a; source /etc/environment; set +a; fi\",
            \"echo \\\"Pulling latest image...\\\"\",
            \"aws ecr get-login-password --region ${{ inputs.aws-region }} | docker login --username AWS --password-stdin ${ECR_URI%/*}\",
-            \"docker pull ${ECR_URI}:latest\",
+            \"docker pull ${ECR_URI}:${{ inputs.environment }}\",
+            \"export ECR_IMAGE=\\\"${ECR_URI}:${{ inputs.environment }}\\\"\",
            \"echo \\\"Stopping existing container...\\\"\",
            \"docker stop $CONTAINER_NAME || true\",
            \"docker rm $CONTAINER_NAME || true\",


Stop legacy graph containers before renaming to graph-api

The refresh action and common runner now hard-code the container name to graph-api/graph-api-shared and only call docker stop/docker rm on those names. Existing instances launched before this change still have running containers named kuzu-writer or neo4j-writer, so the first refresh after deploying this commit will not stop the old container. When run-graph-container.sh tries to start a new container on the same ports, Docker will fail with a "port already allocated" error and the update never succeeds. The refresh command should also remove the previous backend-specific names (or perform a rename) so that the rollout does not fail on active hosts.

Useful? React with 👍 / 👎.

…I integration - Updated CloudWatch namespace in kuzu-writer and neo4j-writer scripts to a unified format for all graph backends. - Enhanced error handling in Volume Manager Lambda invocation for better debugging and logging. - Centralized API key retrieval in the auth middleware to use a unified secrets manager for Kuzu and Neo4j. - Improved health check logic in run-graph-container script for clearer output and configuration management.

jfrench9 added 9 commits October 29, 2025 23:27

Remove unnecessary permissions from the GitHub Actions workflow for d…

4d89c2a

…eploying Neo4j writers, streamlining the configuration.

Update secret ID in test for get_api_key_from_secrets_manager to refl…

39b087d

…ect new naming convention for graph API

Update ecr_image_tag in staging workflow to use 'staging' instead of …

8de264d

…'latest' for consistent environment configuration

Reorganize demo scripts and update command structure in justfile

a3a1645

- Moved demo scripts to a new section for better organization. - Updated the command structure for demo scripts to ensure consistency and clarity. - Renamed parameters in duckdb-query commands for improved readability.

chatgpt-codex-connector Bot reviewed Nov 1, 2025

View reviewed changes

jfrench9 merged commit 5026007 into main Nov 1, 2025
5 checks passed

jfrench9 deleted the bugfix/deployment-fixes branch November 12, 2025 02:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deployment infrastructure and improve health checks#54

Fix deployment infrastructure and improve health checks#54
jfrench9 merged 10 commits into
mainfrom
bugfix/deployment-fixes

jfrench9 commented Nov 1, 2025

Uh oh!

jfrench9 commented Nov 1, 2025

Uh oh!

claude Bot commented Nov 1, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Nov 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Nov 1, 2025

Summary

Key Accomplishments

Infrastructure Improvements

Configuration Updates

Code Organization

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Nov 1, 2025

Uh oh!

claude Bot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete

Overall Assessment

Critical Issues

1. Missing Environment Variable in Container Refresh (refresh-graph-containers/action.yml:107)

2. Inconsistent Health Check Port Logic (refresh-graph-containers/action.yml:83-87)

3. Missing Secret Name Update in Test (test_auth_middleware.py:278)

High-Priority Issues

4. Deployment Workflow Permissions Removed Without Documentation (.github/workflows/deploy-neo4j-writers.yml:1-7)

5. Unsafe Image Tag Usage in Staging (.github/workflows/staging.yml:462)

6. Environment Variable Persistence Issues (kuzu-writer.sh:372-385, neo4j-writer.sh:376-392)

Medium-Priority Issues

7. Hardcoded Wait Times (refresh-graph-asg/action.yml:366)

8. Missing Error Context in Volume Manager Invocation (kuzu-writer.sh:129)

9. Duplicate Code in Userdata Scripts (kuzu-writer.sh:236-263, neo4j-writer.sh:236-265)

10. Missing Validation for SharedRepositories (cloudformation/kuzu-writers.yaml:65-69)

Low-Priority Issues / Suggestions

11. Inconsistent CloudWatch Namespace Naming (kuzu-writer.sh:43-45)

12. Magic Numbers in Health Checks (run-graph-container.sh:176-177)

13. Justfile Command Reorganization (justfile:90-107)

14. Missing Validation in Workflow (staging.yml:462)

Positive Aspects

Breaking Changes Validation

Testing Recommendations

Summary

Tasks

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Nov 1, 2025 •

edited

Loading

1. Missing Environment Variable in Container Refresh (`refresh-graph-containers/action.yml:107`)

2. Inconsistent Health Check Port Logic (`refresh-graph-containers/action.yml:83-87`)

3. Missing Secret Name Update in Test (`test_auth_middleware.py:278`)

4. Deployment Workflow Permissions Removed Without Documentation (`.github/workflows/deploy-neo4j-writers.yml:1-7`)

5. Unsafe Image Tag Usage in Staging (`.github/workflows/staging.yml:462`)

6. Environment Variable Persistence Issues (`kuzu-writer.sh:372-385`, `neo4j-writer.sh:376-392`)

7. Hardcoded Wait Times (`refresh-graph-asg/action.yml:366`)

8. Missing Error Context in Volume Manager Invocation (`kuzu-writer.sh:129`)

9. Duplicate Code in Userdata Scripts (`kuzu-writer.sh:236-263`, `neo4j-writer.sh:236-265`)

10. Missing Validation for SharedRepositories (`cloudformation/kuzu-writers.yaml:65-69`)

11. Inconsistent CloudWatch Namespace Naming (`kuzu-writer.sh:43-45`)

12. Magic Numbers in Health Checks (`run-graph-container.sh:176-177`)

13. Justfile Command Reorganization (`justfile:90-107`)

14. Missing Validation in Workflow (`staging.yml:462`)