Skip to content

Fix deployment infrastructure and improve health checks#54

Merged
jfrench9 merged 10 commits into
mainfrom
bugfix/deployment-fixes
Nov 1, 2025
Merged

Fix deployment infrastructure and improve health checks#54
jfrench9 merged 10 commits into
mainfrom
bugfix/deployment-fixes

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Nov 1, 2025

Summary

This PR addresses critical deployment issues and enhances the reliability of our infrastructure through improved health checks, input validation, and environment management across GitHub Actions workflows and deployment scripts.

Key Accomplishments

Infrastructure Improvements

  • Enhanced Health Checks: Improved health check mechanisms in GitHub Actions to better detect deployment failures and service readiness
  • Input Validation: Added robust input validation to deployment workflows to prevent configuration errors
  • Environment Management: Streamlined environment variable handling and container startup processes
  • Workflow Optimization: Removed unnecessary permissions from deployment workflows and improved error handling

Configuration Updates

  • Environment Consistency: Updated staging workflow to use environment-specific image tags instead of 'latest' for better version control
  • Secret Management: Updated API key retrieval to use new naming conventions for graph API integration
  • CloudFormation: Added missing environment variables to infrastructure templates for proper service configuration

Code Organization

  • Script Refactoring: Reorganized and simplified container startup scripts for better maintainability
  • Command Structure: Updated build and deployment command organization for improved developer experience

Breaking Changes

  • Secret naming convention has changed for graph API integration - existing deployments will need to update secret references
  • Container image tagging strategy updated for staging environment

Testing Notes

  • All existing tests have been updated to reflect new secret naming conventions
  • Health check improvements have been validated in staging environment
  • Container startup scripts tested across different environment configurations

Infrastructure Considerations

  • Deployment workflows now have more robust failure detection and recovery mechanisms
  • Environment-specific configuration is more strictly enforced
  • Container orchestration has been simplified while maintaining reliability
  • CloudFormation templates include additional environment context for better service discovery

🤖 Generated with Claude Code

Branch Info:

  • Source: bugfix/deployment-fixes
  • Target: main
  • Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…eploying Neo4j writers, streamlining the configuration.
…n template

- Introduced AWS_STACK_NAME export in the setup script for Kuzu writers to enhance environment variable management.
- Updated both writer configurations to ensure consistency in environment variable exports.
…raph API integration

- Changed the default secret name from "kuzu" to "graph-api" to align with the new naming convention for the graph API.
- Added a TODO comment to consider moving the secret management to a dedicated secrets manager module in the future.
…'latest' for consistent environment configuration
- Updated the GitHub Actions workflow to use a universal startup script for container management.
- Modified the Kuzu writer userdata script to persist ECR-related environment variables for health check restarts.
- Enhanced the script to source persisted variables and retrieve EC2 metadata dynamically for improved configuration management.
- Moved demo scripts to a new section for better organization.
- Updated the command structure for demo scripts to ensure consistency and clarity.
- Renamed parameters in duckdb-query commands for improved readability.
…nd environment management

- Updated the handle-deployment-failure action to retrieve failed steps directly from job details, streamlining error reporting.
- Modified the refresh-ecs action to include an additional output key for cluster name retrieval, enhancing flexibility.
- Expanded the kuzu-writer userdata script to persist more environment variables, ensuring comprehensive configuration for container restarts.
…s and input validation

- Added health check port determination based on backend type in refresh-graph actions.
- Implemented input validation for backend, node type, and environment in refresh-graph-containers action.
- Updated container naming logic to be backend-agnostic, simplifying deployment configurations.
- Persisted additional environment variables in the Neo4j writer userdata script for better container management.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Nov 1, 2025

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Nov 1, 2025

Claude finished @jfrench9's task —— View job


PR Review Complete

I've completed a thorough review of PR #54. Here's my comprehensive feedback:


Overall Assessment

This PR makes significant improvements to deployment infrastructure, health checks, and environment management. The changes are well-structured and address real operational concerns. However, I've identified several issues that should be addressed before merging.


Critical Issues

1. Missing Environment Variable in Container Refresh (refresh-graph-containers/action.yml:107)

The ECR_IMAGE variable is constructed from components in the SSM command but never exported:

export ECR_IMAGE="${ECR_URI}:${{ inputs.environment }}"

Problem: This variable is used by run-graph-container.sh but isn't defined in the environment passed to the script.

Fix: Add to the environment setup in lines 98-113:

export ECR_URI="${ECR_URI}"
export ECR_IMAGE_TAG="${{ inputs.environment }}"
export ECR_IMAGE="${ECR_URI}:${{ inputs.environment }}"

2. Inconsistent Health Check Port Logic (refresh-graph-containers/action.yml:83-87)

The health check port is determined in the action but the backend type isn't validated:

if [ "${{ inputs.backend }}" == "neo4j" ]; then
  HEALTH_PORT="7474"
else
  HEALTH_PORT="8001"
fi

Problem: If backend is neither "neo4j" nor "kuzu", it defaults to 8001 without warning.

Recommendation: Add validation after input validation (around line 62):

if [[ ! "${{ inputs.backend }}" =~ ^(kuzu|neo4j)$ ]]; then
  echo "❌ Backend must be validated: ${{ inputs.backend }}"
  exit 1
fi

3. Missing Secret Name Update in Test (test_auth_middleware.py:278)

The test hardcodes the old secret name pattern:

mock_client.get_secret_value.assert_called_once_with(
    SecretId="robosystems/prod/graph-api"  # ✅ This is correct
)

While this is actually correct for the new naming convention, the PR description mentions "Secret naming convention has changed" but doesn't clearly document what the old vs new patterns are.

Recommendation: Add a comment in auth.py documenting the secret naming convention:

# Secret naming: robosystems/{env}/graph-api (unified for all graph backends)
# Old pattern: robosystems/{env}/kuzu (deprecated)

High-Priority Issues

4. Deployment Workflow Permissions Removed Without Documentation (.github/workflows/deploy-neo4j-writers.yml:1-7)

The workflow removed permissions: section but the PR description just says "Removed unnecessary permissions."

Issue: No documentation of which permissions were removed or why they were unnecessary.

Recommendation: Add a comment explaining the permission inheritance:

# Inherits repository-level permissions from calling workflow
# No additional permissions needed for deployment operations

5. Unsafe Image Tag Usage in Staging (.github/workflows/staging.yml:462)

The staging workflow uses a hardcoded tag:

ecr_image_tag: staging

Issue: If the staging tag is updated during deployment, there's no version pinning for rollbacks.

Recommendation: Consider using commit SHA for staging deployments:

ecr_image_tag: ${{ github.sha }}

6. Environment Variable Persistence Issues (kuzu-writer.sh:372-385, neo4j-writer.sh:376-392)

Both scripts persist environment variables to /etc/environment, but some critical variables (like INSTANCE_ID) are instance-specific and shouldn't persist across reboots:

Problem: If an instance is reimaged or replaced, stale instance metadata could cause issues.

Recommendation: Separate static config from dynamic metadata. Only persist static variables to /etc/environment.


Medium-Priority Issues

7. Hardcoded Wait Times (refresh-graph-asg/action.yml:366)

sleep "${{ inputs.health-check-timeout }}"

The script waits the full timeout period regardless of health status. This could be optimized to poll for health and exit early on success.

8. Missing Error Context in Volume Manager Invocation (kuzu-writer.sh:129)

aws lambda invoke ... || echo "Failed to invoke Volume Manager"

Issue: Error is swallowed with generic message. No context about why it failed.

Recommendation: Capture and log the actual error:

if ! aws lambda invoke ... 2>&1 | tee /tmp/volume-manager-error.log; then
  echo "Failed to invoke Volume Manager. Error: $(cat /tmp/volume-manager-error.log)"
fi

9. Duplicate Code in Userdata Scripts (kuzu-writer.sh:236-263, neo4j-writer.sh:236-265)

Both scripts have identical S3 download logic for shared scripts. This could be extracted to a common function or script.

10. Missing Validation for SharedRepositories (cloudformation/kuzu-writers.yaml:65-69)

The parameter has a pattern constraint but doesn't validate that the repository names are from an allowed set:

AllowedPattern: "^$|^[a-z]+(,[a-z]+)*$"

Recommendation: Add to the description which repositories are valid (sec, industry, economic, etc.) or add a more specific pattern.


Low-Priority Issues / Suggestions

11. Inconsistent CloudWatch Namespace Naming (kuzu-writer.sh:43-45)

Uses RoboSystemsKuzu/${ENV_CAPITALIZED} but Neo4j uses RoboSystemsGraph/${ENV_CAPITALIZED}/Neo4j....

Suggestion: Standardize to RoboSystemsGraph/${ENV_CAPITALIZED}/Kuzu for consistency.

12. Magic Numbers in Health Checks (run-graph-container.sh:176-177)

for i in {1..12}; do  # 12 * 5 = 60 seconds

Suggestion: Use variables for clarity:

HEALTH_CHECK_ATTEMPTS=12
HEALTH_CHECK_INTERVAL=5
for i in $(seq 1 $HEALTH_CHECK_ATTEMPTS); do

13. Justfile Command Reorganization (justfile:90-107)

The demo commands were reorganized but the comment structure could be improved for better discoverability.

14. Missing Validation in Workflow (staging.yml:462)

The image tag is hardcoded without validation that the image actually exists in ECR.


Positive Aspects

  1. Excellent Input Validation: The new validation in refresh-graph-asg (lines 69-84) is thorough
  2. Improved Error Handling: Better failure detection in deployment workflows
  3. Unified Health Checks: Consolidation of health check logic across backends is well done
  4. Environment Consistency: The staging workflow image tag change (staging instead of latest) improves consistency
  5. Code Reuse: The shared script pattern (run-graph-container.sh) reduces duplication effectively

Breaking Changes Validation

The PR correctly identifies breaking changes:

  • ✅ Secret naming convention change is documented
  • ✅ Image tagging strategy change is documented
  • ⚠️ Migration path for existing deployments could be clearer

Recommendation: Add a migration section to the PR description:

## Migration Steps
1. Update secrets in AWS Secrets Manager to use new naming: `robosystems/{env}/graph-api`
2. Update staging environment to pull `staging` tag instead of `latest`
3. Verify health check endpoints respond on correct ports

Testing Recommendations

  1. Test container refresh on both Kuzu and Neo4j backends
  2. Verify health checks work with new port logic
  3. Test volume attachment failure scenarios
  4. Validate secret retrieval with new naming convention
  5. Test ASG refresh with allocated databases (force-refresh flow)

Summary

Recommendation: Request changes to address critical issues #1-3 before merging. The other issues can be addressed in follow-up PRs if needed for velocity, but the missing environment variable (#1) could cause runtime failures.

The PR makes valuable improvements to deployment reliability and infrastructure management. Once the critical issues are addressed, this will be a solid enhancement to the deployment pipeline.


Tasks

  • Read all changed files (15 files)
  • Review GitHub Actions workflow changes
  • Review userdata script changes
  • Review CloudFormation template changes
  • Review application code changes
  • Compile comprehensive review feedback

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 99 to 110
\"#!/bin/bash\",
\"set -e\",
\"echo \\\"🔄 Starting container refresh for $CONTAINER_NAME\\\"\",
\"echo \\\"Loading environment variables...\\\"\",
\"if [ -f /etc/environment ]; then set -a; source /etc/environment; set +a; fi\",
\"echo \\\"Pulling latest image...\\\"\",
\"aws ecr get-login-password --region ${{ inputs.aws-region }} | docker login --username AWS --password-stdin ${ECR_URI%/*}\",
\"docker pull ${ECR_URI}:latest\",
\"docker pull ${ECR_URI}:${{ inputs.environment }}\",
\"export ECR_IMAGE=\\\"${ECR_URI}:${{ inputs.environment }}\\\"\",
\"echo \\\"Stopping existing container...\\\"\",
\"docker stop $CONTAINER_NAME || true\",
\"docker rm $CONTAINER_NAME || true\",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop legacy graph containers before renaming to graph-api

The refresh action and common runner now hard-code the container name to graph-api/graph-api-shared and only call docker stop/docker rm on those names. Existing instances launched before this change still have running containers named kuzu-writer or neo4j-writer, so the first refresh after deploying this commit will not stop the old container. When run-graph-container.sh tries to start a new container on the same ports, Docker will fail with a "port already allocated" error and the update never succeeds. The refresh command should also remove the previous backend-specific names (or perform a rename) so that the rollout does not fail on active hosts.

Useful? React with 👍 / 👎.

…I integration

- Updated CloudWatch namespace in kuzu-writer and neo4j-writer scripts to a unified format for all graph backends.
- Enhanced error handling in Volume Manager Lambda invocation for better debugging and logging.
- Centralized API key retrieval in the auth middleware to use a unified secrets manager for Kuzu and Neo4j.
- Improved health check logic in run-graph-container script for clearer output and configuration management.
@jfrench9 jfrench9 merged commit 5026007 into main Nov 1, 2025
5 checks passed
@jfrench9 jfrench9 deleted the bugfix/deployment-fixes branch November 12, 2025 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant