Skip to content

Refactor GitHub Actions workflows to streamline service refresh process#186

Merged
jfrench9 merged 2 commits into
mainfrom
feature/service-refresh-workflow
Jan 14, 2026
Merged

Refactor GitHub Actions workflows to streamline service refresh process#186
jfrench9 merged 2 commits into
mainfrom
feature/service-refresh-workflow

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR consolidates and refactors our GitHub Actions workflow infrastructure to improve the service refresh process. The changes introduce a centralized service refresh workflow while simplifying existing deployment pipelines.

Key Accomplishments

  • Consolidated Service Refresh Logic: Created a comprehensive service-refresh.yml workflow that centralizes service refresh operations previously scattered across multiple workflow files
  • Simplified Deployment Workflows: Streamlined deploy-api.yml and deploy-dagster.yml by removing redundant service refresh logic (15 and 68 lines removed respectively)
  • Enhanced Environment Workflows: Updated both prod.yml and staging.yml workflows with improved service refresh integration
  • Cleanup: Removed the obsolete graph-container-refresh.yml workflow (294 lines) as its functionality has been consolidated into the new service refresh workflow
  • Dependency Updates: Updated project dependencies to support the enhanced workflow capabilities

Breaking Changes

  • The graph-container-refresh.yml workflow has been removed - any direct references or manual triggers to this workflow will no longer function
  • Deployment workflows now follow a different service refresh pattern - teams should verify their deployment processes align with the new workflow structure

Testing Notes

  • All existing deployment pipelines should be tested to ensure they properly integrate with the new service refresh workflow
  • Verify that service refresh operations complete successfully in both staging and production environments
  • Test workflow triggers and dependencies to ensure proper execution order

Infrastructure Considerations

  • The new centralized workflow reduces code duplication and provides better maintainability for service refresh operations
  • Deployment times may be affected due to changes in workflow orchestration - monitor pipeline performance
  • Ensure all necessary secrets and environment variables are properly configured for the new workflow structure
  • Review workflow permissions and access controls for the new service refresh workflow

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/service-refresh-workflow
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Removed the refresh ECS service input from API and Dagster workflows.
- Consolidated ECS service refresh logic into a new 'service-refresh' workflow, allowing for parallel execution of refresh tasks for graph containers, API, and Dagster services.
- Deleted the obsolete graph-container-refresh workflow, streamlining the deployment process.
- Updated production and staging workflows to utilize the new service-refresh workflow, improving deployment efficiency.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Jan 14, 2026

Claude finished @jfrench9's task in 1m 56s —— View job


PR Review: Workflow Refactoring

Tasks

  • Read and analyze new service-refresh.yml workflow
  • Review modified deployment workflows (deploy-api.yml, deploy-dagster.yml)
  • Review modified environment workflows (prod.yml, staging.yml)
  • Check deleted graph-container-refresh.yml to understand what was replaced
  • Review composite actions used by workflows
  • Analyze workflow security, triggers, and dependencies
  • Provide comprehensive review feedback

Review Summary

This PR successfully consolidates service refresh logic into a unified workflow. The refactoring is well-structured and improves maintainability, but there are several areas that need attention before merging.

Critical Issues

1. Incorrect Output Reference in service-refresh.yml (.github/workflows/service-refresh.yml:350-356)

The refresh-api job attempts to set an output in the wrong step:

- name: Set status output
  if: always()
  run: |
    if [ "${{ job.status }}" == "success" ]; then
      echo "status=success" >> $GITHUB_OUTPUT
    else
      echo "status=failed" >> $GITHUB_OUTPUT
    fi

Problem: This step doesn't have an id, but the job declares outputs.status: ${{ steps.refresh.outputs.status }}. The output is never actually used, but this is dead code that should be removed.

Fix: Remove the "Set status output" steps from refresh-api, refresh-dagster-daemon, and refresh-dagster-webserver jobs (lines 349-356, 394-401, 439-446) since the job-level outputs rely on job results, not step outputs.

2. Boolean vs String Type Inconsistency (.github/workflows/service-refresh.yml:13-27, 58-94)

The workflow has inconsistent input types between workflow_dispatch and workflow_call:

  • workflow_dispatch uses type: boolean for enable flags
  • workflow_call uses type: string with string defaults like "true"

Problem: This creates fragile conditional logic. Line 115 checks:

if: ${{ (github.event_name == 'workflow_dispatch' && inputs.graph_refresh_enabled) || (github.event_name != 'workflow_dispatch' && inputs.graph_refresh_enabled == 'true') }}

This works but is error-prone and makes the workflow harder to maintain.

Recommendation: Standardize on type: string for workflow_call inputs (required by GitHub Actions for reusable workflows) and always use string comparison in conditionals: inputs.graph_refresh_enabled == 'true'.

3. Hardcoded Stack Name Logic (.github/workflows/service-refresh.yml:322-338)

The workflow hardcodes CloudFormation stack naming conventions:

if [ "$ENV" == "prod" ]; then
  STACK_NAME="RoboSystemsAPIProd"
else
  STACK_NAME="RoboSystemsAPIStaging"
fi

Problem: This duplicates naming logic that already exists in stack-config and breaks if stack names change. The prod.yml and staging.yml workflows already have stack names from stack-config.outputs.api_stack.

Fix: Pass api_stack_name and dagster_stack_name as required inputs from the calling workflows (already done correctly in lines 589, 606, 592, 609).

Major Issues

4. Missing Job Dependencies in Summary (.github/workflows/service-refresh.yml:452)

The summary job declares dependencies on all refresh jobs:

summary:
  needs: [collect-graph-instances, refresh-graph, refresh-api, refresh-dagster-daemon, refresh-dagster-webserver]

Problem: If any of these jobs are skipped (e.g., graph_refresh_enabled: false), the dependency will be unmet and the summary job will be skipped even if other refreshes succeeded.

Fix: Add if: always() to the summary job to ensure it runs regardless of which jobs were skipped.

5. Race Condition in Parallel Refreshes (.github/workflows/service-refresh.yml:252-446)

The workflow refreshes API, Dagster daemon, and Dagster webserver in parallel without coordination.

Problem: If these services share dependencies (database migrations, cache state), parallel refreshes could cause race conditions or inconsistent state.

Recommendation:

  • Document whether parallel execution is safe
  • Consider adding a comment explaining the refresh order assumptions
  • If needed, serialize critical refreshes (e.g., daemon before webserver)

6. Incomplete Error Handling in Composite Actions

refresh-graph-containers/action.yml:182-190: Health check failures exit with code 1, but there's no mechanism to report partial failures to the summary job.

refresh-ecs/action.yml:274-286: Timeouts log a warning but proceed anyway:

echo "Proceeding despite timeout - service update was initiated successfully"

Problem: This can mask real deployment failures. If a service never stabilizes, downstream services might depend on an unhealthy instance.

Recommendation:

  • Consider making timeout behavior configurable (fail vs warn)
  • Add output to indicate timeout vs success
  • Update summary to distinguish between success, timeout, and failure

Minor Issues

7. Magic Numbers and Timeouts

The workflow uses various hardcoded timeouts:

  • timeout-minutes: 15 for graph refresh (line 256)
  • timeout-minutes: 10 for API refresh (line 302)
  • sleep 30 in health checks

Recommendation: Extract these to workflow inputs with defaults for better tuning in different environments.

8. Verbose Bash Scripts

The collect-graph-instances job has 120 lines of inline bash (lines 130-250). This is difficult to test and maintain.

Recommendation: Consider moving this to a composite action or bash script in the repository for better testability.

9. Output Variables Never Used

Several outputs are declared but never consumed:

  • service-refresh.yml outputs graph_instances_updated, api_refreshed, dagster_refreshed
  • These are passed up to calling workflows but not used by prod.yml or staging.yml

Decision needed: Either use these outputs for notifications/logging or remove them to reduce complexity.

10. Inconsistent Error Messages

Error messages vary in format:

  • Some use emojis: echo "❌ Health check failed"
  • Some don't: echo "Error: Could not find ClusterName"

Recommendation: Standardize on either plain text or emoji usage for consistency.

Security Considerations

✅ Good Security Practices Observed:

  1. Uses OIDC for AWS authentication (no long-lived credentials)
  2. Proper use of secrets.ACTIONS_TOKEN with fallback to github.token
  3. SSM command execution uses proper shell escaping
  4. No secrets in logs or outputs

⚠️ Potential Security Concerns:

11. SSM Command Injection Risk (.github/actions/refresh-graph-containers/action.yml:98-113)

The action builds SSM commands with variable interpolation:

\"docker pull ${ECR_URI}:${{ inputs.environment }}\"

Risk: If inputs.environment comes from untrusted input (e.g., workflow_dispatch), it could contain shell metacharacters.

Mitigation: The workflow restricts environment to choice: [staging, prod], which prevents injection. However, consider adding explicit validation in the action itself as defense-in-depth.

12. No Approval Gates for Production

The service-refresh.yml workflow can be triggered manually via workflow_dispatch for production without approval.

Recommendation: Consider adding an environment: production declaration to require approval for prod refreshes.

Testing & Validation

What Should Be Tested:

  1. ✅ Dry-run the workflow with graph_refresh_enabled: false to ensure skipped jobs don't break the summary
  2. ✅ Test with api_refresh_enabled: false and dagster_refresh_enabled: false independently
  3. ✅ Verify health checks fail correctly when services are unhealthy
  4. ✅ Test the new stack name defaulting logic works for both staging and prod
  5. ✅ Confirm parallel refresh doesn't cause race conditions in your environment

Suggested Test Plan:

# Test 1: Staging with all refreshes
gh workflow run service-refresh.yml -f environment=staging

# Test 2: Prod with only graph refresh
gh workflow run service-refresh.yml -f environment=prod -f graph_refresh_enabled=true -f api_refresh_enabled=false -f dagster_refresh_enabled=false

# Test 3: Health check failure simulation (requires manual intervention)

Positive Aspects

  1. Excellent consolidation: Reduces 294 lines of duplicated code
  2. Clear separation of concerns: Graph (EC2), API (ECS), Dagster (ECS)
  3. Reusable design: Can be called from deployment workflows or manually
  4. Good parameterization: Flexible node type filtering, environment targeting
  5. Matrix strategy for graph instances: Efficient parallel updates
  6. Comprehensive summary job: Provides clear visibility into refresh results
  7. Backend-agnostic design: Supports both LadybugDB and Neo4j

Recommendations Summary

Must Fix Before Merge:

Should Fix Before Merge:

Consider for Future PRs:

  • Issue #7: Extract magic numbers to configurable inputs
  • Issue #8: Refactor long bash scripts to separate files
  • Issue #9: Remove unused output variables
  • Issue #10: Standardize error message format
  • Issue #11: Add explicit input validation as defense-in-depth
  • Issue Refactor: Archive Kuzu for ongoing support #12: Add approval gates for production

Conclusion

This is a solid refactoring that significantly improves workflow maintainability. The architecture is sound, and the consolidation makes sense. However, there are a few critical issues (particularly #1, #3, #4) that should be addressed before merging to ensure the workflow functions correctly.

Once the critical issues are resolved, this PR will be ready to merge. Great work on the consolidation effort! 🚀


Comment thread .github/workflows/service-refresh.yml Fixed
…unnecessary outputs

- Added minimal top-level permissions for jobs in the service-refresh workflow.
- Removed redundant output steps related to service refresh status for API and Dagster services, simplifying the workflow.
@jfrench9 jfrench9 merged commit b160ac5 into main Jan 14, 2026
6 checks passed
@jfrench9 jfrench9 deleted the feature/service-refresh-workflow branch January 14, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants