Refactor GitHub Actions workflows to streamline service refresh process by jfrench9 · Pull Request #186 · RoboFinSystems/robosystems

jfrench9 · 2026-01-14T08:22:20Z

Summary

This PR consolidates and refactors our GitHub Actions workflow infrastructure to improve the service refresh process. The changes introduce a centralized service refresh workflow while simplifying existing deployment pipelines.

Key Accomplishments

Consolidated Service Refresh Logic: Created a comprehensive service-refresh.yml workflow that centralizes service refresh operations previously scattered across multiple workflow files
Simplified Deployment Workflows: Streamlined deploy-api.yml and deploy-dagster.yml by removing redundant service refresh logic (15 and 68 lines removed respectively)
Enhanced Environment Workflows: Updated both prod.yml and staging.yml workflows with improved service refresh integration
Cleanup: Removed the obsolete graph-container-refresh.yml workflow (294 lines) as its functionality has been consolidated into the new service refresh workflow
Dependency Updates: Updated project dependencies to support the enhanced workflow capabilities

Breaking Changes

The graph-container-refresh.yml workflow has been removed - any direct references or manual triggers to this workflow will no longer function
Deployment workflows now follow a different service refresh pattern - teams should verify their deployment processes align with the new workflow structure

Testing Notes

All existing deployment pipelines should be tested to ensure they properly integrate with the new service refresh workflow
Verify that service refresh operations complete successfully in both staging and production environments
Test workflow triggers and dependencies to ensure proper execution order

Infrastructure Considerations

The new centralized workflow reduces code duplication and provides better maintainability for service refresh operations
Deployment times may be affected due to changes in workflow orchestration - monitor pipeline performance
Ensure all necessary secrets and environment variables are properly configured for the new workflow structure
Review workflow permissions and access controls for the new service refresh workflow

🤖 Generated with Claude Code

Branch Info:

Source: feature/service-refresh-workflow
Target: main
Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Removed the refresh ECS service input from API and Dagster workflows. - Consolidated ECS service refresh logic into a new 'service-refresh' workflow, allowing for parallel execution of refresh tasks for graph containers, API, and Dagster services. - Deleted the obsolete graph-container-refresh workflow, streamlining the deployment process. - Updated production and staging workflows to utilize the new service-refresh workflow, improving deployment efficiency.

jfrench9 · 2026-01-14T08:22:21Z

@claude please review this PR

claude · 2026-01-14T08:22:34Z

Claude finished @jfrench9's task in 1m 56s —— View job

PR Review: Workflow Refactoring

Tasks

Read and analyze new service-refresh.yml workflow
Review modified deployment workflows (deploy-api.yml, deploy-dagster.yml)
Review modified environment workflows (prod.yml, staging.yml)
Check deleted graph-container-refresh.yml to understand what was replaced
Review composite actions used by workflows
Analyze workflow security, triggers, and dependencies
Provide comprehensive review feedback

Review Summary

This PR successfully consolidates service refresh logic into a unified workflow. The refactoring is well-structured and improves maintainability, but there are several areas that need attention before merging.

Critical Issues

1. Incorrect Output Reference in `service-refresh.yml` (`.github/workflows/service-refresh.yml:350-356`)

The refresh-api job attempts to set an output in the wrong step:

- name: Set status output
  if: always()
  run: |
    if [ "${{ job.status }}" == "success" ]; then
      echo "status=success" >> $GITHUB_OUTPUT
    else
      echo "status=failed" >> $GITHUB_OUTPUT
    fi

Problem: This step doesn't have an id, but the job declares outputs.status: ${{ steps.refresh.outputs.status }}. The output is never actually used, but this is dead code that should be removed.

Fix: Remove the "Set status output" steps from refresh-api, refresh-dagster-daemon, and refresh-dagster-webserver jobs (lines 349-356, 394-401, 439-446) since the job-level outputs rely on job results, not step outputs.

2. Boolean vs String Type Inconsistency (`.github/workflows/service-refresh.yml:13-27, 58-94`)

The workflow has inconsistent input types between workflow_dispatch and workflow_call:

workflow_dispatch uses type: boolean for enable flags
workflow_call uses type: string with string defaults like "true"

Problem: This creates fragile conditional logic. Line 115 checks:

if: ${{ (github.event_name == 'workflow_dispatch' && inputs.graph_refresh_enabled) || (github.event_name != 'workflow_dispatch' && inputs.graph_refresh_enabled == 'true') }}

This works but is error-prone and makes the workflow harder to maintain.

Recommendation: Standardize on type: string for workflow_call inputs (required by GitHub Actions for reusable workflows) and always use string comparison in conditionals: inputs.graph_refresh_enabled == 'true'.

3. Hardcoded Stack Name Logic (`.github/workflows/service-refresh.yml:322-338`)

The workflow hardcodes CloudFormation stack naming conventions:

if [ "$ENV" == "prod" ]; then
  STACK_NAME="RoboSystemsAPIProd"
else
  STACK_NAME="RoboSystemsAPIStaging"
fi

Problem: This duplicates naming logic that already exists in stack-config and breaks if stack names change. The prod.yml and staging.yml workflows already have stack names from stack-config.outputs.api_stack.

Fix: Pass api_stack_name and dagster_stack_name as required inputs from the calling workflows (already done correctly in lines 589, 606, 592, 609).

Major Issues

4. Missing Job Dependencies in Summary (`.github/workflows/service-refresh.yml:452`)

The summary job declares dependencies on all refresh jobs:

summary:
  needs: [collect-graph-instances, refresh-graph, refresh-api, refresh-dagster-daemon, refresh-dagster-webserver]

Problem: If any of these jobs are skipped (e.g., graph_refresh_enabled: false), the dependency will be unmet and the summary job will be skipped even if other refreshes succeeded.

Fix: Add if: always() to the summary job to ensure it runs regardless of which jobs were skipped.

5. Race Condition in Parallel Refreshes (`.github/workflows/service-refresh.yml:252-446`)

The workflow refreshes API, Dagster daemon, and Dagster webserver in parallel without coordination.

Problem: If these services share dependencies (database migrations, cache state), parallel refreshes could cause race conditions or inconsistent state.

Recommendation:

Document whether parallel execution is safe
Consider adding a comment explaining the refresh order assumptions
If needed, serialize critical refreshes (e.g., daemon before webserver)

6. Incomplete Error Handling in Composite Actions

refresh-graph-containers/action.yml:182-190: Health check failures exit with code 1, but there's no mechanism to report partial failures to the summary job.

refresh-ecs/action.yml:274-286: Timeouts log a warning but proceed anyway:

echo "Proceeding despite timeout - service update was initiated successfully"

Problem: This can mask real deployment failures. If a service never stabilizes, downstream services might depend on an unhealthy instance.

Recommendation:

Consider making timeout behavior configurable (fail vs warn)
Add output to indicate timeout vs success
Update summary to distinguish between success, timeout, and failure

Minor Issues

7. Magic Numbers and Timeouts

The workflow uses various hardcoded timeouts:

timeout-minutes: 15 for graph refresh (line 256)
timeout-minutes: 10 for API refresh (line 302)
sleep 30 in health checks

Recommendation: Extract these to workflow inputs with defaults for better tuning in different environments.

8. Verbose Bash Scripts

The collect-graph-instances job has 120 lines of inline bash (lines 130-250). This is difficult to test and maintain.

Recommendation: Consider moving this to a composite action or bash script in the repository for better testability.

9. Output Variables Never Used

Several outputs are declared but never consumed:

service-refresh.yml outputs graph_instances_updated, api_refreshed, dagster_refreshed
These are passed up to calling workflows but not used by prod.yml or staging.yml

Decision needed: Either use these outputs for notifications/logging or remove them to reduce complexity.

10. Inconsistent Error Messages

Error messages vary in format:

Some use emojis: echo "❌ Health check failed"
Some don't: echo "Error: Could not find ClusterName"

Recommendation: Standardize on either plain text or emoji usage for consistency.

Security Considerations

✅ Good Security Practices Observed:

Uses OIDC for AWS authentication (no long-lived credentials)
Proper use of secrets.ACTIONS_TOKEN with fallback to github.token
SSM command execution uses proper shell escaping
No secrets in logs or outputs

⚠️ Potential Security Concerns:

11. SSM Command Injection Risk (.github/actions/refresh-graph-containers/action.yml:98-113)

The action builds SSM commands with variable interpolation:

\"docker pull ${ECR_URI}:${{ inputs.environment }}\"

Risk: If inputs.environment comes from untrusted input (e.g., workflow_dispatch), it could contain shell metacharacters.

Mitigation: The workflow restricts environment to choice: [staging, prod], which prevents injection. However, consider adding explicit validation in the action itself as defense-in-depth.

12. No Approval Gates for Production

The service-refresh.yml workflow can be triggered manually via workflow_dispatch for production without approval.

Recommendation: Consider adding an environment: production declaration to require approval for prod refreshes.

Testing & Validation

What Should Be Tested:

✅ Dry-run the workflow with graph_refresh_enabled: false to ensure skipped jobs don't break the summary
✅ Test with api_refresh_enabled: false and dagster_refresh_enabled: false independently
✅ Verify health checks fail correctly when services are unhealthy
✅ Test the new stack name defaulting logic works for both staging and prod
✅ Confirm parallel refresh doesn't cause race conditions in your environment

Suggested Test Plan:

# Test 1: Staging with all refreshes
gh workflow run service-refresh.yml -f environment=staging

# Test 2: Prod with only graph refresh
gh workflow run service-refresh.yml -f environment=prod -f graph_refresh_enabled=true -f api_refresh_enabled=false -f dagster_refresh_enabled=false

# Test 3: Health check failure simulation (requires manual intervention)

Positive Aspects

✅ Excellent consolidation: Reduces 294 lines of duplicated code
✅ Clear separation of concerns: Graph (EC2), API (ECS), Dagster (ECS)
✅ Reusable design: Can be called from deployment workflows or manually
✅ Good parameterization: Flexible node type filtering, environment targeting
✅ Matrix strategy for graph instances: Efficient parallel updates
✅ Comprehensive summary job: Provides clear visibility into refresh results
✅ Backend-agnostic design: Supports both LadybugDB and Neo4j

Recommendations Summary

Must Fix Before Merge:

Issue Inital open source release of robosystems #1: Remove dead code from status output steps
Issue Enhance SEC pipeline security and add reconsolidation support #3: Remove hardcoded stack name fallback logic (rely on passed inputs)
Issue chore: improve documentation and suppress build output #4: Add if: always() to summary job

Should Fix Before Merge:

Issue Fix MCP tools query results to include units and improve dimensions #2: Standardize boolean/string handling in conditionals
Issue Remove unused Account and Disclosure transaction nodes from graph schema #5: Document or address race condition risk in parallel refreshes
Issue Fix worker monitor Valkey connection with Secrets Manager auth #6: Improve timeout handling in ECS refresh action

Consider for Future PRs:

Issue #7: Extract magic numbers to configurable inputs
Issue #8: Refactor long bash scripts to separate files
Issue #9: Remove unused output variables
Issue #10: Standardize error message format
Issue #11: Add explicit input validation as defense-in-depth
Issue Refactor: Archive Kuzu for ongoing support #12: Add approval gates for production

Conclusion

This is a solid refactoring that significantly improves workflow maintainability. The architecture is sound, and the consolidation makes sense. However, there are a few critical issues (particularly #1, #3, #4) that should be addressed before merging to ensure the workflow functions correctly.

Once the critical issues are resolved, this PR will be ready to merge. Great work on the consolidation effort! 🚀

…unnecessary outputs - Added minimal top-level permissions for jobs in the service-refresh workflow. - Removed redundant output steps related to service refresh status for API and Dagster services, simplifying the workflow.

github-advanced-security AI found potential problems Jan 14, 2026

View reviewed changes

Comment thread .github/workflows/service-refresh.yml Fixed

jfrench9 merged commit b160ac5 into main Jan 14, 2026
6 checks passed

jfrench9 deleted the feature/service-refresh-workflow branch January 14, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GitHub Actions workflows to streamline service refresh process#186

Refactor GitHub Actions workflows to streamline service refresh process#186
jfrench9 merged 2 commits into
mainfrom
feature/service-refresh-workflow

jfrench9 commented Jan 14, 2026

Uh oh!

jfrench9 commented Jan 14, 2026

Uh oh!

claude Bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jfrench9 commented Jan 14, 2026

Summary

Key Accomplishments

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Jan 14, 2026

Uh oh!

claude Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Workflow Refactoring

Tasks

Review Summary

Critical Issues

1. Incorrect Output Reference in service-refresh.yml (.github/workflows/service-refresh.yml:350-356)

2. Boolean vs String Type Inconsistency (.github/workflows/service-refresh.yml:13-27, 58-94)

3. Hardcoded Stack Name Logic (.github/workflows/service-refresh.yml:322-338)

Major Issues

4. Missing Job Dependencies in Summary (.github/workflows/service-refresh.yml:452)

5. Race Condition in Parallel Refreshes (.github/workflows/service-refresh.yml:252-446)

6. Incomplete Error Handling in Composite Actions

Minor Issues

7. Magic Numbers and Timeouts

8. Verbose Bash Scripts

9. Output Variables Never Used

10. Inconsistent Error Messages

Security Considerations

✅ Good Security Practices Observed:

⚠️ Potential Security Concerns:

Testing & Validation

What Should Be Tested:

Suggested Test Plan:

Positive Aspects

Recommendations Summary

Must Fix Before Merge:

Should Fix Before Merge:

Consider for Future PRs:

Conclusion

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Jan 14, 2026 •

edited

Loading

1. Incorrect Output Reference in `service-refresh.yml` (`.github/workflows/service-refresh.yml:350-356`)

2. Boolean vs String Type Inconsistency (`.github/workflows/service-refresh.yml:13-27, 58-94`)

3. Hardcoded Stack Name Logic (`.github/workflows/service-refresh.yml:322-338`)

4. Missing Job Dependencies in Summary (`.github/workflows/service-refresh.yml:452`)

5. Race Condition in Parallel Refreshes (`.github/workflows/service-refresh.yml:252-446`)