Skip to content

Add Spot Instance Support for Graph Replicas Infrastructure#378

Merged
jfrench9 merged 1 commit into
mainfrom
feature/shared-replicas-spot-capacity
Feb 18, 2026
Merged

Add Spot Instance Support for Graph Replicas Infrastructure#378
jfrench9 merged 1 commit into
mainfrom
feature/shared-replicas-spot-capacity

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR implements Spot Instance configuration capabilities for graph replicas, enabling cost-effective infrastructure deployment while maintaining system reliability and performance.

Key Accomplishments

  • ✅ Enhanced CloudFormation template to support Spot Instance configuration for graph replica deployments
  • ✅ Updated GitHub Actions workflows to handle Spot Instance provisioning across deployment pipelines
  • ✅ Integrated Spot Instance parameters into production and staging environment configurations
  • ✅ Added deployment workflow specifically for graph replicas with Spot Instance support
  • ✅ Updated setup automation to accommodate new infrastructure configuration options

Infrastructure Considerations

  • Cost Optimization: Leverages AWS Spot Instances to significantly reduce compute costs for replica workloads
  • High Availability: Maintains service reliability through proper Spot Instance configuration and fallback mechanisms
  • Environment Parity: Ensures consistent Spot Instance behavior across staging and production environments
  • Scalability: Supports dynamic scaling of graph replicas using cost-effective Spot capacity

Testing Notes

  • Verify Spot Instance provisioning works correctly in staging environment before production deployment
  • Monitor Spot Instance interruption handling and ensure graceful failover mechanisms
  • Validate that graph replica performance meets requirements under Spot Instance constraints
  • Test deployment rollback scenarios in case of Spot capacity unavailability

Breaking Changes

None - This is an additive feature that maintains backward compatibility with existing infrastructure.


🤖 Generated with Claude Code

Branch Info:

  • Source: feature/shared-replicas-spot-capacity
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Added parameters for enabling Spot instances, including base capacity, weight, allocation strategy, and proactive replacement settings in the deploy workflows and CloudFormation templates.
- Updated production and staging setup scripts to initialize new Spot configuration variables for shared replicas.
- Enhanced the CloudFormation template to support Spot instance management, improving cost efficiency for replica deployments.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Feb 18, 2026

Claude finished @jfrench9's task in 1m 4s —— View job


PR Review: Add Spot Instance Support for Graph Replicas Infrastructure

Tasks:

  • Gather context and understand request
  • Read changed files
  • Review CloudFormation template
  • Review GitHub Actions workflows
  • Review setup script
  • Post review feedback

Overview

This is a well-structured, additive feature that introduces Spot Instance support for the graph replica ASG. The implementation is conservative by default (Spot disabled, all on-demand fallback) and the parameter flow is consistent across all layers. A few issues worth addressing before merging.


Issues

1. MixedInstancesPolicy always applied, even when Spot is disabled

File: cloudformation/graph-ladybug-replicas.yaml, lines 516–525

MixedInstancesPolicy:
  LaunchTemplate:
    LaunchTemplateSpecification:
      LaunchTemplateId: !Ref ReplicaLaunchTemplate
      Version: !GetAtt ReplicaLaunchTemplate.LatestVersionNumber
  InstancesDistribution:
    OnDemandBaseCapacity: !If [IsSpotEnabled, !Ref SpotOnDemandBaseCapacity, 0]
    OnDemandPercentageAboveBaseCapacity: !If [IsSpotEnabled, !Ref SpotOnDemandPercentageAboveBase, 100]
    SpotAllocationStrategy: !If [IsSpotEnabled, !Ref SpotAllocationStrategy, !Ref "AWS::NoValue"]

MixedInstancesPolicy is unconditionally applied. When SpotEnabled=false, OnDemandPercentageAboveBaseCapacity is set to 100 and SpotAllocationStrategy is omitted — functionally correct — but this is a breaking change for existing stacks if they previously used the top-level LaunchTemplate property. CloudFormation treats switching from LaunchTemplate to MixedInstancesPolicy as a resource replacement of the ASG, which would cause an outage during stack update.

If this is a fresh template with no existing stacks using the old LaunchTemplate form, this is fine. If there are live stacks, consider whether a rolling migration is needed. Worth documenting in the PR.

Fix this →


2. CapacityRebalance is always omitted when Spot is disabled

File: cloudformation/graph-ladybug-replicas.yaml, line 525

CapacityRebalance: !If [IsSpotCapacityRebalance, true, !Ref "AWS::NoValue"]

The IsSpotCapacityRebalance condition requires both IsSpotEnabled=true AND SpotCapacityRebalance=true, so when Spot is disabled, CapacityRebalance is omitted (uses AWS default of false). This is correct behavior, but it's worth noting that CapacityRebalance only has an effect with Spot — so the conditional is redundant if you've already gated on IsSpotEnabled. No bug, but it could be simplified.


3. spot_weight description is inverted relative to the conversion math

File: .github/workflows/deploy-graph-replicas.yml, lines ~107–113

spot_weight:
  description: "Percentage of instances above base that are Spot (0 = all on-demand, 90 = 90% Spot)"
  default: "0"

And the conversion:

SPOT_ON_DEMAND_PCT=$((100 - ${{ inputs.spot_weight }}))

The description says spot_weight represents the Spot percentage (e.g., 90 = 90% Spot), and the math correctly converts it to the on-demand percentage for the ASG API. This is correct. However, deploy-graph.yml passes spot_weight through with the description "Percentage above base that is Spot for shared replicas" — consistent. The naming is fine. No bug.


4. No input validation on spot_weight range

File: .github/workflows/deploy-graph-replicas.yml

SPOT_ON_DEMAND_PCT=$((100 - ${{ inputs.spot_weight }}))

If spot_weight is passed as 101 or a non-integer (e.g., from a misconfigured GHA variable), this arithmetic will silently produce an invalid value (-1) that gets passed to CloudFormation's OnDemandPercentageAboveBaseCapacity, which expects 0–100. CloudFormation will reject it, but the error won't be obvious.

Consider adding a guard:

SPOT_WEIGHT=${{ inputs.spot_weight }}
if [[ "$SPOT_WEIGHT" -lt 0 || "$SPOT_WEIGHT" -gt 100 ]]; then
  echo "Error: spot_weight must be 0–100, got $SPOT_WEIGHT"
  exit 1
fi
SPOT_ON_DEMAND_PCT=$((100 - SPOT_WEIGHT))

Fix this →


Observations

  • gha.sh staging variables are inside if $setup_staging — correct; prod variables are always set. Consistent with existing pattern.
  • AllowedValues on SpotAllocationStrategy is a nice guard, prevents invalid values from reaching EC2.
  • Default spot_rebalance: "true" is reasonable — proactive rebalancing is generally desirable for replica workloads where interruptions are tolerable.
  • spot_weight default of "0" with spot_enabled defaulting to "false" provides a double safety net. Even if spot_enabled were accidentally set to "true", spot_weight=0 means 0% Spot (all on-demand).
  • The GHA variable naming convention (SHARED_REPLICAS_SPOT_ENABLED_PROD) follows the COMPONENT_SETTING_ENVIRONMENT pattern from CLAUDE.md.

Summary

The implementation is sound and the defaults are safe. The main risk to investigate is whether migrating existing live stacks from LaunchTemplate to MixedInstancesPolicy causes ASG replacement (issue #1). If this is a net-new stack or the team is aware of the replacement, the rest is a solid PR. The input validation gap (issue #4) is a minor but worthwhile fix.

@jfrench9 jfrench9 merged commit 5435dbd into main Feb 18, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/shared-replicas-spot-capacity branch February 19, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant