This hands-on lab teaches chaos engineering principles through practical experimentation with a real-world web application. You will deploy Coffee Chaos - a premium coffee bean e-commerce store - with a containerized microservice backend, intentionally inject failures using ToxiProxy, observe how the system degrades, and then implement resilience patterns to make it antifragile.
Coffee Chaos is a React-based single-page application featuring six specialty coffee varieties from around the world. Users can browse products with detailed tasting notes, add items to their cart, adjust quantities, and complete checkout. Orders are posted to a Go microservice and stored in DynamoDB. The entire application runs locally in Docker containers, making it easy to experiment with network failures in a controlled environment.
This lab is divided into three parts:
- Part 1: Deploy the system, run chaos experiments, observe failures
- Part 2: Implement tactical robustness improvements (choose one: retries, validation, or circuit breaker)
- Part 3: Use AI to explore strategic re-architecture for resilience
By the end of this lab, you will understand:
- How distributed systems fail under stress
- The scientific method of chaos engineering
- Practical resilience patterns: retries, timeouts, circuit breakers, validation
- How to use AI assistants effectively for architectural design
- How to build systems that improve from failure
This lab is designed to run in GitHub Codespaces, which provides a complete development environment in your browser.
- Fork or clone this repository to your GitHub account
- Click the green "Code" button
- Select the "Codespaces" tab
- Click "Create codespace on main"
GitHub will automatically set up your environment with:
- Terraform pre-installed
- Go toolchain for microservice development
- Docker for running ToxiProxy
- AWS CLI pre-configured
- All dependencies ready to use
Your Codespace will be ready in 1-2 minutes. No local installation required!
graph LR
A[Web Browser] -->|HTTP| B[React Webapp<br/>Port 3000]
B -->|HTTP POST| C[ToxiProxy<br/>Port 8000]
C -->|Proxy| D[Go Microservice<br/>Port 8080]
D -->|AWS SDK| E[(DynamoDB Table)]
style A fill:#e1f5ff,stroke:#01579b
style B fill:#e8f5e9,stroke:#1b5e20
style C fill:#fff3e0,stroke:#e65100
style D fill:#f3e5f5,stroke:#4a148c
style E fill:#e8f5e9,stroke:#1b5e20
How it works:
- Users browse coffee products and add them to their shopping cart in the React webapp (port 3000)
- When users click checkout, orders are posted via HTTP to a Go microservice
- ToxiProxy sits between the webapp and microservice to inject network failures (port 8000)
- The Go microservice processes orders and stores them in DynamoDB (port 8080)
- All services run in Docker containers orchestrated by Docker Compose
ToxiProxy was created by Shopify to simulate network conditions and test application resilience in distributed systems. It acts as a transparent TCP proxy that can inject various network failures (latency, timeouts, bandwidth limits) on demand through a simple HTTP API. Originally built for Shopify's microservices infrastructure, it's now widely used across the industry to test how applications behave under adverse network conditions. Learn more at the ToxiProxy GitHub repository.
The application uses three interconnected Docker containers that start with a single command:
-
webapp (React + Vite on port 3000)
- Serves the frontend application with hot-reload enabled
- Configured to send requests to ToxiProxy on port 8000
-
toxiproxy (ports 8000 and 8474)
- Acts as a transparent proxy between webapp and Go service
- Port 8000: Proxy endpoint (webapp → toxiproxy → go-service)
- Port 8474: Control API for injecting network failures
-
go-service (port 8080)
- HTTP server that processes order requests
- Connects to AWS DynamoDB using SDK
- Requires AWS credentials via environment variables
All containers communicate through a Docker bridge network (chaos-network).
Before starting the services, you need to set up environment variables.
Create a .env file in the root directory:
# AWS Credentials (required)
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
AWS_SESSION_TOKEN=your_session_token_here # If using temporary credentials
# AWS Configuration
AWS_REGION=eu-north-1
# Application Configuration
STUDENT_ID=your-unique-id # Replace with your name (lowercase, no spaces)
TABLE_NAME=chaos-coffee-$STUDENT_ID # Automatically uses your STUDENT_ID
# Claude Code API Key (for Part 2 and Part 3 AI exercises)
# Your instructor will provide this key
ANTHROPIC_API_KEY=sk-ant-xxxxxNote: The ANTHROPIC_API_KEY is provided by your instructor for class use. This enables Claude Code CLI to work in Part 2 and Part 3 without needing your own subscription.
After creating your .env file, install the Claude Code CLI tool globally:
npm install -g @anthropic-ai/claude-codeThen load your environment variables (required before using Claude Code):
set -a; source .env; set +aYou'll need to run this command each time you open a new terminal session. This exports all variables from .env to your shell.
This lab follows the chaos engineering cycle:
- Steady State: Understand how the system works normally
- Hypothesis: Predict how it will fail under stress
- Experiment: Inject real-world failures
- Observation: Measure the impact
- Improvement: Implement resilience patterns
- Validation: Verify the improvements
In Part 1, you will deploy a deliberately fragile system, run chaos experiments, and observe how it fails.
Explore the repository structure using the VS Code file explorer.
Open service/main.go in the VS Code editor.
Key observations:
- The HTTP handler accepts POST requests with JSON payload
- It stores data in DynamoDB with
student_idas partition key (this comes from theSTUDENT_IDenvironment variable in your.envfile) - Simple error handling with basic logging
- Runs as a standalone HTTP server in Docker
The webapp is a React single-page application built with Vite. Open these key files in VS Code:
webapp/src/App.jsx- Main application componentwebapp/src/components/Cart.jsx- Shopping cart with checkout logicwebapp/src/data/products.js- Six specialty coffee products
Key observations:
- React with Vite build tool
- Six premium coffee products with detailed tasting notes
- Shopping cart with add/remove functionality and quantity controls
- Makes POST requests to microservice endpoint on checkout
- No retry logic - failures show immediately
- Basic error handling displays error messages but doesn't retry failed requests
Open the Terraform files in VS Code:
infra/main.tf- Main infrastructure definitioninfra/variables.tf- Input variablesinfra/outputs.tf- Output values
Key observations:
- Terraform module that requires
student_idvariable - Creates DynamoDB table with on-demand billing
- IAM role with DynamoDB permissions
Before starting the application, you need to create the DynamoDB table using Terraform.
From your Codespace terminal, create a deployment directory:
mkdir -p deployment
cd deploymentCreate a main.tf file that uses the module:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "eu-north-1"
}
module "coffee_chaos" {
source = "../infra"
student_id = "YOUR_NAME_HERE" # Use the same value as STUDENT_ID in your .env file
}
output "dynamodb_table_name" {
value = module.coffee_chaos.dynamodb_table_name
description = "DynamoDB table for storing orders"
}Deploy the infrastructure:
terraform init
terraform plan
terraform applyNote: Make sure you've already loaded your environment variables from the root directory (see "Install Claude Code CLI" section above). The variables remain available when you change directories.
Save the output! Copy the dynamodb_table_name value - you'll need it in the next step.
Return to the project root directory and start all services with a single command:
cd /workspace/itevu4340 # Or your project root
docker-compose up -dThis command will:
- Build the Go microservice Docker image
- Start the microservice on port 8080
- Start ToxiProxy on port 8000 (proxy) and 8474 (control API)
- Start the React webapp on port 3000
- Create a shared Docker network for inter-container communication
Verify all containers are running:
docker-compose psYou should see all three services (go-service, toxiproxy, webapp) with status "Up".
The webapp is now running on port 3000. In GitHub Codespaces:
- Click the "Ports" tab at the bottom of VS Code
- Find port 3000 (webapp)
- Click the globe icon to open in your browser
The webapp should be accessible at a URL like: https://[codespace-name]-3000.preview.app.github.dev
- Browse the six specialty coffee products
- Add items to your cart (observe the smooth animations)
- Adjust quantities using the + and - buttons
- Click the "Checkout" button
What should happen:
- The order is sent to the microservice via HTTP POST through ToxiProxy
- The microservice stores the order in DynamoDB
- You'll see a success message
- The cart clears automatically
If you see errors, check:
- Docker containers are running (
docker-compose ps) - DynamoDB table name is configured correctly in
docker-compose.yml - AWS credentials are set in your environment
- DynamoDB table was created successfully
Verify your order was stored in DynamoDB:
# Scan your DynamoDB table to see all orders
aws dynamodb scan \
--table-name chaos-coffee-$STUDENT_ID \
--region eu-north-1
# For a cleaner view, use jq to format the output
aws dynamodb scan \
--table-name chaos-coffee-$STUDENT_ID \
--region eu-north-1 | jq '.Items'
# Count the number of orders in your table
aws dynamodb scan \
--table-name chaos-coffee-$STUDENT_ID \
--region eu-north-1 \
--select "COUNT" | jq '.Count'You should see your order data with the items you purchased, total price, and timestamp. The count command is useful for verifying how many orders were stored during experiments.
Before injecting chaos, make predictions using the scientific method.
Write down your hypothesis:
-
What will happen when we add 2000ms latency?
- How will the UI respond?
- Will users click checkout multiple times? How does the UI protect against that already?
- What will happen to DynamoDB (duplicate records)?
-
How does the application respond to timeouts?
- What happens when a request takes longer than the timeout?
- Will the UI show an error message?
- Does the backend continue processing even after the frontend times out?
-
What will happen when we add random timeouts?
- Will some requests succeed?
- How will users know if their order worked?
Save your hypotheses - you'll compare them to actual results.
ToxiProxy is already configured and running from Step 2. It sits between your webapp and the Go microservice to inject network failures.
The docker-compose.yml file configures ToxiProxy to:
- Listen on port 8000 (proxy endpoint)
- Forward requests to the Go microservice on port 8080
- Expose port 8474 for the control API (to inject failures)
All services run in the same Docker network, so they can communicate using container names:
toxiproxy:
image: ghcr.io/shopify/toxiproxy:latest
ports:
- "8000:8000" # Proxy endpoint
- "8474:8474" # Control API
networks:
- app-networkThe webapp is pre-configured in webapp/src/config.js to use ToxiProxy:
export const SERVICE_URL = 'http://localhost:8000'; // Through ToxiProxyUse curl to add toxics via the ToxiProxy HTTP API:
# Add 2000ms latency
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{"name": "latency", "type": "latency", "attributes": {"latency": 2000}}'ToxiProxy provides an HTTP API on port 8474 for managing network failures.
List all active toxics:
curl http://localhost:8474/proxies/chaos-proxy/toxics | jqAdd a toxic:
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{
"name": "my-toxic-name",
"type": "latency",
"attributes": {"latency": 2000}
}'Delete a specific toxic (toxic name goes at the end of the URL):
curl -X DELETE http://localhost:8474/proxies/chaos-proxy/toxics/my-toxic-nameCheck proxy status:
curl http://localhost:8474/proxies/chaos-proxy | jqTroubleshooting: If you get a 404 error, the proxy might not be loaded. Restart ToxiProxy:
docker-compose restart toxiproxy
# Wait a few seconds, then verify
curl http://localhost:8474/proxies | jqToxiProxy supports many failure modes you can inject beyond the latency example we've seen:
Latency: Add delay to requests
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{"name": "slow-response", "type": "latency", "attributes": {"latency": 2000, "jitter": 500}}'latency: Base delay in millisecondsjitter: Random variation (±jitter ms)
Timeout: Close connection after delay
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{"name": "connection-timeout", "type": "timeout", "attributes": {"timeout": 3000}}'timeout: Milliseconds before closing connection
Bandwidth: Limit throughput
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{"name": "slow-network", "type": "bandwidth", "attributes": {"rate": 1024}}'rate: Bytes per second (1024 = 1KB/s)
Slicer: Slice data into small packets with delays
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{"name": "packet-loss", "type": "slicer", "attributes": {"average_size": 64, "size_variation": 32, "delay": 10}}'average_size: Average packet size in bytessize_variation: Variation in packet sizedelay: Delay between packets in milliseconds
All toxic types support an optional toxicity parameter (value between 0 and 1) that controls what percentage of requests are affected. By default, toxics apply to 100% of requests (toxicity: 1.0).
# Apply timeout to only 30% of requests
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{
"name": "intermittent-timeout",
"type": "timeout",
"attributes": {"timeout": 5000},
"toxicity": 0.3
}'Toxicity values:
toxicity: 1.0= 100% of requests affected (default if omitted)toxicity: 0.3= 30% of requests affectedtoxicity: 0.5= 50% of requests affectedtoxicity: 0.1= 10% of requests affected
You can add the toxicity parameter to any toxic type (latency, timeout, bandwidth, slicer) to simulate intermittent failures rather than complete outages.
Your enterprise architect just announced a new mandate:
"All API requests must complete within 5 seconds or they will be terminated by our infrastructure."
The reasoning:
- During peak hours (Black Friday, holiday sales), slow requests pile up
- They consume connection pools, memory, and database connections
- This causes cascading failures that crash upstream systems
- The architect's solution: "Kill anything over 5 seconds to protect the infrastructure"
Your challenge: Test if this 5-second timeout is safe for your coffee ordering system.
To comply with the enterprise architecture decision, you need to update the webapp to enforce a 5-second timeout.
Edit webapp/src/config.js:
Find this line:
export const REQUEST_TIMEOUT = 30000;Change it to:
export const REQUEST_TIMEOUT = 5000;Save the file. The webapp will automatically reload with hot module replacement. You'll see the timeout value update in the cart footer (it will show "Request timeout: 5s").
Your Go microservice calls three external services sequentially:
- Credit card processing: ~500-900ms (but can be 2-3x slower during peak)
- SAP inventory update: ~600-900ms (but can be 2-3x slower during peak)
- DHL shipping creation: ~500-700ms (but can be 2-3x slower during peak)
Normal conditions: 2-3 seconds total Peak load conditions: Could be 5-9 seconds total
You'll run three experiments to test the impact of the 5-second timeout under different load conditions.
Before you start: Write down your hypothesis for each experiment.
Hypothesis: What do you expect will happen on a normal day with no external service delays?
Setup: No toxics - test the service in its natural state
Steps:
-
Make sure all toxics are cleared:
curl http://localhost:8474/proxies/chaos-proxy/toxics | jq # Should return: []
-
Place 3-5 orders and observe:
- Success rate
- Response times (check browser Network tab)
- Backend logs:
docker logs -f chaos-coffee-service
Expected Result: All orders should succeed in 2-3 seconds.
Hypothesis: What happens when external services slow down during a busy hour? Will the 5-second timeout cause problems?
Scenario: Credit card processors and SAP are slower than usual (1.5x delay)
Setup:
# Add 1500ms latency (simulates external services under load)
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{
"name": "peak-load-latency",
"type": "latency",
"attributes": {"latency": 1500}
}'Service processing time now: 2-3 seconds + 1.5 seconds = 3.5-4.5 seconds
Note: With the 5-second timeout you configured, requests should complete successfully since they're under the limit.
Steps:
- Place multiple orders
- Observe the results in the browser
- Check DynamoDB after:
aws dynamodb scan --table-name chaos-coffee-${STUDENT_ID} \ --query 'Count'
What to observe:
- Do all orders succeed?
- Are any orders close to the 5-second limit?
- What happens to orders that time out?
Clean up:
curl -X DELETE http://localhost:8474/proxies/chaos-proxy/toxics/peak-load-latencyHypothesis: What happens during peak load (Black Friday) when external services are very slow? Will many orders fail?
Scenario: External services are under heavy load (3x normal delay)
Setup:
# Add 3000ms latency with 500ms jitter (simulates heavy external load)
curl -X POST http://localhost:8474/proxies/chaos-proxy/toxics \
-H 'Content-Type: application/json' \
-d '{
"name": "black-friday-latency",
"type": "latency",
"attributes": {"latency": 3000, "jitter": 500}
}'Service processing time now: 2-3 seconds + 3 seconds (±500ms jitter) = 5-6 seconds
Note: With the 5-second timeout and jitter, some requests will complete just under 5 seconds, but most will exceed it and timeout.
Steps:
- Place multiple orders
- Observe the failure rate
- Check backend logs - did the service finish processing even when frontend timed out?
docker logs chaos-coffee-service --tail 30
- Check DynamoDB - how many orders were actually saved?
Investigation Questions:
When orders timeout in the frontend, what actually happens on the backend?
-
Check the backend logs:
docker logs chaos-coffee-service --tail 50
- Do you see "Order processed successfully" messages?
- How many orders were processed vs how many the frontend reported as failed?
-
Count orders in DynamoDB:
aws dynamodb scan --table-name chaos-coffee-$STUDENT_ID --query 'Count'
- Does this match the number of successful responses in the frontend?
- Are there more orders than you expected?
-
Critical question: If the frontend times out after 5 seconds, does the backend stop processing the order or does it continue?
Think about:
- What is the data consistency problem here?
- Why might users submit duplicate orders?
- What happens if a user sees "timeout" and clicks checkout again?
Clean up:
curl -X DELETE http://localhost:8474/proxies/chaos-proxy/toxics/black-friday-latencyTake a few minutes to summarize what you discovered during the chaos experiments. Be prepared to share with the class:
- What surprised you most about how the system behaved under stress?
- What happened when requests timed out on the frontend? Did the backend stop processing?
- What problems would real users experience with this system?
- What would you fix first?
In Part 1, you discovered the "silent order" problem: frontend times out and shows an error, but the backend continues processing and saves the order to DynamoDB. Users think their order failed, so they might click checkout again, creating duplicate orders.
This is a classic problem in distributed systems. In this part, you'll implement the industry-standard solution used by Stripe, AWS, and all major payment APIs: idempotency keys with automatic retries.
What is an idempotency key?
An idempotency key is a unique identifier (usually a UUID) that you send with a request to ensure that performing the same operation multiple times has the same effect as performing it once.
Real-world example:
When you use Stripe's payment API, you send an Idempotency-Key header:
POST /v1/charges
Idempotency-Key: abc-123-def-456
Content-Type: application/json
{"amount": 1000, "currency": "usd", "source": "tok_visa"}If the request times out and you retry with the same key, Stripe returns the existing charge instead of creating a duplicate. This prevents charging customers twice.
Why are idempotency keys critical?
- Prevents duplicates: Safe to retry failed requests
- Industry standard: Used by Stripe, AWS, Square, and all major payment APIs
- Enables automatic retries: Frontend can retry without fear of duplicates
- Production-ready: This is how real systems handle distributed transactions
Idempotency Key: A unique identifier (UUID) generated once per user action. When included in requests, it allows the backend to detect and prevent duplicate operations.
Automatic Retries: The frontend automatically retries failed requests with exponential backoff (1s, 3s, 5s delays) instead of requiring users to click again.
How they work together: The frontend generates a UUID once per checkout click, then retries with the SAME UUID. The backend checks DynamoDB for existing orders with this UUID - if found, returns it instead of creating a duplicate.
You're encouraged to use Claude Code or GitHub Copilot to help implement this pattern.
Suggested Claude Code prompt:
I need to implement idempotency keys and automatic retries for my checkout system to prevent duplicate orders.
FRONTEND (webapp/src/components/Cart.jsx):
1. Generate a UUID once per checkout button click using crypto.randomUUID()
2. Include it in the order request as "idempotency_key"
3. Add retry logic with exponential backoff (delays: 0ms, 1000ms, 3000ms, 5000ms)
4. Retry on timeout (AbortError) or 5xx server errors
5. Reuse the SAME idempotency key for all retry attempts
6. Update status message to show "Retrying... (attempt X/4)" during retries
7. Stop retrying on success or 4xx client errors
BACKEND (service/main.go):
1. Add IdempotencyKey field to Order and OrderRecord structs
2. Before processing an order, scan DynamoDB for an existing order with the same idempotency_key
3. If found, return the existing order with 200 OK (prevents duplicate)
4. If not found, process the order normally and save it with the idempotency_key
5. Use DynamoDB Scan with FilterExpression (not GSI) to find existing orders
6. Log when returning existing orders: "Found existing order {order_id} with key {key}"
Important: The idempotency key should be generated ONCE when the user clicks checkout, not on every HTTP request attempt. Retries must reuse the same key.
Please implement both frontend and backend changes.
Use the chaos engineering techniques from Part 1 to test your implementation:
- Add latency toxics to trigger timeouts and force retries
- Place orders and watch the browser console for retry attempts
- Verify in DynamoDB that only one order was created (no duplicates)
- Check the idempotency key - it should be the same across all retry attempts
Take a few minutes to think about what you implemented. Be prepared to share with the class:
- How does automatic retry improve the user experience compared to showing an error?
- What would happen without idempotency keys when retries occur?
- Why is it critical to generate the idempotency key ONCE per button click, not per HTTP request?
- How do production systems like Stripe use this pattern?
- What other scenarios (besides timeouts) would benefit from automatic retries?
In Parts 1 and 2, you worked with chaos engineering and tactical robustness improvements. Now you'll explore how AI can help you think bigger: re-architecting the entire solution for robustness, scalability, and maintainability.
This part focuses on working effectively with AI coding assistants like Claude Code to explore architectural alternatives, evaluate trade-offs, and implement more sophisticated solutions.
You have access to a full AWS account with all services available for your re-architecture:
- Existing infrastructure: DynamoDB table (deployed via
infra/folder) - Available services: Lambda, SQS, SNS, EventBridge, Step Functions, API Gateway, S3, CloudWatch, and more
- Infrastructure as Code: Extend the
infra/Terraform configuration to deploy new AWS resources - Cloud-native architecture: Design solutions that leverage managed AWS services for robustness and scalability
When discussing architecture with Claude Code, think cloud-native: replace Docker containers with Lambda functions, use SQS for queues, leverage AWS managed services for reliability and resilience.
- Learn how to prompt AI assistants for architectural guidance
- Compare minimal context vs. detailed context prompting strategies
- Use plan mode to explore solutions before implementation
- Critically evaluate AI-generated architectural proposals
- Design cloud-native AWS architectures for production resilience
- Document the AI collaboration process
You'll perform three experiments with Claude Code, each using a different prompting strategy:
- Minimal Context Experiment: Give vague requirements ("make it robust")
- Detailed Context Experiment: Provide specific architectural goals
- Plan Mode Experiment: Use Claude Code's plan mode for collaborative design
After each experiment, you'll take notes on the AI's suggestions, your assessment of them, and what you learned.
Important: To avoid wasting tokens on irrelevant files, guide Claude Code to focus on the right folders:
Focus on these folders: webapp/src/, service/, infra/
Ignore: node_modules/, .git/, lambda/ (old deprecated code)
Start a conversation with Claude Code and say:
Make my application more robust.
That's it. Don't provide additional context unless Claude asks.
- What questions does Claude ask?
- What assumptions does Claude make?
- What solutions does Claude propose?
- How specific or generic are the suggestions?
- Does Claude explore the codebase before suggesting changes?
Think about and be ready to discuss:
- What questions did Claude ask?
- What solutions were proposed?
- Strengths, weaknesses, and surprises
- What did you learn about prompting AI?
Start a new conversation with detailed requirements:
Focus on: webapp/src/, service/, infra/
Ignore: node_modules/, .git/, lambda/
I want to re-architect my coffee shop application for better robustness and decoupling.
Current architecture:
- React frontend with client-side cart state
- Direct synchronous calls to Go microservice
- Microservice writes to DynamoDB immediately
- Running in Docker containers with ToxiProxy for chaos testing
Problems I want to solve:
1. Frontend is tightly coupled to microservice - failures affect user experience immediately
2. No way to retry failed orders automatically
3. Orders might be lost if microservice fails after accepting the request
4. No visibility into order status
5. Difficult to test and debug distributed failures
Goals for re-architecture:
- Decouple frontend from backend processing
- Make order submission asynchronous and reliable
- Add order status tracking
- Implement proper error recovery
- Maintain simplicity (this is a learning project, not production)
Technologies I'm already using:
- Go microservice in Docker (can be replaced with Lambda)
- DynamoDB for data storage (already deployed)
- React frontend (could use S3 + CloudFront)
- ToxiProxy for chaos testing (local only)
- Terraform for IaC (in infra/ folder)
AWS services available for re-architecture:
- Lambda, SQS, SNS, EventBridge, Step Functions, API Gateway, S3, CloudWatch, and more
- I can extend the infra/ Terraform configuration to deploy new AWS resources
Please suggest a cloud-native AWS architecture that achieves these goals. Explain the trade-offs and what AWS services I'd need to add.
- How does Claude's response differ from Experiment 1?
- Does Claude suggest specific AWS services or patterns?
- Does Claude explain trade-offs?
- Are the suggestions practical given your constraints?
Think about and be ready to discuss:
- What architecture was proposed?
- What components and trade-offs were suggested?
- What technologies did Claude recommend?
- How does this compare to Experiment 1?
- What did you learn?
- Start a new conversation with Claude Code
- IMPORTANT: Enter plan mode FIRST before asking your question
- Ask Claude to explore multiple architectural options
- Iterate on the proposals through back-and-forth discussion
- Refine until you have a clear plan
- Exit plan mode and approve the plan to execute
Focus on: webapp/src/, service/, infra/
Ignore: node_modules/, .git/, lambda/
I want to improve the robustness of my coffee shop application. Before we write any code, let's create a plan.
Current situation:
- Frontend makes direct synchronous calls to Go microservice in Docker
- Microservice writes to DynamoDB immediately
- No retry logic or error recovery
Available AWS resources:
- Full AWS account with Lambda, SQS, SNS, EventBridge, Step Functions, API Gateway, S3, CloudWatch
- Existing Terraform infrastructure in infra/ folder
- Can deploy new AWS resources via Terraform
I want to explore cloud-native AWS architectural options that:
- Reduce coupling between frontend and backend
- Allow the system to handle temporary failures gracefully
- Leverage managed AWS services for reliability
- Can be deployed via Terraform (extend infra/ folder)
Let's discuss a few approaches and their trade-offs before deciding on one.
Then iterate with questions like:
- "What if we wanted to keep the synchronous API but add retry logic?"
- "How does the SQS approach compare to using Step Functions?"
- "What's the minimal viable improvement we could ship this week?"
Think about and be ready to discuss:
- What questions did you ask in the iteration?
- What was the final plan?
- How did plan mode change the collaboration?
- Quality of the final plan - what worked well?
After completing all experiments, prepare to discuss:
- Which prompting strategy worked best?
- Did AI suggest anything you wouldn't have thought of?
- When would you use AI for architecture decisions in real work?
- What are the risks of following AI architectural suggestions?
- How do you validate AI-generated designs?
When you're done with the lab, clean up your resources:
From the project root directory:
docker-compose downThis will stop and remove all containers (webapp, toxiproxy, go-service).
From the terminal in your Codespace:
cd deployment
terraform destroy- Go to https://github.com/codespaces
- Find your Codespace for this repository
- Click the three dots menu
- Select "Delete"
This ensures you don't accumulate storage charges for the Codespace.
You've completed a full chaos engineering cycle:
- Built a distributed system
- Established steady state behavior
- Hypothesized about failure modes
- Injected real-world failures
- Observed degradation
- Implemented resilience patterns
- Validated improvements
Key Learnings:
- Distributed systems fail in complex ways
- Latency is a common failure mode that cascades
- User experience degrades without proper error handling
- Resilience patterns (retries, timeouts, circuit breakers) are essential
- Chaos engineering helps you find weaknesses before users do
- Antifragile systems improve from stress and failure
- Add CloudWatch Alarms - Alert when microservice errors exceed threshold
- Implement Request Deduplication - Use idempotency keys
- Add Caching - Store orders locally, sync periodically
- Multi-Region - Deploy to two regions for higher availability
- Chaos in Production - Gradually roll out chaos to real users (with safeguards!)