Real-world scenarios showing how to use the SRE Playbooks effectively.
Scenario: Your application pod keeps crashing and restarting.
Steps:
- Navigate to
K8s Playbooks/03-Pods/ - Open
CrashLoopBackOff-pod.md - Follow the Playbook steps:
# Step 1: Get pod logs kubectl logs <pod-name> -n <namespace> --previous # Step 2: Check pod events kubectl describe pod <pod-name> -n <namespace> # Step 3: Verify container image kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
- Use the Diagnosis section to correlate with recent changes
- Apply the fix based on findings
Outcome: Identified missing environment variable causing application crash. Fixed by updating the deployment configuration.
Scenario: You can't SSH into your EC2 instance.
Steps:
- Navigate to
AWS Playbooks/ - Open
Connection-Timeout-SSH-Issues-EC2.md - Follow the Playbook steps:
# Step 1: Check instance state aws ec2 describe-instances --instance-ids <instance-id> # Step 2: Verify security group rules aws ec2 describe-security-groups --group-ids <security-group-id> # Step 3: Check public IP assignment aws ec2 describe-instances --instance-ids <instance-id> --query 'Reservations[0].Instances[0].PublicIpAddress'
- Use CloudTrail logs to check for recent security group changes
- Apply the fix (in this case, added SSH rule to security group)
Outcome: Security group was missing SSH (port 22) rule. Added the rule and connection restored.
Scenario: Pods can't reach a Kubernetes service by name.
Steps:
- Navigate to
K8s Playbooks/05-Networking/ - Open
ServiceNotResolvingDNS-dns.md - Follow the Playbook steps:
# Step 1: Check service exists kubectl get service <service-name> -n <namespace> # Step 2: Verify CoreDNS pods kubectl get pods -n kube-system | grep coredns # Step 3: Test DNS resolution kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <service-name>.<namespace>.svc.cluster.local
- Check CoreDNS logs for errors
- Found CoreDNS pod was crashing, restarted it
Outcome: CoreDNS pod was in CrashLoopBackOff. Restarted the pod and DNS resolution restored.
Scenario: Lambda function can't connect to RDS database.
Steps:
- Navigate to
AWS Playbooks/ - Open
Connection-Timeout-from-Lambda-RDS.md - Follow the Playbook steps:
# Step 1: Check RDS instance status aws rds describe-db-instances --db-instance-identifier <rds-instance-id> # Step 2: Verify security group rules aws ec2 describe-security-groups --group-ids <rds-security-group-id> # Step 3: Check Lambda VPC configuration aws lambda get-function-configuration --function-name <function-name>
- Correlated with recent VPC configuration change
- Found Lambda was in wrong subnet, moved to correct subnet
Outcome: Lambda function was in a subnet without route to RDS. Updated VPC configuration and connection restored.
Scenario: HPA should scale your deployment but it's not working.
Steps:
- Navigate to
K8s Playbooks/04-Workloads/ - Open
HPAHorizontalPodAutoscalerNotScaling-workload.md - Follow the Playbook steps:
# Step 1: Check HPA status kubectl get hpa -n <namespace> # Step 2: Verify Metrics Server kubectl get pods -n kube-system | grep metrics-server # Step 3: Check resource metrics kubectl top pods -n <namespace>
- Found Metrics Server was down
- Restarted Metrics Server pod
Outcome: Metrics Server pod was down, preventing HPA from getting metrics. Restarted pod and HPA started scaling correctly.
When: You're on-call and get an alert.
- Identify the issue: Match alert to playbook title
- Navigate quickly: Use category folders for K8s issues
- Follow playbook: Execute steps in order
- Use Diagnosis: Correlate with recent changes
- Document: Note what you found and fixed
Time Saved: Reduced MTTR from 45 minutes to 15 minutes
When: Onboarding a new SRE team member.
- Review structure: Show them the repository organization
- Walk through example: Use a common playbook together
- Practice: Have them follow a playbook for a test issue
- Bookmark: Save frequently used playbooks
- Contribute: Encourage them to improve playbooks
Outcome: New team member productive in 2 days instead of 2 weeks
When: After resolving an incident.
- Review playbook used: Did it help? What was missing?
- Improve playbook: Add steps that would have helped
- Share learnings: Update playbook with new insights
- Contribute back: Submit improvements to the repository
Outcome: Playbooks continuously improve based on real incidents
Scenario: Application spans AWS and Kubernetes (EKS).
Approach:
- Use AWS playbooks for AWS-specific issues (EC2, RDS, etc.)
- Use K8s playbooks for Kubernetes issues (pods, services, etc.)
- Cross-reference when issues span both (e.g., EKS control plane)
Example: EC2 instance can't reach EKS cluster
- Check AWS playbooks for EC2 networking
- Check K8s playbooks for service accessibility
- Found security group misconfiguration affecting both
Scenario: Integrate playbooks into your incident response automation.
Approach:
- Parse playbook steps into automated checks
- Use playbook structure for runbook automation
- Correlate with monitoring data using Diagnosis section
Example: Automated health checks based on playbook steps
- Script checks pod status (from playbook Step 1)
- Script checks node resources (from playbook Step 2)
- Alerts when thresholds exceeded
Scenario: Standardize troubleshooting procedures across teams.
Approach:
- Use playbook structure as template
- Customize for organization-specific tools
- Maintain consistency across all runbooks
Example: All team runbooks now follow same structure
- Consistent format makes knowledge transfer easier
- New team members can follow any runbook
- Easier to maintain and update
Focus on playbooks for issues you encounter most:
- Pod crashes (CrashLoopBackOff)
- Service connectivity
- Resource quotas
- Network policies
Fork the repository and:
- Add organization-specific steps
- Include internal tool commands
- Add team-specific notes
Create your own collection:
- Bookmark frequently used playbooks
- Add custom playbooks for your infrastructure
- Share with your team
After incidents:
- Review which playbook was used
- Identify gaps
- Improve the playbook
- Contribute improvements back
Before: Average 45 minutes to resolve incidents After: Average 18 minutes using playbooks Key Factor: Systematic approach eliminated guesswork
Before: Junior engineers hesitant to handle incidents After: All engineers confident following playbooks Key Factor: Clear, step-by-step guidance
Before: Inconsistent troubleshooting notes After: Standardized playbook format Key Factor: Consistent structure across all runbooks
Have your own success story? Share it in GitHub Discussions!
Need help with a specific scenario? Ask the community!