Examples & Use Cases

Real-world scenarios showing how to use the SRE Playbooks effectively.

Quick Examples

Example 1: Pod Stuck in CrashLoopBackOff

Scenario: Your application pod keeps crashing and restarting.

Steps:

Navigate to K8s Playbooks/03-Pods/
Open CrashLoopBackOff-pod.md

Follow the Playbook steps:

# Step 1: Get pod logs
kubectl logs <pod-name> -n <namespace> --previous

# Step 2: Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Step 3: Verify container image
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

Use the Diagnosis section to correlate with recent changes
Apply the fix based on findings

Outcome: Identified missing environment variable causing application crash. Fixed by updating the deployment configuration.

Example 2: EC2 Instance SSH Connection Timeout

Scenario: You can't SSH into your EC2 instance.

Steps:

Navigate to AWS Playbooks/
Open Connection-Timeout-SSH-Issues-EC2.md

Follow the Playbook steps:

# Step 1: Check instance state
aws ec2 describe-instances --instance-ids <instance-id>

# Step 2: Verify security group rules
aws ec2 describe-security-groups --group-ids <security-group-id>

# Step 3: Check public IP assignment
aws ec2 describe-instances --instance-ids <instance-id> --query 'Reservations[0].Instances[0].PublicIpAddress'

Use CloudTrail logs to check for recent security group changes
Apply the fix (in this case, added SSH rule to security group)

Outcome: Security group was missing SSH (port 22) rule. Added the rule and connection restored.

Example 3: Service Not Resolving DNS

Scenario: Pods can't reach a Kubernetes service by name.

Steps:

Navigate to K8s Playbooks/05-Networking/
Open ServiceNotResolvingDNS-dns.md

Follow the Playbook steps:

# Step 1: Check service exists
kubectl get service <service-name> -n <namespace>

# Step 2: Verify CoreDNS pods
kubectl get pods -n kube-system | grep coredns

# Step 3: Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <service-name>.<namespace>.svc.cluster.local

Check CoreDNS logs for errors
Found CoreDNS pod was crashing, restarted it

Outcome: CoreDNS pod was in CrashLoopBackOff. Restarted the pod and DNS resolution restored.

Example 4: RDS Database Connection Timeout

Scenario: Lambda function can't connect to RDS database.

Steps:

Navigate to AWS Playbooks/
Open Connection-Timeout-from-Lambda-RDS.md

Follow the Playbook steps:

# Step 1: Check RDS instance status
aws rds describe-db-instances --db-instance-identifier <rds-instance-id>

# Step 2: Verify security group rules
aws ec2 describe-security-groups --group-ids <rds-security-group-id>

# Step 3: Check Lambda VPC configuration
aws lambda get-function-configuration --function-name <function-name>

Correlated with recent VPC configuration change
Found Lambda was in wrong subnet, moved to correct subnet

Outcome: Lambda function was in a subnet without route to RDS. Updated VPC configuration and connection restored.

Example 5: Deployment Not Scaling

Scenario: HPA should scale your deployment but it's not working.

Steps:

Navigate to K8s Playbooks/04-Workloads/
Open HPAHorizontalPodAutoscalerNotScaling-workload.md

Follow the Playbook steps:

# Step 1: Check HPA status
kubectl get hpa -n <namespace>

# Step 2: Verify Metrics Server
kubectl get pods -n kube-system | grep metrics-server

# Step 3: Check resource metrics
kubectl top pods -n <namespace>

Found Metrics Server was down
Restarted Metrics Server pod

Outcome: Metrics Server pod was down, preventing HPA from getting metrics. Restarted pod and HPA started scaling correctly.

Common Workflows

Workflow 1: On-Call Incident Response

When: You're on-call and get an alert.

Identify the issue: Match alert to playbook title
Navigate quickly: Use category folders for K8s issues
Follow playbook: Execute steps in order
Use Diagnosis: Correlate with recent changes
Document: Note what you found and fixed

Time Saved: Reduced MTTR from 45 minutes to 15 minutes

Workflow 2: New Team Member Training

When: Onboarding a new SRE team member.

Review structure: Show them the repository organization
Walk through example: Use a common playbook together
Practice: Have them follow a playbook for a test issue
Bookmark: Save frequently used playbooks
Contribute: Encourage them to improve playbooks

Outcome: New team member productive in 2 days instead of 2 weeks

Workflow 3: Post-Incident Review

When: After resolving an incident.

Review playbook used: Did it help? What was missing?
Improve playbook: Add steps that would have helped
Share learnings: Update playbook with new insights
Contribute back: Submit improvements to the repository

Outcome: Playbooks continuously improve based on real incidents

Advanced Use Cases

Use Case 1: Multi-Cloud Troubleshooting

Scenario: Application spans AWS and Kubernetes (EKS).

Approach:

Use AWS playbooks for AWS-specific issues (EC2, RDS, etc.)
Use K8s playbooks for Kubernetes issues (pods, services, etc.)
Cross-reference when issues span both (e.g., EKS control plane)

Example: EC2 instance can't reach EKS cluster

Check AWS playbooks for EC2 networking
Check K8s playbooks for service accessibility
Found security group misconfiguration affecting both

Use Case 2: Automation Integration

Scenario: Integrate playbooks into your incident response automation.

Approach:

Parse playbook steps into automated checks
Use playbook structure for runbook automation
Correlate with monitoring data using Diagnosis section

Example: Automated health checks based on playbook steps

Script checks pod status (from playbook Step 1)
Script checks node resources (from playbook Step 2)
Alerts when thresholds exceeded

Use Case 3: Documentation Standardization

Scenario: Standardize troubleshooting procedures across teams.

Approach:

Use playbook structure as template
Customize for organization-specific tools
Maintain consistency across all runbooks

Example: All team runbooks now follow same structure

Consistent format makes knowledge transfer easier
New team members can follow any runbook
Easier to maintain and update

Tips for Success

Tip 1: Start with Common Issues

Focus on playbooks for issues you encounter most:

Pod crashes (CrashLoopBackOff)
Service connectivity
Resource quotas
Network policies

Tip 2: Customize for Your Environment

Fork the repository and:

Add organization-specific steps
Include internal tool commands
Add team-specific notes

Tip 3: Build a Playbook Library

Create your own collection:

Bookmark frequently used playbooks
Add custom playbooks for your infrastructure
Share with your team

Tip 4: Use During Post-Mortems

After incidents:

Review which playbook was used
Identify gaps
Improve the playbook
Contribute improvements back

Success Stories

Story 1: Reduced MTTR by 60%

Before: Average 45 minutes to resolve incidents After: Average 18 minutes using playbooks Key Factor: Systematic approach eliminated guesswork

Story 2: Improved Team Confidence

Before: Junior engineers hesitant to handle incidents After: All engineers confident following playbooks Key Factor: Clear, step-by-step guidance

Story 3: Better Documentation

Before: Inconsistent troubleshooting notes After: Standardized playbook format Key Factor: Consistent structure across all runbooks

Have your own success story? Share it in GitHub Discussions!

Need help with a specific scenario? Ask the community!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Examples & Use Cases

Quick Examples

Example 1: Pod Stuck in CrashLoopBackOff

Example 2: EC2 Instance SSH Connection Timeout

Example 3: Service Not Resolving DNS

Example 4: RDS Database Connection Timeout

Example 5: Deployment Not Scaling

Common Workflows

Workflow 1: On-Call Incident Response

Workflow 2: New Team Member Training

Workflow 3: Post-Incident Review

Advanced Use Cases

Use Case 1: Multi-Cloud Troubleshooting

Use Case 2: Automation Integration

Use Case 3: Documentation Standardization

Tips for Success

Tip 1: Start with Common Issues

Tip 2: Customize for Your Environment

Tip 3: Build a Playbook Library

Tip 4: Use During Post-Mortems

Success Stories

Story 1: Reduced MTTR by 60%

Story 2: Improved Team Confidence

Story 3: Better Documentation

Uh oh!

FilesExpand file tree

EXAMPLES.md

Latest commit

History

EXAMPLES.md

File metadata and controls

Examples & Use Cases

Quick Examples

Example 1: Pod Stuck in CrashLoopBackOff

Example 2: EC2 Instance SSH Connection Timeout

Example 3: Service Not Resolving DNS

Example 4: RDS Database Connection Timeout

Example 5: Deployment Not Scaling

Common Workflows

Workflow 1: On-Call Incident Response

Workflow 2: New Team Member Training

Workflow 3: Post-Incident Review

Advanced Use Cases

Use Case 1: Multi-Cloud Troubleshooting

Use Case 2: Automation Integration

Use Case 3: Documentation Standardization

Tips for Success

Tip 1: Start with Common Issues

Tip 2: Customize for Your Environment

Tip 3: Build a Playbook Library

Tip 4: Use During Post-Mortems

Success Stories

Story 1: Reduced MTTR by 60%

Story 2: Improved Team Confidence

Story 3: Better Documentation