Skip to content

Scoutflo/Scoutflo-SRE-Playbooks

SRE Playbooks Repository

License Contributions Welcome GitHub Issues GitHub Stars GitHub Forks GitHub Discussions GitHub Contributors

Comprehensive incident response playbooks for AWS, Kubernetes, and Sentry environments - Helping SREs diagnose and resolve infrastructure issues faster with systematic, step-by-step troubleshooting guides.

Table of Contents

Overview

This repository contains 414 comprehensive incident response playbooks designed to help Site Reliability Engineers (SREs) systematically diagnose and resolve common infrastructure and application issues in AWS, Kubernetes, and Sentry environments.

Why This Repository?

  • Systematic Approach: Each playbook follows a consistent structure with clear diagnostic steps
  • Time-Saving: Quickly identify root causes with correlation analysis frameworks
  • Community-Driven: Continuously improved by the open-source community
  • Production-Ready: Based on real-world incident response scenarios
  • Comprehensive Coverage: 232 Kubernetes playbooks + 157 AWS playbooks + 25 Sentry playbooks
  • Proactive Monitoring: 56 K8s + 65 AWS proactive playbooks for capacity planning and compliance

Diagnosis Improvements

All playbooks use an events-first approach for root cause analysis:

  • Diagnosis sections prioritize checking recent events and changes before diving into configuration details
  • Conditional logic patterns help narrow down causes based on observed symptoms
  • Time-based correlation analysis connects events to failures systematically

Use Cases

  • During Incidents: Quick reference for troubleshooting common issues
  • On-Call Rotation: Essential runbook collection for on-call engineers
  • Knowledge Sharing: Standardize troubleshooting procedures across teams
  • Training: Learn systematic incident response methodologies
  • Documentation: Build your own runbook library

Repository Structure

scoutflo-SRE-Playbooks/
β”œβ”€β”€ AWS Playbooks/                    # 157 AWS playbooks
β”‚   β”œβ”€β”€ 01-Compute/                   # 27 playbooks (EC2, Lambda, ECS, EKS)
β”‚   β”œβ”€β”€ 02-Database/                  # 8 playbooks (RDS, DynamoDB)
β”‚   β”œβ”€β”€ 03-Storage/                   # 7 playbooks (S3)
β”‚   β”œβ”€β”€ 04-Networking/                # 17 playbooks (VPC, ELB, Route53)
β”‚   β”œβ”€β”€ 05-Security/                  # 16 playbooks (IAM, KMS, GuardDuty)
β”‚   β”œβ”€β”€ 06-Monitoring/                # 8 playbooks (CloudTrail, CloudWatch)
β”‚   β”œβ”€β”€ 07-CI-CD/                     # 9 playbooks (CodePipeline, CodeBuild)
β”‚   β”œβ”€β”€ 08-Proactive/                 # 65 proactive monitoring playbooks
β”‚   └── README.md
β”œβ”€β”€ K8s Playbooks/                    # 232 Kubernetes playbooks
β”‚   β”œβ”€β”€ 01-Control-Plane/             # 24 playbooks
β”‚   β”œβ”€β”€ 02-Nodes/                     # 24 playbooks
β”‚   β”œβ”€β”€ 03-Pods/                      # 41 playbooks
β”‚   β”œβ”€β”€ 04-Workloads/                 # 25 playbooks
β”‚   β”œβ”€β”€ 05-Networking/                # 27 playbooks
β”‚   β”œβ”€β”€ 06-Storage/                   # 9 playbooks
β”‚   β”œβ”€β”€ 07-RBAC/                      # 6 playbooks
β”‚   β”œβ”€β”€ 08-Configuration/             # 6 playbooks
β”‚   β”œβ”€β”€ 09-Resource-Management/       # 8 playbooks
β”‚   β”œβ”€β”€ 10-Monitoring-Autoscaling/    # 3 playbooks
β”‚   β”œβ”€β”€ 11-Installation-Setup/        # 1 playbook
β”‚   β”œβ”€β”€ 12-Namespaces/                # 2 playbooks
β”‚   β”œβ”€β”€ 13-Proactive/                 # 56 proactive monitoring playbooks
β”‚   └── README.md
β”œβ”€β”€ Sentry Playbooks/                 # 25 Sentry playbooks
β”‚   β”œβ”€β”€ 01-Error-Tracking/            # 19 playbooks
β”‚   β”œβ”€β”€ 02-Performance/               # 6 playbooks
β”‚   β”œβ”€β”€ 03-Release-Health/            # Placeholder
β”‚   └── README.md
β”œβ”€β”€ CONTRIBUTING.md
└── README.md

Contents

AWS Playbooks (AWS Playbooks/)

157 playbooks covering 7 service categories + proactive monitoring:

  • Compute Services (27 playbooks): EC2, Lambda, ECS, EKS
  • Database (8 playbooks): RDS, DynamoDB
  • Storage (7 playbooks): S3
  • Networking (17 playbooks): VPC, ELB, Route 53, NAT Gateway
  • Security (16 playbooks): IAM, KMS, GuardDuty, CloudTrail
  • Monitoring (8 playbooks): CloudTrail, CloudWatch
  • CI/CD (9 playbooks): CodePipeline, CodeBuild
  • Proactive (65 playbooks): Capacity planning, compliance, cost optimization

Key Topics:

  • Connection timeouts and network issues
  • Access denied and permission problems
  • Resource unavailability and capacity issues
  • Security breaches and threat detection
  • Service integration failures
  • Proactive capacity and compliance monitoring

See AWS Playbooks/README.md for complete documentation and playbook list.

Kubernetes Playbooks (K8s Playbooks/)

194 playbooks organized into 13 categorized folders covering Kubernetes cluster and workload issues:

Folder Structure:

  • 01-Control-Plane/ (18 playbooks) - API Server, Scheduler, Controller Manager, etcd
  • 02-Nodes/ (12 playbooks) - Node readiness, kubelet issues, resource constraints
  • 03-Pods/ (31 playbooks) - Scheduling, lifecycle, health checks, resource limits
  • 04-Workloads/ (23 playbooks) - Deployments, StatefulSets, DaemonSets, Jobs, HPA
  • 05-Networking/ (19 playbooks) - Services, Ingress, DNS, Network Policies, kube-proxy
  • 06-Storage/ (9 playbooks) - PersistentVolumes, PersistentVolumeClaims, StorageClasses
  • 07-RBAC/ (6 playbooks) - ServiceAccounts, Roles, RoleBindings, authorization
  • 08-Configuration/ (6 playbooks) - ConfigMaps and Secrets access issues
  • 09-Resource-Management/ (8 playbooks) - Resource Quotas, overcommit, compute resources
  • 10-Monitoring-Autoscaling/ (3 playbooks) - Metrics Server, Cluster Autoscaler
  • 11-Installation-Setup/ (1 playbook) - Helm and installation issues
  • 12-Namespaces/ (2 playbooks) - Namespace management issues
  • 13-Proactive/ (56 playbooks) - Proactive monitoring, capacity planning, compliance

Key Topics:

  • Pod lifecycle issues (CrashLoopBackOff, Pending, Terminating)
  • Control plane component failures
  • Network connectivity and DNS resolution
  • Storage and volume mounting problems
  • RBAC and permission errors
  • Resource quota and capacity constraints
  • Proactive capacity and compliance monitoring

See K8s Playbooks/README.md for complete documentation and playbook list.

Sentry Playbooks (Sentry Playbooks/)

25 playbooks covering error tracking and performance monitoring:

Folder Structure:

  • 01-Error-Tracking/ (19 playbooks) - Error capture, grouping, alerting, and debugging
  • 02-Performance/ (6 playbooks) - Transaction monitoring, performance issues, tracing
  • 03-Release-Health/ - Release tracking and health monitoring (placeholder)

Key Topics:

  • Error capture and reporting issues
  • Issue grouping and deduplication
  • Alert configuration and routing
  • Performance transaction monitoring
  • SDK integration troubleshooting
  • Release health tracking

See Sentry Playbooks/README.md for complete documentation and playbook list.

Getting Started

Prerequisites

  • Basic knowledge of AWS services, Kubernetes, or Sentry
  • Access to AWS Console, Kubernetes cluster, or Sentry dashboard (for using playbooks)
  • Git (for cloning the repository)

Installation

Option 1: Clone the Repository

# Clone the repository
git clone https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git

# Navigate to the repository
cd scoutflo-SRE-Playbooks

# View available playbooks
ls AWS\ Playbooks/
ls K8s\ Playbooks/
ls Sentry\ Playbooks/

Option 2: Use as Git Submodule

Include playbooks in your own projects:

git submodule add https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git playbooks

Option 3: Download Specific Playbooks

Browse and download individual playbooks directly from GitHub web interface.

Quick Start

  1. Identify Your Issue: Determine if it's an AWS, Kubernetes, or Sentry issue
  2. Navigate to Playbooks:
    • AWS issues -> AWS Playbooks/
    • K8s issues -> K8s Playbooks/[category-folder]/
    • Sentry issues -> Sentry Playbooks/[category-folder]/
  3. Find the Playbook: Match your symptoms to a playbook title
  4. Follow the Steps: Execute diagnostic steps in order
  5. Use Diagnosis Section: Apply correlation analysis for root cause identification

Learn More

Example Usage

Scenario: EC2 instance SSH connection timeout

  1. Navigate to AWS Playbooks/
  2. Open Connection-Timeout-SSH-Issues-EC2.md
  3. Follow the Playbook steps, replacing <instance-id> with your actual instance ID
  4. Use the Diagnosis section to correlate events with failures
  5. Apply the identified fix

Usage

How Playbooks Work

Important: These playbooks are designed for AI agents using natural language processing (NLP). They use natural language instructions that AI agents interpret and execute using available tools (like AWS MCP tools, Kubernetes MCP tools, or kubectl).

Example Playbook Step:

  • Natural Language: "Retrieve logs from pod <pod-name> in namespace <namespace> and analyze error messages"
  • AI Agent Action: Interprets this instruction and uses appropriate tools to fetch and analyze pod logs

For Manual Use:

  • While playbooks are optimized for AI agents, you can also use them manually
  • The README files in each category folder include equivalent kubectl/AWS CLI commands for manual verification
  • Replace placeholders with actual resource identifiers when following steps manually

Playbook Structure

All playbooks follow a consistent structure:

  1. Title - Clear, descriptive issue identification
  2. Meaning - What the issue means, triggers, symptoms, root causes
  3. Impact - Business and technical implications
  4. Playbook - 8-10 numbered diagnostic steps in natural language (ordered from common to specific)
  5. Diagnosis - Correlation analysis framework with time windows using events-first approach and conditional logic patterns

Best Practices

  • For AI Agents: Playbooks are optimized for AI interpretation - use natural language instructions
  • For Manual Use: See category README files for equivalent kubectl/AWS CLI commands
  • Replace Placeholders: All playbooks use placeholders (e.g., <instance-id>, <pod-name>) that must be replaced with actual values
  • Follow Order: Execute steps sequentially unless you have strong evidence pointing to a specific step
  • Correlate Timestamps: Use the Diagnosis section to correlate events with failures
  • Extend Windows: If initial correlations don't reveal causes, extend time windows as suggested

Placeholder Reference

AWS Playbooks:

  • <instance-id>, <bucket-name>, <region>, <function-name>, <role-name>, <user-name>, <security-group-id>, <vpc-id>, <rds-instance-id>, <load-balancer-name>

Kubernetes Playbooks:

  • <pod-name>, <namespace>, <deployment-name>, <node-name>, <service-name>, <ingress-name>, <pvc-name>, <configmap-name>, <secret-name>

Sentry Playbooks:

  • <project-slug>, <organization-slug>, <issue-id>, <transaction-name>, <release-version>, <environment>

Terminology & Glossary

Understanding the terms used in these playbooks will help you use them more effectively. For detailed glossaries, see:

Quick Reference

SRE (Site Reliability Engineering)

  • A discipline combining software engineering and operations to build reliable systems.

Playbook / Runbook

  • A step-by-step guide for diagnosing and resolving specific issues.

Incident

  • An event that disrupts or degrades a service, requiring immediate attention.

On-Call

  • Engineers available to respond to incidents outside normal business hours.

MTTR (Mean Time To Recovery)

  • Average time to restore a service after an incident. Playbooks help reduce MTTR.

Correlation Analysis

  • Finding relationships between events (like configuration changes) and symptoms (like service failures) by comparing timestamps.

Root Cause

  • The underlying reason why an issue occurred, as opposed to just the symptoms.

Placeholder

  • A value in playbooks (like <instance-id>) that you replace with your actual resource identifier.

Diagnosis Section

  • Part of each playbook that helps you correlate events with failures using time-based analysis.

Common Abbreviations

  • K8s: Kubernetes (K + 8 letters + s)
  • SRE: Site Reliability Engineering
  • MTTR: Mean Time To Recovery
  • API: Application Programming Interface
  • DNS: Domain Name System
  • RBAC: Role-Based Access Control
  • PVC: PersistentVolumeClaim
  • HPA: Horizontal Pod Autoscaler

For detailed explanations of AWS and Kubernetes terms, see the respective README files above.

Quick Reference

Need a quick cheat sheet? Check out our Quick Reference Card for:

  • One-page overview
  • Common commands
  • Quick lookup tables
  • Essential links

Troubleshooting Guide

Not sure which playbook to use? Use our Troubleshooting Decision Tree to:

  • Quickly identify the right playbook
  • Navigate by issue type
  • Look up by error message or alert name

Examples & Use Cases

See real-world scenarios in EXAMPLES.md:

  • Step-by-step examples
  • Common workflows
  • Success stories
  • Best practices

FAQ

Have questions? Check our FAQ for answers to:

  • General questions
  • Usage questions
  • Technical questions
  • Contributing questions

Video Tutorials

Learn how to use these playbooks effectively:

  • YouTube Channel: @scoutflo6727 - Subscribe for tutorials and walkthroughs
  • AI SRE Demo: Watch Demo Video - See Scoutflo AI SRE in action
  • Tutorials: Step-by-step video guides on using playbooks
  • Best Practices: Learn SRE incident response best practices

Coming Soon: Video tutorials for:

  • How to use playbooks effectively
  • Common troubleshooting scenarios
  • Contributing to playbooks
  • Advanced correlation analysis

Roadmap

Check out our ROADMAP.md to see:

  • Planned features and new playbook categories
  • Short-term and long-term goals
  • How to suggest new features
  • Release history

Contributing

We welcome contributions from the community! Your contributions help make these playbooks better for everyone. See our Contributors page to see who has helped build this project.

First-time contributor? Start with our Getting Started Guide for a quick onboarding experience, then look for issues labeled good first issue.

How to Contribute

1. Reporting Issues

Found a bug, unclear instruction, or have a suggestion?

  1. Check Existing Issues: Search GitHub Issues first
  2. Create a New Issue:
    • Use clear, descriptive title
    • Describe the problem or suggestion
    • Include relevant service/component, error messages, or examples
    • Tag with appropriate labels (aws-playbook, k8s-playbook, sentry-playbook, bug, enhancement, etc.)

2. Improving Existing Playbooks

To fix or enhance existing playbooks:

  1. Fork the Repository: Create your own fork
  2. Create a Branch:
    git checkout -b fix/playbook-name-improvement
  3. Make Your Changes:
    • Follow the established playbook structure
    • Maintain consistency with existing formatting
    • Update placeholders and examples as needed
  4. Test Your Changes: Verify the playbook is accurate and helpful
  5. Commit and Push:
    git add .
    git commit -m "Fix: Improve [playbook-name] with [description]"
    git push origin fix/playbook-name-improvement
  6. Create a Pull Request:
    • Provide clear description of changes
    • Reference any related issues
    • Request review from maintainers

3. Adding New Playbooks

To add a new playbook for an uncovered issue:

  1. Check for Duplicates: Ensure a similar playbook doesn't already exist
  2. Follow the Structure: Use existing playbooks as templates
  3. Choose the Right Location:
    • AWS playbooks -> AWS Playbooks/
    • K8s playbooks -> Appropriate category folder in K8s Playbooks/
    • Sentry playbooks -> Appropriate category folder in Sentry Playbooks/
  4. Follow Naming Conventions:
    • AWS: <IssueOrSymptom>-<Component>.md
    • K8s: <AlertName>-<Resource>.md
    • Sentry: <IssueType>-<Component>.md
  5. Include All Sections: Title, Meaning, Impact, Playbook (8-10 steps), Diagnosis (5 correlations)
  6. Update README: Add the new playbook to the appropriate README's playbook list
  7. Create Pull Request: Follow standard contribution process

Contribution Guidelines

  • Follow the Structure: Maintain consistency with existing playbooks
  • Use Placeholders: Replace specific values with placeholders
  • Be Specific: Provide actionable, step-by-step instructions
  • Include Correlation: Add time-based correlation analysis in the Diagnosis section
  • Test Thoroughly: Ensure playbooks are accurate and helpful
  • Document Changes: Clearly describe what you changed and why

Review Process

  1. All contributions require review from maintainers
  2. Feedback will be provided within 2-3 business days
  3. Address any requested changes promptly
  4. Once approved, your contribution will be merged

See CONTRIBUTING.md for detailed contribution guidelines.

Connect with Us

We'd love to hear from you! Here are the best ways to connect:

Community Channels

Feedback & Feature Requests

Have an idea for improvement or a new playbook topic?

  • GitHub Issues: Create a feature request
  • Slack: Share your ideas in our #playbooks channel

Bug Reports

Found a bug or error in a playbook?

  • GitHub Issues: Create a bug report
  • Slack: Report in our #playbooks channel for quick response

Scoutflo Resources

Additional Resources

Support

Need help? Check out our Support Guide or:

Related Resources

AWS Resources

Official Documentation:

Learning & Best Practices:

Tools & Utilities:

Kubernetes Resources

Official Documentation:

Learning & Best Practices:

Tools & Utilities:

  • k9s - Terminal UI for Kubernetes
  • Lens - Kubernetes IDE
  • Helm - Package manager for Kubernetes
  • kubectx & kubens - Context and namespace switching

Community Resources:

SRE Resources

Books & Guides:

Learning Resources:

Tools & Platforms:

Incident Response & Runbooks

Runbook Resources:

Incident Management:

Community & Forums

General DevOps:

Cloud Native:

Statistics

  • Total Playbooks: 376
    • AWS: 157 playbooks (92 reactive + 65 proactive)
    • Kubernetes: 194 playbooks (138 reactive + 56 proactive)
    • Sentry: 25 playbooks
  • Coverage: Major AWS services, Kubernetes components, and Sentry monitoring
  • Format: Markdown with structured sections
  • Language: English
  • Community: Open source, community-driven

License

This project is licensed under the MIT License - see the LICENSE file for details.

Maintainers

This project is maintained by:

For maintainer information, see MAINTAINERS.md.

Acknowledgments

  • Contributors: Thank you to all contributors who help improve these playbooks
  • Community: The SRE community for sharing knowledge and best practices
  • Organizations: Companies and teams using these playbooks in production

Made with love by the SRE community for the SRE community

If you find these playbooks helpful, please consider giving us a star on GitHub!

About

πŸš€ SRE incident response playbooks for AWS & Kubernetes. Step-by-step troubleshooting guides to help on-call engineers resolve infrastructure issues faster.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Languages