🔧 Self-Healing Infrastructure Monitor

An intelligent infrastructure monitoring system that combines Model Context Protocol (MCP) and Google's Agent Development Kit (ADK) to automatically detect, diagnose, and remediate infrastructure issues with human oversight.

🌟 Key Features

🤖 AI-Powered Diagnosis: Multi-agent system for intelligent root cause analysis
🔄 Self-Healing: Automatic remediation with safety guardrails
👁️ Human-in-the-Loop: Approval workflows for critical operations
📊 Multi-Cloud Support: AWS, GCP, Kubernetes integration
📝 Complete Audit Trail: Every action logged and reversible
🔌 Extensible Architecture: Plugin system for custom monitors
⚡ Real-time Monitoring: Continuous health checks and alerting

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                   Infrastructure Layer                   │
│              (AWS, GCP, Kubernetes, etc.)               │
└────────────────────┬────────────────────────────────────┘
                     │
                     ↓ metrics, logs, events
┌─────────────────────────────────────────────────────────┐
│                    MCP Server Layer                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │   Metrics    │  │     Logs     │  │Infrastructure│ │
│  │  Resources   │  │  Resources   │  │   Resources  │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Diagnostic   │  │ Remediation  │  │   Rollback   │ │
│  │    Tools     │  │    Tools     │  │    Tools     │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└────────────────────┬────────────────────────────────────┘
                     │
                     ↓ MCP Protocol
┌─────────────────────────────────────────────────────────┐
│                   ADK Agent Layer                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Diagnostic   │  │ Remediation  │  │   Analysis   │ │
│  │    Agent     │  │    Agent     │  │    Agent     │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└────────────────────┬────────────────────────────────────┘
                     │
                     ↓
┌─────────────────────────────────────────────────────────┐
│              Orchestration Engine                        │
│  • Workflow State Machine                               │
│  • Human Approval System                                │
│  • Audit Logging & Rollback                            │
└─────────────────────────────────────────────────────────┘

⚠️ Note: This project is currently under active development. Some features and documentation are still being implemented. See the Roadmap section for current status.

🚀 Quick Start

Prerequisites

Python 3.10+
Docker (optional, for containerized deployment)
AWS/GCP credentials (for cloud integrations)
Google ADK API key

Installation

# Clone the repository
git clone https://github.com/rahulbakshee/Self-Healing-Infrastructure-Monitor.git
cd self-healing-infra-monitor

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -e .

# Or using poetry
poetry install

Configuration

# Copy example config
cp config/mcp_config.yaml.example config/mcp_config.yaml
cp config/adk_config.yaml.example config/adk_config.yaml

# Edit with your credentials
vim config/mcp_config.yaml

Running the Server

# Start MCP server
python -m src.mcp_server.server

# Or use the convenience script
./scripts/run_server.sh

Docker Deployment

# Build and run
docker-compose up -d

# View logs
docker-compose logs -f

📖 Usage Examples

Basic Health Check Monitoring

from src.orchestrator.workflow import HealthMonitor
from src.integrations.aws import AWSIntegration

# Initialize monitor
monitor = HealthMonitor(
    integrations=[AWSIntegration()],
    check_interval=60
)

# Start monitoring
monitor.start()

Custom Remediation Action

from src.mcp_server.tools.remediation import RemediationTool
from src.models.remediation import RemediationAction

# Define custom remediation
action = RemediationAction(
    name="restart_service",
    description="Restart unhealthy service",
    command="systemctl restart myapp",
    requires_approval=True,
    rollback_command="systemctl stop myapp"
)

# Register with MCP server
remediation_tool = RemediationTool()
remediation_tool.register_action(action)

Kubernetes Pod Auto-Healing

from src.integrations.kubernetes import K8sIntegration
from src.orchestrator.workflow import Orchestrator

# Setup K8s integration
k8s = K8sIntegration(
    namespace="production",
    auto_remediate=True
)

# Create orchestrator
orchestrator = Orchestrator(
    integrations=[k8s],
    approval_required=True
)

# Monitor and heal
orchestrator.run()

🔧 Configuration

MCP Server Configuration (`config/mcp_config.yaml`)

server:
  name: "self-healing-infra-monitor"
  version: "1.0.0"
  transport: "stdio"

resources:
  metrics:
    enabled: true
    providers:
      - prometheus
      - cloudwatch
  
  logs:
    enabled: true
    retention_days: 30

tools:
  diagnostics:
    timeout: 30
  remediation:
    dry_run: false
    require_approval: true

ADK Agent Configuration (`config/adk_config.yaml`)

agents:
  diagnostic:
    model: "gemini-2.0-flash-exp"
    temperature: 0.3
    max_tokens: 2000
    
  remediation:
    model: "gemini-2.0-flash-exp"
    temperature: 0.1
    max_tokens: 1500
    
  analysis:
    model: "gemini-2.0-flash-exp"
    temperature: 0.5
    max_tokens: 3000

api:
  key: "your-adk-api-key"
  timeout: 60

🧪 Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src tests/

# Run specific test suite
pytest tests/test_mcp_server.py

# Run integration tests
pytest tests/integration/

📚 Documentation

🔌 Integrations

Currently Supported

✅ AWS (EC2, ECS, Lambda, CloudWatch)
✅ Kubernetes (Pods, Deployments, Services)
✅ Prometheus (Metrics & Alerting)

Coming Soon

🚧 Google Cloud Platform
🚧 Azure
🚧 Datadog
🚧 PagerDuty

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

# Fork the repo
# Create feature branch
git checkout -b feature/amazing-feature

# Commit changes
git commit -m 'Add amazing feature'

# Push to branch
git push origin feature/amazing-feature

# Open Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Model Context Protocol by Anthropic
Google Agent Development Kit
Built with inspiration from modern SRE practices

📧 Contact

Author: Rahul Bakshee
LinkedIn: (https://linkedin.com/in/rahulbakshee)

🗺️ Roadmap

⭐ Star this repo if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_project.ps1		setup_project.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔧 Self-Healing Infrastructure Monitor

🌟 Key Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Configuration

Running the Server

Docker Deployment

📖 Usage Examples

Basic Health Check Monitoring

Custom Remediation Action

Kubernetes Pod Auto-Healing

🔧 Configuration

MCP Server Configuration (`config/mcp_config.yaml`)

ADK Agent Configuration (`config/adk_config.yaml`)

🧪 Testing

📚 Documentation

🔌 Integrations

Currently Supported

Coming Soon

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

🗺️ Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔧 Self-Healing Infrastructure Monitor

🌟 Key Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Configuration

Running the Server

Docker Deployment

📖 Usage Examples

Basic Health Check Monitoring

Custom Remediation Action

Kubernetes Pod Auto-Healing

🔧 Configuration

MCP Server Configuration (config/mcp_config.yaml)

ADK Agent Configuration (config/adk_config.yaml)

🧪 Testing

📚 Documentation

🔌 Integrations

Currently Supported

Coming Soon

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

🗺️ Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

MCP Server Configuration (`config/mcp_config.yaml`)

ADK Agent Configuration (`config/adk_config.yaml`)

Packages