An intelligent infrastructure monitoring system that combines Model Context Protocol (MCP) and Google's Agent Development Kit (ADK) to automatically detect, diagnose, and remediate infrastructure issues with human oversight.
- 🤖 AI-Powered Diagnosis: Multi-agent system for intelligent root cause analysis
- 🔄 Self-Healing: Automatic remediation with safety guardrails
- 👁️ Human-in-the-Loop: Approval workflows for critical operations
- 📊 Multi-Cloud Support: AWS, GCP, Kubernetes integration
- 📝 Complete Audit Trail: Every action logged and reversible
- 🔌 Extensible Architecture: Plugin system for custom monitors
- ⚡ Real-time Monitoring: Continuous health checks and alerting
┌─────────────────────────────────────────────────────────┐
│ Infrastructure Layer │
│ (AWS, GCP, Kubernetes, etc.) │
└────────────────────┬────────────────────────────────────┘
│
↓ metrics, logs, events
┌─────────────────────────────────────────────────────────┐
│ MCP Server Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Metrics │ │ Logs │ │Infrastructure│ │
│ │ Resources │ │ Resources │ │ Resources │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Diagnostic │ │ Remediation │ │ Rollback │ │
│ │ Tools │ │ Tools │ │ Tools │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────┬────────────────────────────────────┘
│
↓ MCP Protocol
┌─────────────────────────────────────────────────────────┐
│ ADK Agent Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Diagnostic │ │ Remediation │ │ Analysis │ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────┐
│ Orchestration Engine │
│ • Workflow State Machine │
│ • Human Approval System │
│ • Audit Logging & Rollback │
└─────────────────────────────────────────────────────────┘
⚠️ Note: This project is currently under active development. Some features and documentation are still being implemented. See the Roadmap section for current status.
- Python 3.10+
- Docker (optional, for containerized deployment)
- AWS/GCP credentials (for cloud integrations)
- Google ADK API key
# Clone the repository
git clone https://github.com/rahulbakshee/Self-Healing-Infrastructure-Monitor.git
cd self-healing-infra-monitor
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -e .
# Or using poetry
poetry install# Copy example config
cp config/mcp_config.yaml.example config/mcp_config.yaml
cp config/adk_config.yaml.example config/adk_config.yaml
# Edit with your credentials
vim config/mcp_config.yaml# Start MCP server
python -m src.mcp_server.server
# Or use the convenience script
./scripts/run_server.sh# Build and run
docker-compose up -d
# View logs
docker-compose logs -ffrom src.orchestrator.workflow import HealthMonitor
from src.integrations.aws import AWSIntegration
# Initialize monitor
monitor = HealthMonitor(
integrations=[AWSIntegration()],
check_interval=60
)
# Start monitoring
monitor.start()from src.mcp_server.tools.remediation import RemediationTool
from src.models.remediation import RemediationAction
# Define custom remediation
action = RemediationAction(
name="restart_service",
description="Restart unhealthy service",
command="systemctl restart myapp",
requires_approval=True,
rollback_command="systemctl stop myapp"
)
# Register with MCP server
remediation_tool = RemediationTool()
remediation_tool.register_action(action)from src.integrations.kubernetes import K8sIntegration
from src.orchestrator.workflow import Orchestrator
# Setup K8s integration
k8s = K8sIntegration(
namespace="production",
auto_remediate=True
)
# Create orchestrator
orchestrator = Orchestrator(
integrations=[k8s],
approval_required=True
)
# Monitor and heal
orchestrator.run()server:
name: "self-healing-infra-monitor"
version: "1.0.0"
transport: "stdio"
resources:
metrics:
enabled: true
providers:
- prometheus
- cloudwatch
logs:
enabled: true
retention_days: 30
tools:
diagnostics:
timeout: 30
remediation:
dry_run: false
require_approval: trueagents:
diagnostic:
model: "gemini-2.0-flash-exp"
temperature: 0.3
max_tokens: 2000
remediation:
model: "gemini-2.0-flash-exp"
temperature: 0.1
max_tokens: 1500
analysis:
model: "gemini-2.0-flash-exp"
temperature: 0.5
max_tokens: 3000
api:
key: "your-adk-api-key"
timeout: 60# Run all tests
pytest
# Run with coverage
pytest --cov=src tests/
# Run specific test suite
pytest tests/test_mcp_server.py
# Run integration tests
pytest tests/integration/- Architecture Guide
- Getting Started
- MCP Protocol Details
- ADK Agent Guide
- Deployment Guide
- API Reference
- ✅ AWS (EC2, ECS, Lambda, CloudWatch)
- ✅ Kubernetes (Pods, Deployments, Services)
- ✅ Prometheus (Metrics & Alerting)
- 🚧 Google Cloud Platform
- 🚧 Azure
- 🚧 Datadog
- 🚧 PagerDuty
Contributions are welcome! Please read our Contributing Guide for details.
# Fork the repo
# Create feature branch
git checkout -b feature/amazing-feature
# Commit changes
git commit -m 'Add amazing feature'
# Push to branch
git push origin feature/amazing-feature
# Open Pull RequestThis project is licensed under the MIT License - see the LICENSE file for details.
- Model Context Protocol by Anthropic
- Google Agent Development Kit
- Built with inspiration from modern SRE practices
- Author: Rahul Bakshee
- LinkedIn: (https://linkedin.com/in/rahulbakshee)
- Core MCP server implementation
- ADK agent integration
- AWS integration
- Kubernetes integration
- GCP integration
- ML-based anomaly detection
- Predictive maintenance
- Cost optimization recommendations
- Slack/Teams notifications
- Web dashboard UI
⭐ Star this repo if you find it useful!