An open-source platform for building autonomous DevOps and SRE agents. Built with Google ADK and managed as a uv workspace.
Agents can monitor infrastructure, diagnose issues, and take action — with built-in safety guardrails that require human confirmation before any destructive operation. Interact via the ADK web UI, terminal, or directly from Slack.
- Multi-agent orchestration — a root agent delegates to specialist agents via
AgentTool(LLM-routed) and deterministic sub-agent workflows (ADR-002) - Structured workflows —
SequentialAgentandParallelAgentfor deterministic multi-step pipelines (e.g., incident triage checks Kafka, K8s, Docker, and observability in parallel, then summarizes) - Slack integration — chat with the agent from Slack, with interactive Approve/Deny buttons for guarded operations
- Role-based access control — three-role hierarchy (viewer/operator/admin) inferred from guardrail decorators; enforced via
authorize()callback (ADR-001) - Input validation — all tool inputs validated at the boundary with reusable validators (string length, integer range, URL scheme, path traversal, regex patterns)
- Safety guardrails — destructive tools (
@destructive) require explicit confirmation; mutating tools (@confirm) prompt before executing; confirmations are args-hashed and time-limited - Structured logging — JSON-formatted logs to stdout, ready for Loki/ELK/Cloud Logging; every tool call is audited with timestamp, agent, arguments, and result
- Persistent sessions — SQLite-backed session state, user-scoped notes, and app-wide shared data that survive restarts
- Multi-provider LLM support — switch between Gemini, Claude, OpenAI, Ollama, or any LiteLLM-supported provider via environment variables
- Prometheus metrics — tool call counts, latency histograms, error rates, circuit breaker state, and LLM token tracking exposed via
/metricsfor Prometheus scraping - Resilience — circuit breaker and retry with exponential backoff for transient failures
- Composable architecture — each agent is a standalone package that can run independently or plug into an orchestrator
graph TB
subgraph Frontends
WEB[ADK Web UI / CLI]
SLACK[Slack Bot]
end
subgraph "devops-assistant (orchestrator)"
ROOT[Root Agent]
TRIAGE[Incident Triage]
ROOT -.->|AgentTool| KAFKA[Kafka Agent]
ROOT -.->|AgentTool| K8S[K8s Agent]
ROOT -.->|AgentTool| OBS[Observability Agent]
ROOT -.->|AgentTool| DOCKER[Docker Agent]
ROOT -.->|AgentTool| JOURNAL[Ops Journal]
ROOT -->|sub-agent| TRIAGE
TRIAGE -->|parallel| KAFKA
TRIAGE -->|parallel| K8S
TRIAGE -->|parallel| DOCKER
TRIAGE -->|parallel| OBS
TRIAGE -->|then| SUMMARIZE[Triage Summary]
TRIAGE -->|then| SAVE[Save to Journal]
end
subgraph Safety
RBAC[RBAC · authorize]
GUARD[Guardrails · confirm / destructive]
AUDIT[Audit Logger]
METRICS[Prometheus Metrics]
end
subgraph Infrastructure
KF[Kafka]
KU[Kubernetes]
PR[Prometheus]
LO[Loki]
AM[Alertmanager]
DK[Docker]
end
WEB --> ROOT
SLACK --> ROOT
KAFKA --> KF
K8S --> KU
OBS --> PR
OBS --> LO
OBS --> AM
DOCKER --> DK
ROOT -.-> RBAC
ROOT -.-> GUARD
ROOT -.-> AUDIT
ROOT -.-> METRICS
| Agent | Type | Description |
|---|---|---|
| core | Library | Agent factory, RBAC, guardrails, input validation, error handlers, structured logging, audit trail, activity tracking, Prometheus metrics, persistent runner, typed config |
| kafka-health-agent | Single agent | Kafka cluster health, topics, consumer groups, lag |
| k8s-health-agent | Single agent | Kubernetes cluster health, nodes, pods, deployments, logs, events |
| observability-agent | Single agent | Prometheus metrics/alerts, Loki log queries, Alertmanager silence management |
| devops-assistant | Multi-agent | Orchestrator using AgentTool for specialist agents and sub-agents for deterministic workflows |
| ops-journal | Memory/state | Notes, preferences, and session tracking with persistent storage |
| slack-bot | Integration | Slack bot with thread-based sessions and interactive confirmation buttons |
The only prerequisite is Docker and an API key from any supported provider.
# With Google AI Studio (default)
GOOGLE_API_KEY=your-key docker compose --profile demo up -d
# With Anthropic Claude
MODEL_PROVIDER=anthropic MODEL_NAME=anthropic/claude-sonnet-4-20250514 \
ANTHROPIC_API_KEY=sk-ant-... docker compose --profile demo up -d
# With OpenAI
MODEL_PROVIDER=openai MODEL_NAME=openai/gpt-4o \
OPENAI_API_KEY=sk-... docker compose --profile demo up -d
# Open the web UI
open http://localhost:8000This starts Kafka, Zookeeper, Kafka UI, Prometheus, Loki, Alertmanager, and the devops-assistant agent with a chat interface. See the configuration reference for all supported providers and API key setup.
make install # install all workspace packages
make infra-up # start Kafka, Zookeeper, Prometheus, Loki, Alertmanager
make run-devops # launch the devops-assistant in ADK Dev UIRun make help to see all available commands.
- Docker only for the quick start above
- For local development: uv, Docker, and a Google AI Studio API key or Vertex AI project
Chat with the agent directly from Slack — each thread is a separate conversation, with interactive Approve/Deny buttons for guarded operations.
→ Full setup guide (app manifest, env vars, run commands)
Every tool call across all agents is instrumented with Prometheus metrics — latency histograms, error rates, invocation counts, and circuit breaker state. The Slack bot exposes a /metrics endpoint on port 9100 for Prometheus scraping.
Metrics reference (available metrics, PromQL examples, integration guide)
Each agent loads typed settings from .env files via Pydantic. Shared variables (GCP project, model version) plus per-agent settings (broker addresses, API tokens, etc.) are documented in the configuration reference.
→ Configuration reference (env vars, infrastructure ports, Docker Compose profiles)
Run the full suite (395 tests):
make testRun tests for a single package:
uv run pytest agents/kafka-health/tests/ -vAll external dependencies (Kafka, Kubernetes, Docker, Slack) are mocked — no running infrastructure needed.
→ Step-by-step guide with boilerplate, RBAC setup, and testing tips.
Contributions are welcome! See CONTRIBUTING.md for guidelines on adding new agents, improving existing ones, and submitting pull requests.
This project is licensed under the MIT License.
