AI Agents for DevOps & SRE

An open-source platform for building autonomous DevOps and SRE agents. Built with Google ADK and managed as a uv workspace.

Agents can monitor infrastructure, diagnose issues, and take action — with built-in safety guardrails that require human confirmation before any destructive operation. Interact via the ADK web UI, terminal, or directly from Slack.

Key Features

Multi-agent orchestration — a root agent delegates to specialist agents via AgentTool (LLM-routed) and deterministic sub-agent workflows (ADR-002)
Structured workflows — SequentialAgent and ParallelAgent for deterministic multi-step pipelines (e.g., incident triage checks Kafka, K8s, Docker, and observability in parallel, then summarizes)
Slack integration — chat with the agent from Slack, with interactive Approve/Deny buttons for guarded operations
Role-based access control — three-role hierarchy (viewer/operator/admin) inferred from guardrail decorators; enforced via authorize() callback (ADR-001)
Input validation — all tool inputs validated at the boundary with reusable validators (string length, integer range, URL scheme, path traversal, regex patterns)
Safety guardrails — destructive tools (@destructive) require explicit confirmation; mutating tools (@confirm) prompt before executing; confirmations are args-hashed and time-limited
Structured logging — JSON-formatted logs to stdout, ready for Loki/ELK/Cloud Logging; every tool call is audited with timestamp, agent, arguments, and result
Persistent sessions — SQLite-backed session state, user-scoped notes, and app-wide shared data that survive restarts
Multi-provider LLM support — switch between Gemini, Claude, OpenAI, Ollama, or any LiteLLM-supported provider via environment variables
Prometheus metrics — tool call counts, latency histograms, error rates, circuit breaker state, and LLM token tracking exposed via /metrics for Prometheus scraping
Resilience — circuit breaker and retry with exponential backoff for transient failures
Composable architecture — each agent is a standalone package that can run independently or plug into an orchestrator

Architecture

graph TB
    subgraph Frontends
        WEB[ADK Web UI / CLI]
        SLACK[Slack Bot]
    end

    subgraph "devops-assistant (orchestrator)"
        ROOT[Root Agent]
        TRIAGE[Incident Triage]

        ROOT -.->|AgentTool| KAFKA[Kafka Agent]
        ROOT -.->|AgentTool| K8S[K8s Agent]
        ROOT -.->|AgentTool| OBS[Observability Agent]
        ROOT -.->|AgentTool| DOCKER[Docker Agent]
        ROOT -.->|AgentTool| JOURNAL[Ops Journal]
        ROOT -->|sub-agent| TRIAGE

        TRIAGE -->|parallel| KAFKA
        TRIAGE -->|parallel| K8S
        TRIAGE -->|parallel| DOCKER
        TRIAGE -->|parallel| OBS
        TRIAGE -->|then| SUMMARIZE[Triage Summary]
        TRIAGE -->|then| SAVE[Save to Journal]
    end

    subgraph Safety
        RBAC[RBAC · authorize]
        GUARD[Guardrails · confirm / destructive]
        AUDIT[Audit Logger]
        METRICS[Prometheus Metrics]
    end

    subgraph Infrastructure
        KF[Kafka]
        KU[Kubernetes]
        PR[Prometheus]
        LO[Loki]
        AM[Alertmanager]
        DK[Docker]
    end

    WEB --> ROOT
    SLACK --> ROOT

    KAFKA --> KF
    K8S --> KU
    OBS --> PR
    OBS --> LO
    OBS --> AM
    DOCKER --> DK

    ROOT -.-> RBAC
    ROOT -.-> GUARD
    ROOT -.-> AUDIT
    ROOT -.-> METRICS

Agents

Agent	Type	Description
core	Library	Agent factory, RBAC, guardrails, input validation, error handlers, structured logging, audit trail, activity tracking, Prometheus metrics, persistent runner, typed config
kafka-health-agent	Single agent	Kafka cluster health, topics, consumer groups, lag
k8s-health-agent	Single agent	Kubernetes cluster health, nodes, pods, deployments, logs, events
observability-agent	Single agent	Prometheus metrics/alerts, Loki log queries, Alertmanager silence management
devops-assistant	Multi-agent	Orchestrator using AgentTool for specialist agents and sub-agents for deterministic workflows
ops-journal	Memory/state	Notes, preferences, and session tracking with persistent storage
slack-bot	Integration	Slack bot with thread-based sessions and interactive confirmation buttons

Quick Start

Try it with Docker (no install required)

The only prerequisite is Docker and an API key from any supported provider.

# With Google AI Studio (default)
GOOGLE_API_KEY=your-key docker compose --profile demo up -d

# With Anthropic Claude
MODEL_PROVIDER=anthropic MODEL_NAME=anthropic/claude-sonnet-4-20250514 \
  ANTHROPIC_API_KEY=sk-ant-... docker compose --profile demo up -d

# With OpenAI
MODEL_PROVIDER=openai MODEL_NAME=openai/gpt-4o \
  OPENAI_API_KEY=sk-... docker compose --profile demo up -d

# Open the web UI
open http://localhost:8000

This starts Kafka, Zookeeper, Kafka UI, Prometheus, Loki, Alertmanager, and the devops-assistant agent with a chat interface. See the configuration reference for all supported providers and API key setup.

Local development

make install      # install all workspace packages
make infra-up     # start Kafka, Zookeeper, Prometheus, Loki, Alertmanager
make run-devops   # launch the devops-assistant in ADK Dev UI

Run make help to see all available commands.

Prerequisites

Docker only for the quick start above
For local development: uv, Docker, and a Google AI Studio API key or Vertex AI project

Slack Bot

Chat with the agent directly from Slack — each thread is a separate conversation, with interactive Approve/Deny buttons for guarded operations.

→ Full setup guide (app manifest, env vars, run commands)

Metrics

Every tool call across all agents is instrumented with Prometheus metrics — latency histograms, error rates, invocation counts, and circuit breaker state. The Slack bot exposes a /metrics endpoint on port 9100 for Prometheus scraping.

Metrics reference (available metrics, PromQL examples, integration guide)

Configuration

Each agent loads typed settings from .env files via Pydantic. Shared variables (GCP project, model version) plus per-agent settings (broker addresses, API tokens, etc.) are documented in the configuration reference.

→ Configuration reference (env vars, infrastructure ports, Docker Compose profiles)

Testing

Run the full suite (395 tests):

make test

Run tests for a single package:

uv run pytest agents/kafka-health/tests/ -v

All external dependencies (Kafka, Kubernetes, Docker, Slack) are mocked — no running infrastructure needed.

Adding a New Agent

→ Step-by-step guide with boilerplate, RBAC setup, and testing tips.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines on adding new agents, improving existing ones, and submitting pull requests.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
agents		agents
core		core
docs		docs
infra		infra
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.prod		Dockerfile.prod
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Agents for DevOps & SRE

Key Features

Architecture

Agents

Quick Start

Try it with Docker (no install required)

Local development

Prerequisites

Slack Bot

Metrics

Configuration

Testing

Adding a New Agent

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Agents for DevOps & SRE

Key Features

Architecture

Agents

Quick Start

Try it with Docker (no install required)

Local development

Prerequisites

Slack Bot

Metrics

Configuration

Testing

Adding a New Agent

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages