Skip to content

BAHALLA/devops-agents

Repository files navigation

AI Agents for DevOps & SRE

An open-source platform for building autonomous DevOps and SRE agents. Built with Google ADK and managed as a uv workspace.

Agents can monitor infrastructure, diagnose issues, and take action — with built-in safety guardrails that require human confirmation before any destructive operation. Interact via the ADK web UI, terminal, or directly from Slack.

Slack Bot Demo

Key Features

  • Multi-agent orchestration — a root agent delegates to specialist agents via AgentTool (LLM-routed) and deterministic sub-agent workflows (ADR-002)
  • Structured workflowsSequentialAgent and ParallelAgent for deterministic multi-step pipelines (e.g., incident triage checks Kafka, K8s, Docker, and observability in parallel, then summarizes)
  • Slack integration — chat with the agent from Slack, with interactive Approve/Deny buttons for guarded operations
  • Role-based access control — three-role hierarchy (viewer/operator/admin) inferred from guardrail decorators; enforced via authorize() callback (ADR-001)
  • Input validation — all tool inputs validated at the boundary with reusable validators (string length, integer range, URL scheme, path traversal, regex patterns)
  • Safety guardrails — destructive tools (@destructive) require explicit confirmation; mutating tools (@confirm) prompt before executing; confirmations are args-hashed and time-limited
  • Structured logging — JSON-formatted logs to stdout, ready for Loki/ELK/Cloud Logging; every tool call is audited with timestamp, agent, arguments, and result
  • Persistent sessions — SQLite-backed session state, user-scoped notes, and app-wide shared data that survive restarts
  • Multi-provider LLM support — switch between Gemini, Claude, OpenAI, Ollama, or any LiteLLM-supported provider via environment variables
  • Prometheus metrics — tool call counts, latency histograms, error rates, circuit breaker state, and LLM token tracking exposed via /metrics for Prometheus scraping
  • Resilience — circuit breaker and retry with exponential backoff for transient failures
  • Composable architecture — each agent is a standalone package that can run independently or plug into an orchestrator

Architecture

graph TB
    subgraph Frontends
        WEB[ADK Web UI / CLI]
        SLACK[Slack Bot]
    end

    subgraph "devops-assistant (orchestrator)"
        ROOT[Root Agent]
        TRIAGE[Incident Triage]

        ROOT -.->|AgentTool| KAFKA[Kafka Agent]
        ROOT -.->|AgentTool| K8S[K8s Agent]
        ROOT -.->|AgentTool| OBS[Observability Agent]
        ROOT -.->|AgentTool| DOCKER[Docker Agent]
        ROOT -.->|AgentTool| JOURNAL[Ops Journal]
        ROOT -->|sub-agent| TRIAGE

        TRIAGE -->|parallel| KAFKA
        TRIAGE -->|parallel| K8S
        TRIAGE -->|parallel| DOCKER
        TRIAGE -->|parallel| OBS
        TRIAGE -->|then| SUMMARIZE[Triage Summary]
        TRIAGE -->|then| SAVE[Save to Journal]
    end

    subgraph Safety
        RBAC[RBAC · authorize]
        GUARD[Guardrails · confirm / destructive]
        AUDIT[Audit Logger]
        METRICS[Prometheus Metrics]
    end

    subgraph Infrastructure
        KF[Kafka]
        KU[Kubernetes]
        PR[Prometheus]
        LO[Loki]
        AM[Alertmanager]
        DK[Docker]
    end

    WEB --> ROOT
    SLACK --> ROOT

    KAFKA --> KF
    K8S --> KU
    OBS --> PR
    OBS --> LO
    OBS --> AM
    DOCKER --> DK

    ROOT -.-> RBAC
    ROOT -.-> GUARD
    ROOT -.-> AUDIT
    ROOT -.-> METRICS
Loading

Agents

Agent Type Description
core Library Agent factory, RBAC, guardrails, input validation, error handlers, structured logging, audit trail, activity tracking, Prometheus metrics, persistent runner, typed config
kafka-health-agent Single agent Kafka cluster health, topics, consumer groups, lag
k8s-health-agent Single agent Kubernetes cluster health, nodes, pods, deployments, logs, events
observability-agent Single agent Prometheus metrics/alerts, Loki log queries, Alertmanager silence management
devops-assistant Multi-agent Orchestrator using AgentTool for specialist agents and sub-agents for deterministic workflows
ops-journal Memory/state Notes, preferences, and session tracking with persistent storage
slack-bot Integration Slack bot with thread-based sessions and interactive confirmation buttons

Quick Start

Try it with Docker (no install required)

The only prerequisite is Docker and an API key from any supported provider.

# With Google AI Studio (default)
GOOGLE_API_KEY=your-key docker compose --profile demo up -d

# With Anthropic Claude
MODEL_PROVIDER=anthropic MODEL_NAME=anthropic/claude-sonnet-4-20250514 \
  ANTHROPIC_API_KEY=sk-ant-... docker compose --profile demo up -d

# With OpenAI
MODEL_PROVIDER=openai MODEL_NAME=openai/gpt-4o \
  OPENAI_API_KEY=sk-... docker compose --profile demo up -d

# Open the web UI
open http://localhost:8000

This starts Kafka, Zookeeper, Kafka UI, Prometheus, Loki, Alertmanager, and the devops-assistant agent with a chat interface. See the configuration reference for all supported providers and API key setup.

Local development

make install      # install all workspace packages
make infra-up     # start Kafka, Zookeeper, Prometheus, Loki, Alertmanager
make run-devops   # launch the devops-assistant in ADK Dev UI

Run make help to see all available commands.

Prerequisites

  • Docker only for the quick start above
  • For local development: uv, Docker, and a Google AI Studio API key or Vertex AI project

Slack Bot

Chat with the agent directly from Slack — each thread is a separate conversation, with interactive Approve/Deny buttons for guarded operations.

Full setup guide (app manifest, env vars, run commands)

Metrics

Every tool call across all agents is instrumented with Prometheus metrics — latency histograms, error rates, invocation counts, and circuit breaker state. The Slack bot exposes a /metrics endpoint on port 9100 for Prometheus scraping.

Metrics reference (available metrics, PromQL examples, integration guide)

Configuration

Each agent loads typed settings from .env files via Pydantic. Shared variables (GCP project, model version) plus per-agent settings (broker addresses, API tokens, etc.) are documented in the configuration reference.

Configuration reference (env vars, infrastructure ports, Docker Compose profiles)

Testing

Run the full suite (395 tests):

make test

Run tests for a single package:

uv run pytest agents/kafka-health/tests/ -v

All external dependencies (Kafka, Kubernetes, Docker, Slack) are mocked — no running infrastructure needed.

Adding a New Agent

Step-by-step guide with boilerplate, RBAC setup, and testing tips.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines on adding new agents, improving existing ones, and submitting pull requests.

License

This project is licensed under the MIT License.

About

An open-source platform for building autonomous DevOps and SRE agents

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors