HEPEx AnalysisOps Benchmark (Green Agent)

AgentBeats Green Agent for evaluating autonomous agents on high-energy physics (HEP) analysis workflows.

Overview

This benchmark evaluates an agent's ability to perform end-to-end physics analyses using ATLAS Open Data. It serves as the Green Agent (assessor) in the AgentBeats ecosystem.

Supported Tasks

Task	Description
`zpeak_fit`	Extract Z mass and width from muon pairs
`hyy`	Measure Higgs mass using diphoton events
`hmumu`	Search for H→μμ using VBF topology
`hbb`	Identify H→bb in 0-lepton VH channel
`hzz`	Analyze H→ZZ→4l "Golden Channel"
`ttbar`	Reconstruct top quark mass
`wz3l`	Analyze WZ diboson in 3-lepton final state

Quick Start

Docker Image

# Pull from GHCR
docker pull ghcr.io/hrzhao76/hepex-analysisops-benchmark:latest

# Or build locally
docker build -t hepex-green-agent:local .

# Run (listens on port 9009)
docker run -p 9009:9009 ghcr.io/hrzhao76/hepex-analysisops-benchmark:latest

Local Development

# Install dependencies
uv sync

# Run the agent
uv run src/server.py --host 0.0.0.0 --port 9009

Local Reproduction

Test the full benchmark locally with a Purple Agent:

# Set API keys
export GOOGLE_API_KEY="..."

# Run the reproduction script
uv run scripts/reproduce_locally.py --local

This generates a docker-compose.yml and runs both agents in isolated containers. Results are saved to ./output/.

AgentBeats Integration

Agent Card

Name: hepex-green-agent
Port: 9009 (A2A standard)
Protocol: A2A (Agent-to-Agent)

EvalRequest Format

{
  "participants": {
    "white_agent": "http://purple-agent:9009/"
  },
  "config": {
    "task_dirs": ["specs/zpeak_fit"],
    "data_dir": "/home/agent/output"
  }
}

Output Artifacts

Each evaluation run produces:

output/
├── runs/<run_id>/<task_id>/
│   ├── meta.json              # Task metadata
│   ├── submission_trace.json  # Agent response
│   ├── judge_input.json       # Evaluator input
│   └── judge_output.json      # Scored result
└── <release>/<dataset>/<skim>/  # Cached data files

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      AgentBeats Platform                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼ EvalRequest
┌─────────────────────────────────────────────────────────────┐
│                  Green Agent (This Repo)                    │
│  ┌──────────────┐  ┌──────────┐  ┌─────────────────────┐    │
│  │ Task Loader  │→ │ Data Mgr │→ │ Evaluation Engine   │    │
│  └──────────────┘  └──────────┘  └─────────────────────┘    │
│         │                               ▲                   │
│         ▼ A2A                           │ trace             │
│  ┌──────────────────────────────────────┴─────────┐         │
│  │              Purple Agent (External)           │         │
│  └────────────────────────────────────────────────┘         │
└─────────────────────────────────────────────────────────────┘

Reproducibility

Deterministic Scoring: Rule-based checks produce identical scores for identical traces
Artifact Persistence: All inputs and outputs saved as JSON for audit
Isolation: Each task runs in its own directory

Attribution

This benchmark uses ATLAS Open Data released under the CERN Open Data policy.

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
docs/walkthrough		docs/walkthrough
reference/example-output/t001_zpeak_fit		reference/example-output/t001_zpeak_fit
scripts		scripts
specs		specs
src		src
tasks_public/t001_zpeak_fit		tasks_public/t001_zpeak_fit
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HEPEx AnalysisOps Benchmark (Green Agent)

Overview

Supported Tasks

Quick Start

Docker Image

Local Development

Local Reproduction

AgentBeats Integration

Agent Card

EvalRequest Format

Output Artifacts

Architecture

Reproducibility

Attribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HEPEx AnalysisOps Benchmark (Green Agent)

Overview

Supported Tasks

Quick Start

Docker Image

Local Development

Local Reproduction

AgentBeats Integration

Agent Card

EvalRequest Format

Output Artifacts

Architecture

Reproducibility

Attribution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages