MASArena 🏟️

Layered Architecture • Stack • Swap • Built for Scale

🌟 Core Features

🧱 Modular Design: Swap agents, tools, datasets, prompts, and evaluators with ease.
📦 Built-in Benchmarks: Single/multi-agent datasets for direct comparison.
📊 Visual Debugging: Inspect interactions, accuracy, and tool use.
🔧 Tool Support: Manage tool selection via pluggable wrappers.
🧩 Easy Extensions: Add agents via subclassing—no core changes.
📂 Paired Datasets & Evaluators: Add new benchmarks with minimal effort.
🔍 Failure Attribution: Identify failure causes and responsible agents.

🎬 Demo

See MASArena in action! This demo showcases the framework's visualization capabilities:

visualization.mp4

🚀 Quick Start

1. Setup

We recommend using uv for dependency and virtual environment management.

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate

2. Configure Environment Variables

Create a .env file in the project root and set the following:

OPENAI_API_KEY=your_openai_api_key
MODEL_NAME=gpt-4o-mini
OPENAI_API_BASE=https://api.openai.com/v1

3. Running Benchmarks

./run_benchmark.sh

Supported benchmarks:
- Math: math, aime
- Code: humaneval, mbpp
- Reasoning: drop, bbh, mmlu_pro, ifeval
Supported agent systems:
- Single Agent: single_agent
- Multi-Agent: supervisor_mas, swarm, agentverse, chateval, evoagent, jarvis, metagpt

🧪 Testing

MASArena includes a comprehensive test suite to ensure code reliability and facilitate development.

Running Tests

# Run all tests
pytest

# Run tests with coverage
pytest --cov=mas_arena --cov-report=html

# Run only unit tests
pytest -m "unit"

# Run only integration tests
pytest -m "integration"

# Run tests excluding slow tests
pytest -m "not slow"

# Run specific test file
pytest tests/test_agents.py

# Run with verbose output
pytest -v

Test Structure

tests/test_agents.py - Tests for agent systems
tests/test_evaluators.py - Tests for evaluation components
tests/test_tools.py - Tests for tool management
tests/test_benchmark_runner.py - Tests for benchmark execution
tests/test_integration.py - End-to-end integration tests
tests/conftest.py - Shared fixtures and configuration

Test Categories

Unit Tests: Fast, isolated tests for individual components
Integration Tests: Tests for component interactions
Slow Tests: Long-running tests (marked with @pytest.mark.slow)

📚 Documentation

For comprehensive guides, tutorials, and API references, visit our complete documentation.

✅ TODOs

Add asynchronous support for model calls
Implement failure detection in MAS workflows
Add more benchmarks emphasizing tool usage
Improve configuration for MAS and tool integration
Integrate multiple tools(e.g., Browser, Video, Audio, Docker) into the current evaluation framework
Optimize the framework's tool management architecture to decouple MCP tool invocation from local tool invocation
Implement more benchmark evaluations(e.g., webArena, SweBench) that requires tool usage
Reimplementation of the Dynamic Architecture Paper Based on the Benchmark Framework

🙌 Contributing

We warmly welcome contributions from the community!

You can contribute in many ways:

🧠 New Agent Systems (MAS): Add novel single- or multi-agent systems to expand the diversity of strategies and coordination models.
📊 New Benchmark Datasets: Bring in domain-specific or task-specific datasets (e.g., reasoning, planning, tool-use, collaboration) to broaden the scope of evaluation.
🛠 New Tools & Toolkits: Extend the framework's tool ecosystem by integrating domain tools (e.g., search, calculators, code editors) and improving tool selection strategies.
⚙️ Improvements & Utilities: Help with performance optimization, failure handling, asynchronous processing, or new visualizations.

Name		Name	Last commit message	Last commit date
Latest commit History 304 Commits
.github/workflows		.github/workflows
data		data
docs		docs
mas_arena		mas_arena
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_benchmark.sh		run_benchmark.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MASArena 🏟️

🌟 Core Features

🎬 Demo

🚀 Quick Start

1. Setup

2. Configure Environment Variables

3. Running Benchmarks

🧪 Testing

Running Tests

Test Structure

Test Categories

📚 Documentation

✅ TODOs

🙌 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MASArena 🏟️

🌟 Core Features

🎬 Demo

🚀 Quick Start

1. Setup

2. Configure Environment Variables

3. Running Benchmarks

🧪 Testing

Running Tests

Test Structure

Test Categories

📚 Documentation

✅ TODOs

🙌 Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages