Layered Architecture • Stack • Swap • Built for Scale
- 🧱 Modular Design: Swap agents, tools, datasets, prompts, and evaluators with ease.
- 📦 Built-in Benchmarks: Single/multi-agent datasets for direct comparison.
- 📊 Visual Debugging: Inspect interactions, accuracy, and tool use.
- 🔧 Tool Support: Manage tool selection via pluggable wrappers.
- 🧩 Easy Extensions: Add agents via subclassing—no core changes.
- 📂 Paired Datasets & Evaluators: Add new benchmarks with minimal effort.
- 🔍 Failure Attribution: Identify failure causes and responsible agents.
See MASArena in action! This demo showcases the framework's visualization capabilities:
visualization.mp4
We recommend using uv for dependency and virtual environment management.
# Install dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activateCreate a .env file in the project root and set the following:
OPENAI_API_KEY=your_openai_api_key
MODEL_NAME=gpt-4o-mini
OPENAI_API_BASE=https://api.openai.com/v1./run_benchmark.sh- Supported benchmarks:
- Math:
math,aime - Code:
humaneval,mbpp - Reasoning:
drop,bbh,mmlu_pro,ifeval
- Math:
- Supported agent systems:
- Single Agent:
single_agent - Multi-Agent:
supervisor_mas,swarm,agentverse,chateval,evoagent,jarvis,metagpt
- Single Agent:
MASArena includes a comprehensive test suite to ensure code reliability and facilitate development.
# Run all tests
pytest
# Run tests with coverage
pytest --cov=mas_arena --cov-report=html
# Run only unit tests
pytest -m "unit"
# Run only integration tests
pytest -m "integration"
# Run tests excluding slow tests
pytest -m "not slow"
# Run specific test file
pytest tests/test_agents.py
# Run with verbose output
pytest -vtests/test_agents.py- Tests for agent systemstests/test_evaluators.py- Tests for evaluation componentstests/test_tools.py- Tests for tool managementtests/test_benchmark_runner.py- Tests for benchmark executiontests/test_integration.py- End-to-end integration teststests/conftest.py- Shared fixtures and configuration
- Unit Tests: Fast, isolated tests for individual components
- Integration Tests: Tests for component interactions
- Slow Tests: Long-running tests (marked with
@pytest.mark.slow)
For comprehensive guides, tutorials, and API references, visit our complete documentation.
- Add asynchronous support for model calls
- Implement failure detection in MAS workflows
- Add more benchmarks emphasizing tool usage
- Improve configuration for MAS and tool integration
- Integrate multiple tools(e.g., Browser, Video, Audio, Docker) into the current evaluation framework
- Optimize the framework's tool management architecture to decouple MCP tool invocation from local tool invocation
- Implement more benchmark evaluations(e.g., webArena, SweBench) that requires tool usage
- Reimplementation of the Dynamic Architecture Paper Based on the Benchmark Framework
We warmly welcome contributions from the community!
You can contribute in many ways:
-
🧠 New Agent Systems (MAS): Add novel single- or multi-agent systems to expand the diversity of strategies and coordination models.
-
📊 New Benchmark Datasets: Bring in domain-specific or task-specific datasets (e.g., reasoning, planning, tool-use, collaboration) to broaden the scope of evaluation.
-
🛠 New Tools & Toolkits: Extend the framework's tool ecosystem by integrating domain tools (e.g., search, calculators, code editors) and improving tool selection strategies.
-
⚙️ Improvements & Utilities: Help with performance optimization, failure handling, asynchronous processing, or new visualizations.