|
| 1 | +<a id="readme-top"></a> |
| 2 | + |
| 3 | +<!-- PROJECT --> |
| 4 | +<br /> |
| 5 | +<div align="center"> |
| 6 | + <h3 align="center">LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?</h3> |
| 7 | + |
| 8 | + <p align="center"> |
| 9 | + Benchmarking the agent in real-world tasks within a large-scale MCP toolset. |
| 10 | + </p> |
| 11 | +</div> |
| 12 | +<p align="center"> |
| 13 | +<a href="https://www.python.org/downloads/release/python-31113/"><img src="https://img.shields.io/badge/python-3.11-blue.svg" alt="Python 3.11"></a> |
| 14 | +<a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/badge/code%20style-ruff-000000.svg" alt="Code style: ruff"></a> |
| 15 | +</p> |
| 16 | + |
| 17 | +<p align="center"> |
| 18 | + 🌐 <a href="https://icip-cas.github.io/LiveMCPBench" target="_blank">Website</a> | |
| 19 | + <!-- 📄 <a href="" target="_blank">Paper</a> | --> |
| 20 | + 🤗 <a href="https://huggingface.co/datasets/hysdhlx/LiveMCPBench" target="_blank">Dataset</a> | |
| 21 | + 🏆 <a href="https://docs.google.com/spreadsheets/d/1EXpgXq1VKw5A7l7-N2E9xt3w0eLJ2YPVPT-VrRxKZBw/edit?usp=sharing" target="_blank">Leaderboard</a> |
| 22 | + <!-- | --> |
| 23 | + <!-- 🙏 <a href="#citation" target="_blank">Citation</a> --> |
| 24 | +</p> |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +## News |
| 29 | +* [8/3/2025] We release the LiveMCPBench. |
| 30 | +## Getting Started |
| 31 | + |
| 32 | +### Prerequisites |
| 33 | +We will release our docker image soon, but if you want to run the code locally, you will need to install the following tools: |
| 34 | +* npm |
| 35 | +* uv |
| 36 | +### Installation |
| 37 | +1. sync python env |
| 38 | + |
| 39 | + ```bash |
| 40 | + uv sync |
| 41 | + ``` |
| 42 | +2. check the MCP tools |
| 43 | + |
| 44 | + ```bash |
| 45 | + bash ./tools/scripts/tool_check.sh |
| 46 | + ``` |
| 47 | + After running this command, you can check ./tools/test/tools.json to see the tools. |
| 48 | + |
| 49 | +3. prepare the .env file |
| 50 | + |
| 51 | + ```bash |
| 52 | + cp .env_template .env |
| 53 | + ``` |
| 54 | + You can modify the .env file to set your own environment variables. |
| 55 | + ```bash |
| 56 | + # MCP Copilot Agent Configuration |
| 57 | + BASE_URL= |
| 58 | + OPENAI_API_KEY= |
| 59 | + MODEL= |
| 60 | + |
| 61 | + # Tool Retrieval Configuration |
| 62 | + EMBEDDING_MODEL= |
| 63 | + EMBEDDING_BASE_URL= |
| 64 | + EMBEDDING_API_KEY= |
| 65 | + EMBEDDING_DIMENSIONS=1024 |
| 66 | + TOP_SERVERS=5 |
| 67 | + TOP_TOOLS=3 |
| 68 | + # Abstract API Configuration (optional) |
| 69 | + ABSTRACT_MODEL= |
| 70 | + ABSTRACT_API_KEY= |
| 71 | + ABSTRACT_BASE_URL= |
| 72 | + |
| 73 | + # lark report (optional) |
| 74 | + LARK_WEBHOOK_URL= |
| 75 | + ``` |
| 76 | + |
| 77 | +## Quick Start |
| 78 | +### MCP Copilot Agent |
| 79 | +#### Example Run |
| 80 | +You can run the MCP Copilot Agent with the following command: |
| 81 | + |
| 82 | +```bash |
| 83 | +bash ./baseline/scripts/run_example.sh |
| 84 | +``` |
| 85 | +This will run the agent with a simple example and save the results in `./baseline/output/`. |
| 86 | + |
| 87 | +#### Full Run |
| 88 | +We default use /root dir to store our benchmark data. |
| 89 | + |
| 90 | +1. Move the code repo and create a symbolic link |
| 91 | + |
| 92 | + You should mv this code repo to `/LiveMCPBench/`, because we will link `/LiveMCPBench/annotated_data` to `/root/`. |
| 93 | + |
| 94 | + ```bash |
| 95 | + bash scripts/link_path.sh |
| 96 | + ``` |
| 97 | + |
| 98 | + This will create a symbolic link from `/LiveMCPBench/annotated_data/dirs` to `/root/annotated_data`. |
| 99 | + |
| 100 | +2. Run the MCP Copilot Agent |
| 101 | + |
| 102 | + Be sure you have set the environment variables in the .env file. |
| 103 | + |
| 104 | + ````bash |
| 105 | + bash ./baseline/scripts/run_baselines.sh |
| 106 | + ```` |
| 107 | +3. Check the results |
| 108 | + |
| 109 | + After running the agent, you can check the trajectories in `./baseline/output`. |
| 110 | + |
| 111 | +### Evaluation using the LiveMCPEval |
| 112 | +1. Modify the .env to change evluation models |
| 113 | + |
| 114 | +2. Run the evaluation script |
| 115 | + |
| 116 | + ```bash |
| 117 | + bash ./evaluator/scripts/run_baseline.sh |
| 118 | + ``` |
| 119 | + |
| 120 | +3. Check the results |
| 121 | + |
| 122 | + After running the evaluation, you can check the results in `./evaluator/output`. |
| 123 | + |
| 124 | +4. Calculate the human agreement |
| 125 | + |
| 126 | + ```bash |
| 127 | + uv run ./evaluator/human_agreement.py |
| 128 | + ``` |
| 129 | + |
| 130 | + This will calculate the human agreement for the evaluation results and save it in `./evaluator/output/human_agreement.json`. |
| 131 | + |
| 132 | +## Project Structure |
| 133 | +``` |
| 134 | +LiveMCPBench/ |
| 135 | +├── annotated_data/ # Tasks and task files |
| 136 | +├── baseline/ # MCP Copilot Agent |
| 137 | +│ ├── scripts/ # Scripts for running the agent |
| 138 | +│ ├── output/ # Output for the agent |
| 139 | +│ └── mcp_copilot/ # Source code for the agent |
| 140 | +├── evaluator/ # LiveMCPEval |
| 141 | +│ ├── scripts/ # Scripts for evaluation |
| 142 | +│ └── output/ # Output for evaluation |
| 143 | +├── tools/ # LiveMCPTool |
| 144 | +│ ├── LiveMCPTool/ # Tool data |
| 145 | +│ └── scripts/ # Scripts for the tools |
| 146 | +├── scripts/ # Path prepare scripts |
| 147 | +├── utils/ # Utility functions |
| 148 | +└── .env_template # Template for environment |
| 149 | +``` |
| 150 | +<!-- ## Citation |
| 151 | + |
| 152 | +If you find this project helpful, please use the following to cite it: |
| 153 | +```bibtex |
| 154 | +
|
| 155 | +``` --> |
0 commit comments