You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[8/18/2025] We releas [Docker images](https://hub.docker.com/r/hysdhlx/livemcpbench) and add evaluation results in [leaderboard](https://docs.google.com/spreadsheets/d/1EXpgXq1VKw5A7l7-N2E9xt3w0eLJ2YPVPT-VrRxKZBw/edit?usp=sharing) for three new models: GLM 4.5, GPT-5-Mini, and Kimi-K2.
29
31
*[8/3/2025] We release the LiveMCPBench.
30
32
## Getting Started
31
33
32
34
### Prerequisites
33
-
We will release our docker image soon, but if you want to run the code locally, you will need to install the following tools:
35
+
We recommend using our docker image, but if you want to run the code locally, you will need to install the following tools:
After running this command, you can check ./tools/test/tools.json to see the tools.
48
-
49
-
3. prepare the .env file
59
+
3. Prepare the .env file
50
60
51
61
```bash
52
62
cp .env_template .env
@@ -70,46 +80,73 @@ We will release our docker image soon, but if you want to run the code locally,
70
80
ABSTRACT_API_KEY=
71
81
ABSTRACT_BASE_URL=
72
82
83
+
# Proxy Configuration (optional)
84
+
http_proxy=
85
+
https_proxy=
86
+
no_proxy=127.0.0.1,localhost
87
+
HTTP_PROXY=
88
+
HTTPS_PROXY=
89
+
NO_PROXY=127.0.0.1,localhost
90
+
73
91
# lark report (optional)
74
92
LARK_WEBHOOK_URL=
75
93
```
94
+
4. Enter the container & Reset the environment
95
+
96
+
As we have mounted the code repo to `/outside`, you can access the code repo in the container at `/outside/`.
97
+
98
+
99
+
```bash
100
+
docker exec -it LiveMCPBench_container bash
101
+
```
102
+
Because the agent may change the environment, we recommend resetting the environment before running the agent. To reset the environment, you can run the following command:
103
+
104
+
```bash
105
+
cd /LiveMCPBench/
106
+
bash scripts/env_reset.sh
107
+
```
108
+
This will copy the repo code in `/outside` to `/LiveMCPBench` and link the `annotated_data` to `/root/`.
109
+
5. Check the MCP tools
110
+
111
+
```bash
112
+
bash ./tools/scripts/tool_check.sh
113
+
```
114
+
After running this command, you can check `./tools/test/tools.json` to see the tools.
115
+
> You could run this script multiple times if you find some tools are not working.
116
+
117
+
6. Index the servers
118
+
119
+
The MCP Copilot Agent requires you have indexed the servers before running. You can run the following command to warm up the agent:
120
+
121
+
```bash
122
+
uv run -m baseline.mcp_copilot.arg_generation
123
+
```
76
124
77
125
## Quick Start
78
126
### MCP Copilot Agent
79
127
#### Example Run
80
-
You can run the MCP Copilot Agent with the following command:
81
128
82
129
```bash
83
130
bash ./baseline/scripts/run_example.sh
84
131
```
85
132
This will run the agent with a simple example and save the results in `./baseline/output/`.
86
133
87
134
#### Full Run
88
-
We default use /root dir to store our benchmark data.
135
+
We default use /root dir to store our data that the agent will access. If you want to run locally, you need to ensure the file in the right path.
89
136
90
-
1. Move the code repo and create a symbolic link
91
-
92
-
You should mv this code repo to `/LiveMCPBench/`, because we will link `/LiveMCPBench/annotated_data` to `/root/`.
93
-
94
-
```bash
95
-
bash scripts/link_path.sh
96
-
```
97
-
98
-
This will create a symbolic link from `/LiveMCPBench/annotated_data/dirs` to `/root/annotated_data`.
99
-
100
-
2. Run the MCP Copilot Agent
137
+
1. Run the MCP Copilot Agent
101
138
102
139
Be sure you have set the environment variables in the .env file.
103
140
104
141
````bash
105
142
bash ./baseline/scripts/run_baselines.sh
106
143
````
107
-
3. Check the results
144
+
2. Check the results
108
145
109
146
After running the agent, you can check the trajectories in`./baseline/output`.
110
147
111
148
### Evaluation using the LiveMCPEval
112
-
1. Modify the .env to change evluation models
149
+
1. Modify the `MODEL`in.env to change evluation models
113
150
114
151
2. Run the evaluation script
115
152
@@ -121,14 +158,12 @@ We default use /root dir to store our benchmark data.
121
158
122
159
After running the evaluation, you can check the results in`./evaluator/output`.
123
160
124
-
4. Calculate the human agreement
161
+
4. Calculate the success rate
125
162
126
163
```bash
127
-
uv run ./evaluator/human_agreement.py
164
+
uv run ./evaluator/stat_success_rate.py --result_path /path/to/evaluation/
128
165
```
129
166
130
-
This will calculate the human agreement forthe evaluation results and save itin`./evaluator/output/human_agreement.json`.
0 commit comments