Skip to content

Commit a928e57

Browse files
feat(docker): docker release
1 parent 7bf29af commit a928e57

File tree

6 files changed

+129
-79
lines changed

6 files changed

+129
-79
lines changed

.env_template

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,13 @@ ABSTRACT_MODEL=qwen25_72b_int4_instruct
1515
ABSTRACT_API_KEY=
1616
ABSTRACT_BASE_URL=
1717

18+
# Proxy Configuration (optional)
19+
http_proxy=
20+
https_proxy=
21+
no_proxy=127.0.0.1,localhost
22+
HTTP_PROXY=
23+
HTTPS_PROXY=
24+
NO_PROXY=127.0.0.1,localhost
25+
1826
# lark report (optional)
1927
LARK_WEBHOOK_URL=

README.md

Lines changed: 62 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
🌐 <a href="https://icip-cas.github.io/LiveMCPBench" target="_blank">Website</a> &nbsp; | &nbsp;
1919
📄 <a href="https://arxiv.org/abs/2508.01780" target="_blank">Paper</a> &nbsp; | &nbsp;
2020
🤗 <a href="https://huggingface.co/datasets/ICIP/LiveMCPBench" target="_blank">Dataset</a> &nbsp; | &nbsp;
21+
🐳 <a href="https://hub.docker.com/r/hysdhlx/livemcpbench" target="_blank">Docker</a> &nbsp; | &nbsp;
2122
🏆 <a href="https://docs.google.com/spreadsheets/d/1EXpgXq1VKw5A7l7-N2E9xt3w0eLJ2YPVPT-VrRxKZBw/edit?usp=sharing" target="_blank">Leaderboard</a>
2223
&nbsp; | &nbsp;
2324
🙏 <a href="#citation" target="_blank">Citation</a>
@@ -26,27 +27,36 @@
2627

2728
![Overview](media/LiveMCPBench.png)
2829
## News
30+
* [8/18/2025] We releas [Docker images](https://hub.docker.com/r/hysdhlx/livemcpbench) and add evaluation results in [leaderboard](https://docs.google.com/spreadsheets/d/1EXpgXq1VKw5A7l7-N2E9xt3w0eLJ2YPVPT-VrRxKZBw/edit?usp=sharing) for three new models: GLM 4.5, GPT-5-Mini, and Kimi-K2.
2931
* [8/3/2025] We release the LiveMCPBench.
3032
## Getting Started
3133

3234
### Prerequisites
33-
We will release our docker image soon, but if you want to run the code locally, you will need to install the following tools:
35+
We recommend using our docker image, but if you want to run the code locally, you will need to install the following tools:
3436
* npm
3537
* uv
3638
### Installation
37-
1. sync python env
39+
1. Pull the docker image
3840

3941
```bash
40-
uv sync
42+
docker pull hysdhlx/livemcpbench:latest
4143
```
42-
2. check the MCP tools
44+
2. Git the repo and run the docker image
4345

4446
```bash
45-
bash ./tools/scripts/tool_check.sh
47+
git clone https://github.com/icip-cas/LiveMCPBench.git
48+
cd LiveMCPBench
49+
50+
docker run -itd \
51+
-v "$(pwd):/outside" \
52+
--gpus all \
53+
--ipc=host \
54+
--net=host \
55+
--name LiveMCPBench_container \
56+
hysdhlx/livemcpbench:latest \
57+
bash
4658
```
47-
After running this command, you can check ./tools/test/tools.json to see the tools.
48-
49-
3. prepare the .env file
59+
3. Prepare the .env file
5060

5161
```bash
5262
cp .env_template .env
@@ -70,46 +80,73 @@ We will release our docker image soon, but if you want to run the code locally,
7080
ABSTRACT_API_KEY=
7181
ABSTRACT_BASE_URL=
7282

83+
# Proxy Configuration (optional)
84+
http_proxy=
85+
https_proxy=
86+
no_proxy=127.0.0.1,localhost
87+
HTTP_PROXY=
88+
HTTPS_PROXY=
89+
NO_PROXY=127.0.0.1,localhost
90+
7391
# lark report (optional)
7492
LARK_WEBHOOK_URL=
7593
```
94+
4. Enter the container & Reset the environment
95+
96+
As we have mounted the code repo to `/outside`, you can access the code repo in the container at `/outside/`.
97+
98+
99+
```bash
100+
docker exec -it LiveMCPBench_container bash
101+
```
102+
Because the agent may change the environment, we recommend resetting the environment before running the agent. To reset the environment, you can run the following command:
103+
104+
```bash
105+
cd /LiveMCPBench/
106+
bash scripts/env_reset.sh
107+
```
108+
This will copy the repo code in `/outside` to `/LiveMCPBench` and link the `annotated_data` to `/root/`.
109+
5. Check the MCP tools
110+
111+
```bash
112+
bash ./tools/scripts/tool_check.sh
113+
```
114+
After running this command, you can check `./tools/test/tools.json` to see the tools.
115+
> You could run this script multiple times if you find some tools are not working.
116+
117+
6. Index the servers
118+
119+
The MCP Copilot Agent requires you have indexed the servers before running. You can run the following command to warm up the agent:
120+
121+
```bash
122+
uv run -m baseline.mcp_copilot.arg_generation
123+
```
76124

77125
## Quick Start
78126
### MCP Copilot Agent
79127
#### Example Run
80-
You can run the MCP Copilot Agent with the following command:
81128

82129
```bash
83130
bash ./baseline/scripts/run_example.sh
84131
```
85132
This will run the agent with a simple example and save the results in `./baseline/output/`.
86133

87134
#### Full Run
88-
We default use /root dir to store our benchmark data.
135+
We default use /root dir to store our data that the agent will access. If you want to run locally, you need to ensure the file in the right path.
89136

90-
1. Move the code repo and create a symbolic link
91-
92-
You should mv this code repo to `/LiveMCPBench/`, because we will link `/LiveMCPBench/annotated_data` to `/root/`.
93-
94-
```bash
95-
bash scripts/link_path.sh
96-
```
97-
98-
This will create a symbolic link from `/LiveMCPBench/annotated_data/dirs` to `/root/annotated_data`.
99-
100-
2. Run the MCP Copilot Agent
137+
1. Run the MCP Copilot Agent
101138

102139
Be sure you have set the environment variables in the .env file.
103140

104141
````bash
105142
bash ./baseline/scripts/run_baselines.sh
106143
````
107-
3. Check the results
144+
2. Check the results
108145

109146
After running the agent, you can check the trajectories in `./baseline/output`.
110147

111148
### Evaluation using the LiveMCPEval
112-
1. Modify the .env to change evluation models
149+
1. Modify the `MODEL` in .env to change evluation models
113150

114151
2. Run the evaluation script
115152

@@ -121,14 +158,12 @@ We default use /root dir to store our benchmark data.
121158

122159
After running the evaluation, you can check the results in `./evaluator/output`.
123160

124-
4. Calculate the human agreement
161+
4. Calculate the success rate
125162

126163
```bash
127-
uv run ./evaluator/human_agreement.py
164+
uv run ./evaluator/stat_success_rate.py --result_path /path/to/evaluation/
128165
```
129166

130-
This will calculate the human agreement for the evaluation results and save it in `./evaluator/output/human_agreement.json`.
131-
132167
## Project Structure
133168
```
134169
LiveMCPBench/

evaluator/llm_as_judge_baseline.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ def format_tool_descriptions(tool_map, server_name, tool_name):
128128

129129
def get_args():
130130
parser = argparse.ArgumentParser(description="LLM as Judge Baseline")
131-
parser.add_argument("--tools_path", type=str, default="./tools/fillter/tools.json")
131+
parser.add_argument("--tools_path", type=str, default="./tools/LiveMCPTool/tools.json")
132132
parser.add_argument(
133133
"--trajectory_path",
134134
type=str,
@@ -210,7 +210,7 @@ def get_args():
210210
) or message.get("function_call", [])
211211
for tool_call in message_tool_calls:
212212
function = tool_call["function"]
213-
if function["name"] == "execute-tool":
213+
if function.get("name") == "execute-tool":
214214
tool_calls.append(function["arguments"])
215215
try:
216216
tool_config = json.loads(function["arguments"])

media/model_pareto_frontier.png

327 KB
Loading

scripts/env_reset.sh

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,14 @@
22
SRC_DIR="/outside"
33
DST_DIR="/LiveMCPBench"
44

5-
find "$SRC_DIR" -mindepth 1 -maxdepth 1 | while read -r item; do
5+
EXCLUDES=(".venv" "logs")
6+
7+
exclude_args=()
8+
for ex in "${EXCLUDES[@]}"; do
9+
exclude_args+=(! -name "$ex")
10+
done
11+
12+
find "$SRC_DIR" -mindepth 1 -maxdepth 1 "${exclude_args[@]}" | while read -r item; do
613
name=$(basename "$item")
714
target="$DST_DIR/$name"
815

@@ -13,4 +20,4 @@ find "$SRC_DIR" -mindepth 1 -maxdepth 1 | while read -r item; do
1320
cp -r "$item" "$target"
1421
done
1522

16-
echo "LiveMCPBench workspace has been updated from $SRC_DIR."
23+
echo "LiveMCPBench workspace has been updated from $SRC_DIR (excluding: ${EXCLUDES[*]})."

uv.lock

Lines changed: 48 additions & 48 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)