feat(docker): docker release

KurisuMakiseSame · KurisuMakiseSame · commit a928e57d1ee7 · 2025-08-18T15:12:08.000+08:00
diff --git a/.env_template b/.env_template
@@ -15,5 +15,13 @@ ABSTRACT_MODEL=qwen25_72b_int4_instruct
 ABSTRACT_API_KEY=
 ABSTRACT_BASE_URL=
 
+# Proxy Configuration (optional)
+http_proxy=
+https_proxy=
+no_proxy=127.0.0.1,localhost
+HTTP_PROXY=
+HTTPS_PROXY=
+NO_PROXY=127.0.0.1,localhost
+
 # lark report (optional)
 LARK_WEBHOOK_URL=
diff --git a/README.md b/README.md
@@ -18,6 +18,7 @@
   🌐 <a href="https://icip-cas.github.io/LiveMCPBench" target="_blank">Website</a> &nbsp; | &nbsp;
   📄 <a href="https://arxiv.org/abs/2508.01780" target="_blank">Paper</a> &nbsp; | &nbsp;
   🤗 <a href="https://huggingface.co/datasets/ICIP/LiveMCPBench" target="_blank">Dataset</a> &nbsp; | &nbsp;
+  🐳 <a href="https://hub.docker.com/r/hysdhlx/livemcpbench" target="_blank">Docker</a> &nbsp; | &nbsp;
   🏆 <a href="https://docs.google.com/spreadsheets/d/1EXpgXq1VKw5A7l7-N2E9xt3w0eLJ2YPVPT-VrRxKZBw/edit?usp=sharing" target="_blank">Leaderboard</a> 
   &nbsp; | &nbsp;
   🙏 <a href="#citation" target="_blank">Citation</a>
@@ -26,27 +27,36 @@
 
 ![Overview](media/LiveMCPBench.png)
 ## News
+* [8/18/2025] We releas [Docker images](https://hub.docker.com/r/hysdhlx/livemcpbench) and add evaluation results in [leaderboard](https://docs.google.com/spreadsheets/d/1EXpgXq1VKw5A7l7-N2E9xt3w0eLJ2YPVPT-VrRxKZBw/edit?usp=sharing) for three new models: GLM 4.5, GPT-5-Mini, and Kimi-K2.
 * [8/3/2025] We release the LiveMCPBench.
 ## Getting Started
 
 ### Prerequisites
-We will release our docker image soon, but if you want to run the code locally, you will need to install the following tools:
+We recommend using our docker image, but if you want to run the code locally, you will need to install the following tools:
 * npm
 * uv
 ### Installation
-1. sync python env
+1. Pull the docker image
 
    ```bash
-   uv sync
+   docker pull hysdhlx/livemcpbench:latest
    ```
-2. check the MCP tools
+2. Git the repo and run the docker image
 
    ```bash
-   bash ./tools/scripts/tool_check.sh
+   git clone https://github.com/icip-cas/LiveMCPBench.git
+   cd LiveMCPBench
+
+   docker run -itd \
+   -v "$(pwd):/outside" \
+   --gpus all \
+   --ipc=host \
+   --net=host \
+   --name LiveMCPBench_container \
+   hysdhlx/livemcpbench:latest \
+   bash
    ```
-   After running this command, you can check ./tools/test/tools.json to see the tools.
-
-3. prepare the .env file
+3. Prepare the .env file
 
    ```bash
    cp .env_template .env
@@ -70,46 +80,73 @@ We will release our docker image soon, but if you want to run the code locally,
     ABSTRACT_API_KEY=
     ABSTRACT_BASE_URL=
 
+    # Proxy Configuration (optional)
+    http_proxy=
+    https_proxy=
+    no_proxy=127.0.0.1,localhost
+    HTTP_PROXY=
+    HTTPS_PROXY=
+    NO_PROXY=127.0.0.1,localhost
+
     # lark report (optional)
     LARK_WEBHOOK_URL=
    ```
+4. Enter the container & Reset the environment
+
+   As we have mounted the code repo to `/outside`, you can access the code repo in the container at `/outside/`. 
+
+
+   ```bash
+   docker exec -it LiveMCPBench_container bash
+   ```
+   Because the agent may change the environment, we recommend resetting the environment before running the agent. To reset the environment, you can run the following command:
+
+   ```bash
+   cd /LiveMCPBench/
+   bash scripts/env_reset.sh 
+   ```
+   This will copy the repo code in `/outside` to `/LiveMCPBench` and link the `annotated_data` to `/root/`.
+5. Check the MCP tools
+
+   ```bash
+   bash ./tools/scripts/tool_check.sh
+   ```
+   After running this command, you can check `./tools/test/tools.json` to see the tools.
+   > You could run this script multiple times if you find some tools are not working.
+
+6. Index the servers
+
+   The MCP Copilot Agent requires you have indexed the servers before running. You can run the following command to warm up the agent:
+
+   ```bash
+   uv run -m baseline.mcp_copilot.arg_generation
+   ```
 
 ## Quick Start
 ### MCP Copilot Agent
 #### Example Run
-You can run the MCP Copilot Agent with the following command:
 
 ```bash
 bash ./baseline/scripts/run_example.sh
 ```
 This will run the agent with a simple example and save the results in `./baseline/output/`.
 
 #### Full Run
-We default use /root dir to store our benchmark data.
+We default use /root dir to store our data that the agent will access. If you want to run locally, you need to ensure the file in the right path.
 
-1. Move the code repo and create a symbolic link
-
-    You should mv this code repo to `/LiveMCPBench/`, because we will link `/LiveMCPBench/annotated_data` to `/root/`.
-
-    ```bash
-    bash scripts/link_path.sh
-    ```
-
-    This will create a symbolic link from `/LiveMCPBench/annotated_data/dirs` to `/root/annotated_data`.
-
-2. Run the MCP Copilot Agent
+1. Run the MCP Copilot Agent
 
     Be sure you have set the environment variables in the .env file.
 
     ````bash
     bash ./baseline/scripts/run_baselines.sh
     ````
-3. Check the results
+2. Check the results
 
     After running the agent, you can check the trajectories  in `./baseline/output`.
 
 ### Evaluation using the LiveMCPEval
-1. Modify the .env to change evluation models
+1. Modify the `MODEL` in .env to change evluation models
 
 2. Run the evaluation script
 
@@ -121,14 +158,12 @@ We default use /root dir to store our benchmark data.
 
     After running the evaluation, you can check the results in `./evaluator/output`.
 
-4. Calculate the human agreement
+4. Calculate the success rate
 
    ```bash
-   uv run ./evaluator/human_agreement.py
+   uv run ./evaluator/stat_success_rate.py --result_path /path/to/evaluation/
    ```
 
-   This will calculate the human agreement for the evaluation results and save it in `./evaluator/output/human_agreement.json`.
-
 ## Project Structure
 ```
 LiveMCPBench/
diff --git a/evaluator/llm_as_judge_baseline.py b/evaluator/llm_as_judge_baseline.py
@@ -128,7 +128,7 @@ def format_tool_descriptions(tool_map, server_name, tool_name):
 
 def get_args():
     parser = argparse.ArgumentParser(description="LLM as Judge Baseline")
-    parser.add_argument("--tools_path", type=str, default="./tools/fillter/tools.json")
+    parser.add_argument("--tools_path", type=str, default="./tools/LiveMCPTool/tools.json")
     parser.add_argument(
         "--trajectory_path",
         type=str,
@@ -210,7 +210,7 @@ def get_args():
                         ) or message.get("function_call", [])
                         for tool_call in message_tool_calls:
                             function = tool_call["function"]
-                            if function["name"] == "execute-tool":
+                            if function.get("name") == "execute-tool":
                                 tool_calls.append(function["arguments"])
                                 try:
                                     tool_config = json.loads(function["arguments"])
diff --git a/media/model_pareto_frontier.png b/media/model_pareto_frontier.png
diff --git a/scripts/env_reset.sh b/scripts/env_reset.sh
@@ -2,7 +2,14 @@
 SRC_DIR="/outside"
 DST_DIR="/LiveMCPBench"
 
-find "$SRC_DIR" -mindepth 1 -maxdepth 1 | while read -r item; do
+EXCLUDES=(".venv" "logs")
+
+exclude_args=()
+for ex in "${EXCLUDES[@]}"; do
+    exclude_args+=(! -name "$ex")
+done
+
+find "$SRC_DIR" -mindepth 1 -maxdepth 1 "${exclude_args[@]}" | while read -r item; do
     name=$(basename "$item")
     target="$DST_DIR/$name"
 
@@ -13,4 +20,4 @@ find "$SRC_DIR" -mindepth 1 -maxdepth 1 | while read -r item; do
     cp -r "$item" "$target"
 done
 
-echo "LiveMCPBench workspace has been updated from $SRC_DIR."
+echo "LiveMCPBench workspace has been updated from $SRC_DIR (excluding: ${EXCLUDES[*]})."
diff --git a/uv.lock b/uv.lock