Skip to content

[Benchmark] Add support for MathCanvas-Bench#1292

Merged
FangXinyu-0913 merged 3 commits intoopen-compass:mainfrom
shiwk24:feature/add-mathcanvas-bench-support
Oct 30, 2025
Merged

[Benchmark] Add support for MathCanvas-Bench#1292
FangXinyu-0913 merged 3 commits intoopen-compass:mainfrom
shiwk24:feature/add-mathcanvas-bench-support

Conversation

@shiwk24
Copy link
Contributor

@shiwk24 shiwk24 commented Oct 28, 2025

Add Support for MathCanvas-Bench 🔥

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

🌐 Project Page | 📖 Paper | 💻 Code | 📊 Leaderboard | 🤗 Model

Datasets: MathCanvas-Bench · MathCanvas-Instruct · MathCanvas-Edit · MathCanvas-Imagen

📖 Introduction

MathCanvas-Bench is a challenging new benchmark designed to evaluate the intrinsic Visual Chain-of-Thought (VCoT) capabilities of Large Multimodal Models (LMMs). It serves as the primary evaluation testbed for the MathCanvas framework.

While existing math benchmarks have advanced textual reasoning, they largely overlook a critical skill: the ability to generate and reason with visual aids as part of a solution. MathCanvas-Bench specifically targets this gap by requiring models to produce interleaved visual and textual solutions, mirroring how humans often solve complex problems in domains like geometry or function analysis.

Current state-of-the-art models, including both standard LMMs and Unified LMMs (ULMMs), often fail on problems that require strategic visual assistance. They may produce text-only solutions that miss the visual intuition or generate incorrect and unhelpful diagrams. MathCanvas-Bench is specifically designed to measure and drive progress on this critical capability.


LMMs produce text-only solutions ULMMs may generate incorrect and unhelpful visuals

📈 Evaluation

The evaluation on MathCanvas-Bench requires an LLM-based judge to assess the correctness of the generated solutions.

Important: The official judge for this benchmark is gpt-4.1. Please ensure you have set the OPENAI_API_KEY and OPENAI_API_BASE environment variables before running the evaluation.

For API-based models:

python run.py \
--data MathCanvas-Bench \
--model your_api_model \
--work-dir your_save_path \
--api-nproc 32 \
--verbose

For local open-source models:

You can run inference directly using torchrun:

torchrun --nproc-per-node=2 run.py \
--data MathCanvas-Bench \
--model your_local_model \
--work-dir your_save_path \
--verbose

Alternatively, for faster inference, you can deploy the model using services like lmdeploy or vllm, and then evaluate it using the API-based model command by setting the correct IP and port in vlmeval/config.py.

@FangXinyu-0913
Copy link
Collaborator

Hi @shiwk24, Thank you for your contribution to our community. May I ask if you can confirm that you can run MathCanvas-Bench's inference and evaluation, and can align the performance on some well-known models (Like Qwen2.5-VL, InternVL3.5, etc.)?

@shiwk24
Copy link
Contributor Author

shiwk24 commented Oct 30, 2025

Hi @FangXinyu-0913, thanks for the follow-up!

Yes, I can confirm it works. This is the exact code we used for the official leaderboard in our paper, so the performance for models like Qwen2.5-VL and InternVL3.5 is aligned.

I have also re-verified the setup to ensure everything works as expected. Let me know if you need any more information. Thanks!

@FangXinyu-0913 FangXinyu-0913 merged commit 0a5b9d0 into open-compass:main Oct 30, 2025
8 checks passed
Koii2k3 pushed a commit to wjnwjn59/VLMEvalKit that referenced this pull request Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants