[Benchmark] Add support for MathCanvas-Bench by shiwk24 · Pull Request #1292 · open-compass/VLMEvalKit

shiwk24 · 2025-10-28T06:33:19Z

Add Support for MathCanvas-Bench 🔥

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

🌐 Project Page | 📖 Paper | 💻 Code | 📊 Leaderboard | 🤗 Model

Datasets: MathCanvas-Bench · MathCanvas-Instruct · MathCanvas-Edit · MathCanvas-Imagen

📖 Introduction

MathCanvas-Bench is a challenging new benchmark designed to evaluate the intrinsic Visual Chain-of-Thought (VCoT) capabilities of Large Multimodal Models (LMMs). It serves as the primary evaluation testbed for the MathCanvas framework.

While existing math benchmarks have advanced textual reasoning, they largely overlook a critical skill: the ability to generate and reason with visual aids as part of a solution. MathCanvas-Bench specifically targets this gap by requiring models to produce interleaved visual and textual solutions, mirroring how humans often solve complex problems in domains like geometry or function analysis.

Current state-of-the-art models, including both standard LMMs and Unified LMMs (ULMMs), often fail on problems that require strategic visual assistance. They may produce text-only solutions that miss the visual intuition or generate incorrect and unhelpful diagrams. MathCanvas-Bench is specifically designed to measure and drive progress on this critical capability.

LMMs produce text-only solutions	ULMMs may generate incorrect and unhelpful visuals

📈 Evaluation

The evaluation on MathCanvas-Bench requires an LLM-based judge to assess the correctness of the generated solutions.

Important: The official judge for this benchmark is gpt-4.1. Please ensure you have set the OPENAI_API_KEY and OPENAI_API_BASE environment variables before running the evaluation.

For API-based models:

python run.py \
--data MathCanvas-Bench \
--model your_api_model \
--work-dir your_save_path \
--api-nproc 32 \
--verbose

For local open-source models:

You can run inference directly using torchrun:

torchrun --nproc-per-node=2 run.py \
--data MathCanvas-Bench \
--model your_local_model \
--work-dir your_save_path \
--verbose

Alternatively, for faster inference, you can deploy the model using services like lmdeploy or vllm, and then evaluate it using the API-based model command by setting the correct IP and port in vlmeval/config.py.

FangXinyu-0913 · 2025-10-30T08:05:19Z

Hi @shiwk24, Thank you for your contribution to our community. May I ask if you can confirm that you can run MathCanvas-Bench's inference and evaluation, and can align the performance on some well-known models (Like Qwen2.5-VL, InternVL3.5, etc.)?

shiwk24 · 2025-10-30T08:18:33Z

Hi @FangXinyu-0913, thanks for the follow-up!

Yes, I can confirm it works. This is the exact code we used for the official leaderboard in our paper, so the performance for models like Qwen2.5-VL and InternVL3.5 is aligned.

I have also re-verified the setup to ensure everything works as expected. Let me know if you need any more information. Thanks!

shiwk24 added 3 commits October 28, 2025 13:57

[benchmark] Add support for MathCanvas-Bench

e320579

Merge branch 'main' into feature/add-mathcanvas-bench-support

59a83f8

Merge branch 'main' into feature/add-mathcanvas-bench-support

dd1666a

FangXinyu-0913 merged commit 0a5b9d0 into open-compass:main Oct 30, 2025
8 checks passed

Koii2k3 pushed a commit to wjnwjn59/VLMEvalKit that referenced this pull request Nov 13, 2025

[Benchmark] Add support for MathCanvas-Bench (open-compass#1292)

f2ec0b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Add support for MathCanvas-Bench#1292

[Benchmark] Add support for MathCanvas-Bench#1292
FangXinyu-0913 merged 3 commits intoopen-compass:mainfrom
shiwk24:feature/add-mathcanvas-bench-support

shiwk24 commented Oct 28, 2025

Uh oh!

FangXinyu-0913 commented Oct 30, 2025

Uh oh!

shiwk24 commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shiwk24 commented Oct 28, 2025

Add Support for MathCanvas-Bench 🔥

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

📖 Introduction

📈 Evaluation

For API-based models:

For local open-source models:

Uh oh!

FangXinyu-0913 commented Oct 30, 2025

Uh oh!

shiwk24 commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants