[Benchmark] Add support for MathCanvas-Bench#1292
Merged
FangXinyu-0913 merged 3 commits intoopen-compass:mainfrom Oct 30, 2025
Merged
[Benchmark] Add support for MathCanvas-Bench#1292FangXinyu-0913 merged 3 commits intoopen-compass:mainfrom
FangXinyu-0913 merged 3 commits intoopen-compass:mainfrom
Conversation
Collaborator
|
Hi @shiwk24, Thank you for your contribution to our community. May I ask if you can confirm that you can run MathCanvas-Bench's inference and evaluation, and can align the performance on some well-known models (Like Qwen2.5-VL, InternVL3.5, etc.)? |
Contributor
Author
|
Hi @FangXinyu-0913, thanks for the follow-up! Yes, I can confirm it works. This is the exact code we used for the official leaderboard in our paper, so the performance for models like Qwen2.5-VL and InternVL3.5 is aligned. I have also re-verified the setup to ensure everything works as expected. Let me know if you need any more information. Thanks! |
Koii2k3
pushed a commit
to wjnwjn59/VLMEvalKit
that referenced
this pull request
Nov 13, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Support for MathCanvas-Bench 🔥
MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
🌐 Project Page | 📖 Paper | 💻 Code | 📊 Leaderboard | 🤗 Model
Datasets: MathCanvas-Bench · MathCanvas-Instruct · MathCanvas-Edit · MathCanvas-Imagen
📖 Introduction
MathCanvas-Bench is a challenging new benchmark designed to evaluate the intrinsic Visual Chain-of-Thought (VCoT) capabilities of Large Multimodal Models (LMMs). It serves as the primary evaluation testbed for the MathCanvas framework.
While existing math benchmarks have advanced textual reasoning, they largely overlook a critical skill: the ability to generate and reason with visual aids as part of a solution. MathCanvas-Bench specifically targets this gap by requiring models to produce interleaved visual and textual solutions, mirroring how humans often solve complex problems in domains like geometry or function analysis.
Current state-of-the-art models, including both standard LMMs and Unified LMMs (ULMMs), often fail on problems that require strategic visual assistance. They may produce text-only solutions that miss the visual intuition or generate incorrect and unhelpful diagrams. MathCanvas-Bench is specifically designed to measure and drive progress on this critical capability.
📈 Evaluation
The evaluation on MathCanvas-Bench requires an LLM-based judge to assess the correctness of the generated solutions.
Important: The official judge for this benchmark is
gpt-4.1. Please ensure you have set theOPENAI_API_KEYandOPENAI_API_BASEenvironment variables before running the evaluation.For API-based models:
For local open-source models:
You can run inference directly using
torchrun:Alternatively, for faster inference, you can deploy the model using services like
lmdeployorvllm, and then evaluate it using the API-based model command by setting the correct IP and port invlmeval/config.py.