-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
379 lines (351 loc) · 19.4 KB
/
index.html
File metadata and controls
379 lines (351 loc) · 19.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought">
<meta name="keywords" content="Thinking with images, Visual Cot">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:url" content="https://mira-benchmark.github.io/"/>
<title> When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.5.2/css/all.min.css">
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/explorer-index.js"></script>
<style>
.publication-authors {
display: flex;
flex-wrap: wrap;
justify-content: center;
align-items: center;
gap: 0.3em 1em;
text-align: center;
max-width: 900px;
margin: 0 auto;
}
.author-block {
display: inline-block;
white-space: nowrap;
}
.paper-block {
display: inline-block;
margin-top: 0.5em;
text-align: center;
}
</style>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title is-bold">
<span>When Visualizing is the First Step to Reasoning: <span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span>, a Benchmark for Visual Chain-of-Thought</span>
</h1>
<div class="is-size-5 publication-authors">
<span class="author-block"><a href="https://yiyangzhou.github.io/">Yiyang Zhou</a><sup style="color:#1E40AF;">1</sup><sup style="color:#3B82F6;">2</sup>*, </span>
<span class="author-block"><a href="https://www.haqtu.me/">Haoqin Tu</a><sup style="color:#1E40AF;">1</sup><sup style="color:#FDB515;">3</sup>*, </span>
<span class="author-block"><a href="https://asillycat.github.io/">Zijun Wang</a><sup style="color:#FDB515;">3</sup>, </span>
<span class="author-block"><a href="https://zw615.github.io/">Zeyu Wang</a><sup style="color:#1E40AF;">1</sup><sup style="color:#FDB515;">3</sup></span>
<div style="flex-basis: 100%; height: 0;"></div>
<span class="author-block"><a href="https://muennighoff.github.io/">Niklas Muennighoff</a><sup style="color:#8C1515">4</sup>, </span>
<span class="author-block"><a href="https://fannie1208.github.io/">Fan Nie</a><sup style="color:#8C1515">4</sup>, </span>
<span class="author-block"><a href="https://yejinc.github.io/">Yejin Choi</a><sup style="color:#8C1515">4</sup>, </span>
<span class="author-block"><a href="https://www.james-zou.com/">James Zou</a><sup style="color:#8C1515">4</sup></span>
<div style="flex-basis: 100%; height: 0;"></div>
<span class="author-block"><a href="https://scholar.google.com/citations?user=k0TWfBoAAAAJ&hl=en">Chaorui Deng</a><sup style="color:#1E40AF">1</sup>, </span>
<span class="author-block"><a href="https://shenyann.github.io/">Shen Yan</a><sup style="color:#1E40AF">1</sup>, </span>
<span class="author-block"><a href="https://haoqifan.github.io/">Haoqi Fan</a><sup style="color:#1E40AF">1</sup>, </span>
<span class="author-block"><a href="https://cihangxie.github.io/">Cihang Xie</a><sup style="color:#FDB515">3</sup></span>
<div style="flex-basis: 100%; height: 0;"></div>
<span class="author-block"><a href="https://www.huaxiuyao.io/">Huaxiu Yao</a><sup style="color:#3B82F6">2</sup>†, </span>
<span class="author-block"><a href="https://scholar.google.com/citations?user=ZYOhaGwAAAAJ&hl=en">Qinghao Ye</a><sup style="color:#0A2472">1</sup>†</span>
</div>
<div class="is-size-5 publication-authors">
<div style="flex-basis: 100%; height: 0;"></div>
<span class="author-block"><sup style="color:#1E40AF;">1</sup>ByteDance Seed</span>
<span class="author-block"><sup style="color:#3B82F6">2</sup>UNC-Chapel Hill</span>
<span class="author-block"><sup style="color:#FDB515">3</sup>UCSC</span>
<span class="author-block"><sup style="color:#8C1515">4</sup>Stanford</span>
<div style="flex-basis: 100%; height: 0;"></div>
<span class="paper-block"><b>* Equal Contribution.</b></span>
<span class="paper-block"><b>† Corresponding Authors.</b></span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://arxiv.org/pdf/2511.02779"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/aiming-lab/MIRA"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<!-- Dataset Link. -->
<span class="link-block">
<a href="https://huggingface.co/datasets/YiyangAiLab/MIRA" class="button is-normal is-rounded is-dark">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg"
alt="Hugging Face" style="height: 1em; vertical-align: middle; margin-right: 0.3em;">
Hugging Face
</a>
</span>
<!-- Twitter Link. -->
<span class="link-block">
<a href="https://x.com/AiYiyangZ"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fa-brands fa-x-twitter"></i>
</span>
<span>Twitter</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container" style="margin-bottom: 2vh;">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Introduction</h2>
<div class="content has-text-justified">
<span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span> (<span style="color: #5d60a5;">M</span>ultimodal <span style="color: #5784ae;">I</span>magination for <span style="color: #5ba7c2;">R</span>easoning <span style="color: #5cc8cb;">A</span>ssessment) is a new benchmark that consists of carefully curated multimodal questions, each requiring the ability to generate or utilize intermediate visual images (i.e., Visual Chain-of-Thought) to successfully perform complex reasoning.
</div>
<p align="center">
<img src="assets/fig2.jpg" alt="Reference to Figure 1 from MIRA paper" width="80%"/> <br>
</p>
<strong>Fig. 1.</strong> <span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span> Reveals MLLM Weaknesses in Visual Reasoning. Leading MLLMs such as GPT-5, o3, o4-mini, and Gemini 2.5 Pro perform well on benchmarks like MMMU, MMStar and RealWorldQA but drop below 20% accuracy on <span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span>. This sharp decline highlights <span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span>’s ability to expose core challenges in reasoning tasks that require generating intermediate visual imagery. The left example shows a dice-rolling task where humans, able to visualize motion, succeed, while MLLMs relying only on text reasoning fail.
<p align="center">
<img src="assets/fig1.jpg" alt="Reference to Figure 2 from MIRA paper" width="100%"/> <br>
</p>
<strong>Fig. 2.</strong> <span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span> categorizes Visual-CoT reasoning tasks into two primary types: Static (Single-Step) and Dynamic (Multi-Step), with representative examples from each category illustrated in the figure. The dataset includes 20 types of tasks, 546 input images with manually designed questions, and 936 manually constructed single-step and multi-step intermediate images.
</div>
</div>
</div>
</section>
<section class="section">
<div class="container">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Results</h2>
<div class="content has-text-justified">
<p>
The <span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span> benchmark is highly challenging for all Multimodal Large Language Models (MLLMs), with even the most advanced closed-source models achieving an overall accuracy of no more than 20% under <span style="color: #41b2ce;">direct input (D)</span>. <span style="color: #2597de;">Text-based Chain-of-Thought (T)</span> has limited effect, or even a negative impact on strong models like Gemini 2.5 Pro and o3. In contrast, providing human-annotated intermediate <span style="color: #175ba1;">visual cues (V)</span> significantly boosts model performance, yielding an average relative gain of 33.7%, highlighting the critical role of visual information in complex reasoning.
</p>
<div class="content has-text-centered">
<img src="assets/result1.png" alt="geometric reasoning" width="90%"/>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Analyze</h2>
<div class="content has-text-justified">
<p>
Expanding the model's decoding search space (e.g., using Pass@k) brings only limited performance gains on the <span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span> task, with improvements quickly saturating beyond Pass@4. For stronger models, the benefits of Pass@k or majority voting are minimal, indicating that their failures stem from a fundamental lack of capability rather than mere random reasoning errors.
<div class="content has-text-centered">
<img src="assets/result2.png" alt="geometric reasoning" width="90%"/>
</div>
<p>
Replacing the generic Text-CoT prompt with task-specific prompts to better simulate the guidance provided by Visual-CoT yields consistent but relatively marginal performance improvements (an average gain of about 1.4% to 1.5%). This limited improvement, in contrast to the substantial gains brought by Visual-CoT, highlights the inherent limitations of purely textual guidance, which struggles to adequately capture the visual information required for certain reasoning steps.
</p>
<div class="content has-text-centered">
<img src="assets/result3.png" alt="geometric reasoning" width="90%"/>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Data Case</h2>
<div class="content has-text-justified">
<p>
We showcase several representative examples for each category below.
</p>
<div id="results-carousel" class="carousel results-carousel">
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case1_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case2_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case3_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case4_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case5_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case6_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case7_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case8_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case9_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="assets/case10_page-0001.jpg" alt="image name" width="90%"/>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- @PAN TODO: bibtex -->
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title is-3 has-text-centered">BibTeX</h2>
<pre><code>@misc{zhou2025visualizingstepreasoningmira,
title={When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought},
author={Yiyang Zhou and Haoqin Tu and Zijun Wang and Zeyu Wang and Niklas Muennighoff and Fan Nie and Yejin Choi and James Zou and Chaorui Deng and Shen Yan and Haoqi Fan and Cihang Xie and Huaxiu Yao and Qinghao Ye},
year={2025},
eprint={2511.02779},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.02779},
}</code></pre>
</div>
</section>
<section>
<div class="section" id="org-banners" style="display:flex">
<a href="https://seed.bytedance.com/en/" target="_blank" rel="external">
<img class="center-block org-banner" src="assets/seed.png" style="max-width: 240px; height: auto;">
</a>
<a href="https://www.unc.edu/" target="_blank" rel="external">
<img class="center-block org-banner" src="assets/unc.png" style="max-width: 200px; height: auto;">
</a>
<a href="https://www.ucsc.edu/" target="blank" class="ext-link">
<img class="center-block org-banner" src="assets/ucsc.png" style="max-width: 190px; height: auto;">
</a>
<a href="https://www.stanford.edu/" target="_blank" rel="external">
<img class="center-block org-banner" src="assets/stanford.png" style="max-width: 180px; height: auto;">
</a>
</div>
</section>
<footer style="background-color: #1a1a1a; color: #ffffff; padding: 40px 30px; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">
<div style="max-width: 1200px; margin: 0 auto; display: flex; justify-content: space-between; align-items: center; flex-wrap: wrap;">
<div style="margin-bottom: 20px;">
<h2 style="color: #00bfff; font-size: 36px; font-weight: 800; margin: 0; text-transform: uppercase;"><span style="color: #5d60a5;">M</span><span style="color: #5784ae;">I</span><span style="color: #5ba7c2;">R</span><span style="color: #5cc8cb;">A</span></h2>
<p style="margin: 0; font-size: 16px;">A Benchmark for Visual Chain-of-Thought, where imagining is the first step to reasoning.</p>
<div style="margin-top: 15px; display: flex; gap: 15px;">
<!-- GitHub -->
<a href="https://github.com/aiming-lab/MIRA" class="footer-icon" style="color: #ffffff; font-size: 24px; text-decoration: none;">
<i class="fab fa-github"></i>
</a>
<!-- Hugging Face -->
<a href="https://huggingface.co/datasets/YiyangAiLab/MIRA" class="footer-icon" style="color: #ffffff; font-size: 24px; text-decoration: none;">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face" style="height: 1em; vertical-align: middle;">
</a>
<!-- X / Twitter -->
<a href="https://x.com/AiYiyangZ/" class="footer-icon" style="color: #ffffff; font-size: 24px; text-decoration: none;">
<i class="fa-brands fa-x-twitter"></i>
</a>
<!-- LinkedIn -->
<a href="https://www.linkedin.com/in/yiyang-zhou-1bb05829a" class="footer-icon" style="color: #ffffff; font-size: 24px; text-decoration: none;">
<i class="fab fa-linkedin"></i>
</a>
</div>
</div>
<div style="font-size: 14px;">
<br>
<br>
<span style="margin-right: 20px;">
<a href="#" id="share-link" style="color: #ffffff; text-decoration: none;">
<i class="fa-solid fa-link" style="margin-right: 5px;"></i> Share Site
</a>
</span>
<br>
<br>
</div>
<div id="toast" style="
display: none;
position: fixed;
bottom: 20px;
left: 50%;
transform: translateX(-50%);
background: #333;
color: #fff;
padding: 10px 20px;
border-radius: 5px;
font-size: 14px;
z-index: 1000;
">Link copied to clipboard!</div>
<script>
const toast = document.getElementById('toast');
document.getElementById("share-link").addEventListener("click", function(e) {
e.preventDefault();
const url = "https://mira-benchmark.github.io/";
navigator.clipboard.writeText(url).then(() => {
toast.style.display = 'block';
setTimeout(() => { toast.style.display = 'none'; }, 2000); // 2秒后消失
});
});
</script>
</div>
</footer>
</body>
</html>