Skip to content

Commit 493a7e8

Browse files
[Result] Update Evaluation Results (#60)
* update MME, SEEDBench * update results * update LLaVABench * fix * update AI2D accuracy * update LLaVABench * update README * update teaser link
1 parent e992046 commit 493a7e8

File tree

12 files changed

+363
-244
lines changed

12 files changed

+363
-244
lines changed

README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
![LOGO](https://github-production-user-asset-6210df.s3.amazonaws.com/34324155/295443340-a300f073-4995-48a5-af94-495141606cf7.jpg)
1+
![LOGO](http://opencompass.openxlab.space/utils/MMLB.jpg)
22
<div align="center"><b>A Toolkit for Evaluating Large Vision-Language Models. </b></div>
33
<div align="center"><br>
44
<a href="https://opencompass.org.cn/leaderboard-multimodal">🏆 Learderboard </a> •
@@ -9,16 +9,15 @@
99
<a href="#%EF%B8%8F-citation">🖊️Citation </a>
1010
<br><br>
1111
</div>
12-
1312
**VLMEvalKit** (the python package name is **vlmeval**) is an **open-source evaluation toolkit** of **large vision-language models (LVLMs)**. It enables **one-command evaluation** of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt **generation-based evaluation** for all LVLMs (obtain the answer via `generate` / `chat` interface), and provide the evaluation results obtained with both **exact matching** and **LLM(ChatGPT)-based answer extraction**.
1413

1514
## 🆕 News
1615

17-
- **[2024-01-14]** We have supported [**LLaVABench (in-the-wild)**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild).
16+
- **[2024-01-21]** We have updated results for [**LLaVABench (in-the-wild)**](/results/LLaVABench.md) and [**AI2D**](/results/AI2D.md).
1817
- **[2024-01-14]** We have supported [**AI2D**](https://allenai.org/data/diagrams) and provided the [**script**](/scripts/AI2D_preproc.ipynb) for data pre-processing. 🔥🔥🔥
1918
- **[2024-01-13]** We have supported [**EMU2 / EMU2-Chat**](https://github.com/baaivision/Emu) and [**DocVQA**](https://www.docvqa.org). 🔥🔥🔥
2019
- **[2024-01-11]** We have supported [**Monkey**](https://github.com/Yuliang-Liu/Monkey). 🔥🔥🔥
21-
- **[2024-01-09]** The performance numbers on our official multi-modal leaderboards can be downloaded in json files: [**MMBench Leaderboard**](http://opencompass.openxlab.space/utils/MMBench.json), [**OpenCompass Multi-Modal Leaderboard**](http://opencompass.openxlab.space/utils/MMLB.json). We also add a [notebook](scripts/visualize.ipynb) to visualize these results.🔥🔥🔥
20+
- **[2024-01-09]** The performance numbers on our official multi-modal leaderboards can be downloaded in json files: [**MMBench Leaderboard**](http://opencompass.openxlab.space/utils/MMBench.json), [**OpenCompass Multi-Modal Leaderboard**](http://opencompass.openxlab.space/utils/MMLB.json). We also added a [**notebook**](scripts/visualize.ipynb) to visualize these results.🔥🔥🔥
2221
- **[2024-01-03]** We support **ScienceQA (Img)** (Dataset Name: ScienceQA_[VAL/TEST], [**eval results**](results/ScienceQA.md)), **HallusionBench** (Dataset Name: HallusionBench, [**eval results**](/results/HallusionBench.md)), and **MathVista** (Dataset Name: MathVista_MINI, [**eval results**](/results/MathVista.md)). 🔥🔥🔥
2322
- **[2023-12-31]** We release the [**preliminary results**](/results/VQA.md) of three VQA datasets (**OCRVQA**, **TextVQA**, **ChatVQA**). The results are obtained by exact matching and may not faithfully reflect the real performance of VLMs on the corresponding task.
2423

@@ -46,9 +45,9 @@
4645
| [**OCRVQA**](https://ocr-vqa.github.io) | OCRVQA_[TESTCORE/TEST] ||| [**VQA**](/results/VQA.md) |
4746
| [**TextVQA**](https://textvqa.org) | TextVQA_VAL ||| [**VQA**](/results/VQA.md) |
4847
| [**ChartQA**](https://github.com/vis-nlp/ChartQA) | ChartQA_VALTEST_HUMAN ||| [**VQA**](/results/VQA.md) |
48+
| [**AI2D**](https://allenai.org/data/diagrams) | AI2D ||| [**AI2D**](/results/AI2D.md) |
49+
| [**LLaVABench**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) | LLaVABench ||| [**LLaVABench**](/results/LLaVABench.md) |
4950
| [**DocVQA**](https://www.docvqa.org) | DocVQA_VAL ||| |
50-
| [**AI2D**](https://allenai.org/data/diagrams) | AI2D ||| |
51-
| [**LLaVABench**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) | LLaVABench ||| |
5251
| [**Core-MM**](https://github.com/core-mm/core-mm) | CORE_MM || | |
5352

5453
**Supported API Models**

results/AI2D.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# AI2D Evaluation Results
2+
3+
> During evaluation, we use `GPT-3.5-Turbo-0613` as the choice extractor for all VLMs if the choice can not be extracted via heuristic matching. **Zero-shot** inference is adopted.
4+
5+
## AI2D Accuracy
6+
7+
| Model | overall |
8+
|:----------------------------|----------:|
9+
| Monkey-Chat | 72.6 |
10+
| GPT-4v (detail: low) | 71.3 |
11+
| Qwen-VL-Chat | 68.5 |
12+
| Monkey | 67.6 |
13+
| GeminiProVision | 66.7 |
14+
| QwenVLPlus | 63.7 |
15+
| Qwen-VL | 63.4 |
16+
| LLaVA-InternLM2-20B (QLoRA) | 61.4 |
17+
| CogVLM-17B-Chat | 60.3 |
18+
| ShareGPT4V-13B | 59.3 |
19+
| TransCore-M | 59.2 |
20+
| LLaVA-v1.5-13B (QLoRA) | 59 |
21+
| LLaVA-v1.5-13B | 57.9 |
22+
| ShareGPT4V-7B | 56.7 |
23+
| InternLM-XComposer-VL | 56.1 |
24+
| LLaVA-InternLM-7B (QLoRA) | 56 |
25+
| LLaVA-v1.5-7B (QLoRA) | 55.2 |
26+
| mPLUG-Owl2 | 55.2 |
27+
| SharedCaptioner | 55.1 |
28+
| IDEFICS-80B-Instruct | 54.4 |
29+
| LLaVA-v1.5-7B | 54.1 |
30+
| PandaGPT-13B | 49.2 |
31+
| LLaVA-v1-7B | 47.8 |
32+
| IDEFICS-9B-Instruct | 42.7 |
33+
| InstructBLIP-7B | 40.2 |
34+
| VisualGLM | 40.2 |
35+
| InstructBLIP-13B | 38.6 |
36+
| MiniGPT-4-v1-13B | 33.4 |
37+
| OpenFlamingo v2 | 30.7 |
38+
| MiniGPT-4-v2 | 29.4 |
39+
| MiniGPT-4-v1-7B | 28.7 |

results/Caption.md

Lines changed: 30 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -10,34 +10,36 @@
1010
1111
### Evaluation Results
1212

13-
| Model | BLEU-4 | BLEU-1 | ROUGE-L | CIDEr | Word_cnt mean. | Word_cnt std. |
14-
|:------------------------------|---------:|---------:|----------:|--------:|-----------------:|----------------:|
15-
| Qwen-VL-Chat | 34 | 75.8 | 54.9 | 98.9 | 10 | 1.7 |
16-
| IDEFICS-80B-Instruct | 32.5 | 76.1 | 54.1 | 94.9 | 9.7 | 3.2 |
17-
| IDEFICS-9B-Instruct | 29.4 | 72.7 | 53.4 | 90.4 | 10.5 | 4.4 |
18-
| InstructBLIP-7B | 20.9 | 56.8 | 39.9 | 58.1 | 11.6 | 5.9 |
19-
| InstructBLIP-13B | 16.9 | 50 | 37 | 52.4 | 11.8 | 12.8 |
20-
| InternLM-XComposer-VL | 12.4 | 38.3 | 37.9 | 41 | 26.3 | 22.2 |
21-
| TransCore-M | 8.8 | 30.3 | 36.1 | 34.7 | 39.9 | 27.9 |
22-
| GeminiProVision | 8.4 | 33.2 | 31.2 | 9.7 | 35.2 | 15.7 |
23-
| LLaVA-v1.5-7B (QLoRA, XTuner) | 7.2 | 25 | 36.6 | 43.2 | 48.8 | 42.9 |
24-
| mPLUG-Owl2 | 7.1 | 25.8 | 33.6 | 35 | 45.8 | 32.1 |
25-
| LLaVA-v1-7B | 6.7 | 27.3 | 26.7 | 6.1 | 40.9 | 16.1 |
26-
| VisualGLM | 5.4 | 28.6 | 23.6 | 0.2 | 41.5 | 11.5 |
27-
| LLaVA-v1.5-13B (QLoRA, XTuner) | 5.3 | 19.6 | 25.8 | 17.8 | 72.2 | 39.4 |
28-
| LLaVA-v1.5-13B | 5.1 | 20.7 | 21.2 | 0.3 | 70.6 | 22.3 |
29-
| LLaVA-v1.5-7B | 4.6 | 19.6 | 19.9 | 0.1 | 72.5 | 21.7 |
30-
| PandaGPT-13B | 4.6 | 19.9 | 19.3 | 0.1 | 65.4 | 16.6 |
31-
| MiniGPT-4-v1-13B | 4.4 | 20 | 19.8 | 1.3 | 64.4 | 30.5 |
32-
| MiniGPT-4-v1-7B | 4.3 | 19.6 | 17.5 | 0.8 | 61.9 | 30.6 |
33-
| LLaVA-InternLM-7B (QLoRA) | 4 | 17.3 | 17.2 | 0.1 | 82.3 | 21 |
34-
| CogVLM-17B-Chat | 3.6 | 21.3 | 20 | 0.1 | 56.2 | 13.7 |
35-
| Qwen-VL | 3.5 | 11.6 | 30 | 41.1 | 46.6 | 105.2 |
36-
| GPT-4v (detail: low) | 3.3 | 18 | 18.1 | 0 | 77.8 | 20.4 |
37-
| ShareGPT4V-7B | 1.4 | 9.7 | 10.6 | 0.1 | 147.9 | 45.4 |
38-
| MiniGPT-4-v2 | 1.4 | 12.6 | 13.3 | 0.1 | 83 | 27.1 |
39-
| OpenFlamingo v2 | 1.3 | 6.4 | 15.8 | 14.9 | 60 | 81.9 |
40-
| SharedCaptioner | 1 | 8.8 | 9.2 | 0 | 164.2 | 31.6 |
13+
| Model | BLEU-4 | BLEU-1 | ROUGE-L | CIDEr | Word_cnt mean. | Word_cnt std. |
14+
|:----------------------------|---------:|---------:|----------:|--------:|-----------------:|----------------:|
15+
| EMU2-Chat | 38.7 | 78.2 | 56.9 | 109.2 | 9.6 | 1.1 |
16+
| Qwen-VL-Chat | 34 | 75.8 | 54.9 | 98.9 | 10 | 1.7 |
17+
| IDEFICS-80B-Instruct | 32.5 | 76.1 | 54.1 | 94.9 | 9.7 | 3.2 |
18+
| IDEFICS-9B-Instruct | 29.4 | 72.7 | 53.4 | 90.4 | 10.5 | 4.4 |
19+
| InstructBLIP-7B | 20.9 | 56.8 | 39.9 | 58.1 | 11.6 | 5.9 |
20+
| InstructBLIP-13B | 16.9 | 50 | 37 | 52.4 | 11.8 | 12.8 |
21+
| InternLM-XComposer-VL | 12.4 | 38.3 | 37.9 | 41 | 26.3 | 22.2 |
22+
| GeminiProVision | 8.4 | 33.2 | 31.2 | 9.7 | 35.2 | 15.7 |
23+
| LLaVA-v1.5-7B (QLoRA) | 7.2 | 25 | 36.6 | 43.2 | 48.8 | 42.9 |
24+
| mPLUG-Owl2 | 7.1 | 25.8 | 33.6 | 35 | 45.8 | 32.1 |
25+
| LLaVA-v1-7B | 6.7 | 27.3 | 26.7 | 6.1 | 40.9 | 16.1 |
26+
| VisualGLM | 5.4 | 28.6 | 23.6 | 0.2 | 41.5 | 11.5 |
27+
| LLaVA-v1.5-13B (QLoRA) | 5.3 | 19.6 | 25.8 | 17.8 | 72.2 | 39.4 |
28+
| LLaVA-v1.5-13B | 5.1 | 20.7 | 21.2 | 0.3 | 70.6 | 22.3 |
29+
| LLaVA-v1.5-7B | 4.6 | 19.6 | 19.9 | 0.1 | 72.5 | 21.7 |
30+
| PandaGPT-13B | 4.6 | 19.9 | 19.3 | 0.1 | 65.4 | 16.6 |
31+
| MiniGPT-4-v1-13B | 4.4 | 20 | 19.8 | 1.3 | 64.4 | 30.5 |
32+
| MiniGPT-4-v1-7B | 4.3 | 19.6 | 17.5 | 0.8 | 61.9 | 30.6 |
33+
| LLaVA-InternLM-7B (QLoRA) | 4 | 17.3 | 17.2 | 0.1 | 82.3 | 21 |
34+
| LLaVA-InternLM2-20B (QLoRA) | 4 | 17.9 | 17.3 | 0 | 83.2 | 20.4 |
35+
| CogVLM-17B-Chat | 3.6 | 21.3 | 20 | 0.1 | 56.2 | 13.7 |
36+
| Qwen-VL | 3.5 | 11.6 | 30 | 41.1 | 46.6 | 105.2 |
37+
| GPT-4v (detail: low) | 3.3 | 18 | 18.1 | 0 | 77.8 | 20.4 |
38+
| TransCore-M | 2.1 | 14.2 | 13.8 | 0.2 | 92 | 6.7 |
39+
| ShareGPT4V-7B | 1.4 | 9.7 | 10.6 | 0.1 | 147.9 | 45.4 |
40+
| MiniGPT-4-v2 | 1.4 | 12.6 | 13.3 | 0.1 | 83 | 27.1 |
41+
| OpenFlamingo v2 | 1.3 | 6.4 | 15.8 | 14.9 | 60 | 81.9 |
42+
| SharedCaptioner | 1 | 8.8 | 9.2 | 0 | 164.2 | 31.6 |
4143

4244
We noticed that, VLMs that generate long image descriptions tend to achieve inferior scores under different caption metrics.
4345

results/HallusionBench.md

Lines changed: 33 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -29,31 +29,36 @@
2929
> Models are sorted by the **descending order of qAcc.**
3030
3131

32-
| Model | aAcc | fAcc | qAcc |
33-
|:------------------------------|-------:|-------:|-------:|
34-
| GPT-4v (detail: low) | 65.8 | 38.4 | 35.2 |
35-
| GeminiProVision | 63.9 | 37.3 | 34.3 |
36-
| Qwen-VL-Chat | 56.4 | 27.7 | 26.4 |
37-
| MiniGPT-4-v1-7B | 52.4 | 17.3 | 25.9 |
38-
| CogVLM-17B-Chat | 55.1 | 26.3 | 24.8 |
39-
| InternLM-XComposer-VL | 57 | 26.3 | 24.6 |
40-
| MiniGPT-4-v1-13B | 51.3 | 16.2 | 24.6 |
41-
| SharedCaptioner | 55.6 | 22.8 | 24.2 |
42-
| MiniGPT-4-v2 | 52.6 | 16.5 | 21.1 |
43-
| InstructBLIP-7B | 53.6 | 20.2 | 19.8 |
44-
| Qwen-VL | 57.6 | 12.4 | 19.6 |
45-
| OpenFlamingo v2 | 52.7 | 17.6 | 18 |
46-
| mPLUG-Owl2 | 48.9 | 22.5 | 16.7 |
47-
| VisualGLM | 47.2 | 11.3 | 16.5 |
48-
| IDEFICS-9B-Instruct | 50.1 | 16.2 | 15.6 |
49-
| ShareGPT4V-7B | 48.2 | 21.7 | 15.6 |
50-
| LLaVA-InternLM-7B (QLoRA) | 49.1 | 22.3 | 15.4 |
51-
| InstructBLIP-13B | 47.9 | 17.3 | 15.2 |
52-
| LLaVA-v1.5-7B | 48.3 | 19.9 | 14.1 |
53-
| LLaVA-v1.5-13B (QLoRA, XTuner) | 46.9 | 17.6 | 14.1 |
54-
| LLaVA-v1.5-7B (QLoRA, XTuner) | 46.2 | 16.2 | 13.2 |
55-
| LLaVA-v1.5-13B | 46.7 | 17.3 | 13 |
56-
| IDEFICS-80B-Instruct | 46.1 | 13.3 | 11 |
57-
| TransCore-M | 44.7 | 16.5 | 10.1 |
58-
| LLaVA-v1-7B | 44.1 | 13.6 | 9.5 |
59-
| PandaGPT-13B | 43.1 | 9.2 | 7.7 |
32+
| Model | aAcc | fAcc | qAcc |
33+
|:----------------------------|-------:|-------:|-------:|
34+
| GPT-4v (detail: low) | 65.8 | 38.4 | 35.2 |
35+
| GeminiProVision | 63.9 | 37.3 | 34.3 |
36+
| Monkey-Chat | 58.4 | 30.6 | 29 |
37+
| Qwen-VL-Chat | 56.4 | 27.7 | 26.4 |
38+
| MiniGPT-4-v1-7B | 52.4 | 17.3 | 25.9 |
39+
| Monkey | 55.1 | 24 | 25.5 |
40+
| CogVLM-17B-Chat | 55.1 | 26.3 | 24.8 |
41+
| MiniGPT-4-v1-13B | 51.3 | 16.2 | 24.6 |
42+
| InternLM-XComposer-VL | 57 | 26.3 | 24.6 |
43+
| SharedCaptioner | 55.6 | 22.8 | 24.2 |
44+
| MiniGPT-4-v2 | 52.6 | 16.5 | 21.1 |
45+
| InstructBLIP-7B | 53.6 | 20.2 | 19.8 |
46+
| Qwen-VL | 57.6 | 12.4 | 19.6 |
47+
| OpenFlamingo v2 | 52.7 | 17.6 | 18 |
48+
| EMU2-Chat | 49.4 | 22.3 | 16.9 |
49+
| mPLUG-Owl2 | 48.9 | 22.5 | 16.7 |
50+
| ShareGPT4V-13B | 49.8 | 21.7 | 16.7 |
51+
| VisualGLM | 47.2 | 11.3 | 16.5 |
52+
| TransCore-M | 49.7 | 21.4 | 15.8 |
53+
| IDEFICS-9B-Instruct | 50.1 | 16.2 | 15.6 |
54+
| ShareGPT4V-7B | 48.2 | 21.7 | 15.6 |
55+
| LLaVA-InternLM-7B (QLoRA) | 49.1 | 22.3 | 15.4 |
56+
| InstructBLIP-13B | 47.9 | 17.3 | 15.2 |
57+
| LLaVA-InternLM2-20B (QLoRA) | 47.7 | 17.1 | 14.3 |
58+
| LLaVA-v1.5-13B (QLoRA) | 46.9 | 17.6 | 14.1 |
59+
| LLaVA-v1.5-7B | 48.3 | 19.9 | 14.1 |
60+
| LLaVA-v1.5-7B (QLoRA) | 46.2 | 16.2 | 13.2 |
61+
| LLaVA-v1.5-13B | 46.7 | 17.3 | 13 |
62+
| IDEFICS-80B-Instruct | 46.1 | 13.3 | 11 |
63+
| LLaVA-v1-7B | 44.1 | 13.6 | 9.5 |
64+
| PandaGPT-13B | 43.1 | 9.2 | 7.7 |

0 commit comments

Comments
 (0)