Skip to content

Commit 507975d

Browse files
[Dataset] Add AesBench VAL (#240)
* Add files via upload * update aesbench * update init * update dataset config * update md5 * update --------- Co-authored-by: kennymckormick <dhd@pku.edu.cn>
1 parent f6c9f5e commit 507975d

File tree

7 files changed

+74
-9
lines changed

7 files changed

+74
-9
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
2626
## 🆕 News
2727

2828
- **[2024-06-27]** We have supported [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
29+
- **[2024-06-27]** We have supported [**AesBench**](https://github.com/yipoh/AesBench), thanks to [**Yipo Huang**](https://github.com/yipoh) and [**Quan Yuan**](https://github.com/dylanqyuan)🔥🔥🔥
2930
- **[2024-06-26]** We have supported the evaluation of [**CongRong**](https://mllm.cloudwalk.com/web), it ranked **3rd** on the [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 🔥🔥🔥
3031
- **[2024-06-26]** We firstly support a video understanding benchmark: [**MMBench-Video**](https://mmbench-video.github.io), Image LVLMs that accept multiple images as inputs can be evaluated on the video understanding benchmarks. Check [**QuickStart**](/docs/en/Quickstart.md) to learn how to perform the evaluation 🔥🔥🔥
3132
- **[2024-06-24]** We have supported the evaluation of [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet), it ranked the **2nd** on the [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 🔥🔥🔥
@@ -34,7 +35,6 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
3435
- **[2024-06-18]** We have supported [**MMT-Bench**](https://mmt-bench.github.io), thanks to [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥
3536
- **[2024-06-12]** We have supported [**GLM-4v-9B**](https://huggingface.co/THUDM/glm-4v-9b)🔥🔥🔥
3637
- **[2024-06-05]** We have supported [**WeMM**](https://github.com/scenarios/WeMM), thanks to [**scenarios**](https://github.com/scenarios)🔥🔥🔥
37-
- **[2024-05-27]** We have supported [**Mini InternVL**](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5), thanks to [**czczup**](https://github.com/czczup)🔥🔥🔥
3838

3939
## 📊 Datasets, Models, and Evaluation Results
4040

@@ -59,7 +59,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
5959
| [**InfoVQA**](https://www.docvqa.org/datasets/infographicvqa)+ | InfoVQA_[VAL/TEST] | VQA | [**OCRBench**](https://github.com/Yuliang-Liu/MultimodalOCR) | OCRBench | VQA |
6060
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
6161
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
62-
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | | | |
62+
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_VAL | MCQ |
6363

6464
**\*** We only provide a subset of the evaluation results, since some VLMs do not yield reasonable results under the zero-shot setting
6565

docs/ja/README_ja.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ PS: 日本語の README には最新のアップデートがすべて含まれ
4747
| [**InfoVQA**](https://www.docvqa.org/datasets/infographicvqa)+ | InfoVQA_[VAL/TEST] | VQA | [**OCRBench**](https://github.com/Yuliang-Liu/MultimodalOCR) | OCRBench | VQA |
4848
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
4949
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
50-
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | | | |
50+
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_VAL | MCQ |
5151

5252
**\*** ゼロショット設定で合理的な結果を出せないVLMの一部の評価結果のみを提供しています
5353

docs/zh-CN/README_zh-CN.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
## 🆕 更新
2525

2626
- **[2024-06-27]** 支持了 [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
27+
- **[2024-06-27]** 支持了 [**AesBench**](https://github.com/yipoh/AesBench),感谢 [**Yipo Huang**](https://github.com/yipoh)[**Quan Yuan**](https://github.com/dylanqyuan)🔥🔥🔥
2728
- **[2024-06-26]** 支持了 [**CongRong**](https://mllm.cloudwalk.com/web) 的评测,该模型在 [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)**排名第三** 🔥🔥🔥
2829
- **[2024-06-26]** 首次支持了视频理解评测基准 [**MMBench-Video**](https://mmbench-video.github.io),可以用于测试支持多图输入的图文多模态大模型的。[**快速开始**](/docs/zh-CN/Quickstart.md) 中提供了启动 MMBench-Video 测试的方式 🔥🔥🔥
2930
- **[2024-06-24]** 支持了 [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet) 的评测,该模型在 [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)**排名第二** 🔥🔥🔥
@@ -32,7 +33,6 @@
3233
- **[2024-06-18]** 支持了 [**MMT-Bench**](https://mmt-bench.github.io),感谢 [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥
3334
- **[2024-06-12]** 支持了 [**GLM-4v-9B**](https://huggingface.co/THUDM/glm-4v-9b)🔥🔥🔥
3435
- **[2024-06-05]** 支持了 [**WeMM**](https://github.com/scenarios/WeMM),感谢 [**scenarios**](https://github.com/scenarios)🔥🔥🔥
35-
- **[2024-05-27]** 支持了 [**Mini InternVL**](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5), 感谢 [**czczup**](https://github.com/czczup)🔥🔥🔥
3636

3737
## 📊 评测结果,支持的数据集和模型 <a id="data-model-results"></a>
3838
### 评测结果
@@ -56,7 +56,7 @@
5656
| [**InfoVQA**](https://www.docvqa.org/datasets/infographicvqa)+ | InfoVQA_[VAL/TEST] | VQA | [**OCRBench**](https://github.com/Yuliang-Liu/MultimodalOCR) | OCRBench | VQA |
5757
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
5858
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
59-
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | | | |
59+
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_VAL | MCQ |
6060

6161
**\*** 我们只提供了部分模型上的测试结果,剩余模型无法在 zero-shot 设定下测试出合理的精度
6262

vlmeval/dataset/config.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,8 @@
5050
'MMT-Bench_ALL': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_ALL.tsv',
5151
'MMT-Bench_VAL_MI': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_VAL_MI.tsv',
5252
'MMT-Bench_VAL': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_VAL.tsv',
53-
# MLLMGuard
5453
'MLLMGuard_DS': 'https://opencompass.openxlab.space/utils/VLMEval/MLLMGuard_DS.tsv',
54+
'AesBench_VAL': 'https://opencompass.openxlab.space/utils/VLMEval/AesBench_VAL.tsv',
5555

5656
# Video Benchmarks
5757
'MMBench-Video': 'https://huggingface.co/datasets/nebulae09/MMBench-Video/raw/main/MMBench-Video.tsv',
@@ -107,8 +107,8 @@
107107
'MMT-Bench_ALL': 'b273a2f4c596fe4f2605de0494cd632f',
108108
'MMT-Bench_VAL_MI': 'c7d7b998eb5cd9aa36c7d4f721472462',
109109
'MMT-Bench_VAL': '8dd4b730f53dbf9c3aed90ca31c928e0',
110-
# MLLMGuard
111110
'MLLMGuard_DS': '975fc0dd7119386e198c37d71e274b3f',
111+
'AesBench_VAL': '3edb0c319e9187aa0b97fe7a11700a8c',
112112

113113
# Video Benchmarks
114114
'MMBench-Video': '98f7df3eb1007fc375ea6fe88a98e2ff',
@@ -154,7 +154,7 @@ def DATASET_TYPE(dataset):
154154
return 'VideoQA'
155155
elif listinstr([
156156
'mmbench', 'seedbench', 'ccbench', 'mmmu', 'scienceqa', 'ai2d',
157-
'mmstar', 'realworldqa', 'mmt-bench'
157+
'mmstar', 'realworldqa', 'mmt-bench', 'aesbench'
158158
], dataset):
159159
return 'multi-choice'
160160
elif listinstr(['mme', 'hallusion', 'pope'], dataset):

vlmeval/evaluate/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,5 @@
66
from .mathvista_eval import MathVista_eval
77
from .llavabench import LLaVABench_eval
88
from .misc import build_judge
9-
from .ocrbench import OCRBench_eval
9+
from .ocrbench_eval import OCRBench_eval
1010
from .mmbench_video import MMBenchVideo_eval

vlmeval/evaluate/ocrbench_eval.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
from vlmeval.smp import *
2+
3+
4+
def OCRBench_eval(eval_file):
5+
OCRBench_score = {
6+
'Regular Text Recognition': 0,
7+
'Irregular Text Recognition': 0,
8+
'Artistic Text Recognition': 0,
9+
'Handwriting Recognition': 0,
10+
'Digit String Recognition': 0,
11+
'Non-Semantic Text Recognition': 0,
12+
'Scene Text-centric VQA': 0,
13+
'Doc-oriented VQA': 0,
14+
'Key Information Extraction': 0,
15+
'Handwritten Mathematical Expression Recognition': 0
16+
}
17+
18+
logger = get_logger('Evaluation')
19+
20+
data = load(eval_file)
21+
lt = len(data)
22+
lines = [data.iloc[i] for i in range(lt)]
23+
for i in tqdm(range(len(lines))):
24+
line = lines[i]
25+
predict = str(line['prediction'])
26+
answers = eval(line['answer'])
27+
category = line['category']
28+
if category == 'Handwritten Mathematical Expression Recognition':
29+
for j in range(len(answers)):
30+
answer = answers[j].strip().replace('\n', ' ').replace(' ', '')
31+
predict = predict.strip().replace('\n', ' ').replace(' ', '')
32+
if answer in predict:
33+
OCRBench_score[category] += 1
34+
break
35+
else:
36+
for j in range(len(answers)):
37+
answer = answers[j].lower().strip().replace('\n', ' ')
38+
predict = predict.lower().strip().replace('\n', ' ')
39+
if answer in predict:
40+
OCRBench_score[category] += 1
41+
break
42+
43+
final_score_dict = {}
44+
final_score_dict['Text Recognition'] = (
45+
OCRBench_score['Regular Text Recognition'] + OCRBench_score['Irregular Text Recognition']
46+
+ OCRBench_score['Artistic Text Recognition'] + OCRBench_score['Handwriting Recognition']
47+
+ OCRBench_score['Digit String Recognition'] + OCRBench_score['Non-Semantic Text Recognition']
48+
)
49+
final_score_dict['Scene Text-centric VQA'] = OCRBench_score['Scene Text-centric VQA']
50+
final_score_dict['Doc-oriented VQA'] = OCRBench_score['Doc-oriented VQA']
51+
final_score_dict['Key Information Extraction'] = OCRBench_score['Key Information Extraction']
52+
final_score_dict['Handwritten Mathematical Expression Recognition'] = \
53+
OCRBench_score['Handwritten Mathematical Expression Recognition']
54+
final_score_dict['Final Score'] = (
55+
final_score_dict['Text Recognition'] + final_score_dict['Scene Text-centric VQA']
56+
+ final_score_dict['Doc-oriented VQA'] + final_score_dict['Key Information Extraction']
57+
+ final_score_dict['Handwritten Mathematical Expression Recognition']
58+
)
59+
final_score_dict['Final Score Norm'] = float(final_score_dict['Final Score']) / 10
60+
score_pth = eval_file.replace('.xlsx', '_score.json')
61+
dump(final_score_dict, score_pth)
62+
logger.info(f'OCRBench_eval successfully finished evaluating {eval_file}, results saved in {score_pth}')
63+
logger.info('Score: ')
64+
for key, value in final_score_dict.items():
65+
logger.info('{}:{}'.format(key, value))

0 commit comments

Comments
 (0)