This repository complements the paper Large Multimodal Models Evaluation: A Survey and organizes benchmarks and resources across understanding (general and specialized), generation, and community platforms. It serves as a hub for researchers to find key datasets, papers, and code.
We will continuously maintain and update this repo to ensure long-term value for the community.
Paper: SCIS Project Page: AIBench / LMM Evaluation Survey
We welcome pull requests (PRs)! If you contribute five or more valid benchmarks with relevant details, your contribution will be acknowledged in the next update of the paper's Acknowledgment section.
Come on and join us !!
If you find our work useful, please give us a star. Thank you !!
If you find our work useful, please cite our paper as:
@article{zhang2025large,
author = {Zhang, Zicheng and Wang, Junying and Wen, Farong and Guo, Yijin and Zhao, Xiangyu and Fang, Xinyu and Ding, Shengyuan and Jia, Ziheng and Xiao, Jiahao and Shen, Ye and Zheng, Yushuo and Zhu, Xiaorong and Wu, Yalun and Jiao, Ziheng and Sun, Wei and Chen, Zijian and Zhang, Kaiwei and Fu, Kang and Cao, Yuqin and Hu, Ming and Zhou, Yue and Zhou, Xuemei and Cao, Juntai and Zhou, Wei and Cao, Jinyu and Li, Ronghui and Zhou, Donghao and Tian, Yuan and Zhu, Xiangyang and Li, Chunyi and Wu, Haoning and Liu, Xiaohong and He, Junjun and Zhou, Yu and Liu, Hui and Zhang, Lin and Wang, Zesheng and Duan, Huiyu and Zhou, Yingjie and Min, Xiongkuo and Jia, Qi and Zhou, Dongzhan and Zhang, Wenlong and Cao, Jiezhang and Yang, Xue and Yu, Junzhi and Zhang, Songyang and Duan, Haodong and Zhai, Guangtao},
title = {Large Multimodal Models Evaluation: A Survey},
journal = {SCIENCE CHINA Information Sciences},
year = {2025},
volume = {},
pages = {},
url = {https://www.sciengine.com/SCIS/doi/10.1007/s11432-025-4676-4},
doi = {https://doi.org/10.1007/s11432-025-4676-4}
}- Large Multimodal Models Evaluation: A Survey
| Benchmark | Paper | Project Page |
|---|---|---|
| VQA-RAD | VQA-RAD: Visual Question Answering Radiology Dataset | Project Page |
| PathVQA | PathVQA: Pathology Visual Question Answering | Github |
| RP3D-DiagDS | RP3D-DiagDS: 3D Medical Diagnosis Dataset | Project Page |
| PubMedQA | PubMedQA: Medical Question Answering Dataset | Project Page |
| HealthBench | HealthBench: Medical AI Benchmark | Project Page |
| GMAI-MMBench | GMAI-MMBench: General Medical AI Multimodal Benchmark | Project Page |
| OpenMM-Medical | OpenMM-Medical: Open Medical Multimodal Model | Github |
| Genomics-Long-Range | Genomics-Long-Range: Long-Range Genomic Benchmark | Hugging Face |
| Genome-Bench | Genome-Bench: Comprehensive Genomics Benchmark | Hugging Face |
| MedAgentsBench | MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning | Github |
| MedQ-Bench | MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs | Github |
| Benchmark | Paper | Project Page |
|---|---|---|
| Design2Code | Design2Code: From Design Mockups to Code | Project Page |
| Web2Code | Web2Code: Web-to-Code Generation | Project Page |
| Plot2Code | Plot2Code: From Charts to Code | Hugging Face |
| ChartMimic | ChartMimic: Chart Understanding and Generation | Project Page |
| HumanEval-V | HumanEval-V: Visual Code Generation Benchmark | Project Page |
| Code-Vision | Code-Vision: Visual Code Understanding | Github |
| SWE-bench Multi-modal | SWE-bench Multi-modal: Software Engineering Benchmark | Project Page |
| MMCode | MMCode: Multimodal Code Generation | Github |
| M²Eval | M²Eval: Multimodal Code Evaluation | Github |
| BigDocs-Bench | BigDocs-Bench: Large Document Understanding | Project Page |
| BigDocs-Bench | Bigdocs: An open dataset for training multimodal models on document and code tasks. | GitHub |
| Benchmark | Paper | Project Page |
|---|---|---|
| Self-rag | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | GitHub |
| AMEM | A-MEM: Agentic Memory for LLM Agents | GitHub |
