Skip to content

Official repo for 'Large Multimodal Models Evaluation: A Survey'

Notifications You must be signed in to change notification settings

Q-Future/LMM-Evaluation-Survey

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

Large Multimodal Models Evaluation: A Survey

This repository complements the paper Large Multimodal Models Evaluation: A Survey and organizes benchmarks and resources across understanding (general and specialized), generation, and community platforms. It serves as a hub for researchers to find key datasets, papers, and code.

We will continuously maintain and update this repo to ensure long-term value for the community.

Overview

Paper: SCIS Project Page: AIBench / LMM Evaluation Survey


Contributions

We welcome pull requests (PRs)! If you contribute five or more valid benchmarks with relevant details, your contribution will be acknowledged in the next update of the paper's Acknowledgment section.

Come on and join us !!

If you find our work useful, please give us a star. Thank you !!


📖 Citation

If you find our work useful, please cite our paper as:

@article{zhang2025large,
  author    = {Zhang, Zicheng and Wang, Junying and Wen, Farong and Guo, Yijin and Zhao, Xiangyu and Fang, Xinyu and Ding, Shengyuan and Jia, Ziheng and Xiao, Jiahao and Shen, Ye and Zheng, Yushuo and Zhu, Xiaorong and Wu, Yalun and Jiao, Ziheng and Sun, Wei and Chen, Zijian and Zhang, Kaiwei and Fu, Kang and Cao, Yuqin and Hu, Ming and Zhou, Yue and Zhou, Xuemei and Cao, Juntai and Zhou, Wei and Cao, Jinyu and Li, Ronghui and Zhou, Donghao and Tian, Yuan and Zhu, Xiangyang and Li, Chunyi and Wu, Haoning and Liu, Xiaohong and He, Junjun and Zhou, Yu and Liu, Hui and Zhang, Lin and Wang, Zesheng and Duan, Huiyu and Zhou, Yingjie and Min, Xiongkuo and Jia, Qi and Zhou, Dongzhan and Zhang, Wenlong and Cao, Jiezhang and Yang, Xue and Yu, Junzhi and Zhang, Songyang and Duan, Haodong and Zhai, Guangtao},
  title     = {Large Multimodal Models Evaluation: A Survey},
  journal   = {SCIENCE CHINA Information Sciences},
  year      = {2025},
  volume    = {},
  pages     = {},
  url       = {https://www.sciengine.com/SCIS/doi/10.1007/s11432-025-4676-4},
  doi       = {https://doi.org/10.1007/s11432-025-4676-4}
}

Table of Contents


Understanding Evaluation

General

Adaptability

Benchmark Paper Project Page
LLaVA-Bench Visual instruction tuning Github
MIA-Bench Mia-bench: Towards better instruction following evaluation of multimodal llms Github
MM-IFEval MM-IFEngine: Towards Multimodal Instruction Following Github
VisIT-Bench VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use Github
MMDU MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Github
ConvBench ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models Github
SIMMC 2.0 SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations Github
Mementos Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences Github
MUIRBENCH MUIRBENCH: A Comprehensive Benchmark for Robust Multi-image Understanding Github
MMIU MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models Github
MIRB Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning Github
MIBench MIBench: Evaluating Multimodal Large Language Models over Multiple Images Hugging Face
II-Bench II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models Github
Mantis Mantis: Interleaved Multi-Image Instruction Tuning Github
MileBench MILEBENCH: Benchmarking MLLMs in Long Context Github
ReMI ReMI: A Dataset for Reasoning with Multiple Images Hugging Face
CODIS CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models Github
SPARKLES SPARKLES: UNLOCKING CHATS ACROSS MULTIPLE IMAGES FOR MULTIMODAL INSTRUCTION-FOLLOWING MODELS Github
MMIE MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISIONLANGUAGE MODELS Github
InterleavedBench Holistic Evaluation for Interleaved Text-and-Image Generation Hugging Face
OpenING OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Github
HumaniBench HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation Github
Herm-Bench HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding Github
UNIAA UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark Github
Humanbeauty HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment Github
SocialIQA SOCIAL IQA: Commonsense Reasoning about Social Interactions Hugging Face
EmpathicStories EmpathicStories++: A Multimodal Dataset for Empathy towards Personal Experiences Dataset Download
Chatbot Arena Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference Not available
OpenAssistant Conversations OpenAssistant Conversations - Democratizing Large Language Model Alignment Github
HCE Human-Centric Evaluation for Foundation Models Github

Basic Ability

Benchmark Paper Project Page
NWPU-MOC NWPU-MOC: A Benchmark for Fine-grained Multi-category Object Counting in Aerial Images Github
T2V-ComBench T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation Github
ConceptMix ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty Github
PICD Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding Github
TextVQA Towards VQA Models That Can Read Github
OCR-VQA OCR-VQA: Visual Question Answering by Reading Text in Images Dataset download
OCRBench OCRBENCH: ON THE HIDDEN MYSTERY OF OCR IN LARGE MULTIMODAL MODELS Github
OCRBench v2 OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning Github
ASCIIEval VISUAL PERCEPTION IN TEXT STRINGS Github
OCRReasoning OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning Github
M4-ViteVQA Towards Video Text Visual Question Answering: Benchmark and Baseline Github
SEED-Bench-2-Plus SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Github
MMDocBench MMDOCBENCH: BENCHMARKING LARGE VISIONLANGUAGE MODELS FOR FINE-GRAINED VISUAL DOCUMENT UNDERSTANDING Github
MMLongBench-Doc MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations Github
UDA UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis Github
VisualMRC VisualMRC: Machine Reading Comprehension on Document Images Github
DocVQA DocVQA: A Dataset for VQA on Document Images Hugging Face
DocGenome DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models Github
GDI-Bench GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling Dataset download
AitW Android in the Wild: A Large-Scale Dataset for Android Device Control Github
ScreenSpot SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Github
VisualWebBench VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? Github
GUI-WORLD GUI-WORLD: A VIDEO BENCHMARK AND DATASET FOR MULTIMODAL GUI-ORIENTED UNDERSTANDING Github
WebUIBench WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code Github
ScreenQA ScreenQA: Large-Scale Question-Answer Pairs Over Mobile App Screenshots Github
ChartQA ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning Github
ChartQApro CHARTQAPRO : A More Diverse and Challenging Benchmark for Chart Question Answering Github
ComTQA TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy Github
TableVQA-Bench TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains Github
CharXiv CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Github
SciFIBench SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation Github
AI2D-RST AI2D-RST: A multimodal corpus of 1000 primary school science diagrams Github
InfoChartQA InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts Github
EvoChart-QA EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding Github
WikiMixQA WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts Github
ChartX ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning Github
Q-Bench Q-BENCH: A BENCHMARK FOR GENERAL-PURPOSE FOUNDATION MODELS ON LOW-LEVEL VISION Github
A-Bench A-BENCH: ARE LMMS MASTERS AT EVALUATING AI-GENERATED IMAGES? Github
MVP-Bench MVP-Bench: Can Large Vision–Language Models Conduct Multi-level Visual Perception Like Humans? Github
XLRS-Bench XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? Github
HR-Bench Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models Github
MME-RealWorld MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Github
V*Bench V ∗ : Guided Visual Search as a Core Mechanism in Multimodal LLMs Github
FaceBench FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs Github
MMAFFBench MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs Github
FABA-Bench Facial Affective Behavior Analysis with Instruction Tuning Github
MEMO-Bench MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis Github
EmoBench EmoBench: Evaluating the Emotional Intelligence of Large Language Models Github
EEmo-Bench EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment Github
AesBench AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception Github
UNIAA-Bench UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark Github
ImplictAVE ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction Github
II-Bench II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models Hugging Face
CogBench A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models Github
A4Bench Affordance Benchmark for MLLMs Github
MM-SAP MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception Github
Cambrian-1 Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Github
MMUBench Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models Not available
MMVP Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Github
MicBench Towards Open-ended Visual Quality Comparison Github
CuturalVQA Benchmarking Vision Language Models for Cultural Understanding Hugging Face
RefCOCO Family Modeling Context in Referring ExpressionsGeneration and Comprehension of Unambiguous Object Descriptions Github Github
Ref-L4 Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models Github
MRES-32M Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation Github
UrBench UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios Github
COUNTS COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts Github
MTVQA MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering Github
GePBench GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models Not available
SpatialMQA Can Multimodal Large Language Models Understand Spatial Relations? Github
SpacialRGPT-Bench SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models Github
CoSpace CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models Github
LMM-CompBench MLLM-COMPBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs Github
SOK-Bench SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge Github
GSR-Bench GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs Not available
What's "up" What’s “up” with vision-language models? Investigating their struggle with spatial reasoning Github
Q-Spatial Bench Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models Github
AS-V2 The All-Seeing Project V2: Towards General Relation Comprehension of the Open World Github
Visual CoT Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning Github
LogicVista LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts Github
VisuLogic VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models Github
CoMT CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models Github
PUZZLES PUZZLES: A Benchmark for Neural Algorithmic Reasoning Github
LOVA3 LOVA3 : Learning to Visual Question Answering, Asking and Assessment Github
VLIKEB VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark Github
MMKE-Bench MMKE-BENCH: A MULTIMODAL EDITING BENCHMARK FOR DIVERSE VISUAL KNOWLEDGE Github
MC-MKE MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality ConsistencyMIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing Not available
NegVQA NegVQA: Can Vision Language Models Understand Negation? Github
LongBench LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding Github
OPOR-BENCH OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation Not available
VRT-Bench Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark Github

Comprehensive Perception

Benchmark Paper Project Page
LVLM-eHub Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. GitHub
TinyLVLM-eHub Tinylvlm-ehub: Towards comprehensive and efficient evaluation for large vision-language models. GitHub
LAMM Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. GitHub
MME Mme: A comprehensive evaluation benchmark for multimodal large language models. Project Page
MMBench Mmbench: Is your multi-modal model an all-around player? GitHub
SEED-Bench series SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. GitHub
MMT-Bench Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. GitHub
LMMs-Eval Lmms-eval: Reality check on the evaluation of large multimodal models. GitHub
MMStar Are we on the right way for evaluating large vision-language models? GitHub
NaturalBench Naturalbench: Evaluating vision-language models on natural adversarial samples. Project Page
MM-Vet Mm-vet: Evaluating large multimodal models for integrated capabilities. GitHub
ChEF Chef: A comprehensive evaluation framework for standardized assessment of multimodal large language models. GitHub
Video-MME Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. GitHub
MMBench-Video Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. GitHub
MVBench Mvbench: A comprehensive multi-modal video understanding benchmark. Hugging Face
LongVideoBench Longvideobench: A benchmark for long-context interleaved video-language understanding. GitHub
LVBench Lvbench: An extreme long video understanding benchmark. GitHub
MotionBench Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. GitHub
AudioBench Audiobench: A universal benchmark for audio large language models. GitHub
AIR-Bench Air-bench: Benchmarking large audio-language models via generative comprehension. GitHub
Dynamic-SUPERB Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. GitHub
M3DBench M3dbench: Let's instruct large models with multi-modal 3d prompts. GitHub
M3D M3d: Advancing 3d medical image analysis with multi-modal large language models. GitHub
Space3D-Bench Space3d-bench: Spatial 3d question answering benchmark. Project Page

General Knowledge

Benchmark Paper Project Page
ScienceQA Learn to explain: Multimodal reasoning via thought chains for science question answering. GitHub
CMMU Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. GitHub
Scibench Scibench: Evaluating college-level scientific problem-solving abilities of large language models. GitHub
EXAMS-V Exams-v: A multi-discipline multilingual multi-modal exam benchmark for evaluating vision language models. Hugging Face
MMMU Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. GitHub
MMMU-Pro Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. Hugging Face
HLE Humanity's last exam. Project Page
CURIE Curie: Evaluating llms on multitask scientific long context understanding and reasoning. GitHub
SFE Scientists' first exam: Probing cognitive abilities of mllm via perception, understanding, and reasoning. Hugging Face
MMIE Mmie: Massive multimodal interleaved comprehension benchmark for large vision-language models. GitHub
MDK12-Bench Mdk12-bench: A multi-discipline benchmark for evaluating reasoning in multimodal large language models. GitHub
EESE The ever-evolving science exam. Project Page
Q-Mirror Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs GitHub

Safety

Benchmark Paper Project Page
Unicorn How many unicorns are in this image? a safety evaluation benchmark for vision llms. GitHub
JailbreakV-28K Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. GitHub
MM-SafetyBench Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. GitHub
AVIBench Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions. GitHub
MMJ-Bench MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models. GitHub
USB Usb: A comprehensive and unified safety evaluation benchmark for multimodal large language models. GitHub
MLLMGuard Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. GitHub
SafeBench Safebench: A safety evaluation framework for multimodal large language models. GitHub
MemeSafetyBench Are vision-language models safe in the wild? a meme-based benchmark study. Hugging Face
UnsafeBench Unsafebench: Benchmarking image safety classifiers on real-world and ai-generated images. Not available
POPE Evaluating object hallucination in large vision-language models. GitHub
M-HalDetect Detecting and preventing hallucinations in large vision language models. Not available
Hal-Eval Hal-eval: A universal and fine-grained hallucination evaluation framework for large vision language models. GitHub
Hallu-pi Hallu-pi: Evaluating hallucination in multi-modal large language models within perturbed inputs. Not available
BEAF Beaf: Observing before-after changes to evaluate hallucination in vision-language models. GitHub
HallusionBench Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. Project Page
AutoHallusion Autohallusion: Automatic generation of hallucination benchmarks for vision-language models. GitHub
MultiTrust Benchmarking trustworthiness of multimodal large language models: A comprehensive study. GitHub
MMDT Mmdt: Decoding the trustworthiness and safety of multimodal foundation models. GitHub
Text2VLM Text2vlm: Adapting text-only datasets to evaluate alignment training in visual language models. Not available
MOSSBench Mossbench: Is your multimodal language model oversensitive to safe queries? GitHub
CulturalVQA Benchmarking vision language models for cultural understanding. Project Page
ModScan Modscan: Measuring stereotypical bias in large vision-language models from vision and language modalities. GitHub
FMBench Fmbench: Benchmarking fairness in multimodal large language models on medical tasks. Not available
FairMedFM Fairmedfm: Fairness benchmarking for medical imaging foundation models. GitHub
FairCLIP Fair-clip: Harnessing fairness in vision-language learning. GitHub
DoxingBench Doxing via the lens: Revealing privacy leakage in image geolocation for agentic multi-modal large reasoning model. Hugging Face
PrivQA Can language models be instructed to protect personal information? Not available
SHIELD Shield: An evaluation benchmark for face spoofing and forgery detection with multimodal large language models. GitHub
ExtremeAIGC Extremeaigc: Benchmarking lmm vulnerability to ai-generated extremist content. Not available

Specialized

Math

Benchmark Paper Project Page
MathVista Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. GitHub
PolyMATH Polymath: A challenging multi-modal mathematical reasoning benchmark. Project Page
MATH-Vision Measuring multimodal mathematical reasoning with math-vision dataset. Project Page
Olympiad-Bench Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. GitHub
PolyMath Polymath: Evaluating mathematical reasoning in multilingual contexts. GitHub
Math-Verse Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? GitHub
WE-MATH We-math: Does your large multimodal model achieve human-like mathematical reasoning? GitHub
MathScape Mathscape: Evaluating mllms in multimodal math scenarios through a hierarchical benchmark. Github
CMM-Math Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. Hugging Face
MV-MATH Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. GitHub

Physics

Benchmark Paper Project Page
ScienceQA Learn to explain: Multimodal reasoning via thought chains for science question answering. GitHub
TQA Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. Not available
AI2D A diagram is worth a dozen images. Project Page
MM-PhyQA Mm-phyqa: Multimodal physics question answering with multi-image cot prompting. Not available
PhysUniBench Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models. Project Page
PhysicsArena Physicsarena: The first multimodal physics reasoning benchmark exploring variable, process, and solution dimensions. Hugging Face
SeePhys Seephys: Does seeing help thinking? benchmarking vision-based physics reasoning. GitHub
PhysReason Physreason: A comprehensive benchmark towards physics-based reasoning. Hugging Face
OlympiadBench Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. GitHub
SceMQA Scemqa: A scientific college entrance level multimodal question answering benchmark. GitHub
PACS Pacs: A dataset for physical audiovisual commonsense reasoning. GitHub
GRASP GRASP: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. GitHub
CausalVQA Causalvqa: A physically grounded causal reasoning benchmark for video models. GitHub
LiveXiv Livexiv a multi-modal live benchmark based on arxiv papers content. GitHub
VideoScience-Bench Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench. GitHub

Chemistry

Benchmark Paper Project Page
SMILES Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Project Page
ChEBI-20 Text2mol: Cross-modal molecule retrieval with natural language queries. GitHub
ChemBench Chemllm: A chemical large language model. GitHub
SELFIES Self-referencing embedded strings (selfies): A 100% robust molecular string representation. GitHub
InChI Inchi, the iupac international chemical identifier. Project Page
MolX Molx: Enhancing large language models for molecular learning with a multi-modal extension. Not available
GiT-Mol Git-mol: A multi-modal large language model for molecular science with graph, image, and text. GitHub
Instruct-Mol Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. GitHub
ChEBI-20-MM A quantitative analysis of knowledge-learning preferences in large language models in molecular science. GitHub
MMCR-Bench Chemvlm: Exploring the power of multimodal large language models in chemistry area. GitHub
MACBench Probing the limitations of multimodal language models for chemistry and materials research. GitHub
3D-MoLM Towards 3d molecule-text interpretation in language models. GitHub
M3-20M M3-20m: A large-scale multi-modal molecule dataset for ai-driven drug design and discovery. GitHub
MassSpecGym Massspecgym: A benchmark for the discovery and identification of molecules. GitHub
MolPuzzle Can llms solve molecule puzzles? a multimodal benchmark for molecular structure elucidation. GitHub

Finance

Benchmark Paper Project Page
FinMME FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation Github
FAMMA FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering Project Page
MME-Finance MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning Project Page
MultiFinBen MultiFinBen: A Comprehensive Multimodal Financial Benchmark Github
CFBenchmark-MM CFBenchmark-MM: A Comprehensive Multimodal Financial Benchmark Github
FinMMR FinMMR: Multimodal Financial Reasoning Benchmark Github
Fin-Fact Fin-Fact: Financial Fact Checking Dataset Github
FCMR FCMR: Financial Multimodal Reasoning
FinTral FinTral: Financial Translation and Analysis Github
Open-FinLLMs Open-FinLLMs: Open Financial Large Language Models Hugging Face
FinGAIA FinGAIA: Financial AI Assistant Github

Healthcare & Medical Science

Benchmark Paper Project Page
VQA-RAD VQA-RAD: Visual Question Answering Radiology Dataset Project Page
PathVQA PathVQA: Pathology Visual Question Answering Github
RP3D-DiagDS RP3D-DiagDS: 3D Medical Diagnosis Dataset Project Page
PubMedQA PubMedQA: Medical Question Answering Dataset Project Page
HealthBench HealthBench: Medical AI Benchmark Project Page
GMAI-MMBench GMAI-MMBench: General Medical AI Multimodal Benchmark Project Page
OpenMM-Medical OpenMM-Medical: Open Medical Multimodal Model Github
Genomics-Long-Range Genomics-Long-Range: Long-Range Genomic Benchmark Hugging Face
Genome-Bench Genome-Bench: Comprehensive Genomics Benchmark Hugging Face
MedAgentsBench MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning Github
MedQ-Bench MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs Github

Code

Benchmark Paper Project Page
Design2Code Design2Code: From Design Mockups to Code Project Page
Web2Code Web2Code: Web-to-Code Generation Project Page
Plot2Code Plot2Code: From Charts to Code Hugging Face
ChartMimic ChartMimic: Chart Understanding and Generation Project Page
HumanEval-V HumanEval-V: Visual Code Generation Benchmark Project Page
Code-Vision Code-Vision: Visual Code Understanding Github
SWE-bench Multi-modal SWE-bench Multi-modal: Software Engineering Benchmark Project Page
MMCode MMCode: Multimodal Code Generation Github
M²Eval M²Eval: Multimodal Code Evaluation Github
BigDocs-Bench BigDocs-Bench: Large Document Understanding Project Page
BigDocs-Bench Bigdocs: An open dataset for training multimodal models on document and code tasks. GitHub

Autonomous Driving

Benchmark Paper Project Page
Rank2Tell Rank2Tell: Ranking-based Visual Storytelling Project Page
DRAMA DRAMA: Dynamic Risk Assessment for Autonomous Vehicles Project Page
NuScenes-QA NuScenes-QA: Autonomous Driving Question Answering Github
LingoQA LingoQA: Driving Language Understanding Github
V2V-LLM V2V-LLM: Vehicle-to-Vehicle Communication Project Page
MAPLM-QA MAPLM: Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Github
SURDS SURDS: Autonomous Driving Dataset Github
AD2-Bench AD2-Bench: Autonomous Driving Benchmark
DriveAction DriveAction: Driving Action Recognition Hugging Face
DriveLMM-01 DriveLMM-01: Driving Language Model Github
DriveVLM DriveVLM: Vision-Language Model for Driving Project Page
RoboTron-Sim RoboTron-Sim: Robot Simulation Platform Project Page
IDKB IDKB: Intelligent Driving Knowledge Base Project Page
VLADBench VLADBench: Vision-Language-Action Driving Benchmark Github
DriVQA DriVQA: Driving Visual Question Answering
ADGV-Bench Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework Not available

Earth Science / Remote Sensing

Benchmark Paper Project Page
GEOBench-VLM GEOBench-VLM: Geospatial Vision-Language Model Benchmark Project Page
ClimaQA ClimaQA: Climate Question Answering Github
ClimateBERT ClimateBERT: Climate Language Model Project Page
WeatherQA WeatherQA: Weather Question Answering Github
OceanBench OceanBench: Ocean Data Analysis Benchmark Project Page
OmniEarth-Bench OmniEarth-Bench: Comprehensive Earth Observation Hugging Face
MSEarth MSEarth: Multi-Scale Earth Observation Github
EarthSE EarthSE: Earth System Evaluation Hugging Face
RSICD RSICD: Remote Sensing Image Captioning Dataset Github
NWPU-Captions NWPU-Captions: Remote Sensing Image Descriptions Github
RSVQA-HRBEN/LRBEN RSVQA: Remote Sensing Visual Question Answering Project Page
DIOR-RSVG DIOR-RSVG: Remote Sensing Visual Grounding Github
VRSBench VRSBench: Visual Remote Sensing Benchmark Project Page
LRS-VQA LRS-VQA: Large-scale Remote Sensing VQA Github
GeoChat-Bench GeoChat-Bench: Geospatial Conversation Benchmark Github
XLRS-Bench XLRS-Bench: Cross-Lingual Remote Sensing Project Page
RSIEval RSIEval: Remote Sensing Image Evaluation Github
UrBench UrBench: Urban Remote Sensing Benchmark Project Page
CHOICE CHOICE: Comprehensive Remote Sensing Benchmark Github
SARChat-Bench-2M SARChat-Bench-2M: SAR Image Understanding Github
LHRS-Bench LHRS-Bench: Large-scale High-Resolution Remote Sensing Github
FIT-RSFG FIT-RSFG: Remote Sensing Fine-Grained Recognition Github
VLEO-Bench VLEO-Bench: Very Low Earth Orbit Benchmark Project Page
NAIP-OSM NAIP-OSM: Aerial Imagery and Map Alignment Project Page

Embodied Tasks

Benchmark Paper Project Page
Embodied Questioning Answering (EQA) EXPRESS-Bench: Embodied Question Answering Project Page
R2R (Room-to-Room) R2R: Room-to-Room Navigation Github
Reverie Reverie: Remote Embodied Visual Referring Expression Project Page
Alfred Alfred: A Benchmark for Interpreting Grounded Instructions Github
Calvin Calvin: Long-Horizon Language-Conditioned Robot Learning Github
EPIC-KITCHENS EPIC-KITCHENS: Large Scale Dataset in First Person Vision Project Page
Ego4D Ego4D: Around the World in 3,000 Hours Project Page
EMQA EMQA: Ego-centric Multimodal Question Answering Github
SQA3D SQA3D: Situated Question Answering in 3D Scenes Project Page
Open-EQA Open-EQA: Open-Vocabulary Embodied Question Answering Project Page
HM-EQA HM-EQA: Hierarchical Multi-modal Embodied QA Project Page
MOTIF MOTIF: Multimodal Object-Text Interaction Framework Github
EgoTaskQA EgoTaskQA: Understanding Tasks in Egocentric Videos Project Page
EmbodiedScan EmbodiedScan: Holistic Multi-Modal 3D Perception Project Page
RH20T-P RH20T-P: Robotic Manipulation Dataset Project Page
EXPRESS-Bench EXPRESS-Bench: Embodied Question Answering Github
EmbodiedEval EmbodiedEval: Embodied AI Evaluation Project Page
Embodied Bench Embodied Bench: Comprehensive Embodied AI Evaluation Project Page
VLABench VLABench: Vision-Language-Action Benchmark Project Page
EWMBench EWMBench: Embodied World Model Benchmark Github
NeurIPS 2025 Embodied Agent Interface Challenge NeurIPS 2025 Embodied Agent Interface Challenge Project Page
SEER-Bench Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration Not available
ReMindView-Bench Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective Github

AI Agent

Benchmark Paper Project Page
Self-rag Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection GitHub
AMEM A-MEM: Agentic Memory for LLM Agents GitHub

Generation Evaluation

Image

Benchmark Paper Project Page
DiffusionDB DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models GitHub
HPD、HPD v2、HPS Human Preference Score: Better Aligning Text-to-Image Models with Human Preference GitHub
ImageReward、ImageReward/ReFL ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation GitHub
Pick-A-Pic、PickScore Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation GitHub
AGIQA-1K A Perceptual Quality Assessment Exploration for AIGC Images GitHub
AGIQA-3K AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment GitHub
AIGCIQA2023 AIGCIQA2023: A Large-scale Image Quality Assessment Database for AI Generated Images: from the Perspectives of Quality, Authenticity and Correspondence Hugging Face
AGIN、JOINT Exploring the Naturalness of AI-Generated Images GitHub
AIGIQA-20K AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment Hugging Face
AIGCOIQA2024 AIGCOIQA2024: Perceptual Quality Assessment of AI Generated Omnidirectional Images GitHub
CMC-Bench CMC-Bench: Towards a New Paradigm of Visual Signal Compression GitHub
PKU-I2IQA、NR/FR-AIGCIQA PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images GitHub
SeeTRUE What You See is What You Read? Improving Text-Image Alignment Evaluation GitHub
AIGCIQA2023+、MINT-IQA Quality Assessment for AI Generated Images with Instruction Tuning GitHub
Q-Eval-100K、Q-Eval Score Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content GitHub
Measuring the Quality of Text-to-Video Model Outputs Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset Dataset Download
EvalCrafter EvalCrafter: Benchmarking and Evaluating Large Video Generation Models GitHub
FETV FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation GitHub
VBench VBench: Comprehensive Benchmark Suite for Video Generative Models GitHub
T2VQA-DB Subjective-Aligned Dataset and Metric for Text-to-Video Quality GitHub
GAIA GAIA: Rethinking Action Quality Assessment for AI-Generated Videos GitHub
AIGVQA-DB、AIGV-Assessor AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM GitHub
AIGVE-60K、LOVE LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation GitHub
Human-AGVQA-DB、GHVQ Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric GitHub
TDVE-DB、TDVE-Assessor TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs GitHub
AGAVQA-3K、AGAV-Rater AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment GitHub
Qwen-ALLD Audio Large Language Models Can Be Descriptive Speech Quality Evaluators Hugging Face
BASE-TTS BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data Audio samples of BASE-TTS
ATT Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese Hugging Face
TTSDS2 TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems Website
MATE-3D、LGVQ Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation GitHub
3DGCQA 3DGCQA: A Quality Assessment Database for 3D AI-Generated Contents GitHub
AIGC-T23DAQA Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model GitHub
SI23DCQA SI23DCQA: Perceptual Quality Assessment of Single Image-to-3D Content GitHub
3DGS-IEval-15K 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting GitHub
Inception Score (IS) Improved Techniques for Training GANs GitHub
FVD Towards Accurate Generative Models of Video: A New Metric & Challenges GitHub
VQAScore Evaluating Text-to-Visual Generation with Image-to-Text Generation GitHub
NTIRE 2024 AIGC QA NTIRE 2024 Quality Assessment of AI-Generated Content Challenge Website
Q-Bench Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision GitHub
Q-instruct Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models GitHub
Q-align Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels GitHub
Q-boost Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models Project Page
Co-Instruct Towards Open-ended Visual Quality Comparison Hugging Face
DepictQA Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models GitHub
M3-AGIQA M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment GitHub
Q-Refine Q-Refine: A Perceptual Quality Refiner for AI-Generated Image GitHub
AGIQA Large Multi-modality Model Assisted AI-Generated Image Quality Assessment GitHub
SF-IQA SF-IQA: Quality and Similarity Integration for AI Generated Image Quality Assessment GitHub
SC-AGIQA Text-Visual Semantic Constrained AI-Generated Image Quality Assessment GitHub
TSP-MGS AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity Not available
MoE-AGIQA MoE-AGIQA: Mixture-of-Experts Boosted Visual Perception-Driven and Semantic-Aware Quality Assessment for AI-Generated Images GitHub
AMFF-Net Adaptive Mixed-Scale Feature Fusion Network for Blind AI-Generated Image Quality Assessment GitHub
PSCR PSCR: Patches Sampling-based Contrastive Regression for AIGC Image Quality Assessment GitHub
TIER TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment GitHub
IPCE AIGC Image Quality Assessment via Image-Prompt Correspondence GitHub
RISEBench Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing GitHub
GoT GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing GitHub
SmartEdit SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models GitHub
WISE WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation GitHub
KRIS-Bench KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models GitHub
CoT-editing Enhancing Image Editing with Chain-of-Thought Reasoning and Multimodal Large Language Models Not available
GUIZoom-Bench Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding Github
MICo-Bench MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition Github
CS-Bench START: Spatial and Textual Learning for Chart Understanding Not available
IF-Bench IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting Github

Video

Benchmark Paper Project Page
LMM-VQA LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models GitHub
FineVQ FineVQ: Fine-Grained User Generated Content Video Quality Assessment GitHub
VQA^2 VQA2 : Visual Question Answering for Video Quality Assessment GitHub
Omni-VQA Scaling-up Perceptual Video Quality Assessment Not available
LMM-PVQA Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision GitHub
Compare2Score paradigm Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare GitHub
VQ-Insight VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning GitHub
Who is a Better Talker Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads GitHub
THQA THQA: A Perceptual Quality Assessment Database for Talking Heads GitHub
Who is a Better Imitator Who is a Better Imitator: Subjective and Objective Quality Assessment of Animated Humans GitHub
MI3S MI3S: A multimodal large language model assisted quality assessment framework for AI-generated talking heads Not available
An Implementation of Multimodal Fusion System An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation GitHub
RULER-Bench RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence GitHub
PAI-Bench PAI-Bench: A Comprehensive Benchmark For Physical AI Github
Tri-Bench Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference Github
OpenVE-Bench OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing Github
RVE-Bench ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning Github

Audio

Benchmark Paper Project Page
MOSNet MOSNet: Deep Learning-based Objective Assessment for Voice Conversion GitHub
MOSA-Net+ A Study on Incorporating Whisper for Robust Speech Assessment Hugging Face
MOSLight MOSLight: A Lightweight Data-Efficient System for Non-Intrusive Speech Quality Assessment Not available
MBNet MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network GitHub
DeePMOS DeePMOS: Deep Posterior Mean-Opinion-Score of Speech GitHub
LDNet LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech GitHub
ADTMOS ADTMOS – Synthesized Speech Quality Assessment Based on Audio Distortion Tokens GitHub
UAMOS Uncertainty-Aware Mean Opinion Score Prediction Not available
Audiobox Aesthetics Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound GitHub
HighRateMOS HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment Not available
ALLMs-as-Judges Audio-Aware Large Language Models as Judges for Speaking Styles Hugging Face
SALMONN SALMONN: Towards Generic Hearing Abilities for Large Language Models GitHub
Qwen-Audio Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models GitHub
Qwen2- Audio Qwen2-Audio Technical Report GitHub
natural language quality descriptions Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation GitHub
QualiSpeech QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions Hugging Face
DiscreteEval Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model GitHub
EmergentTTS-Eval EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge GitHub
InstructTTSEval INSTRUCTTTSEVAL: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems GitHub
Mos-Bench MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models Hugging Face
SH-Bench Protecting Bystander Privacy via Selective Hearing in LALMs Huggingface
LISN-Bench LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating Github

3D

Benchmark Paper Project Page
NR-3DQA No-Reference Quality Assessment for 3D Colored Point Cloud and Mesh Models GitHub
MM-PCQA MM-PCQA: Multi-Modal Learning for No-reference Point Cloud Quality Assessment GitHub
GT23D-Bench GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark GitHub
NeRF-NQA NeRF-NQA: No-Reference Quality Assessment for Scenes Generated by NeRF and Neural View Synthesis Methods GitHub
Explicit-NeRF-QA Explicit-NeRF-QA: A Quality Assessment Database for Explicit NeRF Model Compression GitHub
NeRF-QA NeRF-QA: Neural Radiance Fields Quality Assessment Database GitHub
NVS-QA NeRF View Synthesis: Subjective Quality Assessment and Objective Metrics Evaluation GitHub
GSQA GS-QA: Comprehensive Quality Assessment Benchmark for Gaussian Splatting View Synthesis GitHub
GPT-4V Evaluator GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation GitHub
Eval3D Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation GitHub
3DGen-Bench 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models GitHub
LMM-PCQA LMM-PCQA: Assisting Point Cloud Quality Assessment with LMM GitHub

Leaderboards and Tools

Platform / Benchmark Paper Project Page
LMMs-Eval LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models GitHub
GenAI-Arena Genai arena: An open evaluation platform for generative models Hugging Face
OpenCompass GitHub
Epoch AI’s Benchmarking Hub Website
Artificial Analysis Website
Scale’s SEAL Leaderboards Website
FlagEval FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation Website
AGI-Eval AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models Website
ReLE Website
VLMEvalKit、OpenVLM Leaderboard VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models GitHub
HELM Holistic Evaluation of Language Models Project Page
LiveBench LiveBench: A Challenging, Contamination-Limited LLM Benchmark Project Page
SuperCLUE SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark Website
AIBench AIbench: Towards Trustworthy Evaluation Under the 45 Law Website
FutureX FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Github

About

Official repo for 'Large Multimodal Models Evaluation: A Survey'

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published