Official evaluation toolkit for MMMG: the Massive Multi-discipline Multi-tier Knowledge-Image Generation benchmark.
- ✨ Project Page
- 📄 Paper (arXiv 2506.10963)
- 💾 MMMG Dataset on HuggingFace
- 📷 Sampled Results
- 📂 Training Set
MMMG is a large-scale benchmark designed to assess text-to-image (T2I) models on their ability to generate faithful and visually readable images based on knowledge-intensive prompts, spanning multiple academic disciplines and educational levels.
MMMG-Score is computed as:
MMMG-Score = Knowledge Fidelity (1 - GED) × Visual Readability (SAM2.1)
Where:
- GED: Graph Edit Distance between predicted and ground-truth concept graphs.
- SAM2.1: Visual readability score based on SAM2.1 segmentation accuracy.
- 2025.11.29 We have benchmarked Nano Banana Pro, which is currently the leading model.
- 2025.9.19 Our work has been accepted by NeurIPS 2025!
- 2025.6.10 The repository has been updated.
git clone https://github.com/MMMGBench/MMMG.git
cd MMMG
conda env create -f environment.yaml
conda activate mmmgPlace your generated images under the following structure:
/data/
├─ preschool/
├─ primaryschool/
├─ secondaryschool/
├─ highschool/
├─ undergraduate/
└─ PhD/
Each folder contains model-generated images named as <prompt_key>.png.
We use the Azure OpenAI service for knowledge integrity evaluation. If you use a different API interface (e.g., from OpenAI website), please modify:
mmmg_eval/step1_knowledge_integrity.pyInsert your API keys into:
mmmg_eval/utils/gpt_api_pool.pypython evaluate.py \
--img_dir ./data/GPT-4o \
--output_dir ./output \
--sam2_ckpt /YOUR/PATH/TO/sam2/checkpoints/sam2.1_hiera_large.pt \
--t2i_method GPT-4o \
--api_name o3 \
--hf_cache ./data/MMMG--img_dir: Path to generated images (organized by education tier).--output_dir: Where evaluation logs and scores will be saved.--sam2_ckpt: Path to the pretrained SAM2.1 checkpoint.--t2i_method: Name of the T2I model under evaluation.--api_name: LLM backend (e.g.,gpt-4,gpt-4o,o3).--hf_cache: Path to HuggingFace cache for loading ground-truth graphs.
If you find MMMG helpful in your research, please consider citing our paper:
@inproceedings{luo2025mmmg,
title={Mmmg: A massive, multidisciplinary, multi-tier generation benchmark for text-to-image reasoning},
author={Luo, Yuxuan and Yuan, Yuhui and Chen, Junwen and Cai, Haonan and Yue, Ziyi and Yang, Yuwei and Daha, Fatima Zohra and Li, Ji and Lian, Zhouhui},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025}
}