Evaluations

To evaluate the performance of a model on a benchmark:

Prepare the evaluation environment.
Prepare the benchmark dataset.
Run the evaluation script.

Evaluation Environment

Follow the instructions in LLaVA repository to set up the evaluation environment.
Install the required packages.

# Make sure you are currently in evaluations/ directorty
pip install -r ../requirements.txt

Text-rich Multi-Image Benchmarks

MP-DocvVQA

Download the image.tar.gz and question-answer.zip from https://rrc.cvc.uab.es/?ch=17&com=downloads. (Note: Registration is required.)
Extract the image.tar.gz into mpdocvqa/images forder.
Unzip the question-answer.zip, move val.json into mpdocvqa/ folder.
Run load_mpdocvqa.py to prepare the dataset.

cd mpdocvqa/ && python load_mpdocvqa.py

DUDE

Run load_dude.py to prepare the dataset. Data will be downloaded from huggingface Datasets.

cd dude/ && python load_dude.py

SlideVQA

Follow the instructions in https://github.com/nttmdlab-nlp/SlideVQA to download the dataset.
Run load_slidevqa.py to prepare the dataset.

cd slidevqa/ && python load_slidevqa.py

MultiChartQA

Download the dataset (the data/ folder) from https://github.com/Zivenzhu/Multi-chart-QA/tree/main into multichartqa/data/ folder.
Run load_multichartqa.py to prepare the dataset.

cd multichartqa/ && python load_multichartqa.py

MultiHiertt

Download dev.json from https://drive.google.com/drive/folders/1ituEWZ5F7G9T9AZ0kzZZLrHNhRigHCZJ into multihiertt/ folder.
Run load_multihiertt.py to prepare the dataset.

cd multihiertt/ && python load_multihiertt.py

Text-rich Single Image Benchmarks

TextVQA

Download the 'TextVQA_0.5.1_val.json' and images from https://textvqa.org/dataset/.
Unzip the images into textvqa/images/ folder.
Run load_textvqa.py to prepare the dataset.

cd textvqa/ && python load_textvqa.py

DocVQA

Download the val_v1.0_withQT.json and images from https://rrc.cvc.uab.es/?ch=17&com=downloads. (Note: Registration is required.)
Unzip the images into docvqa/images/ folder.
Run load_docvqa.py to prepare the dataset.

cd docvqa/ && python load_docvqa.py

VisualWebBench

Download the dataset files (*.parquert) from https://huggingface.co/datasets/visualwebbench/VisualWebBench.
Run load_visualwebbench.py to prepare the dataset.

cd visualwebbench/ && python load_visualwebbench.py

General Benchmarks

MIRB

Download the dataset files (*.parquert) from https://huggingface.co/datasets/VLLMs/MIRB/tree/main.
Run load_mirb.py to prepare the dataset.

cd mirb/ && python load_mirb.py

MIBench

TBD

MMMU

Download the dataset files (*.parquert) from https://huggingface.co/datasets/MMMU/MMMU.
Run load_mmmu.py to prepare the dataset.

cd mmmu/ && python load_mmmu.py

MathVista

Download the testmini-00000-of-00001-725687bf7a18d64b.parquet file and images.zip from https://huggingface.co/datasets/AI4Math/MathVista.
Unzip the images into mathvista/images folder.
Run load_mathvista.py to prepare the dataset.

cd mathvista/ && python load_mathvista.py

ScienceQA

Download the dataset files (*.parquert) from https://huggingface.co/datasets/ScienceQA/ScienceQA.
Run load_scienceqa.py to prepare the dataset.

cd scienceqa/ && python load_scienceqa.py

Evaluation Script

To evaluate Leopard-LLaVA model:

# Make sure you are currently in the evaluations/ directory
cd models/ && bash run_eval_llava_siglip_multiimg.sh direct $MODEL_PATH

To evaluate Leopard-Idefics model:

# Make sure you are currently in the evaluations/ directory
cd models/ && bash run_eval_idefics2_multiimg.sh direct $MODEL_PATH

The scripts will eval the performance of the model on all benchmark datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluations

Evaluation Environment

Text-rich Multi-Image Benchmarks

MP-DocvVQA

DUDE

SlideVQA

MultiChartQA

MultiHiertt

Text-rich Single Image Benchmarks

TextVQA

DocVQA

VisualWebBench

General Benchmarks

MIRB

MIBench

MMMU

MathVista

ScienceQA

Evaluation Script

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Evaluations

Evaluation Environment

Text-rich Multi-Image Benchmarks

MP-DocvVQA

DUDE

SlideVQA

MultiChartQA

MultiHiertt

Text-rich Single Image Benchmarks

TextVQA

DocVQA

VisualWebBench

General Benchmarks

MIRB

MIBench

MMMU

MathVista

ScienceQA

Evaluation Script