To evaluate the performance of a model on a benchmark:
- Prepare the evaluation environment.
- Prepare the benchmark dataset.
- Run the evaluation script.
- Follow the instructions in LLaVA repository to set up the evaluation environment.
- Install the required packages.
# Make sure you are currently in evaluations/ directorty
pip install -r ../requirements.txt- Download the image.tar.gz and question-answer.zip from https://rrc.cvc.uab.es/?ch=17&com=downloads. (Note: Registration is required.)
- Extract the image.tar.gz into mpdocvqa/images forder.
- Unzip the question-answer.zip, move val.json into mpdocvqa/ folder.
- Run load_mpdocvqa.py to prepare the dataset.
cd mpdocvqa/ && python load_mpdocvqa.py- Run load_dude.py to prepare the dataset. Data will be downloaded from huggingface Datasets.
cd dude/ && python load_dude.py- Follow the instructions in https://github.com/nttmdlab-nlp/SlideVQA to download the dataset.
- Run load_slidevqa.py to prepare the dataset.
cd slidevqa/ && python load_slidevqa.py- Download the dataset (the data/ folder) from https://github.com/Zivenzhu/Multi-chart-QA/tree/main into multichartqa/data/ folder.
- Run load_multichartqa.py to prepare the dataset.
cd multichartqa/ && python load_multichartqa.py- Download dev.json from https://drive.google.com/drive/folders/1ituEWZ5F7G9T9AZ0kzZZLrHNhRigHCZJ into multihiertt/ folder.
- Run load_multihiertt.py to prepare the dataset.
cd multihiertt/ && python load_multihiertt.py- Download the 'TextVQA_0.5.1_val.json' and images from https://textvqa.org/dataset/.
- Unzip the images into textvqa/images/ folder.
- Run load_textvqa.py to prepare the dataset.
cd textvqa/ && python load_textvqa.py- Download the val_v1.0_withQT.json and images from https://rrc.cvc.uab.es/?ch=17&com=downloads. (Note: Registration is required.)
- Unzip the images into docvqa/images/ folder.
- Run load_docvqa.py to prepare the dataset.
cd docvqa/ && python load_docvqa.py- Download the dataset files (*.parquert) from https://huggingface.co/datasets/visualwebbench/VisualWebBench.
- Run load_visualwebbench.py to prepare the dataset.
cd visualwebbench/ && python load_visualwebbench.py- Download the dataset files (*.parquert) from https://huggingface.co/datasets/VLLMs/MIRB/tree/main.
- Run load_mirb.py to prepare the dataset.
cd mirb/ && python load_mirb.pyTBD
- Download the dataset files (*.parquert) from https://huggingface.co/datasets/MMMU/MMMU.
- Run load_mmmu.py to prepare the dataset.
cd mmmu/ && python load_mmmu.py- Download the testmini-00000-of-00001-725687bf7a18d64b.parquet file and images.zip from https://huggingface.co/datasets/AI4Math/MathVista.
- Unzip the images into mathvista/images folder.
- Run load_mathvista.py to prepare the dataset.
cd mathvista/ && python load_mathvista.py- Download the dataset files (*.parquert) from https://huggingface.co/datasets/ScienceQA/ScienceQA.
- Run load_scienceqa.py to prepare the dataset.
cd scienceqa/ && python load_scienceqa.pyTo evaluate Leopard-LLaVA model:
# Make sure you are currently in the evaluations/ directory
cd models/ && bash run_eval_llava_siglip_multiimg.sh direct $MODEL_PATHTo evaluate Leopard-Idefics model:
# Make sure you are currently in the evaluations/ directory
cd models/ && bash run_eval_idefics2_multiimg.sh direct $MODEL_PATHThe scripts will eval the performance of the model on all benchmark datasets.