# Evaluations

To evaluate the performance of a model on a benchmark:
1. Prepare the evaluation environment.
2. Prepare the benchmark dataset.
3. Run the evaluation script.

---
## Evaluation Environment
1. Follow the instructions in [LLaVA](https://github.com/haotian-liu/LLaVA) repository to set up the evaluation environment.
2. Install the required packages.
```bash
# Make sure you are currently in evaluations/ directorty
pip install -r ../requirements.txt
```


## Text-rich Multi-Image Benchmarks

### MP-DocvVQA

1. Download the image.tar.gz and question-answer.zip from https://rrc.cvc.uab.es/?ch=17&com=downloads. (Note: Registration is required.)
2. Extract the image.tar.gz into mpdocvqa/images forder.
3. Unzip the question-answer.zip, move val.json into mpdocvqa/ folder.
4. Run load_mpdocvqa.py to prepare the dataset.
```bash
cd mpdocvqa/ && python load_mpdocvqa.py
```

### DUDE
1. Run load_dude.py to prepare the dataset. Data will be downloaded from huggingface Datasets.
```bash
cd dude/ && python load_dude.py
```

### SlideVQA
1. Follow the instructions in https://github.com/nttmdlab-nlp/SlideVQA to download the dataset.
2. Run load_slidevqa.py to prepare the dataset.
```bash
cd slidevqa/ && python load_slidevqa.py
```

### MultiChartQA
1. Download the dataset (the data/ folder) from https://github.com/Zivenzhu/Multi-chart-QA/tree/main into multichartqa/data/ folder.
2. Run load_multichartqa.py to prepare the dataset. 
```bash
cd multichartqa/ && python load_multichartqa.py
```

### MultiHiertt
1. Download dev.json from https://drive.google.com/drive/folders/1ituEWZ5F7G9T9AZ0kzZZLrHNhRigHCZJ into multihiertt/ folder.
2. Run load_multihiertt.py to prepare the dataset. 
```bash
cd multihiertt/ && python load_multihiertt.py
```

---

## Text-rich Single Image Benchmarks

### TextVQA
1. Download the 'TextVQA_0.5.1_val.json' and images from https://textvqa.org/dataset/.
2. Unzip the images into textvqa/images/ folder.
3. Run load_textvqa.py to prepare the dataset. 
```bash
cd textvqa/ && python load_textvqa.py
```

### DocVQA
1. Download the val_v1.0_withQT.json and images from https://rrc.cvc.uab.es/?ch=17&com=downloads. (Note: Registration is required.)
2. Unzip the images into docvqa/images/ folder.
3. Run load_docvqa.py to prepare the dataset. 
```bash
cd docvqa/ && python load_docvqa.py
```

### VisualWebBench
1. Download the dataset files (*.parquert) from https://huggingface.co/datasets/visualwebbench/VisualWebBench.
2. Run load_visualwebbench.py to prepare the dataset. 
```bash
cd visualwebbench/ && python load_visualwebbench.py
```

---

## General Benchmarks

### MIRB
1. Download the dataset files (*.parquert) from https://huggingface.co/datasets/VLLMs/MIRB/tree/main.
2. Run load_mirb.py to prepare the dataset. 
```bash
cd mirb/ && python load_mirb.py
```

### MIBench
TBD

### MMMU
1. Download the dataset files (*.parquert) from https://huggingface.co/datasets/MMMU/MMMU.
2. Run load_mmmu.py to prepare the dataset. 
```bash
cd mmmu/ && python load_mmmu.py
```

### MathVista
1. Download the testmini-00000-of-00001-725687bf7a18d64b.parquet file and images.zip from https://huggingface.co/datasets/AI4Math/MathVista.
2. Unzip the images into mathvista/images folder.
3. Run load_mathvista.py to prepare the dataset. 
```bash
cd mathvista/ && python load_mathvista.py
```

### ScienceQA
1. Download the dataset files (*.parquert) from https://huggingface.co/datasets/ScienceQA/ScienceQA.
2. Run load_scienceqa.py to prepare the dataset. 
```bash
cd scienceqa/ && python load_scienceqa.py
```

---


## Evaluation Script
To evaluate Leopard-LLaVA model:
```bash
# Make sure you are currently in the evaluations/ directory
cd models/ && bash run_eval_llava_siglip_multiimg.sh direct $MODEL_PATH
```


To evaluate Leopard-Idefics model:
```bash
# Make sure you are currently in the evaluations/ directory
cd models/ && bash run_eval_idefics2_multiimg.sh direct $MODEL_PATH
```

The scripts will eval the performance of the model on all benchmark datasets.