This is the code for the paper Same or Not? Enhancing Visual Perception in Vision-Language Models by Damiano Marsili, Aditya Mehta, Ryan Y. Lin, and Georgia Gkioxari.
Clone the repo:
git clone --recurse-submodules https://github.com/damianomarsili/TWIN.gitWe use uv to manage all dependencies. If your system does not have uv, install it via:
curl -LsSf https://astral.sh/uv/install.sh | shSetup your environment:
cd TWIN
uv syncFor post-training with verl, you must also install verl's dependencies. You can do so by running the following script:
bash modules/verl/scripts/uv_install_vllm_sglang_mcore.sh
torch and flash-attn compatible with your system.
The TWIN dataset and FGVQA benchmark suite are hosted on Huggingface 🤗.
The dataset and benchmark suite can be accessed with the following code:
from datasets import load_dataset
# TWIN Dataset
twin_dataset = load_dataset("glab-caltech/TWIN")
# FGVQA Benchmark Suite
fgvqa_benchmark = load_dataset("glab-caltech/FGVQA")We use LMMs-eval for all evaluations. Model checkpoints post-trained on TWIN are hosted on Huggingface 🤗.
To evaluate a model on FGVQA, please run the following code:
bash evals/eval.sh
Inside evals/eval.sh, you can edit the model checkpoint to evaluate by changing the MODEL_DIR and MODEL_ARGS variables in the script. To evaluate on a subset of the benchmark suite (e.g. CUB), you can edit the TASKS variable to be fgvqa_{subset} (e.g. fgvqa_cub).
We use verl for RL post-training. Prior to training, you must download and preprocess the TWIN dataset. You can do so by running the following script:
bash training/prepare_training_data.sh
Then, you can launch training via the following command:
bash training/run_grpo.sh
The trained checkpoint will default save to training/data/checkpoints/. You can edit this target directory at the top of the bash script training/run_grpo.sh.
If you use the TWIN dataset or FGVQA benchmark suite in your research, please consider citing our work:
@misc{marsili2025notenhancingvisualperception,
title={Same or Not? Enhancing Visual Perception in Vision-Language Models},
author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
year={2025},
eprint={2512.23592},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.23592},
}