⭐ TWIN - Same or Not? Enhancing Visual Perception in Vision-Language Models

This is the code for the paper Same or Not? Enhancing Visual Perception in Vision-Language Models by Damiano Marsili, Aditya Mehta, Ryan Y. Lin, and Georgia Gkioxari.

🚀 Quickstart

Clone the repo:

git clone --recurse-submodules https://github.com/damianomarsili/TWIN.git

We use uv to manage all dependencies. If your system does not have uv, install it via:

curl -LsSf https://astral.sh/uv/install.sh | sh

Setup your environment:

cd TWIN
uv sync

For post-training with verl, you must also install verl's dependencies. You can do so by running the following script:

bash modules/verl/scripts/uv_install_vllm_sglang_mcore.sh

⚠️ Note: This setup assumes CUDA 12.2 and Python 3.10. If you are using a different CUDA version, you may need to install a version of torch and flash-attn compatible with your system.

🤗 The TWIN Dataset & FGVQA Benchmark Suite.

The TWIN dataset and FGVQA benchmark suite are hosted on Huggingface 🤗.

The dataset and benchmark suite can be accessed with the following code:

from datasets import load_dataset

# TWIN Dataset
twin_dataset = load_dataset("glab-caltech/TWIN")

# FGVQA Benchmark Suite
fgvqa_benchmark = load_dataset("glab-caltech/FGVQA")

📊 Evaluating on FGVQA

We use LMMs-eval for all evaluations. Model checkpoints post-trained on TWIN are hosted on Huggingface 🤗.

To evaluate a model on FGVQA, please run the following code:

bash evals/eval.sh

Inside evals/eval.sh, you can edit the model checkpoint to evaluate by changing the MODEL_DIR and MODEL_ARGS variables in the script. To evaluate on a subset of the benchmark suite (e.g. CUB), you can edit the TASKS variable to be fgvqa_{subset} (e.g. fgvqa_cub).

🧠 Post-training on TWIN

We use verl for RL post-training. Prior to training, you must download and preprocess the TWIN dataset. You can do so by running the following script:

bash training/prepare_training_data.sh

Then, you can launch training via the following command:

bash training/run_grpo.sh

The trained checkpoint will default save to training/data/checkpoints/. You can edit this target directory at the top of the bash script training/run_grpo.sh.

📚 Citation

If you use the TWIN dataset or FGVQA benchmark suite in your research, please consider citing our work:

@misc{marsili2025notenhancingvisualperception,
      title={Same or Not? Enhancing Visual Perception in Vision-Language Models}, 
      author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
      year={2025},
      eprint={2512.23592},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.23592}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
evals		evals
modules		modules
training		training
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⭐ TWIN - Same or Not? Enhancing Visual Perception in Vision-Language Models

🚀 Quickstart

🤗 The TWIN Dataset & FGVQA Benchmark Suite.

📊 Evaluating on FGVQA

🧠 Post-training on TWIN

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⭐ TWIN - Same or Not? Enhancing Visual Perception in Vision-Language Models

🚀 Quickstart

🤗 The TWIN Dataset & FGVQA Benchmark Suite.

📊 Evaluating on FGVQA

🧠 Post-training on TWIN

📚 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages