Skip to content

damianomarsili/TWIN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⭐ TWIN - Same or Not? Enhancing Visual Perception in Vision-Language Models

This is the code for the paper Same or Not? Enhancing Visual Perception in Vision-Language Models by Damiano Marsili, Aditya Mehta, Ryan Y. Lin, and Georgia Gkioxari.

arXiv Project Page

image

🚀 Quickstart

Clone the repo:

git clone --recurse-submodules https://github.com/damianomarsili/TWIN.git

We use uv to manage all dependencies. If your system does not have uv, install it via:

curl -LsSf https://astral.sh/uv/install.sh | sh

Setup your environment:

cd TWIN
uv sync

For post-training with verl, you must also install verl's dependencies. You can do so by running the following script:

bash modules/verl/scripts/uv_install_vllm_sglang_mcore.sh

⚠️ Note: This setup assumes CUDA 12.2 and Python 3.10. If you are using a different CUDA version, you may need to install a version of torch and flash-attn compatible with your system.

🤗 The TWIN Dataset & FGVQA Benchmark Suite.

The TWIN dataset and FGVQA benchmark suite are hosted on Huggingface 🤗.

The dataset and benchmark suite can be accessed with the following code:

from datasets import load_dataset

# TWIN Dataset
twin_dataset = load_dataset("glab-caltech/TWIN")

# FGVQA Benchmark Suite
fgvqa_benchmark = load_dataset("glab-caltech/FGVQA")

📊 Evaluating on FGVQA

We use LMMs-eval for all evaluations. Model checkpoints post-trained on TWIN are hosted on Huggingface 🤗.

To evaluate a model on FGVQA, please run the following code:

bash evals/eval.sh

Inside evals/eval.sh, you can edit the model checkpoint to evaluate by changing the MODEL_DIR and MODEL_ARGS variables in the script. To evaluate on a subset of the benchmark suite (e.g. CUB), you can edit the TASKS variable to be fgvqa_{subset} (e.g. fgvqa_cub).

🧠 Post-training on TWIN

We use verl for RL post-training. Prior to training, you must download and preprocess the TWIN dataset. You can do so by running the following script:

bash training/prepare_training_data.sh

Then, you can launch training via the following command:

bash training/run_grpo.sh

The trained checkpoint will default save to training/data/checkpoints/. You can edit this target directory at the top of the bash script training/run_grpo.sh.

📚 Citation

If you use the TWIN dataset or FGVQA benchmark suite in your research, please consider citing our work:

@misc{marsili2025notenhancingvisualperception,
      title={Same or Not? Enhancing Visual Perception in Vision-Language Models}, 
      author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
      year={2025},
      eprint={2512.23592},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.23592}, 
}

About

[CVPR 2026] Same or Not? Enhancing Visual Perception in Vision-Language Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors