Skip to content

Tencent/VersaViT

Repository files navigation

HuggingFace Paper

Installation

conda create -n versavit python=3.11.0
conda activate versavit

pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt

pip install ninja
pip install flash-attn==2.7.4.post1 --no-build-isolation

Quick Start

import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel


model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')

image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])

Data Preparation

We use WebDataset format: training data is stored as .tar archives (shards). Each shard contains sequences of samples (e.g. image + text or other modalities) as separate files within the tar, which allows efficient sequential I/O and scales well for distributed training.

For details on how we build these shards—including downloading with img2dataset, consolidating metadata to .jsonl, and packing into WebDataset .tar files—see the data_pack folder and its README.

Training

Captioning warmup

sh scripts/train_multi_task_webloader_qwen2_warmup.sh exp/cap-only-warmup-qwen2.yaml

Multi-task collaborative training

sh scripts/train_multi_task_webloader_qwen2.sh exp/multi-task-post-train-qwen2.yaml

Evaluation

  • Segmentation & depth (linear probing)
    We provide scripts and configs for linear probing on segmentation and depth. See the evaluation folder: subfolders evaluation/segmentation and evaluation/monodepth contain the respective setup and run instructions.

  • VQA with LLM
    Our VQA training (connecting the vision encoder to an LLM) is done with an internal company framework, so we are unable to open-source that part of the code. We do release the trained model weights for this setup; you can find them here.

🫡 Acknowledgements

Many thanks to the code bases from InternVL and FiT3D.

Citation

If you use this code for your research or project, please cite:

@article{liu2026versavit,
  title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
  author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
  journal={arXiv preprint arXiv:2602.09934},
  year={2026}
}

About

VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors