conda create -n versavit python=3.11.0
conda activate versavit
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
pip install ninja
pip install flash-attn==2.7.4.post1 --no-build-isolation
import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel
model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')
image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])
We use WebDataset format: training data is stored as .tar archives (shards). Each shard contains sequences of samples (e.g. image + text or other modalities) as separate files within the tar, which allows efficient sequential I/O and scales well for distributed training.
For details on how we build these shards—including downloading with img2dataset, consolidating metadata to .jsonl, and packing into WebDataset .tar files—see the data_pack folder and its README.
sh scripts/train_multi_task_webloader_qwen2_warmup.sh exp/cap-only-warmup-qwen2.yaml
sh scripts/train_multi_task_webloader_qwen2.sh exp/multi-task-post-train-qwen2.yaml
-
Segmentation & depth (linear probing)
We provide scripts and configs for linear probing on segmentation and depth. See the evaluation folder: subfoldersevaluation/segmentationandevaluation/monodepthcontain the respective setup and run instructions. -
VQA with LLM
Our VQA training (connecting the vision encoder to an LLM) is done with an internal company framework, so we are unable to open-source that part of the code. We do release the trained model weights for this setup; you can find them here.
Many thanks to the code bases from InternVL and FiT3D.
If you use this code for your research or project, please cite:
@article{liu2026versavit,
title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
journal={arXiv preprint arXiv:2602.09934},
year={2026}
}