GitHub - Tencent/VersaViT: VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Installation

conda create -n versavit python=3.11.0
conda activate versavit

pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt

pip install ninja
pip install flash-attn==2.7.4.post1 --no-build-isolation

Quick Start

import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel


model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')

image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])

Data Preparation

We use WebDataset format: training data is stored as .tar archives (shards). Each shard contains sequences of samples (e.g. image + text or other modalities) as separate files within the tar, which allows efficient sequential I/O and scales well for distributed training.

For details on how we build these shards—including downloading with img2dataset, consolidating metadata to .jsonl, and packing into WebDataset .tar files—see the data_pack folder and its README.

Training

Captioning warmup

sh scripts/train_multi_task_webloader_qwen2_warmup.sh exp/cap-only-warmup-qwen2.yaml

Multi-task collaborative training

sh scripts/train_multi_task_webloader_qwen2.sh exp/multi-task-post-train-qwen2.yaml

Evaluation

Segmentation & depth (linear probing)
We provide scripts and configs for linear probing on segmentation and depth. See the evaluation folder: subfolders evaluation/segmentation and evaluation/monodepth contain the respective setup and run instructions.
VQA with LLM
Our VQA training (connecting the vision encoder to an LLM) is done with an internal company framework, so we are unable to open-source that part of the code. We do release the trained model weights for this setup; you can find them here.

🫡 Acknowledgements

Many thanks to the code bases from InternVL and FiT3D.

Citation

If you use this code for your research or project, please cite:

@article{liu2026versavit,
  title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
  author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
  journal={arXiv preprint arXiv:2602.09934},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data_pack		data_pack
dataset		dataset
ds_configs		ds_configs
evaluation		evaluation
exp		exp
loaders		loaders
models		models
scripts		scripts
tools		tools
train		train
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
arguments.py		arguments.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Quick Start

Data Preparation

Training

Captioning warmup

Multi-task collaborative training

Evaluation

🫡 Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Quick Start

Data Preparation

Training

Captioning warmup

Multi-task collaborative training

Evaluation

🫡 Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages