UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

This repo is the official implementation of paper: UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang, Chenwei Xie , Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng$^\dagger$, Liwei Wang $^\dagger$

Primary contact: Hao Tang ( tanghao@stu.pku.edu.cn )

📣 News

[26-1-26] 🎉 UniLIP is accepted by ICLR 2026!
[25-10-9] 🚀 All checkpoints, training and inference Code is released.
[25-7-31] 👀 UniLIP is released on arXiv.

Overview

🤔 Introduction

Previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. To overcome this, we propose UniLIP:

Two-Stage Self-Distillation: A novel training scheme that teaches CLIP high-fidelity reconstruction without degrading its powerful comprehension abilities.
Dual-Condition Architecture: Enhances reasoning and edit consistency by combining rich multimodal context with learnable queries that harness the power of MLLMs.
State-of-the-Art Performance: Achieves top results on GenEval (0.88/0.90), WISE (0.56/0.63), and ImgEdit (3.81/3.94) with efficient 1B/3B models, demonstrating superior instruction following and edit fidelity.

🚀 Main Results

Image Reconstruction on ImageNet val

Model	Res.	ratio	rFID ↓	PSNR↑	SSIM↑
VILA-U	256	16	1.80	-	-
Tokenflow	256	16	1.37	21.41	0.687
DualViTok	256	16	1.37	22.53	0.741
UniLIP	256	32	0.79	22.99	0.747
Emu2	448	14	3.27	13.49	0.423
UniLIP	448	32	0.31	24.62	0.788

Image Understanding

Model	# LLM Params	MME-P	MMB	MMMU	MM-Vet	SEED	AI2D	MMVP
InternVL3-1B	1B	1492	72.6	43.4	59.5	71.1	69.4	67.3
InternVL3-2B	1.8B	1633	80.6	48.2	62.2	75.0	78.5	72.7
BAGEL-3B	3B	1610	79.2	43.2	48.2	-	-	54.7
BLIP3o-4B	4B	1528	78.6	46.6	60.1	73.8	-	-
TokLIP-7B	7B	1410	-	42.1	-	65.2	-	-
Tar-7B	7B	1571	74.4	39.0		73.0	-	-
UniLIP-1B	1B	1499	72.6	43.3	59.4	71.0	70.7	68.7
UniLIP-3B	2B	1636	80.7	48.7	62.2	75.0	78.6	73.0

Image Generation and Editing

Model	# Params	GenEval	WISE	ImgEdit
BAGEL	7B+7B	0.82	0.52	3.20
BLIP3o-4B	3B+1.4B	0.81	0.50	-
UniWorld-V1	7B+12B	-	-	3.26
UniLIP-1B	1B+0.6B	0.88	0.56	3.81
UniLIP-3B	2B+1.6B	0.90	0.63	3.94

🛠️ Quick Start

Installation

conda create -n UniLIP python=3.11
conda activate UniLIP
pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install -e .

Demo

Please download checkpoints UniLIP-1B and UniLIP-3B, then save them under root dir:

UniLIP
|──UniLIP-1B
|──UniLIP-3B
|──...

Run demo on image generation:

python scripts/inference_gen.py ./UniLIP-3B

Run demo on image editing

python scipts/inference_edit.py ./UniLIP-3B

Scripts

For Training and evaluation commands for generation and editing , please refer TRAIN.md and EVAL.md

For training, evaluation and inference scripts for reconstruction, please refer here.

UniLIP does not require training for understanding, the evaluation script for understanding is here.

👍 Acknowledgement

TiTok We implement reconstruction training following TiTok.
BLIP3-o Thanks to BLIP3-o for providing the generation data and training code.
InternVL We use InternVL3 as the MLLM pretraining.
SANA We use SANA as the DiT pretraining.
DC-AE We use the pixel decoder from DC-AE.

📘 Citation

Please consider citing our work as follows if it is helpful.

@article{tang2025unilip,
  title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing},
  author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
  journal={arXiv preprint arXiv:2507.23278},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

📣 News

Overview

🤔 Introduction

🚀 Main Results

Image Reconstruction on ImageNet val

Image Understanding

Image Generation and Editing

🛠️ Quick Start

Installation

Demo

Scripts

👍 Acknowledgement

📘 Citation

✨ Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
data		data
deepspeed_scripts		deepspeed_scripts
demo		demo
eval		eval
scripts		scripts
tokenizer		tokenizer
unilip		unilip
EVAL.md		EVAL.md
README.md		README.md
TRAIN.md		TRAIN.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

📣 News

Overview

🤔 Introduction

🚀 Main Results

Image Reconstruction on ImageNet val

Image Understanding

Image Generation and Editing

🛠️ Quick Start

Installation

Demo

Scripts

👍 Acknowledgement

📘 Citation

✨ Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages