This repo is the official implementation of paper: UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.
UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
Hao Tang, Chenwei Xie , Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng
$^\dagger$ , Liwei Wang$^\dagger$
- Primary contact: Hao Tang ( tanghao@stu.pku.edu.cn )
- [26-1-26] π UniLIP is accepted by ICLR 2026!
- [25-10-9] π All checkpoints, training and inference Code is released.
- [25-7-31] π UniLIP is released on arXiv.
Previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. To overcome this, we propose UniLIP:
- Two-Stage Self-Distillation: A novel training scheme that teaches CLIP high-fidelity reconstruction without degrading its powerful comprehension abilities.
- Dual-Condition Architecture: Enhances reasoning and edit consistency by combining rich multimodal context with learnable queries that harness the power of MLLMs.
- State-of-the-Art Performance: Achieves top results on GenEval (0.88/0.90), WISE (0.56/0.63), and ImgEdit (3.81/3.94) with efficient 1B/3B models, demonstrating superior instruction following and edit fidelity.
| Model | Res. | ratio | rFID β | PSNRβ | SSIMβ |
|---|---|---|---|---|---|
| VILA-U | 256 | 16 | 1.80 | - | - |
| Tokenflow | 256 | 16 | 1.37 | 21.41 | 0.687 |
| DualViTok | 256 | 16 | 1.37 | 22.53 | 0.741 |
| UniLIP | 256 | 32 | 0.79 | 22.99 | 0.747 |
| Emu2 | 448 | 14 | 3.27 | 13.49 | 0.423 |
| UniLIP | 448 | 32 | 0.31 | 24.62 | 0.788 |
| Model | # LLM Params | MME-P | MMB | MMMU | MM-Vet | SEED | AI2D | MMVP |
|---|---|---|---|---|---|---|---|---|
| InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 69.4 | 67.3 |
| InternVL3-2B | 1.8B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 78.5 | 72.7 |
| BAGEL-3B | 3B | 1610 | 79.2 | 43.2 | 48.2 | - | - | 54.7 |
| BLIP3o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | - | - |
| TokLIP-7B | 7B | 1410 | - | 42.1 | - | 65.2 | - | - |
| Tar-7B | 7B | 1571 | 74.4 | 39.0 | 73.0 | - | - | |
| UniLIP-1B | 1B | 1499 | 72.6 | 43.3 | 59.4 | 71.0 | 70.7 | 68.7 |
| UniLIP-3B | 2B | 1636 | 80.7 | 48.7 | 62.2 | 75.0 | 78.6 | 73.0 |
| Model | # Params | GenEval | WISE | ImgEdit |
|---|---|---|---|---|
| BAGEL | 7B+7B | 0.82 | 0.52 | 3.20 |
| BLIP3o-4B | 3B+1.4B | 0.81 | 0.50 | - |
| UniWorld-V1 | 7B+12B | - | - | 3.26 |
| UniLIP-1B | 1B+0.6B | 0.88 | 0.56 | 3.81 |
| UniLIP-3B | 2B+1.6B | 0.90 | 0.63 | 3.94 |
conda create -n UniLIP python=3.11
conda activate UniLIP
pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install -e .Please download checkpoints UniLIP-1B and UniLIP-3B, then save them under root dir:
UniLIP
|ββUniLIP-1B
|ββUniLIP-3B
|ββ...
Run demo on image generation:
python scripts/inference_gen.py ./UniLIP-3BRun demo on image editing
python scipts/inference_edit.py ./UniLIP-3B
For Training and evaluation commands for generation and editing , please refer TRAIN.md and EVAL.md
For training, evaluation and inference scripts for reconstruction, please refer here.
UniLIP does not require training for understanding, the evaluation script for understanding is here.
- TiTok We implement reconstruction training following TiTok.
- BLIP3-o Thanks to BLIP3-o for providing the generation data and training code.
- InternVL We use InternVL3 as the MLLM pretraining.
- SANA We use SANA as the DiT pretraining.
- DC-AE We use the pixel decoder from DC-AE.
Please consider citing our work as follows if it is helpful.
@article{tang2025unilip,
title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing},
author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
journal={arXiv preprint arXiv:2507.23278},
year={2025}
}
