-
MMLab, CUHK
- Hong Kong, China
- https://caraj7.github.io/
Stars
Implement search image generation similar to Nano-banana-pro / Seedream / FLUX.
[CVPR 2026] The official implementation of The paper "Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation"
RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards.
Offical Repository for Paper: DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
The first Interleaved framework for textual reasoning within the visual generation process
ULMEvalKit: One-Stop Eval ToolKit for Image Generation
Qwen-Image text to image lora trainer
A curated collection of fun and creative examples generated with Nano Banana & Nano Banana Pro🍌, Gemini-2.5-flash-image based model. We also release Nano-consistent-150K openly to support the commu…
Echo-4o: Harnessing Proprietary Models’ Synthetic Images for Improved Image Generation
A curated gallery and toolkit designed to provide inspiration for scientific illustrations, project sites, and visual storytelling in research.
CLIP+MLP Aesthetic Score Predictor
[CVPR 2025 (Oral)] Open implementation of "RandAR"
Official codebase for "Self Forcing: Bridging Training and Inference in Autoregressive Video Diffusion" (NeurIPS 2025 Spotlight)
[NeurIPS 2025] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
Jodi: Unification of Visual Generation and Understanding via Joint Modeling
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Witness the aha moment of VLM with less than $3.
Awesome Unified Multimodal Models
GPT-ImgEval: Evaluating GPT-4o’s state-of-the-art image generation capabilities
[ICLR 2025 Spotlight] The official implementation of the paper “LOKI:A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models”
[ICCV 2025] Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
[NeurIPS 2025] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
[ACM Computing Surveys] The collection of awesome papers on alignment of diffusion models.
MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools.


