Skip to content

DAGroup-PKU/SpatialT2I

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Zhenyu Tang1,2*, Chaoran Feng1*, Yufan Deng1,2, Jie Wu2, Xiaojie Li2,
Rui Wang2, Yunpeng Chen2, Daquan Zhou1

1Peking University ย ย  2ByteDance Seed
*Equal Contribution

Project Page arXiv Dataset hf_space License: CC BY-NC 4.0


๐Ÿ“… News

  • [2026.03] ๐Ÿ“„ Paper is now available on arXiv: https://arxiv.org/abs/2602.24233
  • [TBD] ๐Ÿšง We are planning to release the SpatialReward-Dataset and SpatialScore model weights. Please stay tuned!

๐Ÿ“– Abstract

Text-to-image models have made significant strides in visual fidelity but often struggle with complex spatial relationships. Existing reward models often fail to capture these intricate spatial constraints.

In this work, we introduce a novel method to strengthen spatial understanding in image generation:

  1. SpatialReward-Dataset: A curated dataset with over 80k preference pairs, featuring adversarial spatial perturbations verified by humans.
  2. SpatialScore: A VLM-based reward model (built on Qwen2.5-VL) that surpasses proprietary models (e.g., GPT-5, Gemini 2.5) in spatial evaluation accuracy.
  3. Spatial-RL: We demonstrate that SpatialScore effectively enables online Reinforcement Learning (specifically GRPO with Top-k filtering), yielding significant gains in spatial generation capabilities.
Performance Comparison
Figure 1: Existing reward models often assign high scores to spatially incorrect images. SpatialScore provides accurate feedback, enabling better alignment.

๐Ÿ”ฅ Highlights & Contributions

1. SpatialReward-Dataset

We constructed a large-scale dataset focusing on spatial logic. Each entry consists of a "Perfect Image" (aligned with the text) and a "Perturbed Image" (with subtle spatial violations), creating a hard negative sample for robust training.

2. SpatialScore: State-of-the-Art Reward Model

By fine-tuning Qwen2.5-VL, our SpatialScore achieves superior performance in evaluating spatial relationships, outperforming strong baselines including HPS v2/v3, ImageReward, and even proprietary VLM APIs on our benchmarks.

Model Overall Accuracy
HPS v2.1 46.3%
ImageReward 47.9%
GPT-5 (API) 89.0%
Gemini-2.5 Pro 95.1%
SpatialScore (Ours) 95.8%

3. Reinforcement Learning with Top-k Filtering

We apply GRPO (Group Relative Policy Optimization) using SpatialScore as the feedback signal. To handle reward noise and prompt difficulty variance, we introduce a Top-k filtering strategy, which significantly stabilizes training and improves convergence.

RL Pipeline

๐Ÿ–ผ๏ธ Visual Results

Our method significantly improves the spatial layout capability of Flux.1-dev.

Visual Results


Comparison of generated images using complex spatial prompts.

โœ๏ธ Citation

If you find our work useful, please cite our paper:

@article{tang2026enhancing,
  title={Enhancing Spatial Understanding in Image Generation via Reward Modeling},
  author={Tang, Zhenyu and Feng, Chaoran and Deng, Yufan and Wu, Jie and Li, Xiaojie and Wang, Rui and Chen, Yunpeng and Zhou, Daquan},
  journal={arXiv preprint arXiv:2602.24233},
  year={2026}
}

โš–๏ธ License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). For more details, please refer to the LICENSE file.

About

[CVPR 2026๐Ÿ”ฅ] Enhancing Spatial Understanding in Image Generation via Reward Modeling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors