Zhenyu Tang1,2*, Chaoran Feng1*, Yufan Deng1,2, Jie Wu2, Xiaojie Li2,
Rui Wang2, Yunpeng Chen2, Daquan Zhou1
1Peking University ย ย 2ByteDance Seed
*Equal Contribution
[2026.03]๐ Paper is now available on arXiv: https://arxiv.org/abs/2602.24233[TBD]๐ง We are planning to release the SpatialReward-Dataset and SpatialScore model weights. Please stay tuned!
Text-to-image models have made significant strides in visual fidelity but often struggle with complex spatial relationships. Existing reward models often fail to capture these intricate spatial constraints.
In this work, we introduce a novel method to strengthen spatial understanding in image generation:
- SpatialReward-Dataset: A curated dataset with over 80k preference pairs, featuring adversarial spatial perturbations verified by humans.
- SpatialScore: A VLM-based reward model (built on Qwen2.5-VL) that surpasses proprietary models (e.g., GPT-5, Gemini 2.5) in spatial evaluation accuracy.
- Spatial-RL: We demonstrate that
SpatialScoreeffectively enables online Reinforcement Learning (specifically GRPO with Top-k filtering), yielding significant gains in spatial generation capabilities.
Figure 1: Existing reward models often assign high scores to spatially incorrect images. SpatialScore provides accurate feedback, enabling better alignment.
We constructed a large-scale dataset focusing on spatial logic. Each entry consists of a "Perfect Image" (aligned with the text) and a "Perturbed Image" (with subtle spatial violations), creating a hard negative sample for robust training.
By fine-tuning Qwen2.5-VL, our SpatialScore achieves superior performance in evaluating spatial relationships, outperforming strong baselines including HPS v2/v3, ImageReward, and even proprietary VLM APIs on our benchmarks.
| Model | Overall Accuracy |
|---|---|
| HPS v2.1 | 46.3% |
| ImageReward | 47.9% |
| GPT-5 (API) | 89.0% |
| Gemini-2.5 Pro | 95.1% |
| SpatialScore (Ours) | 95.8% |
We apply GRPO (Group Relative Policy Optimization) using SpatialScore as the feedback signal. To handle reward noise and prompt difficulty variance, we introduce a Top-k filtering strategy, which significantly stabilizes training and improves convergence.
Our method significantly improves the spatial layout capability of Flux.1-dev.
Comparison of generated images using complex spatial prompts.
If you find our work useful, please cite our paper:
@article{tang2026enhancing,
title={Enhancing Spatial Understanding in Image Generation via Reward Modeling},
author={Tang, Zhenyu and Feng, Chaoran and Deng, Yufan and Wu, Jie and Li, Xiaojie and Wang, Rui and Chen, Yunpeng and Zhou, Daquan},
journal={arXiv preprint arXiv:2602.24233},
year={2026}
}
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). For more details, please refer to the LICENSE file.

