Tsinghua University SIGS; Meituan
Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan,
Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
Though recent advances in vision–language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning.
[25/11/18] We release the code for 3DThinker, both stage1 and stage2. See Training for usage.
[25/11/25] We fix some bugs of 3D latent assignment, and release the evaluation result of 3DThinker-Qwen2.5-VL-3B.
[25/12/05] We replace the full data with the example case, for rechecking.
[26/02/10] We release the data and model for training on Mindcube_Train and testing on MindCube-Tiny. See One Case for details.
git clone https://github.com/zhangquanchen/3DThinker.git
cd 3DThinkerconda create -n 3DThinker-stage1 python=3.10 -y && conda activate 3DThinker-stage1
pip install -r envs/requirements_stage1.txtconda create -n 3DThinker-stage2 python=3.10 -y && conda activate 3DThinker-stage2
bash 3dthinker/stage2/setup.shIf the installed trl version conflicts with our repository, replace it with the local copy by running:
cp -rf 3dthinker/stage2/package/trl /home/tiger/anaconda3/envs/3DThinker-stage2/lib/python3.10/site-packages/- Remark:You can refer to envs/requirements_stage2.txt to configure the environment.
Follow LLaMA-Factory for environment setup, which is for SFT training and weight merging.
conda create -n SFT python=3.10 -y && conda activate SFT
cd SFT/env
pip install -e ".[torch,metrics]" --no-build-isolation- Remark:You can refer to envs/requirements_sft.txt to configure the environment.
You should first download data from here, which incude MindCube_train_raw_qa_qwen_sft.json and images. The idx.jsonl under data folder is the data with idx. Then, following the two steps below, you will get data_output3d_begin_10k_resized.jsonl for training.
- VGGT feature extraction:
Download the VGGT-1B weight and place under
models.
python preprocessing/feature/extract_vggt_feature.pyAfter doing this, you will get data/feature_vggt(vggt feature) and data/resized_images(images resized for training).
- CoT data generation and filtering:
## produce chain-of-thought data
python preprocessing/produce_cot.py
## remove non-compliant data, e.g., w/o </output>
python preprocessing/clean.py
## filter useless data
python preprocessing/remove.py
## match VGGT indices
python preprocessing/jsonl_add_idx.pyAfter doing these, you will get data_output3d_begin_10k_resized.jsonl under data folder.
- Remark:
example.jsonlis an example dataset for training.
Once you finish data preprocesing, you could conduct the training of 3DThinker now!
- Supervised Training
Prepare your base model under
models(e.g., Qwen2.5-VL-3B).
conda activate 3DThinker-stage1
cd 3dthinker/stage1 && sh train.sh- Reinforced Training
conda activate 3DThinker-stage2
cd 3dthinker/stage2 && bash run_scripts/train.sh
## merge the weight
conda activate SFT && llamafactory-cli export merge.yamlYou can follow the official code of MindCube for evaluation. Specifically, you need to organize the benchmark according to the requirements, and then run the following script.
cd eval
sh eval_mindcube.sh
sh get_result.shFor other base models, you can use eval_xxx.py for evaluation.
Other benchmarks including Ego3D-Bench, VSI-Bench, SPBench, CV-Bench, SPAR-Bench, ViewSpatial-Bench, MMSI-Bench.
A case for training on Mindcube_Train and testing on MindCube-Tiny are listed below:
- Training data, which is for MindCube training.
- Model, which is trained after stage1 on Qwen2.5-3B-VL. Note that Tab. 2 is trained on a different training data.
The repo also benifits form Mirage, trl, transformers, VLM-R1, MindCube, VGGT.
Thanks for their wonderful works.
If you find 3DThinker helpful for your work, please cite
@article{chen2025think,
title={Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views},
author={Chen, Zhangquan and Zhang, Manyuan and Yu, Xinlei and Luo, Xufang and Sun, Mingze and Pan, Zihao and Feng, Yan and Pei, Peng and Cai, Xunliang and Huang, Ruqi},
journal={arXiv preprint arXiv:2510.18632},
year={2025}
}
