Skip to content

zhangquanchen/3DThinker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

arXiv

Tsinghua University SIGS; Meituan

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan,

Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang

Overview

drawing

Instruction

Though recent advances in vision–language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning.

Changelog

[25/11/18] We release the code for 3DThinker, both stage1 and stage2. See Training for usage.

[25/11/25] We fix some bugs of 3D latent assignment, and release the evaluation result of 3DThinker-Qwen2.5-VL-3B.

[25/12/05] We replace the full data with the example case, for rechecking.

[26/02/10] We release the data and model for training on Mindcube_Train and testing on MindCube-Tiny. See One Case for details.

Env Setup

git clone https://github.com/zhangquanchen/3DThinker.git
cd 3DThinker

3DThinker-stage1

conda create -n 3DThinker-stage1 python=3.10 -y && conda activate 3DThinker-stage1
pip install -r envs/requirements_stage1.txt

3DThinker-stage2

conda create -n 3DThinker-stage2 python=3.10 -y && conda activate 3DThinker-stage2
bash 3dthinker/stage2/setup.sh

If the installed trl version conflicts with our repository, replace it with the local copy by running:

cp -rf 3dthinker/stage2/package/trl /home/tiger/anaconda3/envs/3DThinker-stage2/lib/python3.10/site-packages/
  • Remark:You can refer to envs/requirements_stage2.txt to configure the environment.

SFT

Follow LLaMA-Factory for environment setup, which is for SFT training and weight merging.

conda create -n SFT python=3.10 -y && conda activate SFT
cd SFT/env
pip install -e ".[torch,metrics]" --no-build-isolation
  • Remark:You can refer to envs/requirements_sft.txt to configure the environment.

Data Generation

You should first download data from here, which incude MindCube_train_raw_qa_qwen_sft.json and images. The idx.jsonl under data folder is the data with idx. Then, following the two steps below, you will get data_output3d_begin_10k_resized.jsonl for training.

  1. VGGT feature extraction: Download the VGGT-1B weight and place under models.
python preprocessing/feature/extract_vggt_feature.py

After doing this, you will get data/feature_vggt(vggt feature) and data/resized_images(images resized for training).

  1. CoT data generation and filtering:
## produce chain-of-thought data
python preprocessing/produce_cot.py
## remove non-compliant data, e.g., w/o </output>
python preprocessing/clean.py
## filter useless data
python preprocessing/remove.py
## match VGGT indices
python preprocessing/jsonl_add_idx.py

After doing these, you will get data_output3d_begin_10k_resized.jsonl under data folder.

  • Remark:example.jsonl is an example dataset for training.

Training

Once you finish data preprocesing, you could conduct the training of 3DThinker now!

  1. Supervised Training Prepare your base model under models (e.g., Qwen2.5-VL-3B).
conda activate 3DThinker-stage1
cd 3dthinker/stage1 && sh train.sh
  1. Reinforced Training
conda activate 3DThinker-stage2
cd 3dthinker/stage2 && bash run_scripts/train.sh
## merge the weight
conda activate SFT && llamafactory-cli export merge.yaml

Evaluation

You can follow the official code of MindCube for evaluation. Specifically, you need to organize the benchmark according to the requirements, and then run the following script.

cd eval
sh eval_mindcube.sh
sh get_result.sh

For other base models, you can use eval_xxx.py for evaluation.

Other benchmarks including Ego3D-Bench, VSI-Bench, SPBench, CV-Bench, SPAR-Bench, ViewSpatial-Bench, MMSI-Bench.

OneCase

A case for training on Mindcube_Train and testing on MindCube-Tiny are listed below:

  • Training data, which is for MindCube training.
  • Model, which is trained after stage1 on Qwen2.5-3B-VL. Note that Tab. 2 is trained on a different training data.

Acknowledgements

The repo also benifits form Mirage, trl, transformers, VLM-R1, MindCube, VGGT.

Thanks for their wonderful works.

Bibtex

If you find 3DThinker helpful for your work, please cite

@article{chen2025think,
  title={Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views},
  author={Chen, Zhangquan and Zhang, Manyuan and Yu, Xinlei and Luo, Xufang and Sun, Mingze and Pan, Zihao and Feng, Yan and Pei, Peng and Cai, Xunliang and Huang, Ruqi},
  journal={arXiv preprint arXiv:2510.18632},
  year={2025}
}

About

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages