🤗 Hugging Face | 📑 Paper | ⚙️ Github | 🖥️ Home Page
Peiwen Sun
The official repo for SpaceVista: All-Scale Visual Spatial Reasoning from
[2025.10.10] Our preview SFT code base is released for preview. .
[2025.10.10] Our preview 100K subset of SpaceVista-1M is now available at .
[2025.10.10] Our initial paper is now accessible at .
- Dataset: Preview 100K subset of SpaceVista-1M
- SFT training: SFT code for SpaceVista
.
- Release the full SpaceVista-1M dataset
- Release the GRPO codebase and checkpoints
- Release the SpaceVista-Bench benchmark
Spatial reasoning is the ability to perceive, interpret, and act across spatial scales, from millimeter-sized components to distant aerial scenes. All-scale spatial reasoning is fundamental to next-generation intelligent systems and supports diverse applications: mm sensing for advanced manufacturing, cm and m perception for embodied agents, 10m operation for autonomous driving, and 100m for drone-based sensing. Despite progress, existing work shows clear limitations in both model design and dataset coverage. Current scene perception research mostly targets indoor scenes, narrow object classes, and limited spatial ranges, and lacks training paradigms engineered for end to end, cross scale reasoning. SpaceVista addresses this gap by presenting the first systematic optimization across both data and model dimensions to enable robust, full-scene spatial reasoning.
Development for the repo is done in Python 3.10.18
This code base is adapted from LLaMA-factory, R1-V, VG-LLM and Easy-R1. Sincere thanks to the engineers for their great work.
We use the light venv package for the Python environment. (Do not use other tools like conda at the same time)
git clone
cd SpaceVista
# pip install uv
uv venv -p python3.10.18
source .venv/bin/activate
UV_HTTP_TIMEOUT=600 uv pip install -r requirements_sft.txt --no-deps -i http://mirrors.aliyun.com/pypi/simple/
# For flash_attn
MAX_JOBS=64 uv pip install flash_attn==2.7.1.post4 --no-build-isolation -i http://mirrors.aliyun.com/pypi/simple/
ln -s "$(pwd)/dependency/transformers" ".venv/lib/python3.10/site-packages/transformers"
Please refer to the dataset part.
We provide the dataset in ShareGPT format, along with up to 32 extracted frames.
You may download the original MP4 video from the source.
The model will be released soon after the sensitivity check.
| Model | 🤗 HF | Detail |
|---|---|---|
| To Be Updated | To Be Updated | To Be Updated |
Before everything, a sincere apology for some part of our code is still hard-coded. We are actively seeking for easy usage of this repo.
Change the form of the dataset first. Note, this might be simplified in the future
cd dataset
python flatten.py -i your_path/meta.json -o your_path/meta_flatten.json
To generate audio from a text prompt using our pretrained model:
- Download the pretrained Qwen2.5VL-7B-instruct model and DINOv3.
- (Optional) Download the pretrained VGGT-1B model.
- Change the
dinov3/dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pthandvggt/ckptpath in../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.pyto your path.
# source the same environment
cd sft
# (Optional checking) `training_load = True` in `../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py`
sed -i 's/self\.training_load = False/self\.training_load = True/g' \
"../.venv/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py"
llamafactory-cli train examples/train_full/qwen2_5_vl_spatial_full_sft_video_dinov3.yaml
Preliminary: If you train the model with an additional adapter for DINOv3, you need to use a roughly trained SFT model as the pre-trained base. Otherwise, PEFT will only save the LoRA weights.
- Training each expert on the SFT model
- (Optional checking)
training_load = Falsein../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py
- (Optional checking)
# source the same environment
cd sft
sed -i 's/self\.training_load = True/self\.training_load = False/g' \
"../.venv/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py"
llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_outdoor.yaml
llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_table.yaml
llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_tabletop.yaml
llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_indoor.yaml
- Change the path of each expert in
sft/src/llamafactory/model/adapter.pyto the checkpoint saved on the above step- (Optional checking)
training_load = Falsein../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py
- (Optional checking)
# source the same environment
cd sft
llamafactory-cli train examples/train_lora/qwen2_5_vl_spatial_full_sft_video_expert.yaml
To be updated
# to be updated
- Please be sure to use the venv provided.
- Change the benchmark path to your path
DATASET_CONFIGS = {
"vsibench": {
"dataset_path": "./vsi-bench/test-00000-of-00001.parquet",
"video_dir": "./vsi-bench",
"evaluation_fn": ...,
"metric_fn": ...,
},
"mmsibench": {
"dataset_path": "./MMSI_Bench.parquet",
"video_dir": "", # Not needed as images are in the parquet file
"evaluation_fn": ...,
"metric_fn": ...,
},
"spacevista": {
"dataset_path": "./unified_qa.jsonl", # will be released soon.
"video_dir": "./frames/all", # will be released soon.
"evaluation_fn": ...,
"metric_fn": ...,
},
"sparbench": {
"dataset_path": ["./SPAR-Bench/data/test-00000-of-00004.parquet","./SPAR-Bench/data/test-00001-of-00004.parquet",\
"./SPAR-Bench/data/test-00002-of-00004.parquet","./SPAR-Bench/data/test-00003-of-00004.parquet"],
"video_dir": "",
"evaluation_fn": ...,
"metric_fn": ...,
},
"stibench": {
"dataset_path": "./sti-bench/qa.parquet",
"video_dir": "", # Not needed as images are in the parquet file
"evaluation_fn": ...,
"metric_fn": ...,
}
}
- Use this script to evaluate a model on a chosen dataset. Example:
cd eval
sed -i 's/self\.training_load = True/self\.training_load = False/g' \
"../.venv/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py"
# source the same environment
# vsibench
python eval_multi_model_mp.py --model_path /path/to/model --dataset vsibench --output_dir ./eval_results --gpu_ids 0,1 --num_processes 4 --num_frames 32 --batch_size 1
# spacevista
python eval_multi_model_mp.py --model_path /path/to/model --dataset spacevista --output_dir ./eval_results --gpu_ids 0,1 --num_processes 4 --num_frames 32 --batch_size 1
Required: --model_path (checkpoint or folder) and --dataset (one of: vsibench, mmsibench, spacevista, sparbench, stibench). Optional: --output_dir (results dir, default ./eval_results), --gpu_ids (comma-separated GPU IDs), --num_processes (parallel workers), --num_frames (frames per video), --batch_size (inference batch size), --debug (enable quick run), and --debug_size (samples used when debug is on).
If you find this repo useful, please cite our papers:
@article{sun2025spacevista,
title={SpaceVista: All-Scale Visual Spatial Reasoning from mm to km},
author={Sun, Peiwen and Lang, Shiqiang and Wu, Dongming and Ding, Yi and Feng, Kaituo and Liu, Huadai and Ye, Zhen and Liu, Rui and Liu, Yun-Hui and Wang, Jianan and Yue, Xiangyu},
journal={arXiv preprint arXiv:2510.09606},
year={2025}
}

