SpaceVista: All-Scale Visual Spatial Reasoning from $mm$ to $km$

🤗 Hugging Face | 📑 Paper | ⚙️ Github | 🖥️ Home Page

Peiwen Sun $^{*}$, Shiqiang Lang $^{*}$, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

Keywords:

The official repo for SpaceVista: All-Scale Visual Spatial Reasoning from $mm$ to $km$.

Outlines

💥 News 💥

[2025.10.10] Our preview SFT code base is released for preview. .

[2025.10.10] Our preview 100K subset of SpaceVista-1M is now available at .

[2025.10.10] Our initial paper is now accessible at .

Overall Structure

Dataset: Preview 100K subset of SpaceVista-1M
SFT training: SFT code for SpaceVista .

Release the full SpaceVista-1M dataset
Release the GRPO codebase and checkpoints
Release the SpaceVista-Bench benchmark

SpaceVista

Spatial reasoning is the ability to perceive, interpret, and act across spatial scales, from millimeter-sized components to distant aerial scenes. All-scale spatial reasoning is fundamental to next-generation intelligent systems and supports diverse applications: mm sensing for advanced manufacturing, cm and m perception for embodied agents, 10m operation for autonomous driving, and 100m for drone-based sensing. Despite progress, existing work shows clear limitations in both model design and dataset coverage. Current scene perception research mostly targets indoor scenes, narrow object classes, and limited spatial ranges, and lacks training paradigms engineered for end to end, cross scale reasoning. SpaceVista addresses this gap by presenting the first systematic optimization across both data and model dimensions to enable robust, full-scene spatial reasoning.

Requirements

Development for the repo is done in Python 3.10.18

This code base is adapted from LLaMA-factory, R1-V, VG-LLM and Easy-R1. Sincere thanks to the engineers for their great work.

We use the light venv package for the Python environment. (Do not use other tools like conda at the same time)

git clone 
cd SpaceVista

# pip install uv

uv venv -p python3.10.18
source .venv/bin/activate
UV_HTTP_TIMEOUT=600 uv pip install -r requirements_sft.txt --no-deps -i http://mirrors.aliyun.com/pypi/simple/

# For flash_attn
MAX_JOBS=64 uv pip install flash_attn==2.7.1.post4 --no-build-isolation -i http://mirrors.aliyun.com/pypi/simple/

ln -s "$(pwd)/dependency/transformers" ".venv/lib/python3.10/site-packages/transformers"

Dataset Usages

Please refer to the dataset part.

We provide the dataset in ShareGPT format, along with up to 32 extracted frames.

You may download the original MP4 video from the source.

Model Gallery

The model will be released soon after the sensitivity check.

Model	🤗 HF	Detail
To Be Updated	To Be Updated	To Be Updated

Usage

Before everything, a sincere apology for some part of our code is still hard-coded. We are actively seeking for easy usage of this repo.

SFT training:

Change the form of the dataset first. Note, this might be simplified in the future

cd dataset
python flatten.py -i your_path/meta.json -o your_path/meta_flatten.json

To generate audio from a text prompt using our pretrained model:

Download the pretrained Qwen2.5VL-7B-instruct model and DINOv3.
(Optional) Download the pretrained VGGT-1B model.
Change the dinov3/dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth and vggt/ckpt path in ../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py to your path.

# source the same environment
cd sft

# (Optional checking) `training_load = True` in `../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py`

sed -i 's/self\.training_load = False/self\.training_load = True/g' \
"../.venv/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py"

llamafactory-cli train examples/train_full/qwen2_5_vl_spatial_full_sft_video_dinov3.yaml

SFT training w/. Expert:

Preliminary: If you train the model with an additional adapter for DINOv3, you need to use a roughly trained SFT model as the pre-trained base. Otherwise, PEFT will only save the LoRA weights.

Training each expert on the SFT model
- (Optional checking) training_load = False in ../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py

# source the same environment
cd sft

sed -i 's/self\.training_load = True/self\.training_load = False/g' \
"../.venv/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py"

llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_outdoor.yaml
llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_table.yaml
llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_tabletop.yaml
llamafactory-cli train examples/train_lora/qwen2_5vl_lora_sft_spacevista_cross_indoor.yaml

Change the path of each expert in sft/src/llamafactory/model/adapter.py to the checkpoint saved on the above step
- (Optional checking) training_load = False in ../dependency/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py

# source the same environment
cd sft

llamafactory-cli train examples/train_lora/qwen2_5_vl_spatial_full_sft_video_expert.yaml

RL w/. GRPO

To be updated

# to be updated

Evaluation

Please be sure to use the venv provided.
Change the benchmark path to your path

DATASET_CONFIGS = {
        "vsibench": {
            "dataset_path": "./vsi-bench/test-00000-of-00001.parquet",
            "video_dir": "./vsi-bench",
            "evaluation_fn": ...,
            "metric_fn": ...,
        },
        "mmsibench": {
            "dataset_path": "./MMSI_Bench.parquet",
            "video_dir": "", # Not needed as images are in the parquet file
            "evaluation_fn": ...,
            "metric_fn": ...,
        },
        "spacevista": {
            "dataset_path": "./unified_qa.jsonl", # will be released soon.
            "video_dir": "./frames/all", # will be released soon.
            "evaluation_fn": ...,
            "metric_fn": ...,
        },
        "sparbench": {
            "dataset_path": ["./SPAR-Bench/data/test-00000-of-00004.parquet","./SPAR-Bench/data/test-00001-of-00004.parquet",\
                "./SPAR-Bench/data/test-00002-of-00004.parquet","./SPAR-Bench/data/test-00003-of-00004.parquet"],
            "video_dir": "",
            "evaluation_fn": ...,
            "metric_fn": ...,
        },
        "stibench": {
            "dataset_path": "./sti-bench/qa.parquet",
            "video_dir": "", # Not needed as images are in the parquet file
            "evaluation_fn": ...,
            "metric_fn": ...,
        }
    }

Use this script to evaluate a model on a chosen dataset. Example:

cd eval

sed -i 's/self\.training_load = True/self\.training_load = False/g' \
"../.venv/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py"

# source the same environment

# vsibench
python eval_multi_model_mp.py --model_path /path/to/model --dataset vsibench --output_dir ./eval_results --gpu_ids 0,1 --num_processes 4 --num_frames 32 --batch_size 1

# spacevista
python eval_multi_model_mp.py --model_path /path/to/model --dataset spacevista --output_dir ./eval_results --gpu_ids 0,1 --num_processes 4 --num_frames 32 --batch_size 1

Required: --model_path (checkpoint or folder) and --dataset (one of: vsibench, mmsibench, spacevista, sparbench, stibench). Optional: --output_dir (results dir, default ./eval_results), --gpu_ids (comma-separated GPU IDs), --num_processes (parallel workers), --num_frames (frames per video), --batch_size (inference batch size), --debug (enable quick run), and --debug_size (samples used when debug is on).

Reference

If you find this repo useful, please cite our papers:

@article{sun2025spacevista,
  title={SpaceVista: All-Scale Visual Spatial Reasoning from mm to km}, 
  author={Sun, Peiwen and Lang, Shiqiang and Wu, Dongming and Ding, Yi and Feng, Kaituo and Liu, Huadai and Ye, Zhen and Liu, Rui and Liu, Yun-Hui and Wang, Jianan and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2510.09606},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.asset		.asset
dataset		dataset
dependency/transformers		dependency/transformers
eval		eval
sft		sft
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpaceVista: All-Scale Visual Spatial Reasoning from $mm$ to $km$

Outlines

💥 News 💥

Overall Structure

SpaceVista

Requirements

Dataset Usages

Model Gallery

Usage

SFT training:

SFT training w/. Expert:

RL w/. GRPO

Evaluation

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpaceVista: All-Scale Visual Spatial Reasoning from $mm$ to $km$

Outlines

💥 News 💥

Overall Structure

SpaceVista

Requirements

Dataset Usages

Model Gallery

Usage

SFT training:

SFT training w/. Expert:

RL w/. GRPO

Evaluation

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages