Shuai Yuan,1
Yantai Yang,1, 2
Xiaotian Yang,1
Xupeng Zhang,1
Zhonghao Zhao,1
Lingming Zhang,
Zhipeng Zhang1 ✉
1AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University
2Anyverse Dynamics
✉ Corresponding Author
Achieving higher reconstruction quality and more accurate camera pose estimation using thousands of frames input.
- [Jan 6 , 2026] Paper release.
- [Jan 6 , 2026] Code release.
- [Jan 19 , 2026] Long3D dataset release.
- Welcome to check out our previous collaborative work FastVGGT.
We propose InfiniteVGGT, a causal visual geometry transformer that utilizes a training-free rolling memory mechanism to enable stable, infinite-horizon streaming, and introduce the Long3D benchmark to rigorously evaluate long-term continuous 3D geometry performance. Our main contributions are summarized as follows:
- An unbounded memory architecture InfiniteVGGT for continuous 3D geometry understanding, built on a novel, dynamic, and interpretable explicit memory system.
- State-of-the-art performance on long-sequence benchmarks and a unique capability for robust, infinite-horizon reconstruction without memory overflow.
- The Long3D benchmark, a new dataset for the rigorous evaluation of long-term performance, addressing a critical gap in the field.
- Clone InfiniteVGGT
git clone https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT.git
cd InfiniteVGGT- Create conda environment
conda create -n infinitevggt python=3.11 cmake=3.14.0
conda activate infinitevggt - Install requirements
pip install -r requirements.txt
conda install 'llvm-openmp<16'- Download the StreamVGGT pretrained checkpoint and place it to ./ckpt directory.
# Run on your own data
python run_inference.py --input_dir path/to/your/images_dir
# Run long sequence and store the result to directory for each frame
python run_inference.py \
--input_dir path/to/your/images_dir \
--frame_cache_dir path/to/your/results_perframe_dir \
--no_cache_resultsWe provide demo code based on the NRGBD dataset. You can run it using the following command:
python demo_viser.py \
--seq_path path/to/nrgbd/image_sequence \
--frame_interval 10 \
--gt_path path/to/nrgbd/gt_camera (Optional)The Long3D Dataset is a benchmark designed for long-sequence 3D scene reconstruction. It provides 10Hz image streams paired with dense ground truth point clouds.
| File Name | Description |
|---|---|
image.7z |
Continuous image stream data captured at a frequency of 10 Hz. |
dense_cloud_map.pcd |
Global ground truth point clouds, acquired via a 3D spatial scanner. |
The most efficient way to download the dataset is using the huggingface-hub CLI. Ensure you have the library installed (pip install -U huggingface_hub).
# export HF_ENDPOINT=https://hf-mirror.com
hf download --repo-type dataset \
--resume-download AutoLab-SJTU/Long3D \
--local-dir ./Long3D
Alternatively, you can browse and download files directly from the Long3D dataset.
- [ √ ] Release the Dataset.
We would like to acknowledge the following open-source projects that served as a foundation for our implementation:
DUSt3R CUT3R VGGT Point3R StreamVGGT FastVGGT TTT3R
Many thanks to these authors!
If you incorporate our work into your research, please cite:
@misc{yuan2026infinitevggt,
title={InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams},
author={Shuai Yuan and Yantai Yang and Xiaotian Yang and Xupeng Zhang and Zhonghao Zhao and Lingming Zhang and Zhipeng Zhang},
journal={arXiv preprint arXiv:2601.02281},
year={2026}
}


