Thinking with Visual Primitives

English | 简体中文

📜 License | 📖 Citation

Important

This repository was originally obtained from a source repository previously associated with charlesCXK, which is currently unavailable.

The original upstream/fork relationship is no longer reliably preserved. This repository should be treated as a community mirror/archive rather than an authoritative source.

There is currently no known replacement official repository for this project. Please follow future updates or any re-release from the following sources:

the charlesCXK profile: https://github.com/charlesCXK
the DeepSeek organization: https://github.com/deepseek-ai

News

2026.04.30: We have released the technical report detailing our approach. In the near future, we plan to make the in-house benchmarks and a subset of our cold-start data publicly available. The model weights will be integrated into our foundation model and released in the future.

1. Introduction

While recent Multimodal Large Language Models (MLLMs) have made strides in bridging the "Perception Gap" (e.g., through high-resolution cropping or thinking with images), they still struggle with complex structural reasoning. We identify this bottleneck as the Reference Gap: natural language is simply too ambiguous to precisely point to dense spatial layouts, often leading to logical collapse and hallucinations in thinking process.

This project introduces a paradigm shift. Instead of just "seeing clearer", our model learns to "point while it reasons." By interleaving spatial markers (points and bounding boxes) directly into the reasoning trajectory as minimal units of thought, we anchor abstract linguistic concepts to concrete physical coordinates.

Grounded Task Reasoning

Topological Reasoning

Key Highlights

Point-to-Reason Synergy: Mimicking human cognitive behavior (like using a finger to count or trace a maze), our framework elevates visual primitives to minimal units of thought, effectively solving the Reference Gap in complex structural reasoning.
Extreme Visual Token Efficiency: Built upon the architecture of DeepSeek-V4-Flash, we compress the KV cache of every 4 visual tokens into a single entry, drastically reducing image token consumption while maintaining cognitive depth.
Frontier-Competitive Performance: Despite a compact model scale and a significantly lower image-token budget, our model matches frontier models like GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash across challenging counting and spatial reasoning benchmarks. (We note that the reported scores cover only a subset of evaluation dimensions that are directly relevant to the research focus of this paper, and are therefore not indicative of the models' overall capabilities.)

2. License

This code repository is licensed under the MIT License.

3. Citation

@article{lu2026think,
  title={Thinking with Visual Primitives},
  author={Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and Luo, Lingxiao and Wu, Zhiyu and Pan, Zizheng and Liu, Xingchao and Lin, Yutong and Li, Hao and Liu, Wen and Hao, Zhewen and Gao, Xi and Nie, Shaoheng and Wei, Yixuan and Xie, Zhenda and Chen, Ting and Zeng, Gang},
  year={2026}
}

4. Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE-CODE		LICENSE-CODE
LICENSE-MODEL		LICENSE-MODEL
Makefile		Makefile
README.md		README.md
README_zh.md		README_zh.md
Thinking_with_Visual_Primitives.pdf		Thinking_with_Visual_Primitives.pdf
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinking with Visual Primitives

News

1. Introduction

Key Highlights

2. License

3. Citation

4. Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thinking with Visual Primitives

News

1. Introduction

Key Highlights

2. License

3. Citation

4. Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages