Skip to content

ailuntx/Thinking-with-Visual-Primitives

Repository files navigation

DeepSeek LLM

Thinking with Visual Primitives

English | 简体中文

📜 License | 📖 Citation

Important

This repository was originally obtained from a source repository previously associated with charlesCXK, which is currently unavailable.

The original upstream/fork relationship is no longer reliably preserved. This repository should be treated as a community mirror/archive rather than an authoritative source.

There is currently no known replacement official repository for this project. Please follow future updates or any re-release from the following sources:

News

2026.04.30: We have released the technical report detailing our approach. In the near future, we plan to make the in-house benchmarks and a subset of our cold-start data publicly available. The model weights will be integrated into our foundation model and released in the future.

1. Introduction

While recent Multimodal Large Language Models (MLLMs) have made strides in bridging the "Perception Gap" (e.g., through high-resolution cropping or thinking with images), they still struggle with complex structural reasoning. We identify this bottleneck as the Reference Gap: natural language is simply too ambiguous to precisely point to dense spatial layouts, often leading to logical collapse and hallucinations in thinking process.

This project introduces a paradigm shift. Instead of just "seeing clearer", our model learns to "point while it reasons." By interleaving spatial markers (points and bounding boxes) directly into the reasoning trajectory as minimal units of thought, we anchor abstract linguistic concepts to concrete physical coordinates.


Grounded Task Reasoning

Topological Reasoning

Key Highlights

  • Point-to-Reason Synergy: Mimicking human cognitive behavior (like using a finger to count or trace a maze), our framework elevates visual primitives to minimal units of thought, effectively solving the Reference Gap in complex structural reasoning.
  • Extreme Visual Token Efficiency: Built upon the architecture of DeepSeek-V4-Flash, we compress the KV cache of every 4 visual tokens into a single entry, drastically reducing image token consumption while maintaining cognitive depth.
  • Frontier-Competitive Performance: Despite a compact model scale and a significantly lower image-token budget, our model matches frontier models like GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash across challenging counting and spatial reasoning benchmarks. (We note that the reported scores cover only a subset of evaluation dimensions that are directly relevant to the research focus of this paper, and are therefore not indicative of the models' overall capabilities.)
image

2. License

This code repository is licensed under the MIT License.

3. Citation

@article{lu2026think,
  title={Thinking with Visual Primitives},
  author={Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and Luo, Lingxiao and Wu, Zhiyu and Pan, Zizheng and Liu, Xingchao and Lin, Yutong and Li, Hao and Liu, Wen and Hao, Zhewen and Gao, Xi and Nie, Shaoheng and Wei, Yixuan and Xie, Zhenda and Chen, Ting and Zeng, Gang},
  year={2026}
}

4. Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

About

Archived snapshot of Thinking-with-Visual-Primitives

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors