DeepSeek released 'Thinking-with-Visual-Primitives' framework

· r/LocalLLaMA ·

DeepSeek, Peking University, and Tsinghua release 'Thinking with Visual Primitives,' a multimodal reasoning framework that elevates spatial tokens—coordinates and bounding boxes—into minimal units of thought, enabling models to 'point' within images during chain-of-thought reasoning.

Categories: OSS & Tools, Research

Excerpt

https://preview.redd.it/47r9qee44cyg1.png?width=1450&format=png&auto=webp&s=0d6f9687115be6ff96d0a194d95232ac0413a7e9 DeepSeek, in collaboration with Peking University and Tsinghua University, has released the paper "Thinking with Visual Primitives" along with its open-source repository, introducing a new multimodal reasoning framework. The core approach of this framework is to elevate spatial tokens—specifically coordinate points and bounding boxes—into the "minimal units of thought" within the model's chain-of-thought. These are directly interleaved during the reasoning process, enabling the model to "point" to specific locations within an image while it "thinks." [https://github.com/deepseek-ai/Thinking-with-Visual-Primitives](https://github.com/deepseek-ai/Thinking-with-Visual-Primitives) https://preview.redd.it/lt5qu53g0cyg1.png?width=1844&format=png&auto=webp&s=5d6f0a8de6481035faa22c9d57873c51ca97b1fb **notice: deepseek removed the repo**

Discussions