Commuting? Listen to our paper's key insights in just minutes!
Commuting? Listen to our paper's key insights in just minutes!
We study how to leverage system-2 reasoning to solve perception tasks, the kind of tasks VLMs typically handle with fast, shallow (system-1) thinking. Here's what we found:
We introduce LongPerceptualThoughts, a synthetic dataset of 30K long chain-of-thought (CoT) traces for vision tasks. We synthesize these long CoTs with a novel three-stage data synthesis framework that effectively incorporates key cognitive behaviors such as backtracking, verification, subgoal setting, etc., in the long CoTs.
Overall, by fine-tuning on LongPerceptualThoughts,we improve Qwen2.5-VL-7B on 5 vision-centric benchmarksby by an average of 3.4 points. In particular, we improve V* bench by 11.8 points. We even find that the fine-tuned Qwen2.5-VL improves out-of-distribution MMLU-Pro by more than 2 points!
Despite the rapid progress in reasoning-focused models like OpenAI’s o1 and DeepSeek’s R1, most advances have focused on math and code domains. But what about perception -- tasks like object recognition, scene understanding, or spatial reasoning, where vision-language models (VLMs) typically rely on fast, shallow reasoning?
This work asks: Can we equip VLMs with system-2 reasoning to improve performance on vision-centric tasks? And our answer is Yes! and with only 500 images!
Method | Avg | CV-Bench | V* Bench | MMVP | MMStar-V | MME-RW-V |
---|---|---|---|---|---|---|
Qwen2.5-VL-7B-Instruct | 58.47 | 74.74 | 48.51 | 73.67 | 63.73 | 31.68 |
+ CoT | 59.18 | 75.42 | 55.08 | 70.60 | 62.40 | 32.40 |
+ LongPerceptualThoughts (SFT) | 59.90 | 76.05 | 60.53 | 70.00 | 60.67 | 32.25 |
+ LongPerceptualThoughts (SFT + DPO) | 61.87 | 76.61 | 60.31 | 75.00 | 64.00 | 33.45 |
We propose a scalable, three-stage framework to synthesize long CoTs from images and dense captions:
Richer thinking leads to better seeing—without needing more data or bigger models.
@misc{liao2025longperceptualthoughtsdistillingsystem2reasoning,
title={LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception},
author={Yuan-Hong Liao and Sven Elflein and Liu He and Laura Leal-Taixé and Yejin Choi and Sanja Fidler and David Acuna},
year={2025},
eprint={2504.15362},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.15362},
}