LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

Yuan-Hong Liao¹, Sven Elflein^{1, 2}, Liu He³, Laura Leal-Taixe ́², Yejin Choi², Sanja Fidler^{1, 2}, David Acuna²

¹University of Toronto, Vector Institute, ²NVIDIA, ³Purdue University

Notebook LM Summary

Commuting? Listen to our paper's key insights in just minutes!

🔍 Project summary:

We study how to leverage system-2 reasoning to solve perception tasks, the kind of tasks VLMs typically handle with fast, shallow (system-1) thinking. Here's what we found:

With LongPerceptualThoughts, VLMs can perform system-2 reasoning on perceptual tasks after simple supervised fine-tuning.
Surprisingly, despite being tuned only on visual tasks, these fine-tuned VLMs generalize well to challenging text-only reasoning benchmarks like MMLU-Pro!

🎯 So, what’s the secret sauce here?

We introduce LongPerceptualThoughts, a synthetic dataset of 30K long chain-of-thought (CoT) traces for vision tasks. We synthesize these long CoTs with a novel three-stage data synthesis framework that effectively incorporates key cognitive behaviors such as backtracking, verification, subgoal setting, etc., in the long CoTs.

Overall, by fine-tuning on LongPerceptualThoughts,we improve Qwen2.5-VL-7B on 5 vision-centric benchmarksby by an average of 3.4 points. In particular, we improve V* bench by 11.8 points. We even find that the fine-tuned Qwen2.5-VL improves out-of-distribution MMLU-Pro by more than 2 points!

Introduction

Despite the rapid progress in reasoning-focused models like OpenAI’s o1 and DeepSeek’s R1, most advances have focused on math and code domains. But what about perception -- tasks like object recognition, scene understanding, or spatial reasoning, where vision-language models (VLMs) typically rely on fast, shallow reasoning?

This work asks: Can we equip VLMs with system-2 reasoning to improve performance on vision-centric tasks? And our answer is Yes! and with only 500 images!

Method	Avg	CV-Bench	V* Bench	MMVP	MMStar-V	MME-RW-V
Qwen2.5-VL-7B-Instruct	58.47	74.74	48.51	73.67	63.73	31.68
+ CoT	59.18	75.42	55.08	70.60	62.40	32.40
+ LongPerceptualThoughts (SFT)	59.90	76.05	60.53	70.00	60.67	32.25
+ LongPerceptualThoughts (SFT + DPO)	61.87	76.61	60.31	75.00	64.00	33.45

For more results, please unfold to see expanded tables

Comparison with other multimodal reasoning datasets.

Method	Avg	CV-Bench	V* Bench	MMVP	MMStar-V	MME-RW-V
Qwen2.5-VL-7B-Instruct	58.47	74.74	48.51	73.67	63.73	31.68
+ CoT	59.18	75.42	55.08	70.60	62.40	32.40
+ VLAA-thinking (Chen el al., 2025)	42.32	68.50	53.53	66.67	0.53	22.38
+ Virgo (Du el al., 2025)	50.87	67.22	44.14	57.67	57.6	27.71
+ LongPerceptualThoughts (SFT)	59.90	76.05	60.53	70.00	60.67	32.25
+ LongPerceptualThoughts (SFT + DPO)	61.87	76.61	60.31	75.00	64.00	33.45

Data Synthesis Framework: Ask, Think, and Think Harder

We propose a scalable, three-stage framework to synthesize long CoTs from images and dense captions:

Stage 1 – Ask using an LLM: We start by converting dense image descriptions into multiple-choice questions using GPT-4o-mini. These questions are designed to be grounded, diverse, and verifiable, ensuring the model has a clear visual task to reason about.
Stage 2 – Think like a VLM: Next, we use a Qwen2.5-VL-7B-Instruct to generate a simple chain-of-thought (CoT). These are short, direct reasoning steps that are within the VLM’s comfort zone, providing a natural foundation to build deeper thoughts from.
Stage 3 – Think harder like a Reasoning VLM: Finally, we hand over the image, question, and simple CoT to a powerful reasoning LLM (e.g., DeepSeek-R1-Distill). With a gentle nudge, like inserting "Wait," before the reasoning begins, we prompt the model to reflect, verify, and revise its thinking. The result? A rich, multi-step long CoT that mimics system-2 behavior: setting subgoals, checking assumptions, and even backtracking if needed.

Analysis: More Thoughtful Chains, Not Just Longer Ones

Length of CoTs — Token-length distribution of CoTs.

Cognitive Behaviors in CoTs — Frequency of cognitive behaviors in CoTs.

CoT Lengths (left figure): CoTs in LongPerceptualThoughts are longer on average as compared to CoT produced by popular VLMs.
Cognitive Behaviors (right figure): CoTs in LongPerceptualThoughts also demonstrate diverse behaviors in structure. Unlike flat responses from baseline VLMs, our CoTs include intermediate steps and self-reflection. More specifically, our CoTs in LongPerceptualThoughts demonstrate significantly higher rates of subgoal setting, backtracking, and verification compared to VLMs like Qwen2.5-VL CoTs. The CoTs in LongPerceptualThoughts are more similar to those generated by Gemini Flash Thinking, these structured behaviors are key indicators of system-2 reasoning, and are rarely observed in existing vision-language CoT datasets.

Takeaways

🧠 Synthesizing long, rich reasoning data powers perception.
With just 500 images, we generate 30K chain-of-thought traces that teach vision-language models to reflect, verify, and revise. These cognitive patterns—rare in standard datasets—boost visual reasoning across benchmarks.
🥷 Just SFT. Nothing fancy.
No new architectures. No reinforcement learning from scratch. Just supervised fine-tuning on LongPerceptualThoughts leads to strong gains on vision-centric tasks. More surprisingly, it even improves performance on a text-only reasoning benchmark (MMLU-Pro). See more details in our paper!

Richer thinking leads to better seeing—without needing more data or bigger models.

BibTeX

@misc{liao2025longperceptualthoughtsdistillingsystem2reasoning,
      title={LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception}, 
      author={Yuan-Hong Liao and Sven Elflein and Liu He and Laura Leal-Taixé and Yejin Choi and Sanja Fidler and David Acuna},
      year={2025},
      eprint={2504.15362},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.15362}, 
}