LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

1University of Toronto, Vector Institute, 2NVIDIA, 3Purdue University

Notebook LM Summary

Commuting? Listen to our paper's key insights in just minutes!

🔍 Project summary:

We study how to leverage system-2 reasoning to solve perception tasks, the kind of tasks VLMs typically handle with fast, shallow (system-1) thinking. Here's what we found:

  • With LongPerceptualThoughts, VLMs can perform system-2 reasoning on perceptual tasks after simple supervised fine-tuning.
  • Surprisingly, despite being tuned only on visual tasks, these fine-tuned VLMs generalize well to challenging text-only reasoning benchmarks like MMLU-Pro!
🎯 So, what’s the secret sauce here?

We introduce LongPerceptualThoughts, a synthetic dataset of 30K long chain-of-thought (CoT) traces for vision tasks. We synthesize these long CoTs with a novel three-stage data synthesis framework that effectively incorporates key cognitive behaviors such as backtracking, verification, subgoal setting, etc., in the long CoTs.

Overall, by fine-tuning on LongPerceptualThoughts,we improve Qwen2.5-VL-7B on 5 vision-centric benchmarksby by an average of 3.4 points. In particular, we improve V* bench by 11.8 points. We even find that the fine-tuned Qwen2.5-VL improves out-of-distribution MMLU-Pro by more than 2 points!

Introduction

Despite the rapid progress in reasoning-focused models like OpenAI’s o1 and DeepSeek’s R1, most advances have focused on math and code domains. But what about perception -- tasks like object recognition, scene understanding, or spatial reasoning, where vision-language models (VLMs) typically rely on fast, shallow reasoning?

This work asks: Can we equip VLMs with system-2 reasoning to improve performance on vision-centric tasks? And our answer is Yes! and with only 500 images!

Method Avg CV-Bench V* Bench MMVP MMStar-V MME-RW-V
Qwen2.5-VL-7B-Instruct 58.47 74.74 48.51 73.67 63.73 31.68
+ CoT 59.18 75.42 55.08 70.60 62.40 32.40
+ LongPerceptualThoughts (SFT) 59.90 76.05 60.53 70.00 60.67 32.25
+ LongPerceptualThoughts (SFT + DPO) 61.87 76.61 60.31 75.00 64.00 33.45
For more results, please unfold to see expanded tables

Comparison with other multimodal reasoning datasets.

Method Avg CV-Bench V* Bench MMVP MMStar-V MME-RW-V
Qwen2.5-VL-7B-Instruct 58.47 74.74 48.51 73.67 63.73 31.68
+ CoT 59.18 75.42 55.08 70.60 62.40 32.40
+ VLAA-thinking (Chen el al., 2025) 42.32 68.50 53.53 66.67 0.53 22.38
+ Virgo (Du el al., 2025) 50.87 67.22 44.14 57.67 57.6 27.71
+ LongPerceptualThoughts (SFT) 59.90 76.05 60.53 70.00 60.67 32.25
+ LongPerceptualThoughts (SFT + DPO) 61.87 76.61 60.31 75.00 64.00 33.45

Data Synthesis Framework: Ask, Think, and Think Harder

We propose a scalable, three-stage framework to synthesize long CoTs from images and dense captions:

Data Pipeline

  • Stage 1 – Ask using an LLM: We start by converting dense image descriptions into multiple-choice questions using GPT-4o-mini. These questions are designed to be grounded, diverse, and verifiable, ensuring the model has a clear visual task to reason about.
  • Stage 2 – Think like a VLM: Next, we use a Qwen2.5-VL-7B-Instruct to generate a simple chain-of-thought (CoT). These are short, direct reasoning steps that are within the VLM’s comfort zone, providing a natural foundation to build deeper thoughts from.
  • Stage 3 – Think harder like a Reasoning VLM: Finally, we hand over the image, question, and simple CoT to a powerful reasoning LLM (e.g., DeepSeek-R1-Distill). With a gentle nudge, like inserting "Wait," before the reasoning begins, we prompt the model to reflect, verify, and revise its thinking. The result? A rich, multi-step long CoT that mimics system-2 behavior: setting subgoals, checking assumptions, and even backtracking if needed.

Analysis: More Thoughtful Chains, Not Just Longer Ones

Length of CoTs
Token-length distribution of CoTs.
Cognitive Behaviors in CoTs
Frequency of cognitive behaviors in CoTs.
  • CoT Lengths (left figure): CoTs in LongPerceptualThoughts are longer on average as compared to CoT produced by popular VLMs.
  • Cognitive Behaviors (right figure): CoTs in LongPerceptualThoughts also demonstrate diverse behaviors in structure. Unlike flat responses from baseline VLMs, our CoTs include intermediate steps and self-reflection. More specifically, our CoTs in LongPerceptualThoughts demonstrate significantly higher rates of subgoal setting, backtracking, and verification compared to VLMs like Qwen2.5-VL CoTs. The CoTs in LongPerceptualThoughts are more similar to those generated by Gemini Flash Thinking, these structured behaviors are key indicators of system-2 reasoning, and are rarely observed in existing vision-language CoT datasets.

Takeaways

  1. 🧠 Synthesizing long, rich reasoning data powers perception.
    With just 500 images, we generate 30K chain-of-thought traces that teach vision-language models to reflect, verify, and revise. These cognitive patterns—rare in standard datasets—boost visual reasoning across benchmarks.
  2. 🥷 Just SFT. Nothing fancy.
    No new architectures. No reinforcement learning from scratch. Just supervised fine-tuning on LongPerceptualThoughts leads to strong gains on vision-centric tasks. More surprisingly, it even improves performance on a text-only reasoning benchmark (MMLU-Pro). See more details in our paper!

Richer thinking leads to better seeing—without needing more data or bigger models.

BibTeX

@misc{liao2025longperceptualthoughtsdistillingsystem2reasoning,
      title={LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception}, 
      author={Yuan-Hong Liao and Sven Elflein and Liu He and Laura Leal-Taixé and Yejin Choi and Sanja Fidler and David Acuna},
      year={2025},
      eprint={2504.15362},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.15362}, 
}