Yuan-Hong Liao

Toronto, Canada

I am on the job market for an industry role starting mid 2025. If my research aligns with your needs, please feel free to reach out via email.

I am a final-year Ph.D. student at the University of Toronto and Vector Institute. I am fortunate to be supervised by Prof. Sanja Fidler. Previously, I was an CV/ML scientist intern at NVIDIA Toronto AI lab in 2022 - 2023 and Amazon Astros team in 2024.

My research surrounds two essential aspects: visual data labeling and vision-language models:

🖼️ Improving visual labels: Visual labeling: I develop methods to reduce crowdsourced labeling costs [CVPR’21] and fix semantic inconsistencies in real-world datasets [ICLR’24].
🧠 Vision-language models: I enhance vision-language models in spatial reasoning [EMNLP’24], enable self-correction during inference [CVPR’25], and promote system-2 thinking in vision-centric tasks [arxiv’25]

Check my resume here (last updated in April 2025)

Previous experiences

Prior to my Ph.D., I was a visiting student at Vector Institute and USC in 2018 and 2017, respectively. I was fortunate to start by AI research at National Tsing Hua University, supervised by Prof. Min Sun.

news

Jul 08, 2025	Our paper LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception is accepted to COLM 2025
May 21, 2025	Gave talk “System-2 Thinking in Vision-Language Models” at NTU and Appier AI Research
Apr 23, 2025	Our preprint LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception is on arXiv!
Feb 26, 2025	Our paper Can Large Vision-Language Models Correct Grounding Errors By Themselves? is accepted to CVPR 2025
Sep 20, 2024	Our paper Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models is accepted to EMNLP 2024

selected publications

LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

Yuan-Hong Liao, Sven Elflein , Liu He , Laura Leal-Taixé , Yejin Choi , Sanja Fidler , and David Acuna

In Second Conference on Language Modeling , 2025

Abs Bib PDF Code Website

Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V* Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.
@inproceedings{liao2025longperceptualthoughts, title = {LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception}, author = {Liao, Yuan-Hong and Elflein, Sven and He, Liu and Leal-Taixé, Laura and Choi, Yejin and Fidler, Sanja and Acuna, David}, booktitle = {Second Conference on Language Modeling}, url = {https://arxiv.org/abs/2504.15362}, year = {2025} }
Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Yuan-Hong Liao, Rafid Mahmood , Sanja Fidler , and David Acuna

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Jun 2025

Abs Bib PDF Code Poster Website

Enhancing semantic grounding abilities in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore whether VLMs can improve their semantic grounding by "receiving" to feedback 💬, without requiring in-domain data, fine-tuning, or modifications to the network architectures. We systematically analyze this hypothesis using a feedback mechanism composed of a binary signal. We find that if prompted appropriately, VLMs can utilize feedback both in a single step and iteratively, showcasing the potential of feedback as an alternative technique to improve grounding in internet-scale VLMs. Furthermore, VLMs, like LLMs, struggle to self-correct errors out-of-the-box. However, we find that this issue can be mitigated via a binary verification mechanism. Finally, we explore the potential and limitations of amalgamating these findings and applying them iteratively to automatically enhance VLMs’ grounding performance, showing grounding accuracy consistently improves using automated feedback across all models in all settings investigated. Overall, our iterative framework improves semantic grounding in VLMs by more than 15 accuracy points under noise-free feedback and up to 5 accuracy points under a simple automated binary verification mechanism 🚀.
@inproceedings{liao2024feedback, author = {Liao, Yuan-Hong and Mahmood, Rafid and Fidler, Sanja and Acuna, David}, title = {Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = jun, year = {2025}, url = {https://arxiv.org/abs/2404.06510}, }
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models

Yuan-Hong Liao, Rafid Mahmood , Sanja Fidler , and David Acuna

In The 2024 Conference on Empirical Methods in Natural Language Processing , Jun 2024

Abs Bib PDF Code Poster Website

Despite recent advances demonstrating vision-language models’ (VLMs) abilities to describe complex relationships in images using natural language, their capability to quantitatively reason about object sizes and distances remains underexplored. In this work, we introduce a manually annotated benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning and systematically investigate the performance of state-of-the-art VLMs on this task. Our analysis reveals that reasoning about distances between objects is particularly challenging for SoTA VLMs; however, some VLMs significantly outperform others, with an over 40-point gap between the two best performing models. We also make the surprising observation that the success rate of the top-performing VLM increases by 19 points when a reasoning path using a reference object emerges naturally in the response. Inspired by this observation, we develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues. By instructing VLMs to use reference objects in their reasoning paths via SpatialPrompt, Gemini 1.5 Pro, Gemini 1.5 Flash, and GPT-4V improve their success rates by over 40, 20, and 30 points, respectively. We emphasize that these significant improvements are obtained without needing more data, model architectural modifications, or fine-tuning.
@inproceedings{liaos2024reasoning, title = {Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models}, author = {Liao, Yuan-Hong and Mahmood, Rafid and Fidler, Sanja and Acuna, David}, booktitle = {The 2024 Conference on Empirical Methods in Natural Language Processing}, year = {2024}, url = {https://arxiv.org/abs/2409.09788}, }
Translating Labels to Solve Annotation Mismatches Across Object Detection Datasets

Yuan-Hong Liao, David Acuna , Rafid Mahmood , James Lucas , Viraj Uday Prabhu , and Sanja Fidler

In The Twelfth International Conference on Learning Representations , Jun 2024

Abs Bib PDF Code Website

In object detection, varying annotation protocols across datasets can result in annotation mismatches, leading to inconsistent class labels and bounding regions. Addressing these mismatches typically involves manually identifying common trends and fixing the corresponding bounding boxes and class labels. To alleviate this laborious process, we introduce the label transfer problem in object detection. Here, the goal is to transfer bounding boxes from one or more source datasets to match the annotation style of a target dataset. We propose a data-centric approach, Label-Guided Pseudo-Labeling (LGPL), that improves downstream detectors in a manner agnostic to the detector learning algorithms and model architectures. Validating across four object detection scenarios, defined over seven different datasets and three different architectures, we show that transferring labels for a target task via LGPL consistently improves the downstream detection in every setting, on average by 1.88 mAP and 2.65 AP^75. Most importantly, we find that when training with multiple labeled datasets, carefully addressing annotation mismatches with LGPL alone can improve downstream object detection better than off-the-shelf supervised domain adaptation techniques that align instance features.
@inproceedings{liao2024translating, title = {Translating Labels to Solve Annotation Mismatches Across Object Detection Datasets}, author = {Liao, Yuan-Hong and Acuna, David and Mahmood, Rafid and Lucas, James and Prabhu, Viraj Uday and Fidler, Sanja}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=ChHx5ORqF0}, }
Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets

Yuan-Hong Liao, Amlan Kar , and Sanja Fidler

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Jun 2021

Abs Bib PDF Code Website

Data is the engine of modern computer vision, which necessitates collecting large-scale datasets. This is expen- sive, and guaranteeing the quality of the labels is a ma- jor challenge. In this paper, we investigate efficient anno- tation strategies for collecting multi-class classification la- bels for a large collection of images. While methods that ex- ploit learnt models for labeling exist, a surprisingly preva- lent approach is to query humans for a fixed number of labels per datum and aggregate them, which is expensive. Building on prior work on online joint probabilistic mod- eling of human annotations and machine-generated beliefs, we propose modifications and best practices aimed at min- imizing human labeling effort. Specifically, we make use of advances in self-supervised learning, view annotation as a semi-supervised learning problem, identify and mitigate pitfalls and ablate several key design choices to propose ef- fective guidelines for labeling. Our analysis is done in a more realistic simulation that involves querying human la- belers, which uncovers issues with evaluation using exist- ing worker simulation methods. Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average, a 2.7x and 6.7x improvement over prior work and manual annotation, respectively.
@inproceedings{Liao_2021_CVPR, author = {Liao, Yuan-Hong and Kar, Amlan and Fidler, Sanja}, title = {Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = jun, year = {2021}, pages = {4350-4359}, }