Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models

Abstract

Despite recent advances demonstrating vision- language models’ (VLMs) abilities to describe complex relationships in images using natural language, their capability to quantitatively reason about object sizes and distances remains underexplored. In this work, we introduce a manually annotated benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning and systematically investigate the performance of state-of-the-art VLMs on this task. Our analysis reveals that reasoning about distances between objects is particularly challenging for SoTA VLMs; however, some VLMs significantly outperform others, with an over 40-point gap between the two best performing models. We also make the surprising observation that the success rate of the top-performing VLM increases by 19 points when a reasoning path using a reference object emerges naturally in the response. Inspired by this observation, we develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues. By instructing VLMs to use reference objects in their reasoning paths via SpatialPrompt, Gemini 1.5 Pro, Gemini 1.5 Flash, and GPT-4V improve their success rates by over 40, 20, and 30 points, respectively. We emphasize that these significant improvements are obtained without needing more data, model architectural modifications, or fine-tuning.

Why Quantitative Spatial Reasoning?

Spatial reasoning is essential for humans to interact with the world, such as determining if there is enough room on a desk for a backpack; if there is enough space to navigate through a room without hitting any obstacles; or if an object is placed sufficiently high enough to be inaccessible to a child.

Our contributions and findings

We introduce Q-Spatial Bench, a benchmark that evaluates the quantitative spatial reasoning in large vision-language models.
Perhaps surprisingly, we find that GPT-4o outperforms other closed-source VLMs by a large margin.
We develop SpatialPrompt, a prompting technique that consistently improve quantitative spatial reasoning in VLMs.

Q-Spatial Bench

Motivation: Current benchmarks primarily assess whether these models understand qualitative concepts like "left" versus "right" or "near" versus "far" from monocular images. However, recent studies have revealed that state-of-the-art VLMs struggle with quantitative spatial tasks. While measuring sizes or distances from monocular images is ill-posed, humans are surprisingly adept at mak- ing such estimations by relying on contextual clues.

Setup We explore quantitative spatial reasoning where a VLM is tasked to recognize quantitative spatial information of physical objects such as distances or sizes from a 2D image. In particular, we consider direct quantitative spatial reasoning, where a VLM predicts the quantitative spatial information without accessing any external tools or large models.

This work: In this paper, we introduce a new question answering corpora, Q-Spatial Bench, specifically designed to evaluate quantitative spatial reasoning in VLMs with high-precision.

GPT-4o dominates in Q-Spatial Bench

Prior works (Chen et al., 2024 and Cheng et al., 2024) suggest that state-of-the-art VLMs, e.g., GPT-4V, struggle in quantitative spatial reasoning. Perhaps surprisingly, we find the GPT-4o performs reasonably well in the proposed benchmark. Specifically, GPT-4o achieves more than 60 points in success rates in both splits of Q-Spatial Bench, while the second-best VLM gets no more than 30 points in success rates.

GPT-4o dominates in Q-Spatial Bench

Interestingly, we find that GPT-4o tends to use reference objects in its reasoning path and estimate the answer at the end. We, therefore, hypothesize that Using reference objects for quantitative spatial reasoning leads to strong performances. To validate our hypothesis, we collect the GPT-4o responses from both splits and use a logistic regression model for the analysis. For each response, we can derive four variables: (1) if the response is considered as a success or not (Y), (2) the source of data split (X1), (3) the ground truth distances (X2), and (4) if the response use any reference objects (X3). By using the success of each response as target, we fit a logistic regression model on them. We find that using reference objects in the responses increase the odds of the success variable by more than 250 percent in a statistically significant manner!

Spatial Prompt

Inspired by the above analysis, we develop a prompting technique to encourage VLM to perform quantitative spatial reasoning with referecen objects. We call this "SpatialPrompt".

SpatialPrompt consistently improves all VLM testede in both splits of Q-Spatial Bench, improving Gemini-1.5-Pro by more than 40 points and GPT-4V by more than 30 points!

BibTeX

@inproceedings{ liaos2024reasoning, title={Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models}, author={Yuan-Hong Liao and Rafid Mahmood and Sanja Fidler and David Acuna}, booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing}, year={2024}, url={https://arxiv.org/abs/2409.09788}, }