Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?

1University of Toronto, Vector Institute, 2NVIDIA, 3University of Ottawa

Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?

Abstract

Enhancing semantic grounding abilities in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore whether VLMs can improve their semantic grounding by "receiving" to feedback 💬, without requiring in-domain data, fine-tuning, or modifications to the network architectures. We systematically analyze this hypothesis using a feedback mechanism composed of a binary signal. We find that if prompted appropriately, VLMs can utilize feedback both in a single step and iteratively, showcasing the potential of feedback as an alternative technique to improve grounding in internet-scale VLMs. Furthermore, VLMs, like LLMs, struggle to self-correct errors out-of-the-box. However, we find that this issue can be mitigated via a binary verification mechanism. Finally, we explore the potential and limitations of amalgamating these findings and applying them iteratively to automatically enhance VLMs' grounding performance, showing grounding accuracy consistently improves using automated feedback across all models in all settings investigated. Overall, our iterative framework improves semantic grounding in VLMs by more than 15 accuracy points under noise-free feedback and up to 5 accuracy points under a simple automated binary verification mechanism 🚀.

Use Prompt-based Feedback to Improve VLMs

Despite VLMs’ strong visual-language understanding abilities, fine-grained visual grounding remains a challenge. This work investigates how prompt-based feedback can improve semantic grounding in VLMs. This enjoys two main advantages:

  1. 🔥 Requiring no additional training
  2. 🔥 Unleashing services relying on API-based VLMs for their downstream applications

Feedback Dynamics in VLMs

We first study the foundamental questions (1) Can VLMs receive prompt-based feedback? and (2) Can VLMs give prompt-based feedback? We study these questions in the context of semantic grounding and investigate three open-sourced VLMs (LLaVA-1.5, ViP-LLaVA, and CogVLM) and one proprietary VLM (GPT-4V & SoM). Here, we summarize our findings and contributions

  1. 💬 VLMs can receive feedback to improve downstream semantic grounding: Our study shows that VLMs improve 🚀 4 to 12 accuracy points when receiving oracle feedback that indicates the correctness of the prediction. When prompted over multiple rounds, VLMs improves by over 15 accuracy points 🚀. This shows the potential of feedback as a means of improving grounding performance in VLMs.
  2. 🗣 VLMs can give binary feedback: In line with the prior literature in LLMs, VLMs struggles to self-correction (Kim et al., 2023), but succeed in self-evaluation (Kadavath et al., 2022). We show that VLMs can provide high-quality binary feedback through different visual prompts, i.e., isolation or marking of objects.
  3. 🤖 Combining 1 and 2, VLMs benefit from automatic iterative feedback: We formulate an iterative framework that improve semantic grounding in VLMs up to nearly 5 accuracy points.

Label Transfer

In this work, we propose the "Label Transfer" problem, where a label transfer model needs to adjust the source labels such that the tranferred labels follow the annotation protocol in the target dataset. We evaluate the effectiveness by the performances of the induced downstream detectors.

Main challenge: there is no paired labels on the same image in the datasets.

Label-Guided Pseudo-Labeling (LGPL)

To mitigate annotations mismatches, we may use a model trained on the target data to generate pseudo-labels on the source images (Arazo et al., 2019; Lee, 2013), but this discards the existing source labels. On the other hand, statistical normalization (Wang et al., 2020) aligns boxes statistics but ignores the image content. Label-guided pseduo-lableing aims to fully leverate all available information for the label transfer problem

LGPL is inspired by identifying that the strategy used in two-stage object detectors. In short, we trained the RPN network on the source datasets and apply source-trained RPN to produce source-like proposals on the target images. Finally, we train the RoI head to transfer the source-like proposals on the target images to the target labels. All the components can be trained end-to-end.

Iterative Feedback Improves Semantic Grounding in VLMs

We introduce an iterative loop of dialogue between a VLM agent and Verifier. At the first time step, the VLM agent obtain base predictions for every scene. We then prompt a verifier (the same VLM) to generate a binary feedback for every prediction. In the next time step, we provide this binary feedback to the VLM agent and ask it to re-generate predictions. We repeat these steps until Verifier agrees with the predictions.

🔢 Quantitative Results

We compare our approach, VLM binary verification, with intrinsic self-correction (Kim et al., 2023). We also compare with a noise-free version of our iterative approach, noise-free verification. The following table show the results on ADE20k*:
*We use a subset of ADE20k validation images for evaluation.

We highlight the performance difference w.r.t. the performance of the base predictions and if the performances are below the performance of the base predictions, we use red-colored font.

🎨 Qualitative Results

Here, we show an example using GPT-4V & SoM (Jianwei Yang et al., 2023) as the VLM. GPT-4V is able to identify what objects are in the image, but struggles to identify the mapping between numeric object IDs with objects. With self-generated feedback (from center to left), GPT-4V successfully corrects its own predictions. For more qualitative results, please refere to our paper.

BibTeX

@misc{liao2024feedback,
title={Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?}, 
author={Yuan-Hong Liao and Rafid Mahmood and Sanja Fidler and David Acuna},
year={2024},
eprint={2404.06510},
archivePrefix={arXiv},
primaryClass={cs.CV}
}