Natural Language Object Retrieval

This article is about part of my project that I did during UmboCV internship. I'll show some survey on natural language object retrieval and similar topic: referring expression. Besides, I released the tensorflow implementation of Natural Language Object Retrieval (paper link).

Big concept

We always tend to use some adjective or nearby things to refer the objects. Nowadays, we have smart object classification , object detection techniques. However these are all human-defined categories. We need to move further, from pre-defined category to free-form natural language, which we call it referring expression (not a general expression, but a specific one). In the future, we might ask out AI robot to take the remote controller for us, or to differentiate the man near the pole from the man in the red T-shirt. The AI robot must understand these concept, how human describe the object.
Here's the illustration:


The blue box represents the ground truth, while the yellow stands for positive recall, red for negative recall.


  1. Natural Language Object Retrieval
  2. Grounding with supervised training + multi-task loss
  3. Grounding + region proposal network

1. Natural Language Object Retrieval, inspired by Ronghang Hu's work

This work was appeared in cvpr 2016. It extend the previous work, LRCN, use the same way to generate caption. Compare the predicted caption with the query, and find the most related one. I call the whole process "reconstruction", which means that the model try to reconstruct back the query based on the bbox feature.
Here's the model:
model-nlor Noted that the model generate the caption not onlt based on the bbox feature, but also on the context feature, though it's a relatively naive way to use the context information. The paper also mentions the w/o context feature perormance:

2. Grounding with supervised training + multi-task loss, inspired by Anna Rohrbach's work

This work is from the same team of UC Berkeley. It extends the work to three different models, unsupervised, semi-supervised, and fully supervised model. Basically, it also uses the concept of reconstruction and combine it with a more direct way: directly choose the most related region. Furthermore, in the unsupervised model, they also use the attention mechanism, which is used heavily in image caption task.<\br> Here, I modified the fully supervised method with an extra reconstruction loss. This make the whole model become like a multi-task model, which increase the performance about 2%! The following is what the model look like:<\br> model2


This work is done during my Umbo CV internship. Thanks to all the umbots that fulfills my summer internship. Umbo CV really provides a comfortable environment for coders. I truely enjoy the life there. :)


Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell, Natural Language Object Retrieval
Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele, Grounding of Textual Phrases in Images by Reconstruction
Sahar Kazemzadeh, Vicente Ordonez Mark Matten, Tamara Berg, ReferItGame: Referring to Objects in Photographs of Natural Scenes