Computer scienceData scienceNLPMain NLP tasksImage recognition as an NLP task

Visual QA

4 minutes read

A task that involves using an image to answer a question is known as Visual Question Answering (VQA), essentially a computer vision task. VQA assumes that the input materials are two inputs: an image and a natural-language question. It considers whether the image content matches the question asked and treats the picture-answering problem as if it is addressed exclusively in the NLP domain, knowing both the textual content of the inquiry and its answer.

Main approaches

In general, we can outline the VQA procedure as follows:

  • Extract features from a question. We get a question feature by passing the question token through an embedded layer. The question feature representation is a final hidden state extracted from the LSTM or GRU.
  • Extract features from an image. After removing the last layer for classification, a CNN extracts the image feature.
  • Combine the features to generate an answer. The fusion block is a critical element of the entire VQA. The inputs are question and image features, while the output from the fusion module is represented as the multi-modal feature. Bi-linear pooling and element-wise multiplication of image and question features are some of the methods used for fusion. The dimensions of the question and image must be the same to use the element-wise multiplication method. Bilinear pooling can be used to get a fused embedded from different features.

Bag-of-words or Long short-term memory can be used for text features. CNN pre-trained on ImageNet is the most frequent choice regarding image features. Answer generation modules usually model the problem as a classification task.

Then, the classifier module plays an important role, too. A classifier, the last block in this schema, tries to identify the feature in the given answer labels. VQA transforms single-word questions into a classification problem by converting the answers into labels. As usually happens, the matter belongs to a multi-label problem since a single open-ended question can have multiple correct answers (labels). For example, if you have a question: What is your car brand? There can be multiple correct answers: BMW, Toyota, Nissan, etc.

The input picture shows two elephants somewhere in wildlife. It's not obvious from the picture whether they are playing or fighting. Our purpose is to extract all features from this picture. The common choices for image feature extraction include:Output from the penultimate layer of a pre-trained CNN;Local features from convolutional feature maps generated by a pre-trained CNN;CNN features extracted region proposals;Don't forget that we also have a question: 'What animals are these?'', which should be answered too. Most common ways to do this are:BOW;LSTM/GRU;Natural Language Parser;Then we need to choose an algorithm to combine these features (Image features & Question features). The final step shown on the scheme is a classifier. We classify features from image and question. An example from the scheme shows the following features: Giraffe, Elephant, Cat, Playing, Yes, 2. Our classifier chooses Elephant.

(source)

There are very few evaluation metrics for VQA. Some researchers still use only manual evaluation. WUPS is one of the few automatic metrics. WUPS is based on the WUP measure; it estimates the semantic distance. A space between 0 and 1 defines the semantic distance that separates an answer from the ground truth. Using WordNet, they can determine similarity by measuring the distance between each term in the semantic tree and the ground truth within the answer.

Datasets

  • VQA is a dataset with a massive collection of open-ended questions. To answer these questions, one must grasp vision, language, and common sense knowledge. The total number of images is 265,000, and each image has at least three questions;
  • TextVQA requires models to read and reason about the text in images to answer questions about them. To answer TextVQA questions, models need to use a new type of text present in the images and reason over it. TextVQA dataset contains 45,336 questions over 28,408 images from the OpenImages dataset. The dataset uses a VQA accuracy metric for evaluation.
  • Tumblr GIF (TGIF) contains 100,000 animated GIFs and 120,000 sentences describing the visual content of the animated GIFs. The animated GIFs have been collected from Tumblr, from randomly selected posts published between May and June 2015
  • Visual7W: Grounded Question Answering in Images is built on 47,300 COCO images. It has 327,939 QA pairs, 1,311,756 human-generated multiple choices, and 561,459 object groundings from 36,579 categories. Examples from the dataset (source):

 There are 6 images with a question posed. The dataset provides 4 possible answers with only one correct.

Despite the precautions taken in the design of datasets (for example, the inclusion of popular answers makes it more difficult to infer the type of question from the set of answers), we can observe some issues. Maybe the most striking one is that some questions are too subjective to have a single correct answer.

Implementation

VQA models are available in Hugging Face. One can use either the VisualBERT model or ViLT (Vision-and-Language Transformer).

We'll show you how to implement VQA on a fine-tuned ViLT model:

from transformers import ViltProcessor, ViltForQuestionAnswering


processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

A processor in Transformers is somewhat similar to what we know as Transformer Tokenizer. The text should be tokenized, and so does the image need to be processed.

We will try our transformers model on the 19th-century painting After Walk by Gustave Léonard de Jonghe:

 A late 19-th century work of art with a young girl in the centre sitting. She holds a book in her hands, but she doesn't read it. She is fallen asleep. She must be tired after walk.

from PIL import Image


image = Image.open('after_walk_jonghe.jpg')

The question is pretty ordinary:

text = "What is the girl doing?"

encoding = processor(image, text, return_tensors="pt")

Now, let's generate the answer:

outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])


##  Predicted answer: sitting

The model's answer is correct, though there is a place for improvement. For example, a more sophisticated model could say taking rest...

Conclusion

In this topic, you've learned about Visual QA, an exciting task on the verge of NLP and CV. You've learned the main VQA approaches, datasets and how to implement them in Python.

How did you like the theory?
Report a typo