A task that involves using an image to answer a question is known as Visual Question Answering (VQA), essentially a computer vision task. VQA assumes that the input materials are two inputs: an image and a natural-language question. It considers whether the image content matches the question asked and treats the picture-answering problem as if it is addressed exclusively in the NLP domain, knowing both the textual content of the inquiry and its answer.
Main approaches
In general, we can outline the VQA procedure as follows:
- Extract features from a question. We get a question feature by passing the question token through an embedded layer. The question feature representation is a final hidden state extracted from the LSTM or GRU.
- Extract features from an image. After removing the last layer for classification, a CNN extracts the image feature.
- Combine the features to generate an answer. The fusion block is a critical element of the entire VQA. The inputs are question and image features, while the output from the fusion module is represented as the multi-modal feature. Bi-linear pooling and element-wise multiplication of image and question features are some of the methods used for fusion. The dimensions of the question and image must be the same to use the element-wise multiplication method. Bilinear pooling can be used to get a fused embedded from different features.
Bag-of-words or Long short-term memory can be used for text features. CNN pre-trained on ImageNet is the most frequent choice regarding image features. Answer generation modules usually model the problem as a classification task.
Then, the classifier module plays an important role, too. A classifier, the last block in this schema, tries to identify the feature in the given answer labels. VQA transforms single-word questions into a classification problem by converting the answers into labels. As usually happens, the matter belongs to a multi-label problem since a single open-ended question can have multiple correct answers (labels). For example, if you have a question: What is your car brand? There can be multiple correct answers: BMW, Toyota, Nissan, etc.
(source)
There are very few evaluation metrics for VQA. Some researchers still use only manual evaluation. WUPS is one of the few automatic metrics. WUPS is based on the WUP measure; it estimates the semantic distance. A space between 0 and 1 defines the semantic distance that separates an answer from the ground truth. Using WordNet, they can determine similarity by measuring the distance between each term in the semantic tree and the ground truth within the answer.
Datasets
- VQA is a dataset with a massive collection of open-ended questions. To answer these questions, one must grasp vision, language, and common sense knowledge. The total number of images is 265,000, and each image has at least three questions;
- TextVQA requires models to read and reason about the text in images to answer questions about them. To answer TextVQA questions, models need to use a new type of text present in the images and reason over it. TextVQA dataset contains 45,336 questions over 28,408 images from the OpenImages dataset. The dataset uses a VQA accuracy metric for evaluation.
- Tumblr GIF (TGIF) contains 100,000 animated GIFs and 120,000 sentences describing the visual content of the animated GIFs. The animated GIFs have been collected from Tumblr, from randomly selected posts published between May and June 2015
- Visual7W: Grounded Question Answering in Images is built on 47,300 COCO images. It has 327,939 QA pairs, 1,311,756 human-generated multiple choices, and 561,459 object groundings from 36,579 categories. Examples from the dataset (source):
Despite the precautions taken in the design of datasets (for example, the inclusion of popular answers makes it more difficult to infer the type of question from the set of answers), we can observe some issues. Maybe the most striking one is that some questions are too subjective to have a single correct answer.
Implementation
VQA models are available in Hugging Face. One can use either the VisualBERT model or ViLT (Vision-and-Language Transformer).
We'll show you how to implement VQA on a fine-tuned ViLT model:
from transformers import ViltProcessor, ViltForQuestionAnswering
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
A processor in Transformers is somewhat similar to what we know as Transformer Tokenizer. The text should be tokenized, and so does the image need to be processed.
We will try our transformers model on the 19th-century painting After Walk by Gustave Léonard de Jonghe:
from PIL import Image
image = Image.open('after_walk_jonghe.jpg')
The question is pretty ordinary:
text = "What is the girl doing?"
encoding = processor(image, text, return_tensors="pt")
Now, let's generate the answer:
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])
## Predicted answer: sitting
The model's answer is correct, though there is a place for improvement. For example, a more sophisticated model could say taking rest...
Conclusion
In this topic, you've learned about Visual QA, an exciting task on the verge of NLP and CV. You've learned the main VQA approaches, datasets and how to implement them in Python.