Computer scienceData scienceNLPMain NLP tasksImage recognition as an NLP task

Image Annotation

4 minutes read

In this topic, you will come to know what image annotation is, how it works, and how you can implement it in Python. Image annotation is a fascinating operation that is very similar to object detection. It is object detection. The only difference is that in image annotation, you generate a small text in the output.

The picture shows a table with a plate of food and a blender. Detected objects are highlighted with coloured squares. Each square has its own annotation, which is given on the black sheet around the picture.

What is Image Annotation?

Image annotation is the process of assigning labels to an image or set of images. A human operator reviews a collection of pictures, identifies relevant objects in each image, and then annotates the image by indicating the shape and label of each object. Annotations can be utilized to generate a training dataset for computer vision models using these annotations. The model uses these annotations to build naive models that can use them to detect objects or label images independently. Some of the image annotation datasets include:

COCO – a dataset with more than 330k images. This dataset contains a total number of 1.5 million object instances. Every image is annotated with 5 text descriptions.

Flickr30K – a dataset with 31783 images collected from the Flickr website. Every image is accompanied by 5 human annotated text descriptions.

Sometimes the task of image annotation is also called image-text matching or captioning. The basic concepts of image annotation are based on such technologies as image retrieval systems, object detection, image classification, and others. Image annotation is unthinkable without all these operations.

The output can also be different. It could be a detailed description of the picture or just 3–4 words.

The most common types of image annotations are:

2D Bounding Boxes;
Polygonal Segmentation;
Lines and Splines;
Semantic Segmentation;
3D Cuboids.

To annotate an image, you first need to detect all visible objects (Object Detection task). The result of object detection is an image with marked areas (as in the pictures below).

There are multiple pictures with detected objects on them.

They could be marked with rectangles or squares (as above) or with color (as below).

There are two photos; both show a wooden birdhouse on a tree, but the left one is a normal, while on the right one the birdhouse if coloured in red.

After this step, you can embed visual semantics and analyze it. You will read about it in the following section. Finally, you can generate a text based on these embeddings.

Visual Semantic Embeddings

When people describe what they see in a picture using natural language, they mention not only the objects they see in the picture, but also the interactions between the objects, their relative positions, as well as other high-level semantic concepts. Visual Semantic Embeddings are essential for such reasoning, since the computer can't work with concepts and their names like "chair" or "table"; it is able to process only embeddings. Visual Semantic Embeddings capture both objects and the relations between them.

The process of Visual Semantic Embedding (3 pictures): first we detect objects using Bottom-Up Attention, then we do Visual Semantic Reasoning (here the picture shows detected objects are interacted), and we do final representation.

A similar operation is Visual Semantic Role Labeling, an NLP context that refers to labeling words in a sentence with different semantic roles for the verb in the sentence. Visual semantic role labeling is a new task that hasn't been studied properly before. NLP research on this and related areas has resulted in FrameNet, VerbNet, etc. Researchers use such datasets as PASCAL-VOC, a popular dataset for static action classification, to train visual semantic role labelers. Other datasets for this task are MPII Human Pose, People Playing Musical Instruments (PPMI), and the Trumpets, Bikes and Hats dataset (TBH).

A picture is given with a guy on a red motocycle. Above that picture there is a question posed: Is this person riding or driving something? The answer is "yes" .

Attention in image annotation models

There is an issue with image annotation: when the model is trying to generate the next word of the caption, this word usually describes only a part of the image. It is unable to capture the essence of the entire input image. Using the whole representation of the image h to condition the generation of each word can't efficiently produce different words for different parts of the image. This is exactly where an attention mechanism is helpful.

An attention model is a method that takes n arguments y1,…, yn (in the preceding examples, the yi would be the hi), and a context c. It returns a vector z, which is supposed to be the "summary" of the yi, focusing on information linked to the context c. In other words, it returns the mean of the yi, and the weights are chosen according to the importance of each yi given the context c.

In Image Annotation, there are two types of models:

Global Attention(Luong's Attention): in this model, attention is placed on all source positions.
Local Attention(Bahdanau Attention): in this model, attention is placed only on a few source positions.

Most Image-Text matching models are based on Stacked Cross Attention Network (SCAN). For example, R-SCAN, a model proposed by Microsoft, is just a fine-tuned version of SCAN. R-SCAN uses a two-stage attention mechanism to discover fine-grained correspondence between objects and words. R-SCAN additionally encodes visual relations and employs a gating mechanism to select between region and relation features.

But there is only Image Annotation without Attention. This image captioning system would encode the image, using a pre-trained Convolutional Neural Network that would produce a hidden state h. Then, it would decode this hidden state by using a Recurrent Neural Network (RNN) and generate recursively each word of the caption.

Some other methods include Image Captioning and Scene Graph Generation.

Image-text matching implementation

Image-text matching models are available in Hugging Face. Those models are based on Vision Encoder Decoder architecture.

First, make sure you have the Transformers library installed:

!pip install git+https://github.com/huggingface/transformers

In this topic, let's use this image-captioning model:

from transformers import BlipProcessor, BlipForConditionalGeneration


processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

A processor in Transformers is somewhat similar to what you might know as Transformer Tokenizer. The text should be tokenized, and the image needs to be processed.

Let's try out the Transformers model on the 19-th century painting "Mary Magdalene in the Grotto" by Jules-Joseph Lefebvre.

Let's use the PIL library to load the image:

from PIL import Image


raw_image = Image.open('magdaline.jpg').convert('RGB')

Image annotation generation can be conditional or unconditional. Conditional generation means that you provide an utterance you want to see in the beginning of the annotation, and the model generates text as you've ordered. For example, you state that the future annotation should start with "A photograph of":

text = "A painting of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))


##  a painting of a nude - clad woman laying on a rock

Your wish is accomplished!

The other way to generate annotation is a simpler one: unconditional generation. This means that you don't limit your model with words like "A photograph of". The model can generate anything it wishes!

inputs = processor(raw_image, return_tensors="pt")  # pay attention: there is no text in the input
 
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))


##  a close up of a naked woman laying on a rock

Some models (for example, the ViT series models) offer another way of annotation generation, and for that they need so-called feature extractor:

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer


model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

After that, you need a function to generate an annotation:

max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
  images = []
  for image_path in image_paths:
    i_image = Image.open(image_path)
    if i_image.mode != "RGB":
      i_image = i_image.convert(mode="RGB")

    images.append(i_image)

  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
  pixel_values = pixel_values.to(device)

  output_ids = model.generate(pixel_values, **gen_kwargs)

  results = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  results = [i.strip() for i in results]
  return results

Then, you can finally generate an annotation:

predict_step(['magdaline.jpg'])

##  ['a woman sitting on a rock with her legs crossed']

Conclusion

In this topic, you've continued to explore the fascinating intersections between Computer Vision and NLP. You've learned what Image Annotation is (a.k.a Image-Text Matching) and read about some datasets and models for this task. You've also read about Visual Semantic Embeddings.

How did you like the theory?

Report a typo