Computer scienceData scienceNLPMain NLP tasksImage recognition as an NLP task

Reading text from image

5 minutes read

Reading text from images is not a task for NLP. It's more in the Computer Vision domain. This task requires an understanding of how we should process images and how to analyze them. In this topic, you will understand what Optical Character Recognition is — this will be good for your general knowledge, and, by the way, it may be useful to extract a text from an image (a possible task for an NLP Data Scientist).

This topic serves more as an introduction to what OCR is and how to implement it without going deep into neural network models. We will neither delve into CNN or RNN models nor into building your OCR model from scratch. However, we will touch on the general OCR subprocesses (and their subprocesses) and how to use ready models.

Introduction to OCR

Optical Character Recognition (OCR) is a technology of automatic text processing from images. OCR models are specialized in reading from:

Typed text (either a PDF file or a text typed on a typing machine);
Hand-written text:

An image of scribed letter

Text from the background of an image:

10 photos of outdoor labels

OCR is an adjoint task between NLP and Computer Vision (CV). As a CV task, OCR demands deep knowledge of image processing techniques. As an NLP task, OCR demands an understanding of tokenization, regular expressions, and so on.

OCR includes the following subtasks:

Preprocessing of the image
Text localization
Character Segmentation
Character Recognition
Post Processing

These operations can differ, but these are approximate steps required to get to the automatic character recognition.

OCR has many applications. It may be used for car plate number detection, document scanning, traffic sign recognition, passport recognition (in places like a police department or an airport), etc.

Apart from classic OCR models, there are also Scene Text Recognition models.

Scene text recognition

Scene Text Recognition (STR) consists of two stages: Text localization and Text recognition.

The scene text localization model is used to identify text areas by examining each character and the relationship between characters. This approach allows for easy identification of detailed scene texts in general, as well as images that are oriented in an arbitrarily varying manner (such as curved, distorted, or otherwise orientated). The output shows a region score and an affinity score, which can be used both to localize individual characters in the picture and to aggregate them into ONE instance.

Text recognition is aimed at making the computer understand each symbol. This stage has such substages as normalization, feature extraction, sequence modeling, and prediction.

a process of text recognition in 5 images: first we recognize the text border, and we normalize it so that se we could read it normally, then we extract visual and contextual features - and voila, making prediction

Here is an example of the STR output (all characters are highlighted with green frame):

photos of text localization in different places

As can be seen from the picture above, STR is very similar to object recognition tasks.

Other STR models, like SVTR (which has a block for feature extraction and a sequence model for text recognition), have a different structure.

For us, classic STR is substantial because it includes text localization, a first subprocess for OCR.

Character segmentation

After text localization, you need to segment your characters. Character segmentation is a process of breaking an image into parts to process them further. This process can be divided into the following operations:

Line-level segmentation;
Word-level segmentation;
Character-level segmentation.

These operations are conducted in this sequence. Here is an example of the character segmentation of the following text:

a process of text segmentation as described below

Character segmentation is based on a principle of a sliding window. For line-level segmentation, a model checks the sum of all black pixels on the particular sliding window. On this level, each sliding window is a particular line of pixels. The model first checks the first top-line pixels then goes on to the second top level, and so on. If some (2 or 4) sliding windows do not have any pixels or their number is insignificant, these pixel lines are taken as delimiters.

The word level segmentation is pretty much the same but here the window is sliding 1 pixel to the right along the horizontal axis. If there are 3 or 5 white pixels on the whole line, the model assumes that it is a space character — the same as if you used sentence.split(' ') to tokenize your text. At this level, the model also segments the punctuation marks, it's possible since most punctuation marks have pixels in specific areas.

The many options for character-level segmentation. The most popular one is based on histogram processing. In this case, the histogram minimums represent the potential delimiters between characters. Since many delimiters are considered the potential ones, it is necessary to make a decision and chose the ones which are the most likely to be the real delimiters. The decision-making logic used for determining the real delimiters is shown below:

we look at the word as if it is just a dataframe with multiple columns: 4-5 columns form a word. In this word we need to find columns that may be delimiters

After this, the model should choose the real delimiters. For example, the model may do this by calculating the actual average character width:

$c_{wavg}=[W_w/ c_n]$

The given word is separated starting from the left border and taking the delimiter at the distance $c_{wavg}$ as a referent delimiter. The real delimiter is determined by finding the closest delimiter to the referent delimiter in a range controlled by the threshold value. If there are no delimiters in that range, the referent delimiter will be taken as a real delimiter.

Character recognition

An attention-based text recognizer is typically designed as an encoder-decoder framework. In the encoding stage, an image is transformed into a sequence of feature vectors by CNN/LSTM, and each feature vector corresponds to a region in the input image. In the decoding stage, the attention network (AN) first computes alignment factors by referring to the history of target characters and the encoded feature vectors for generating the synthesis vectors (also called glimpse vectors), thus achieving the alignments between the attention regions and the corresponding ground-truth-labels. Then, a recurrent neural network (RNN) is used to generate the target characters based on the glimpse vectors and the history of the target characters.

However, even with deep learning, there are challenges because of the variety of characters caused by different fonts, languages, and of course poor handwriting. There are different deep learning approaches to handle this including EAST and CRNN. These methods can be implemented with tools like Tensorflow and Keras. OCR is also possible with Transformers models. For this task researchers invented the Vision Encoder-Decoder model architecture. Encoder-decoder models, as we've already said 1000 times, are one of Transformers architectures with an encoder and decoder module.

Handwritten text recognition

Handwritten Text Recognition (HTR) is a process of character recognition from a handwritten text. This is important for some very ancient manuscripts which aren't available in typed format. This must be some kind of extremely rare manuscript which is known only to a small circle of historians. Or this could be a secret manuscript in the Vatican library. Another motivation to learn Handwritten OCR is that you can make an algorithm read some personal letters that were written by hand and have never been converted into a computer format. Finally, the best reason to learn HTR as it may be applied in reading examination papers as most students write down their answers by hand.

In Handwritten OCR, we skip the text localization procedure since we assume that the whole image is a text array. Then, we can skip the character segmentation as characters could be united in handwritten text. This omission is connected to the so-called Sayre's paradox, a dilemma encountered in the design of automated handwriting recognition systems. A standard statement of the paradox is that a cursively written word cannot be recognized without being segmented and cannot be segmented without being recognized. Please note, that we don't skip Line & Word Segmentation.

Traditional techniques rely on segmenting for recognition, while modern techniques aim to identify all the characters in a segmented line of text. Machine learning techniques are particularly effective in this regard, which can avoid the traditional use of feature engineering and instead utilize models from state-of-the-art OCR systems like CNN models and Vision Transformers.

HTR is still highly challenging. Because the scripture of a handwritten character can be unpredictable, and there are few datasets to train a model. One of the most famous CV datasets with handwritten digits is MNIST. A good dataset for handwritten letters is the A-Z Kaggle dataset.

MNIST 0-9 is dataset of digits, so there you can find numbers from 0 to 9 with different way of scribing. Kaggle A-Z offers the same but to alphabet characters

In general, HTR can be separated into two tasks: Online HTR and Offline HTR.

When we do online handwriting recognition, the methods involve a digital pen and through that, we have access to the stroke information and pen location while the text is being written. The picture above is a very good illustration of this and should also indicate that for these writings, we usually have a good amount of information which makes it a whole lot easier to classify characters with higher accuracy.

Offline methods, on the other hand, involve recognizing text once it's written down and based on this information only. Thus, we have much more scarce features to use for making predictions. Even worse, we might also have to deal with background noise emanating from the paper.

Conclusion

In this topic, you've come to an understanding of what OCR is. You've learned the general OCR procedure (Image preprocessing, text localization, character segmentation & recognition, and post-processing) and OCR applications.

OCR is really helpful for NLP Data Scientists to extract text from images for further processing. Of course, one can just scan an image in Google Lens (which is much simpler than all these libraries and models), but OCR libraries allow you to complete the whole project in Python without using external software. The other reason it may be useful is that your customer or boss may want to know how exactly you extracted the text you are processing from the image.

How did you like the theory?

Report a typo