Computer scienceData scienceNLPText processingText normalization

Tokenization

9 minutes read

NLP includes a variety of procedures. Tokenization is one of them. The main task is to split a sequence of characters into units, called tokens. Tokens are usually represented by words, numbers, or punctuation marks. Sometimes, they can be represented by sentences or morphemes (word parts). Tokenization is the first step in text preprocessing. It is a very important procedure; before going to more sophisticated NLP procedures, we need to identify words that can help us interpret the meaning.

Tokenization issues

The major issue is choosing the right token. Let's analyze the example below:

Tokenization sample

This example is trivial; we use whitespaces to split the sentence into tokens. But, sometimes, the English language displays less obvious cases. What should we do with the apostrophes or a combination of numbers and letters?

Tokenization sample

What is the most suitable token here? Intuitively, we can say that the first option is what we should go for. Of course, the second option also makes sense as "we're" is the contraction for "we are". All other options are also theoretically possible.

What if we speak about a city with a complex name, for example, New York or ["New", "York"]?

As you can see, choosing the right token may be tricky. During the tokenization process, the following aspects have to be considered:

Capitalization
The language of the text (this includes programming languages, such as Python, where whitespaces and keywords are used for indentation)
Numbers and digits
Special tokens (the tokens that represent something other than text itself). They might indicate the beginning or end of text, a token for a specific task (such as classification), or mask parts of the text for training, among other purposes we will consider below.

The tokenizer granularity

There are 4 main types of tokenization granularity: word, subword, character, and byte tokens.

Word tokenization was once popular but faded over the years due to several issues. One of the issues is that the tokenizer is unable to deal with tokens that were not present in the training set. This also makes the vocabulary bloated with words that are very similar in meaning (e.g., classify, classification, classifiable, classifier). This is solved with subword tokenization that has a root token ('class') and suffix tokens ('ify', 'ification', etc), which helps with vocabulary size.

Vocabulary size in tokenization impacts the model's computational requirements, as each token needs its own embedding vector. A smaller vocabulary can still effectively represent language by using subword units, while avoiding the computational overhead of a large vocabulary of whole words.

Subword tokenization breaks down vocabulary into both words and word fragments. This maintains a rich vocabulary while also handling unknown words by decomposing them into known subwords that already exist in the vocabulary. For example, if the word "uncharacteristically" isn't in the vocabulary, it can still be represented using subword pieces like "un" + "character" + "istic" + "ally".

Character tokenization can handle any new word by using individual letters, but this makes the model's task harder - it must learn to compose letters like "p-l-a-y" instead of recognizing "play" as a single unit. Additionally, subword tokens are more efficient for context length: in a model with 1,024 token limit, subword tokenization can typically fit 3x more text than character-level tokenization since each subword token usually represents multiple characters.

Some models use byte-level tokenization, processing text as raw unicode bytes without traditional tokenization. While some subword tokenizers include bytes as fallback tokens for unknown characters, they aren't truly tokenization-free since they only use bytes for a subset of characters rather than the entire text.

WordPiece

WordPiece (originally introduced in 2012) is a subword tokenizer that identifies the most common character sequences in a training corpus and combines them to build a vocabulary of subwords that represents the text while maintaining a manageable vocabulary size.

WordPiece works by first initializing the vocabulary with individual characters and then repeatedly merging the most frequently occurring pairs of adjacent tokens. When encoding text, words are split into the longest possible subwords from the vocabulary, with special tokens (usually ##) marking subword pieces that don't start words.

Let’s look at the uncased version of the BERT tokenizer on a regular English sentence and Python code:

WordPiece tokenization example

WordPiece tokenization example based on the code input

Here, we see how 'uncharacteristically' is split into 6 tokens, with 5 subword tokens. [CLS] is a special token that indicates a classification task. [SEP] is used to separate sentences.

There are 3 more special tokens in WordPiece:

[PAD] - Padding token that is used to make all sequences the same length (a desirable property for training NNs that require fixed length)
[UNK] - Unknown token for words that are not in the model's vocabulary
[MASK] - Masking token that is used for masked language modeling. During training, random tokens in a sentence are replaced with [MASK], and the model learns to predict what word should go in that position.

Byte pair encoding (BPE)

At a high level, byte pair encoding works similar to Wordpiece, but uses a different strategy for merging tokens. BPE is more flexible with unknown words, while WordPiece maintains more semantic units and produces longer subwords.

Let's see how GPT-2 BPE-based tokenizer works:

An example of sentence tokenization with BPE

An example of code tokenization with BPE

Here, we see that the new lines, spaces, and tabs are represented by the tokenizer (and this is significant when LLMs work with code). There is a one special token in this model, <|endoftext|>. GPT-4 tokenizer works similarly to GPT-2 tokenizer, but is better at handling code and tends to use fewer tokens for word representation.

Conclusion

As a result, you are now familiar with the following:

Tokenization is a procedure that breaks text into meaningful units (tokens). The process must consider factors like capitalization, language specifics, and special tokens.
There are four main types of tokenization granularity: word, subword, character, and byte tokens. While word tokenization was once common, subword tokenization has become preferred.
Two major subword tokenization approaches are WordPiece (used by BERT) and Byte Pair Encoding (BPE, used by GPT models). WordPiece maintains semantic units and produces longer subwords, while BPE is more flexible with unknown words and is particularly good at handling code due to its preservation of whitespaces and special characters.

12 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo