Computer scienceData scienceNLPText processingText normalization

Tokenization implementation

14 minutes read

Tokenization helps to identify meaningful units that contribute to understanding the text's content. In this topic, we'll explore various tokenization approaches using the Natural Language Toolkit (NLTK) library.

Tokenization in NLTK

The Natural Language Toolkit (NLTK) provides various tokenization tools through its tokenize module. This module contains several tokenizers, each designed for specific text processing needs. To use a tokenizer, import it directly from the module using:

from nltk.tokenize import <tokenizer>
# for example:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

The table below describes the main tokenizers available.

Syntax	Description
`word_tokenize()`	Returns word and punctuation tokens.
`WordPunctTokenizer()`	Returns tokens from a string of alphabetic or non-alphabetic characters (like integers, $, @...).
`regexp_tokenize()`	Returns tokens using standard regular expressions.
`TreebankWordTokenizer()`	Returns the tokens as in the Penn Treebank using regular expressions.
`sent_tokenize()`	Returns tokenized sentences.

Word tokenization

Let's take a look at an example. Imagine we have a string of three sentences:

text = "I have got a cat. My cat's name is C-3PO. He's golden."

Now, let's have a look at each tokenization method from the table. Don't forget to import all of them in advance.

In the example below, we pass the text variable to the word_tokenize() function:

print(word_tokenize(text))
# ['I', 'have', 'got', 'a', 'cat', '.', 'My', 'cat', "'s", 'name', 'is', 'C-3PO', '.', 'He', "'s", 'golden', '.']

The result is a list of strings (tokens). The function splits the string into words and punctuation marks. Mind the possessives and the contractions. The tokenizer transforms all 's into separate words. Of course, we understand that cat's could also be recognized as one token.

The next code snippet introduces the WordPunctTokenizer(). This tokenizer is similar to the first one, but the result is a little bit different. All the punctuation marks including dashes and apostrophes are separate tokens. Now, C-3PO, the cat's name, is split into three tokens. In this case, this behavior is not optimal.

wpt = WordPunctTokenizer()
print(wpt.tokenize(text))
# ['I', 'have', 'got', 'a', 'cat', '.', 'My', 'cat', "'", 's', 'name', 'is', 'C', '-', '3PO', '.', 'He', "'", 's', 'golden', '.']

The next example shows the results of the TreebankWordTokenizer().

tbw = TreebankWordTokenizer()
print(tbw.tokenize(text))
# ['I', 'have', 'got', 'a', 'cat.', 'My', 'cat', "'s", 'name', 'is', 'C-3PO.', 'He', "'s", 'golden', '.']

The TreebankWordTokenizer() works almost the same way as the word_tokenize(). Mind full stops – they form a token with the previous word, but the last full stop is a separate token. Word_tokenize(), on the contrary, recognizes full stops as separate tokens in all cases. Moreover, the apostrophe and s are not separated as with WordPunctTokenizer().

Let's now move on to the next method. The regexp_tokenize() function uses regular expressions and accepts two arguments: a string and a pattern for tokens.

# 1
print(regexp_tokenize(text, "[A-z]+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', 'cat', 's', 'name', 'is', 'C', 'PO', 'He', 's', 'golden']

# 2
print(regexp_tokenize(text, "[0-9A-z]+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', 'cat', 's', 'name', 'is', 'C', '3PO', 'He', 's', 'golden']

# 3
print(regexp_tokenize(text, "[0-9A-z']+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', "cat's", 'name', 'is', 'C', '3PO', "He's", 'golden']

# 4
print(regexp_tokenize(text, "[0-9A-z'\-]+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', "cat's", 'name', 'is', 'C-3PO', "He's", 'golden']

The pattern [A-z]+ in the first example above allows us to find all the words or letters, but it leaves aside integers and punctuation. Because of that, all the possessive forms and the cat's name are split. The next pattern improves the search for tokens as the integers are added. It improves the search for the cat's name, but the way isn't optimal. The third pattern with an apostrophe also allows the tokenizer to find possessive forms. The last pattern includes the hyphen, so the name of the cat is recognized without mistakes.

You can see that obtaining tokens with the help of regular expressions can be flexible. We change the pattern in each case; this allows us to get more precise results.

Sentence tokenization

Finally, let's look at the sent_tokenize() function. It splits a string into sentences:

print(sent_tokenize(text))
# ['I have got a cat.', "My cat's name is C-3PO.", "He's golden."]

However, sentence tokenization is also a difficult task. A dot, for example, can mark abbreviations or contractions, not the end of a sentence only. Moreover, some dots can indicate both an abbreviation and the end of a sentence. Let's have a look at the examples.

text_2 = "Mrs. Beam lives in the U.S.A., it is her motherland. She lost about 9 kilos (20 lbs.) last year."
print(sent_tokenize(text_2))
# ['Mrs. Beam lives in the U.S.A., it is her motherland.', 'She lost about 9 kilos (20 lbs.)', 'last year.']

The sent_tokenize() includes a list of typical abbreviations and contractions with dots, so they are not recognized as the end of a sentence. Sometimes, it still provides confusing results. For example, after tokenizing the text_2 above, .) was recognized as the end of the sentence. It is a mistake. The last part in the tokenizer output is 'last year.' but it should belong to the previous sentence.

If you deal with informal texts such as comments splitting them into sentences may be particularly problematic. For example, in text_3, there are lots of periods and no spaces, so two sentences are recognized as one.

text_3 = "The plot of the film is cool!!!!!!! but the characters leave much to be desired....i don't like them."
print(sent_tokenize(text_3))
# ['The plot of the film is cool!!!!!!!', "but the characters leave much to be desired....i don't like them."]

Conclusion

To sum up, tokenization is an important procedure for text preprocessing in NLP. In this topic, we have learned:

How to split a text into words with different NLTK modules;
How to split a text into sentences with the sent_tokenize() module.

Of course, there are many other tools for tokenization: spaCy, keras, gensim, HuggingFace, Stanza, and others. It's always a good idea to take a look at their documentation.

Now, it is your time to carry out your tokenization experiments!

141 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo

Tokenization implementation

Tokenization in NLTK

Word tokenization

Sentence tokenization

Conclusion

Related topics