Computer scienceData scienceNLPText processing

N-gram model

12 minutes read

In NLP, the n-gram model is a type of statistical model. The main idea behind this model is that you can predict the next word based on the word or a few words before it. You can also predict tokens, phonemes, actions, or letters. For instance, given the sentence 'Most NLP developers create machine learning,' the model is likely to predict 'models' as the next word, as it is the most frequently occurring word after the phrase 'machine learning'.

Examples of 1-, 2-, and 3-gram models

N-gram models analyze the previously seen $n$ samples. They can include 1, 2, 3, or any arbitrary value. Let's take a dataset of three sentences and understand 1-, 2-, and 3-gram models:

The cat sat on the mat.
The dog chased the cat.
The cat sat, and the dog is a friend.

With a 1-gram model, each word is independent. The probability of a word appearing in a sentence is equal to the word frequency. For instance, the probability of the word the is ${6 \over 19}$ as it occurs 6 over 19 words.

To generate a sentence, we will use each word only once. Based on the frequency of each word, we can generate a sentence The cat dog. The is the most frequent word, cat is the second most frequent word, and dog is less frequent. However, the sentence is not grammatically correct as each word was considered independent.

With a 2-gram model, each word is dependent on the one previous word; this is also known as the Markov assumption. The probability of a pair appearing in the sentence is equal to the frequency of the pair together divided by the frequency of the first word. For example, the probability of the pair the cat = P (cat | the) and is equal to the frequency of the cat divided by the frequency of the = ${3 \over 6}$ .

Based on the frequencies of each pair, we can generate the sentence: The cat sat. Note that the is usually followed by cat, and cat is followed by sat. Probabilities are P(cat | the) = ${3 \over 6}$ , P(sat | cat) = ${2 \over 3}$

With a 3-gram model, we consider each word dependent on the two previous words. The probability of a trigram appearing in the sentence is equal to the frequency of the trigram divided by the frequency of the first two words. For example, the probability of the trigram the cat sat = P (sat | the cat) equals the frequency of (the cat sat) / the frequency of (the cat) = ${2 \over 3}$ .

The same sentence The cat sat can be generated from the training data as the trigram the cat sat is most frequent for the two words the cat.

As you can notice, in the 2-gram model, we looked at one last word; in a 3-gram, we looked at two prior words. Therefore, for the n-gram model, we look at $n-1$ words. The frequency we have calculated as the probability of a unigram, bigram, and trigram is also called relative frequency.

N-gram formula

A more complicated version of these examples can be seen as the chain rule of probability. We'll represent a sequence of $n$ words either as $w_1...w_n$ or $w_{1:n}$ . Then, the expression $w_{1:n-1}$ will mean the string of all words in the sequence except the last one, word $n$ . The joint probability of each word will be $P(w_1, w_2, ..., w_n)$ — this is the joint probability, not the chain rule of probability. The chain rule of probability will look like this:

P(w_{1:n}) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_{1:2})\cdot ...\cdot P(w_n|w_{1:n-1}) = \prod\limits_{k=1}^n P(w_k|w_{1:k-1})

This equation implies that we can estimate the joint probability of an entire sequence of words by multiplying all conditional probabilities. The idea of the N-gram model is that we don't need to compute all these probabilities for, let's say, 100 previous words; we can calculate just one or two. The Bigram model, to be particular, compute only one probability, the example of break the law:

\prod\limits_{k=1}^n P(w_k|w_{1:k-1}) = P(w_n|w_{1:n-1}) = P(w_n|w_{n-1}) = P(\text{law}| \text{the})

N-grams in NLTK

NLTK provides many tools to implement the n-gram analysis, much more than any other NLP library. In this section, we will see how to find N-grams. The code below will help you detect all bigrams in a sentence. NLTK will give you a sequence of tuples in the output:

from nltk import ngrams

sentence = 'Roman victory in the Punic Wars and Macedonian Wars established Rome as a super power'
n = 2
n_grams = ngrams(sentence.split(), n)
for grams in n_grams:
    print(grams)

Output:

('Roman', 'victory')
('victory', 'in')
('in', 'the')
('the', 'Punic')
('Punic', 'Wars')
('Wars', 'and')
('and', 'Macedonian')
('Macedonian', 'Wars')
('Wars', 'established')
('established', 'Rome')
('Rome', 'as')
('as', 'a')
('a', 'super')
('super', 'power')

If you change n to 3, you will get trigrams.

You can see that the output is just all possible bigrams in the sentence. We are yet to use a predictive n-gram model.

We can find all N-grams in the nltk.collocations module:

import nltk

nltk.download("punkt")

bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.word_tokenize(sentence)
)
bigram_freq = bigramFinder.ngram_fd.items()
print(bigram_freq)

Output:

dict_items(
    [
        (("Roman", "victory"), 1),
        (("victory", "in"), 1),
        (("in", "the"), 1),
        (("the", "Punic"), 1),
        (("Punic", "Wars"), 1),
        (("Wars", "and"), 1),
        (("and", "Macedonian"), 1),
        (("Macedonian", "Wars"), 1),
        (("Wars", "established"), 1),
        (("established", "Rome"), 1),
        (("Rome", "as"), 1),
        (("as", "a"), 1),
        (("a", "super"), 1),
        (("super", "power"), 1),
    ]
)

Here we have a list of tuples: each contains another tuple with N-grams inside and an integer indicating the number of times this N-gram occurs in the sentence.

Advantages and disadvantages of the N-gram model

N-gram models were prevalent a few years ago, as they have a lot of advantages:

Simple and easy to understand: the main idea behind n-gram models is similar to how humans generate sequences;
Speed: N-gram models can be trained quickly and efficiently on large datasets as they are based on statistics;
Scalability: N-gram models can scale well to handle large volumes of text data.

However, n-gram models are not used in all of the cases as they have some disadvantages:

Limited context: N-gram models only consider a little context of $n$ words when making predictions. It can lead to poor results when the context is more extended than $n$ words.
Data sparsity: As the size of the n-gram increases, the number of unique n-grams also increases. It can result in lousy accuracy for rare n-grams.
Lack of semantic understanding: N-gram models are based on statistical patterns and do not capture the semantic meaning of the words.

Conclusion

Overall, the n-gram model is a fundamental model of natural language modeling. It can be the first and easy solution for many tasks in NLP. N-gram models are the best choice in case of a small amount of data and limited resources.

12 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo