Computer scienceData scienceNLPText processing

N-gram and collocation measures

6 minutes read

When we train any n-gram model, we get a number of different n-grams, but some of them can be not so informative. For instance, a bigram "and then" is less important than a bigram "heavy rain" or "machine learning". To evaluate the strength and significance of the association between n-grams, you can use collocation measures.

Reasons for measurement

There are many reasons to use collocation measures. The most important ones are the following:

  1. Identifying meaningful word combinations: help to identify combinations of words that have a specific meaning in a given context. For instance, the collocation "artificial intelligence" frequently appears together and has a special meaning.
  2. Improving text processing: help to group collocations together. For example, in machine translation, collocations can be translated more accurately as a unit rather than individual words.

  3. Identifying word associations: help to identify associations between words, which can be useful for tasks such as topic modeling and information retrieval.

  4. Keywords extraction: help to find keywords in the text by the association measure between them. For instance, in a message "I want to buy a cocktail dress in a shop Dress-up tonight" we can find the keyword "cocktail dress" and a name of the shop "Dress-up tonight". Therefore, we can recommend a user dresses from this shop.

Measures

To measure the N-gram or collocation probability, there are many statistic measures. We will use four popular association measures: PMI, student's t-test, chi-squared test, and likelihood-ratio test.

In this part, we will discuss finding nn best N-grams with association measures in a text. You can count all these measures in NLTK with the nltk.collocations module. At first, you import all necessary libraries and the nltk 'punkt' module and create a sentence:

import nltk
from nltk.collocations import *
from nltk import word_tokenize
nltk.download('punkt')

sentence = "The cat sat on the mat. The dog chased the cat. The cat sat and the dog is friend."

Then you need to initialize the BigramAssocMeasures() class. This class has functions to calculate all needed association measures.

bigrams = nltk.collocations.BigramAssocMeasures()

Next, you need to find all bigrams in the source. Be sure that you give a tokenized text in the input — otherwise, NLTK may give you incorrect output.

finder = BigramCollocationFinder.from_words(word_tokenize(sentence))

For all the calculations, we are going to use the finder.score_ngrams function. To score one n_gram, you can use the finder.score_ngram function.

Chi-Squared

The chi-squared test is a statistical method to determine whether there is a significant association between two words occurring together. It involves calculating a chi-squared statistic based on the frequencies of the observed and expected occurrences of the two words. The expected frequency is calculated based on the assumption that the two words are independent of each other.

To find chi-squared test in NLTK, you simply score n-grams with bigrams.chi_sq.

scored_chi = finder.score_ngrams(bigrams.chi_sq)

# print the top 3 scored bigrams
for bigram, score in scored_chi [:3]:
    print(bigram, score)

# the output is the bigram and its score
# ('is', 'friend') 22.0
# ('cat', 'sat') 13.933333333333334
# ('dog', 'chased') 10.476190476190476

PMI

PMI (Pointwise Mutual Information) is a measure of the association between two words that takes into account their co-occurrence in a corpus relative to their individual frequencies. Higher values indicate a stronger association, and 0 indicates that the two words are independent.

# apply the PMI score as the n-gram collocation measure
scored_pmi = finder.score_ngrams(bigrams.pmi)

# print the top 3 scored bigrams
for bigram, score in scored_pmi[:3]:
    print(bigram, score)

# ('is', 'friend') 4.459431618637297
# ('dog', 'chased') 3.4594316186372973
# ('dog', 'is') 3.4594316186372973

Log-likelihood

Log-likelihood is a statistical measure used to quantify the strength of association between two terms in a text corpus. It measures the likelihood of observing the co-occurrence of two terms in a document under the assumption that these two words are independent. Then it compares it to the actual frequency of co-occurrence in the corpus. A positive log-likelihood score indicates that the observed frequency of co-occurrence is high, while negative log-likelihood means that the observed frequency is low.

Let's take "Alice's Adventures in Wonderland" in the Gutenberg corpus.

If you forgot how to download a corpus in NLTK, then here is a prompt:

nltk.download('gutenberg')

Now you need to apply the stopwords to the TrigramCollocationFinder that you defined as finder in the previous section. But in case you have got messed up, the following snippet shows the process of finding the ten best trigrams from zero:

import nltk
from nltk.collocations import *
from nltk.corpus import gutenberg
from nltk import word_tokenize


ignored_words = nltk.corpus.stopwords.words('english')  # we define nltk stopwords as ignored_words


threegram = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(gutenberg.raw('carroll-alice.txt')))


finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)  # we apply stopwords 
finder.nbest(threegram.likelihood_ratio, 10)


# [('Mock', 'Turtle', 'replied'),
#  ('Mock', 'Turtle', 'sighed'),
#  ('Mock', 'Turtle', 'Soup'),
#  ('Mock', 'Turtle', 'went'),
#  ('Mock', 'Turtle', 'persisted'),
#  ('Mock', 'Turtle', 'recovered'),
#  ('Mock', 'Turtle', 'sang'),
#  ('Mock', 'Turtle', 'yawned'),
#  ('miserable', 'Mock', 'Turtle'),
#  ('Mock', 'Turtle', 'drew')]

To score one particular trigram, we will use the score_ngram function.

finder.score_ngram(threegram.likelihood_ratio, 'miserable', 'Mock', 'Turtle')
# 791.6334542411805

Student's t-test

T-test can be used to compare the frequencies of a word pair in two different corpora and determine if there is a significant difference in their usage. If the t-value exceeds a certain critical value based on the degrees of freedom and level of significance, we reject the null hypothesis and conclude that the word pair is used significantly differently in the two corpora.

finder.nbest(threegram.student_t, 10)


# [('white', 'kid', 'gloves'),
#  ('little', 'golden', 'key'),
#  ('poor', 'little', 'thing'),
#  ('March', 'Hare', 'said'),
#  ('Mock', 'Turtle', 'said'),
#  ('cats', 'eat', 'bats'),
#  ('Mock', 'Turtle', 'replied'),
#  ('Mock', 'Turtle', 'went'),
#  ('said', 'Alice', 'indignantly'),
#  ('thought', 'poor', 'Alice')]

Conclusion

This topic focused on collocations. Collocation is a multi-word expression or phrase with a special meaning, while an N-gram is a mere collection of words. Overall, the collocation measures are useful sources to calculate the importance of n-grams and can be integrated in any NLP model. To understand how strongly a few words are connected, you can use the above discussed measures: PMI, student's t-test, chi-squared test, and likelihood-ratio test. However, there is not one perfect measure for one corpus, therefore it is recommended to try different scores for each corpus.

How did you like the theory?
Report a typo