Computer scienceData scienceNLPText representationCount-based text representations

Word-frequency distribution

2 minutes read

The task of word frequency distribution is similar to the N-gram model or finding the nn best N-grams with association methods. You can see how often a word occurs in a text in word-frequency distribution.

Zipf's law

Zipf's law is an empirical law that shows a particular behavior of the word frequency distribution. Zipf's law states that given some text data, the frequency of any word is inversely proportional to its rank in the frequency table. The most frequent word has the rank of 11. The second most frequent word, the rank of 22, is half as regular as the first. The third most frequent word is one-third of the frequency, and so on.

Take, for instance, the Brown Corpus of American English text. The word the is the most frequent word and accounts for nearly 7% of all word occurrences (69,971 occurrences per 1 million). The second most-frequent word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852). But some texts and languages may not match Zipf's Law.

Below you can see a plot of population distribution in Italian cities, an ideal Zipf's Law plot:

A plot of city population distribution in Italy

Word-frequency plot

Frequency distribution in NLTK is a Python dictionary with all the words (of your text) as keys and the number of occurrences of each word in values. It may not comply with Zipf's Law. To check the frequency distribution of your text, import the FreqDist() class and initialize it:

from nltk.probability import FreqDist

fdist = FreqDist()

Now, let's work with the Reuters corpus:

import nltk

nltk.download('reuters')

Next, remove all tokenized words from the Reuters corpus. Then, we apply stopwords. In this case, we need to use a for loop to filter:

from nltk.corpus import reuters, stopwords

nltk.download('stopwords')

words_doc = nltk.Text(reuters.words())
stop_words = set(stopwords.words('english'))

words_doc = [word.lower() for word in words_doc if word.isalpha()]
words_doc = [word for word in words_doc if word not in stop_words]

Count all the occurrences for each word in the dictionary:

for word in words_doc:
    fdist[word.lower()] += 1

As a result, you have an extensive dictionary. We can check the ten most frequent words with the code below:

fdist.most_common(10)


# [('said', 25383),
#  ('mln', 18623),
#  ('vs', 14341),
#  ('dlrs', 12417),
#  ('pct', 9810),
#  ('lt', 8696),
#  ('cts', 8361),
#  ('year', 7529),
#  ('net', 6989),
#  ('u', 6392)]

As you can see, our frequency distribution does not comply with Zipf's Law: the most common word occurs 1.38 times more frequently than the second.

Now let's plot the frequency distribution for the top 30 words:

fdist.plot(30)

You will get the following output:

A word frequency plot

Let's compare the ideal plot and our result:

A plot where comparison of our word frequency distribution and Zipf's Law cure is shown

Conclusion

Word-frequency distribution is an easy solution for the first EDA step in your model creation. It can help analyze the distribution of words in the data. Moreover, it can even help to find new stopwords to omit for better performance. In addition, you can also use word distribution analysis to compare the distribution of your data with Zipf's law curve and get new insights from the data.

How did you like the theory?
Report a typo