Computer scienceData scienceNLPText representationCount-based text representations

Bag-of-words

15 minutes read

In NLP, we usually convert an input text/word/symbol into a numeric format to apply various mathematical operations (compare numbers, find patterns in numeric input, and so on) with them. The most convenient way is to provide a vector for each text unit. By vector, we mean an ordered sequence of numbers. We can use them to describe any element, be it a single word or an entire text. What a vector encodes depends solely on our objective and approach.

Bag-of-words is a concept of vectorizing our text or, more broadly, a concept of text representations.

Transforming symbol to number

The process of transforming a symbol into a number or a row of numbers (vectors) is called word embedding. In one of the types of embeddings, Word2Vec, vectors can be formed following the features of the input text. Word2Vec is a tool used in distributional semantics, while Bag-of-words is more like a tool for classification. For example, the word actor will have the following encoded features: human, profession, masculine, singular, art — all these features will be encoded in something like [0.3, 5.2, 1, 1, -3] . But if the word is actress, then our third feature will be changed to feminine, and the vector will probably look like this [0.3, 5.2, 2, 1, -3]. This transformation is often represented in NLP as actor - man + female = actress . But an example with actress is more like an exception because most words in English are unisex (take teacher as an example, we don't say teacheress). Still, words in English change their forms due to their number, so if the word is actors, then we will revise the fourth feature in our vector row: actor - singular + plural = actors. We showed five features on which our vector may depend, but generally, there are hundreds of such features (like small-big, dark-bright), and then the vector consists of hundreds of digits inside.

As was said above, we vectorize not only words but letters, n-grams, phrases, sentences, or entire texts (articles, novels). For the last one — it's evident that we sometimes need to compare different texts: one text may be an adventure novel, and the second one a news article, and they will have different embeddings. But sometimes, we need to compare different types of objects. Take, for example, a bi-gram ballet dancer and a unigram ballerina — we may need to compare them the same way we have done with actor-actress.

With bag-of-words, we measure not semantics, but the frequency, so the number vectors are always bound to the number of times the words/n-gram/text/etc. are represented in the input.

We can use word embeddings for different purposes:

  • measure similarity between objects (this task is well doable in Word2Vec)

  • sentiment analysis (in Word2Vec)

  • text classification (in Bag-of-words)

Bag-of-words model concept

Here we consider each word independently, without considering the surrounding context. We describe a text as a sequence of all words it contains, but we do not keep its original order and place (hence the name). The resulting vector encodes the text and stores information about occurrences of words in it. The model is frequently used in document classification and information retrieval but also has applications in many other NLP tasks.

Let's look at the example. We have three reviews, each consisting of one sentence:

Review I: easily the best album of the year.

Review II: the album is amazing.

Review III: loved the clean production!

First, we need to design the vocabulary; it is the list of all known words across the data. If we ignore punctuation marks, it can look as follows: easily, the, best, album, of, year, is, amazing, loved, clean, production. The vector length should equal the length of the vocabulary so that it will be 11.

The next step is to count all occurrences of these words in each review. You can create a table, the columns of which will represent the units from the vocabulary:

easily

the

best

album

of

year

is

amazing

loved

clean

production

I

1

2

1

1

1

1

0

0

0

0

0

II

0

1

0

1

0

0

1

1

0

0

0

III

0

1

0

0

0

0

0

0

1

1

1

For one, the article the appears twice in the first review, so we insert 2 in the corresponding cell opposite the document name. We obtain the following representations, in which a number in the vector represents the count of the related word:

Reviews I =  [1,2,1,1,1,1,0,0,0,0,0],                                                                                         

Review II =  [0,1,0,1,0,0,1,1,0,0,0],                    

Review III = [0,1,0,0,0,0,0,0,1,1,1]

Other scoring methods

There can be different ways of scoring. You can point out whether a word appears in a document or not. That leads to binary vectors, with 0 for each absent word and 1 for each present word. Now our representation will change a little:

Reviews I =  [1,1,1,1,1,1,0,0,0,0,0],                                                                                         

Review II =  [0,1,0,1,0,0,1,1,0,0,0],                    

Review III = [0,1,0,0,0,0,0,0,1,1,1]

We create binary vectors when we are more concerned about the presence of words rather than their raw counts. The most straightforward sentiment analysis is an example of a task to which we can apply this representation.

Another approach is to calculate frequencies. You need to score occurrences of a particular word divided by the total number of words in a document. Let's illustrate it with the: it appears twice in the first review, comprising 7 words overall. Hence, the result of their division is 2/7. If we convert the number to a decimal fraction and round it up to two decimal places, we get 0.29. So, for all the reviews, we get these vectors:

Review1 = [0.14,0.29,0.14,0.14,0.14,0.14,0.00,0.00,0.00,0.00,0.00]
Review2 = [0.00,0.25,0.00,0.25,0.00,0.00,0.25,0.25,0.00,0.00,0.00]
Review3 = [0.00,0.25,0.00,0.00,0.00,0.00,0.00,0.00,0.25,0.25,0.25]

Counting frequencies makes sense when we have several documents. It is a way to compare the ratio of a specific word across the data; for instance, if one document consists of 2525 words and the other 100100 words. However, raw counts will be enough to determine the most common terms if you have only one text.

Advantages and disadvantages of bag-of-words

Now, we can name some advantages of the model:

  • The main benefit lies in its simplicity. All values in a vector are easy to compute, and we can always tell what they stand for. In addition, despite the simplicity, it usually shows good performance in classification tasks.

  • We can encode an entire text right away. This approach is a good choice if we do not need to pay attention to any of the inner structures.

  • We do not need a large dataset to build a model (as opposed to word embeddings you will learn about below).

However, there are some weaknesses, too:

  • The model pays no attention to inner relations and neglects the word context, so the semantics is left out. Consider buy and purchase; in most cases, these words are synonyms and used in the same context, but we cannot access this information here.

  • Often it provides sparse vectors with a massive amount of dimensions, that is, vector length; in the bag-of-words, it also equals the vocabulary length. Such vectors are both computation and memory-consuming.

There is nothing we can do about the first disadvantage. As for the second one, standard preprocessing steps, such as text normalization and stopword removal, may help us. After the first procedure, various forms of one word (goes, going) will become the base form (go). This also applies to cases of Go and go: if we do not convert Go to lowercase, these words will be recognized as two different units in the dictionary. After the second procedure, some high-frequency but meaningless words (a, the, are, prepositions, etc.) will be deleted. As a result, the vocabulary will include fewer items, and the vector will be shorter.

However, these steps are inefficient with extensive texts and dictionaries of thousands or millions of words. Later in this topic, we will observe another type of representation, which allows us to solve this problem.

Bag-of-words in Python

Manual creation of the bag-of-words model is a simple task in Python, and one of the application tasks will be concentrated on it. Here we will show how to use a ready-made model. This is possible in the scikit-learn library:

from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()  #  initializes the class

reviews = [
    "easily the best album of the year.",
    "the album is amazing.",
    "loved the clean production!",
]

X = vectorizer.fit_transform(reviews)

print(X.toarray())  # shows a matrix of all 3 reviews

##  [[1 0 1 0 1 0 0 1 0 2 1]
##   [1 1 0 0 0 1 0 0 0 1 0]
##   [0 0 0 1 0 0 1 0 1 1 0]]

You will get a matrix. This matrix will contain three arrays, each corresponding to its respective review.

You can also check the dictionary of the given bag-of-words output:

print(vectorizer.get_feature_names_out())


##  ['album', 'amazing', 'best', 'clean', 'easily', 'is', 'loved', 'of', 'production', 'the', 'year']

You can specify the CountVectorizer class to a specific size of n-grams. Below, we specify that we need just bigrams, but if you want to see both unigrams and bigrams, then you can change the settings to ngram_range=(1, 2).

vectorizer = CountVectorizer(ngram_range=(2, 2)) 

X = vectorizer.fit_transform(reviews) 

print(X.toarray())

print(vectorizer.get_feature_names_out())


##  [[0 1 1 0 1 0 0 1 0 1 0 1]
##   [1 0 0 0 0 1 0 0 1 0 0 0]
##   [0 0 0 1 0 0 1 0 0 0 1 0]]
##  ['album is', 'album of', 'best album', 'clean production', 'easily the', 'is amazing', 'loved the', 'of the', 'the album', 'the best', 'the clean', 'the year']

We can also specify the stopwords. For that purpose, we need to download our stopwords corpus first:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

sw = stopwords.words('english')

Then, we can apply the downloaded stopword list to the CountVectorizer class:

vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words=sw)

X = vectorizer.fit_transform(reviews) 

print(X.toarray())

print(vectorizer.get_feature_names_out())


##  [[0 1 1 0 1 0]
##   [1 0 0 0 0 0]
##   [0 0 0 1 0 1]]
##  ['album amazing', 'album year', 'best album', 'clean production', 'easily best', 'loved clean']

As you can see, the arrays now are much shorter, and there are only a few bigrams in our dictionary because tokens like the and is are deleted.

Conclusion

In this topic, we have deepened your knowledge of bag-of-words and showed you how we could implement it in the Python library. Ultimately, it's worth mentioning that this model has more complicated versions: Bag-of-n-grams or even (theoretically) bag-of-texts. For example, in Python implementation in the last code snippet, we showed how to get Bag-of-Bigrams in scikit-learn.

It is always a good idea to check the library's documentation (scikit-learn). We have omitted many details in this topic. For example, we haven't discussed such parameters in the CountVectorizer class as binary, you can read about them in the documentation.

11 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo