Computer scienceData scienceInstrumentsScikit-learnData preprocessing with scikit-learn

TF-IDF in scikit-learn

27 minutes read

scikit-learn, a well-known Python ML library, comprises a lot of useful and ready-made methods, metrics, and algorithms. In this topic, you will take a look at the various ways of working with one of the most popular word representations, TF-IDF, in scikit-learn.

Class parameters and attributes

The most convenient way to get a TF-IDF matrix for your data with scikit-learn is to use the TfidfVectorizer class. Take a look at the official documentation, if you're interested. At first, you need to import it and create an instance:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

dataset = ["So no one told you life was gonna be this way",
           "Your job's a joke, you're broke",
           "Your love life's DOA",
           "It's like you're always stuck in second gear",
           "When it hasn't been your day, your week, your month",
           "Or even your year, but",
           "I'll be there for you"]

Each string in the dataset represents a separate document; there are seven documents in total. They form your document collection.

TfidfVectorizer class has a lot of parameters. Let's have a look at some of them:

input='content' is the default value. The program expects data as a sequence of strings or bytes, like your dataset above. Alternatively, with input='file', you can provide a sequence of files (as files are expected, they have to be opened first), and with input='filename' – a sequence of filenames;
the encoding parameter with the default value of utf-8 can be useful if your input data is a file object;
a boolean use_idf, when set to False, tells the vectorizer to calculate only the TF;
a boolean lowercase is True by default; if set to False, there is no conversion to lowercase;
analyzer is used to set the level of processing, it can be a character or a word level (analyzer='char' or 'word', correspondingly);
ngram_range=(1, 5) tuple sets the lower (the first value) and the upper (the second value) n-gram limits for extraction;
stop_words can provide you with a list of words that have to be removed from the data before calculations;
vocabulary can allow you to calculate only the scores of the words you want;
min_df and max_df (float for percentage frequency or int for absolute frequency) can set the thresholds for a term document frequency.

An n-gram is a sequence consisting of $n$ items, words, or characters. Here are some examples of word bigrams: "my friend", "they will", "for a"; and character trigrams: "lin", "tyd", "mak".

The following vectorizer, for example, takes a sequence of byte strings, converts it into lowercase, extracts unigrams, and calculates a TF-IDF score. It doesn't contain stop words (stop_words) and doesn't make calculations for vocabulary, however, the n-grams that occur in more than 60% of documents or in less than 1% of documents will be ignored.

vectorizer = TfidfVectorizer(input='content', use_idf=True, lowercase=True, 
analyzer='word', ngram_range=(1, 1), stop_words=None, vocabulary=None, min_df=0.01, max_df=0.60)

scikit-learn also contains the CountVectorizer class — it builds vectors with term counts. You can use it to represent words in a text.

fit_transform()

Once you've created a vectorizer instance, it's time to obtain a TF-IDF matrix. You can use the fit_transform() class method and shape to print out its dimension:

tfidf_matrix = vectorizer.fit_transform(dataset)
print(f"Matrix dimension: {tfidf_matrix.shape}")  # Matrix dimension: (7, 38)

Passing a file to TfidfVectorizer

If dataset is a file-like object, you need to open it in advance. Moreover, fit_transform() still needs a sequence as input, so you should include the opened file in a list:

dataset = open('my_data.txt', 'r')

vectorizer = TfidfVectorizer(input='file')
tfidf_matrix = vectorizer.fit_transform([dataset])  # the argument must be a sequence

As you've probably guessed, 7 rows in the matrix correspond to the number of documents in your dataset. So, the number of columns reflects the number of different terms (the vocabulary size). If you print the matrix(tfidf_matrix), you will get something like this:

print(tfidf_matrix)

# (0, 32)	0.32013213618851233
# (0, 29)	0.32013213618851233
# ...
# (3, 0)	0.35903541343111484
# (3, 17)	0.35903541343111484
# ...
# (6, 1)	0.4115330003294659
# (6, 36)	0.3054049222662203

To access the term weights of a particular document, use indexation. You will get a ready-to-use representation for your document:

print(tfidf_matrix[6])

# (0, 8)	0.4957715949559137
# (0, 28)	0.4957715949559137
# (0, 18)	0.4957715949559137
# (0, 1)	0.4115330003294659
# (0, 36)	0.3054049222662203

On the left, you can see the location of a particular term in the matrix (a document number, a term index); on the right, you can see the scores. Since you are printing a single document, the document number is zero for every row.

A visual explanation of the tfidf_matrix output:

An examplary tfidf_matrix with explanations for each part of the output(Document index location, Word index location, and Term weight)

Note that tfidf_matrix is sparse and only outputs non-zero values of the vector for each document. You will see how to work with a more familiar representation later in the topic.

get_feature_names_out()

The numbers above don't tell us much about the scores of particular words, as we don't know how the vocabulary is built. To see the vocabulary, use the get_feature_names_out() method:

terms = vectorizer.get_feature_names_out()
print(terms)

Here's what it will output:

['always', 'be', 'been', 'broke', 'but', 'day', 'doa', 'even', 'for', 
'gear', 'gonna', 'hasn', 'in', 'it', 'job', 'joke', 'life', 'like', 'll', 
'love', 'month', 'no', 'one', 'or', 're', 'second', 'so', 'stuck', 'there', 
'this', 'told', 'was', 'way', 'week', 'when', 'year', 'you', 'your']

You haven't preprocessed the documents, so some words in the vocabulary may seem weird. It may be a good idea to do something about apostrophes to prevent some words like "hasn" from appearing.

You can get a tangible representation by using a library for data analysis, like Pandas. You can also take a look at some of the results using standard Python tools as well.

As you know, word indexes in a returned list correspond to those in the vocabulary. If you see the following line in the TF-IDF matrix: # (0, 8) 0.4957715949559137, it means that you can access the corresponding word using the following indexation:

print(terms[8])  # for

Bear in mind that the size of the collection can be much larger than yours. Because of this, printing the whole matrix (or even a part of it) for a specific document can take a long time and still won't be very representative. Nevertheless, if you need a more convenient way to represent the results, you can get a list of terms sorted by their TF-IDF scores. You will see how to do that shortly.

Specifying vocabulary parameters

Now let's consider a few examples that illustrate in detail how the stopwords and vocabulary parameters work.

If you provide a list of stopwords, these words will be excluded from the vocabulary and, subsequently, from the matrix:

stopwords = ['so', 'or', 'be']

vectorizer = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix = vectorizer.fit_transform(dataset)
terms = vectorizer.get_feature_names_out()
print(terms)  # compare the list with the one above: 
# words 'so', 'or', and 'be' are not in the vocabulary

# ['always', 'been', 'broke', 'but', 'day', 'doa', 'even', 'for', 'gear', 
# 'gonna', 'hasn', 'in', 'it', 'job', 'joke', 'life', 'like', 'll', 'love', 
# 'month', 'no', 'one', 're', 'second', 'stuck', 'there', 'this', 'told', 
# 'was', 'way', 'week', 'when', 'year', 'you', 'your']

In case you only want to know the importance of particular words, mention them in the vocabulary parameter; the final matrix will only contain their scores:

my_vocab = ['it', 'your']
vectorizer = TfidfVectorizer(vocabulary=my_vocab)

tfidf_matrix = vectorizer.fit_transform(dataset)
terms = vectorizer.get_feature_names_out()

print(terms)  # ['it', 'your']

print(tfidf_matrix)

# (1, 1)	1.0
# (2, 1)	1.0
# (3, 0)	1.0
# (4, 1)	0.9122058069917823
# (4, 0)	0.40973230979564096
# (5, 1)	1.0

The first tuple value is the index of a document in your collection; the second is the index of a term in the vocabulary.

toarray()

TfIdfVectorizer.fit_transform() returns a sparse matrix. That matrix has a toarray() method, which converts a sparse matrix into an n-dimensional array (there is a similar method, .todense(), which returns a numpy matrix, but let's focus on arrays here). This comes in useful when you want to better understand what's going on with the regular indexation or perform certain calculations. Let's see how it works with on an example:

corpus = [
    "The quick brown fox",
    "Jumped over the lazy dog",
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)

Your vocabulary looks like this (call vectorizer.vocabulary_):

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}

So you can look at the output above and say that the term 'quick', for example, has the index 6.

Let's look at the first document in the collection:

print(tfidf_matrix[0])
#  (0, 2)	0.534046329052269
#  (0, 0)	0.534046329052269
#  (0, 6)	0.534046329052269
#  (0, 7)	0.37997836159100784

The word 'fox' (index 2), 'brown' (index 0), and 'quick' (index 6) all have a score of 0.534046329052269, and 'the' (index 7) has a score of 0.37997836159100784.

print(tfidf_matrix.toarray()[0])
#   [0.53404633 0.         0.53404633 0.         0.         0.
#    0.53404633 0.37997836]

Now, after you've called .toarray(), you see the same output as described above, but now you can access a score for a specific word from the first document like this:

tfidf_matrix.toarray()[0][vectorizer.vocabulary_['brown']]
# 0.534046329052269

A small visual walk-through of .toarray() interpretation:

Considering vocabulary, TfIdfVectorizer transformation, and .toarray() call

Here is an example of how to get the sorted array of the terms from the first document, considering the document collection:

first_doc = tfidf_matrix[0].toarray()
terms = vectorizer.get_feature_names_out()
scores = [(first_doc[j][k], terms[k]) for j in range(len(first_doc)) for k in range(len(first_doc[j]))]
scores = sorted(scores, reverse=True, key=lambda tup: (tup[0], tup[1]))

The scores output:

[(0.534046329052269, 'quick'),
 (0.534046329052269, 'fox'),
 (0.534046329052269, 'brown'),
 (0.37997836159100784, 'the'),
 (0.0, 'over'),
 (0.0, 'lazy'),
 (0.0, 'jumped'),
 (0.0, 'dog')]

Summary

Now, let's summarize what you should've taken away from this topic:

You've familiarized yourselves with how TF-IDF is calculated in scikit-learn;
You've learned how to use the TfidfVectorizer class to calculate a TF-IDF score;
You took a look at TfidfVectorizer parameters;
You used a couple of class methods;
You learned to interpret and work with output matrixes and use .toarray() to work with the matrices more easily.

56 learners liked this piece of theory. 8 didn't like it. What about you?

Report a typo