Computer scienceData scienceNLPText representationCount-based text representations

TF-IDF

12 minutes read

While processing numerical data may seem straightforward, the task becomes more complex when dealing with text data. This topic will explore a classic NLP technique, TF-IDF, used for word representation.

TF-IDF justification

TF-IDF stands for Term Frequency — Inverse Document Frequency. Its primary purpose is to assess a word's significance or create vectors for word representation. The process involves calculating a score for each word that reflects its importance in the document and the corpus.

TF-IDF could be helpful in certain situations, such as in Information Retrieval. In this case, ranking results in order of relevance is significant, which involves breaking down a query into terms and searching for them in a collection of documents. The frequency of a term in a document indicates its relevance. Therefore, we focus on term frequency when considering relevance.

Calculating the weights of terms in a document can be done in various ways. The general rule is relatively straightforward: terms that appear most frequently have greater weights. However, there are some subtleties to consider, leading to different approaches. For instance, the Bag of Words (BoW) model counts how often each term occurs in a document. Although this is the most basic way to assign weights to terms, it has limitations. Let's use Information Retrieval (IR) to illustrate these limitations. In IR, we need to grasp the meaning of a query, sentence, document, etc. Yet, in the BoW model, all terms are equally significant, which poses some challenges.

Another point is that there are terms that are articles, prepositions, pronouns, and auxiliary verbs. These words are widespread and crucial from a grammatical point of view, though they are of little to no use regarding the meaning of the whole query, sentence, or document. In the BoW model, these terms are considered necessary, meaningful terms.

The virtue of TF-IDF is that it deals with the mentioned difficulties in striving to grasp the most relevant words from a document.

TF-IDF overview

Now that we have glanced at the underlying idea, let's move on to the formula. How can you estimate the weight of a term tt in the document dd in terms of the TF-IDF scheme?

As mentioned above, TF stands for term frequency, and IDF stands for inverse document frequency.

  • Term frequency shows how often a term t is used in a particular document d:

TF(t,d)=number of occurrences of t in dtotal number of terms in d\text{TF}(t, d) = \frac{\text{number of occurrences of } t \text{ in } d}{\text{total number of terms in } d}

The Bag of Words (BoW) model only considers term frequency, but we previously criticized this model for a reason. The main difference between BoW and TF-IDF is that the latter takes into account the inverse document frequency (IDF). Common words like is, are, and the may have higher term frequency scores, but they can be considered stop words and removed during preprocessing. However, it's difficult to identify all of these words. IDF helps to lower the importance of these common terms, making the TF-IDF model more effective.

  • Inverse document frequency illustrates how often a term t occurs across the whole corpus DD:

IDF(t,D)=log10number of documents in corpus Dnumber of documents in corpus D containing t\text{IDF}(t, D) = \log_{10}\frac{\text{number of documents in corpus D}}{\text{number of documents in corpus D containing } t}

It is worth noticing that adding 1 to the number of documents in the corpus DD where the term tt appears is a common practice. This allows us to be sure that the division by zero may be avoided in case no such documents in corpus DD contain the term tt.

A term's IDF (inverse document frequency) is high if it is rare and low if frequent. This is because dividing a relatively large number by a relatively small number results in a large value. For example, if there are 20 documents in the corpus and only 5 contain a specific term, the IDF equals the base ten logarithm of 4. On the other hand, if a term appears in 18 documents, the IDF is equal to the base ten logarithm of 1.1.

IDF is beneficial regarding auxiliary parts of speech and important in cases like the example below. Suppose there is a collection of documents on the sewing industry. In almost every document, you will likely find more than one occurrence of the term stitch. You may need to use keywords to describe the content of one of the documents from this collection. Assuming that terms like stitch, sew, and their derivatives are highly frequent, one does not have much of a choice but to look deeper into the true meaning of each document.

Combining the TF and IDF of a term, you develop the composite weight of terms in a corpus.

TF-IDF(t,D)=TF(t,d)IDF(t,D)\text{TF-IDF}(t, D) = \text{TF}(t, d) * \text{IDF}(t, D)

How to interpret the results? Let's look at the figure below.

The meaning of TF-IDF values

Commonly used words such as articles and auxiliary verbs are less critical due to their low IDF score. Terms found in many documents or words overused in a few documents are given an average TF-IDF score. The words that are given the highest score are repeated several times in a small number of documents.

TF-IDF calculation walk-through

Let us now try to compute the TF-IDF values of terms from the example below. Suppose our sentences are documents, and together they constitute a corpus.

Document 1
term term count
learning 4
English 2
made 1
me 1
happy 1
Document 2
term term count
learning 2
Chinese 3
made 1
me 1
happy 1
Document 3
term term count
learning 2
made 1
me 1
happy 1

The TF-IDF value calculation for learning is performed as follows. According to the formula, divide the number of occurrences of the term learning by the total number of terms in each document. In Document 1, it occurs four times, and the document contains 9 terms in total. In Document 2, it occurs twice, and the total number of terms in the document is 8.

TF("learning", D1)=490.44TF("learning", D2)=28=0.25TF("learning", D3)=25=0.4\text{TF}(\text{"learning", D1}) = \frac{4}{9} \approx 0.44 \\ \text{TF}(\text{"learning", D2}) = \frac{2}{8} = 0.25\\ \text{TF}(\text{"learning", D3}) = \frac{2}{5} = 0.4

Following this logic, you can also compute TF values for other terms.

TF("English", D1)=290.22TF("made", D1)=TF("me", D1)=TF("happy", D1)=190.11\text{TF}(\text{"English", D1}) = \frac{2}{9} \approx 0.22 \\ \text{TF}(\text{"made", D1}) = \text{TF}(\text{"me", D1}) = \text{TF}(\text{"happy", D1})=\frac{1}{9} \approx 0.11

TF("Chinese", D2)=38=0.375TF("made", D2)=TF("me", D2)=TF("happy", D2)=18=0.125\text{TF}(\text{"Chinese", D2}) = \frac{3}{8} = 0.375 \\ \text{TF}(\text{"made", D2}) = \text{TF}(\text{"me", D2}) = \text{TF}(\text{"happy", D2})=\frac{1}{8} = 0.125

TF("made", D3)=TF("me", D3)=TF("happy", D3)=15=0.2\text{TF}(\text{"made", D3}) = \text{TF}(\text{"me", D3}) = \text{TF}(\text{"happy", D3})=\frac{1}{5} = 0.2

Now we can proceed with computing the IDF values. The total number of documents in our corpus is 3. Keeping this in mind, let us find the number of documents containing these terms. You come across the term learning in every document from the corpus. So the IDF of learning is the logarithm of 1. The same holds for the terms made, me, and happy.

IDF("learning")=log33=log1=0IDF("made")=IDF("me")=IDF("happy")=log33=log1=0\text{IDF}(\text{"learning"}) = \log{3\over 3}=\log1=0 \\ \text{IDF}(\text{"made"}) = \text{IDF}(\text{"me"}) = \text{IDF}(\text{"happy"})= \log{3\over 3}=\log1=0

IDF("English")=IDF("Chinese")=log31=0,48\text{IDF}(\text{"English"}) = \text{IDF}(\text{"Chinese"}) = \log{3\over 1}=\approx{0,48}

Now that we have computed the TF and the IDF values, we can combine the results:

TF-IDF("learning")=TF("learning", D3)×IDF("learning")=0,4×0=0\text{TF-IDF}(\text{"learning"}) = \text{TF}(\text{"learning", D3})\times\text{IDF}(\text{"learning"}) = 0,4 \times 0 = 0

TF-IDF("English")=TF("English", D1)×IDF("English")=0,22×0,480,11\text{TF-IDF}(\text{"English"}) = \text{TF}(\text{"English", D1})\times\text{IDF}(\text{"English"}) = 0,22 \times 0,48 \approx{0,11}

Pros and cons of TF-IDF

Pros:

  1. Computationally inexpensive method
  2. It is easy to calculate

Cons:

  1. TF-IDF computes document similarity directly in the word-count space, which can be slow for large vocabularies.
  2. TF-IDF does not consider the semantic meaning of words: it cannot understand the context of the words and derive their meaning. Moreover, it fails to define semantic links between words. That makes it impossible to recognize compound nouns such as "Queen of England" or "Red Square," not to mention set expressions.
  3. TF-IDF ignores word order. For example, there are two sentences: "English is easier to learn than Chinese" and "Chinese is easier to learn than English." In terms of TF-IDF, these sentences are identical since they share the same terms with the same number of occurrences. However, semantics point to the fact that these sentences refer to a different state of things. There are presumably very few people who may find the second statement relatable.

Conclusion

Throughout this topic, we have delved into the concept of the TF-IDF metric and its significance in text mining. We have explored the formula used to calculate TF-IDF values and potential applications of the metric. Additionally, we briefly touched upon both the advantages and disadvantages of TF-IDF. Despite its potential drawbacks, TF-IDF remains a fundamental tool in the pursuit of automated comprehension of textual data. It is widely applied in various fields, including the development of search engines.

9 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo