Computer scienceData scienceNLPMain NLP tasks

Text clustering

6 minutes read

Sometimes, it may be important to group a set of texts into groups. It may be done for a general understanding of what your corpus consists of. Or you may conduct it as a part of a more sophisticated statistical analysis. In this topic, we'll discuss one of the ways of text grouping — text clustering. Although text clustering is a self-sufficient process, it is also a part of more elaborate NLP technologies like topic modeling and others.

Text clustering

Clustering is a type of unsupervised machine learning. A supervisor is not a professor who knows everything, but the so-called "class" labels (often denoted as $y$ in classification tasks). In other words, clustering is a classification problem where we don't know the classes in advance. Clustering is the task of dividing data into sets, where each cluster conventionally consists of similar elements.

Text clustering is a process of dividing a collection of texts into clusters. Since we can cluster texts on a document or sentence level, there are two subtypes of text clustering: document clustering and sentence clustering. Document clustering is much more common than sentence clustering, so it'll be at the center of our discussion.

Document clustering is applied in many fields, including spam filtering, information retrieval, and, of course, topic modeling. The amount of text data on the web grows exponentially, so organizations need to have a structure in place to mine profitable insights from the generated text. So, the necessity for documents grows at an incredible pace.

Feature extraction

Before clusterization, you need to convert text into embeddings. This procedure is also known as text representation. The most common ones are TF-IDF and bag-of-words.

Just a quick reminder:

TF-IDF (or Term Frequency-Inverse Document Frequency) is applied in a specific document and an entire text collection. Some words appear in each document (for example, do, say, have), and their scores can be high even for one text. The idea of TF-IDF is to rescale the counts to eliminate such bias;
Bag-of-words is a simpler technique. In BOW, we create dictionaries that list unique words in the text corpus, which is similar to vocabulary. Next, we represent each word in any given sentence or document as 0 for absence and 1 for presence. BOW works as follows:

Bag-of-words (BOW)

Text representation is necessary to collect all the features of the text. The textual features could be just the presence/absence of the token in the text, as in the picture above. Or it could be a semantics feature (buy vs. purchase). Part of speech and grammatical structure also adds to the textual features. For text clustering, it is better to use TF-IDF rather than BOW because the first one indicates how frequently a word appears in the entire corpus. There are some more elaborate models like Word2Vec and Doc2Vec; they show better results. Still, TF-IDF is somewhat classical.

Document similarity

After feature extraction, we need to conduct their similarity measures. Here we can use common similarity measure methods. Words can be similar lexically or semantically.

Lexical similarity shows how close two samples are on the character/word/n-gram level. To maintain lexical similarity, we don't need to transform our samples, tokens, characters, and n-grams.

If we take two samples:

Anny rejected the invitation from Jessy 

Jessy rejected the invitation from Anny.

In lexical similarity, these two samples are very close. We can use the Overlap coefficient, Levenshtein Distance, Longest Common Substring, Jaro, Needleman-Wunsch, or Jaccard similarity metric for such lexical similarity.

The Jaccard similarity can be expressed as the number of common words over the total number of words in two texts or documents.

$\text{Jaccard}(doc_1, doc_2) = \frac {doc_1∩doc_2} {doc_1∪doc_2}$ The sentences above both have six words in total, four of them are common to both sentences, so their lexical similarity is $\frac 2 3$ .

In semantic similarity, words are similar (semantically) if they have the same meaning. For document similarity measures, it's preferable to use lexical similarity.

Main algorithms in clustering

K-means is the most simple clusterization algorithm. This algorithm is often called centroid-based because, in the center of the cluster, there is always a centroid.

K-means

You can watch the process of K-means clusterization at Visualizing K-Means Clustering by Naftaliharri.

The main problems of K-means are:

It is necessary to know in advance the number of clusters;
The algorithm is very sensitive to the choice of initial cluster centers;
It can't work properly if the object belongs to different clusters equally or doesn't belong to any

An alternative to K-means is C-means. The charm of C-means is that it resolves the disambiguation problems (the last one in the list above). Instead of an unambiguous answer to the question of which cluster an object belongs to, it determines the probability that the object belongs to a particular cluster.

A more advanced algorithm is DBSCAN. It is based on connecting points within certain distance thresholds. The principal difference between DBSCAN and K-means is shown in the picture below:

The difference between DBSCAN and K-means

In this topic, you will learn how to implement k-means.

K-means code implementation

K-means is available in scikit-learn. To implement K-means, you'll need a corpus; let's use one of Kaggle's datasets. Kaggle is an IT website, like GitHub and Hugging Face, where you can participate in competitions and use open datasets.

The dataset is available on the Kaggle page with a detailed description of each column. We advise you to use Google Colaboratory or Jupyter Notebook for this task. To download the dataset, use the following instructions:

import opendatasets as od


path = 'https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification'
od.download(path)

Make sure you have a Kaggle account. Opendatasets will ask you Kaggle username first, then it will ask for your token (the same way Hugging Face asks for it). To get the token, go to your profile page (click on the avatar on the top-right) --> Account --> API --> Create a new API token. Your browser will download a text file. Copy the token to complete the downloading.

Then, open the dataset. The following code shows how to open the dataset in Google Colaboratory. If you use a platform other than Google Colab, make sure to change the path name — we use /content/ in path name only in Colab.

import pandas as pd

df_train = pd.read_csv('/content/covid-19-nlp-text-classification/Corona_NLP_train.csv', encoding= 'ISO-8859-1')
df_test = pd.read_csv('/content/covid-19-nlp-text-classification/Corona_NLP_test.csv', encoding= 'ISO-8859-1')

To train a model, we need to split the data into train and test sections. The dataset is already split into train and test sets, so you don't need to use train_test_split function in the future.

The dataset is designed for text classification tasks. As you already know, classification tasks imply that there is a label. In this dataset, the label is in the column Sentiment. You shouldn't use this column. Now, the text is stored in the OriginalTweet column.

Now let's perform TF-IDF on a split corpus:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.95,
    stop_words="english",
)

X_train = vectorizer.fit_transform(df_train['OriginalTweet'])
X_test = vectorizer.fit_transform(df_test['OriginalTweet'])

Decide how many clusters you want to see in the end. You can set any number you want, but since we are using a classification dataset, it is better to set it as the number of labels.

import numpy as np

true_k = len(np.unique(df_train['Sentiment']))

Now, let's implement K-means algorithms itself. But first, let's import and initialize the KMeans class. Specify the maximum number of iterations. Otherwise, the model may never stop.

from sklearn.cluster import KMeans

kmeans = KMeans(
        n_clusters=true_k,
        max_iter=100,
        n_init=10,
        random_state=42)

Let's train the model and check the results.

clusters =  kmeans.fit(X_train)
cluster_ids, cluster_sizes = np.unique(clusters.labels_, return_counts=True)
print(f"Number of elements asigned to each cluster: {cluster_sizes}")

 
##  Number of elements asigned to each cluster: [ 7314  2713 20145  5321  5664]

After training the model, one can use it on the test set.

y = kmeans.fit_predict(X_test)
print(y[:20])  ## shows cluster row number to 20 first texts

## [2 3 3 4 2 3 3 3 3 2 3 3 4 0 3 3 3 4 3 3]

Finally, you can check the cluster of each text in the test set:

import pandas as pd

pd.DataFrame({'X': df_test['OriginalTweet'], 'cluster_id': y})

K-means code implementation

You see that the text isn't filtered: punctuation marks, symbols like \n remain. It should be done, but we omitted it since you already know how to conduct text filtration.

DBSCAN implementation

The scikit-learn library provides DBSCAN, too. Import the same way you imported KMeans just a moment earlier.

from sklearn.cluster import DBSCAN

We assume you use the same old dataset, so we won't say how to load it.

As with KMeans, the first thing you need is to fit the model to the vectorized dataset (we have already vectorized it with TF-IDF in the previous section).

dbscan_cluster1 = DBSCAN(eps=1.32)
db_clusters = dbscan_cluster1.fit(X_train)

cluster_ids, cluster_sizes = np.unique(db_clusters.labels_, return_counts=True)
print(f"Number of elements asigned to each cluster: {cluster_sizes}")

Now, as with KMeans, we should initiate the function fit_predict:

y = dbscan.fit_predict(X_test)
print(y[:20])

You can make a pandas dataframe the same way we did it for KMeans results.

Conclusion

In this topic, we have talked about famous clustering algorithms and showed how they could have possibly been used in real life. We have discussed how to implement KMeans and DBSCAN clustering in Python. In real life, though, we cluster documents using another NLP technology topic modeling. Topic modeling is based on text clustering. So, understanding text clustering will be useful when we'll be talking about Topic Modelling.

How did you like the theory?

Report a typo

Text clustering

Text clustering

Feature extraction

Document similarity

Main algorithms in clustering

K-means code implementation

DBSCAN implementation

Conclusion

Related topics