Computer scienceData scienceNLPMain NLP tasksTopic modeling

Topic modeling

14 minutes read

Topic modeling is one of the main tasks in NLP. The main idea behind it is to find all topics in a corpus based on the words used in those documents. A topic is a group of words that best represents the collection, while a document is a text. Each word is a unique term used in the text.

For example, we have a few news that can be grouped into topics: animals, sports, and technology. Each topic is represented by a set of words; for the topic animals, these words could include dog, animal, loyal and evil. Note that some documents may be about multiple topics, and some words may be used in multiple topics.

Themes and documentsv

Let's talk about some use cases:

Information Retrieval: search for more specific topics in a collection of documents.
Journalism: classify news articles into different categories.
EDA step: understand data distribution and correlated topics.
Social media analysis: identify a topic in a post and suggest appropriate hashtags for it.
Customer feedback analysis: analyze all feedback in the form of product reviews and identify the most common issues.

Topic coherence

Topic coherence is a measure that evaluates the strength of a topic based on how well it is supported by a reference corpus. It's about checking if the topic created through LDA is good or bad, for example, $(0.005 * juice + 0.08* plane + 0.01*force)$ being a bad topic, while $(0.03 * apple + 0.098*juice+ 0.01*mix)$ is a good one. This score is calculated based on the word frequency that appears together in the reference corpus. We aim to increase the topic coherence, but it's often linked to a specific dataset, so it's hard to determine whether a coherence score is good or bad in general.

Here, $(0.005 * juice + 0.08* plane + 0.01*force)$ means that the topic can be described by words juice, plane, force. Each of these words has its significance score on the meaning of the topic, the score is written before the word.

To get the topic coherence score, use gensim:

cm = CoherenceModel(model, corpus, coherence='u_mass')
coherence = cm.get_coherence()

Approaches to topic modeling

There are many different approaches to topic modeling:

Distributional approaches. The main idea is that words that co-occur frequently are likely to be related to the same topic. Methods — Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF).

Deterministic approaches use a set of pre-defined rules and algorithms to identify and extract topics. Methods — keyword extraction, clustering, and categorization.

Hierarchical approaches create a hierarchy of topics, with broader topics at the top and more specific topics at the bottom. Methods — Hierarchical Dirichlet Process (HDP) and Hierarchical Topic Models (HTM).

Temporal approaches identify topics that change over time and analyze how these topics are related to each other. Methods — Dynamic Topic Models (DTM) and Structural Topic Models (STM).

We will cover the most popular approaches in detail.

LDA

Latent Dirichlet Allocation (LDA) is based on the idea that each document can be described by a distribution of topics and each topic can be described by a distribution of words. The word Latent in the name refers to hidden topics that generate the observed data, Dirichlet refers to the probability distribution of topics in documents and words in topics, and Allocation refers to the assignment of words in a document to one or more topics based on their probabilities. Furthermore, LDA assumes that the initial topic assignments for each word in the corpus are randomly generated based on a probability distribution. This random initialization allows the model to explore different topic assignments and avoid getting stuck in local optima during the optimization process.

For LDA we have two sets of parameters:

Topic distribution for each document, which represents the proportion of each topic that is present in the document.
for example, an article about a football match in Argentina: 50% Topic sport, 30% Topic champions, 20% Topic countries;
Word distribution for each topic, which represents the probability of each word in the vocabulary given the topic.
For example, the sports topic: $0.1*football + 0.3*sportsmen + 0.001*jump + 0.002*run$

The algorithm starts with some initial values for these parameters, which can be set randomly or based on some prior knowledge, and then iteratively refines them until convergence is achieved. The goal of the algorithm is to find the optimal values for these parameters that maximize the likelihood of the observed data.

Practice LDA model training

Now, let's train a topic modeling model using the gensim library. We will use a dataset of reviews on the Trip Advisor website. First, you need to download the dataset. Read the dataset and transform it into a list of sentences:

import pandas as pd

reviews = pd.read_csv('./tripadvisor_hotel_reviews.csv')

# a review column as a list of sentences
reviews = reviews['Review'].to_list()

# split sentences in words
reviews = [sent.split() for sent in reviews]

Import the gensim library, the gensim.corpora library to prepare the word frequencies and import pprint:

import gensim
import gensim.corpora as corpora

from pprint import pprint

Next, we need to create an id2word dictionary with the pairs of word_ids and words. Then we create a corpus as a list of all reviews and the frequencies of each word in this review. For example, the corpus of the first review is [.... (62, 3), ...]. It means that a word with the id of 62 (room) appears three times in this review. This is important to understand what topic is the most common in this review:

# dictionary - {word_id: word}
id2word = corpora.Dictionary(reviews)


# one review - [(word_id, frequency),...]
corpus = [id2word.doc2bow(text) for text in reviews]

Then we specify parameters such as the number of topics, chunksize the number of reviews we use at each training iteration passes, the number of passes through the corpus during training, and whether all of the words should be used for training. We pass our corpus and the word IDs of the words to the model and print the results.

Each topic is represented as a list of words and their contribution to the topic in percentages.

# create and train the model
lda_model = gensim.models.ldamodel.LdaModel(
                                           corpus=corpus,
                                           id2word=id2word,
                                           num_topics=3,
                                           chunksize=50,
                                           passes=5,
                                           per_word_topics=False
                                            )

pprint(lda_model.print_topics())

# [(0,
#   '0.041*"hotel" + 0.018*"room" + 0.016*"great" + 0.010*"staff" + ...'),
# ....

LSA

Latent Semantic Analysis is based on the idea to find hidden relationships between words and documents. First, we create a term-document matrix, where each row represents a word and each column represents a document. Then we apply the dimensionality reduction SVD operation and get three matrices: word-topic, topic importance and topic-document. These matrixes are the things we use then to identify each topic and each document. In the picture, you can see that the last matrix is what we are looking for: what topic every document has.

The last matrix is what topic every document has.

Let's train an LSA model on the same dataset. Luckily, the gensim library offers functionality for LSA training. We do everything the same but also create a TF-IDF matrix of the corpus:

from gensim.models import TfidfModel, LsiModel

# create TF-IDF frequencies of the corpus 
tfidf_gensim = TfidfModel(corpus)
corpus_tfidf = tfidf_gensim[corpus]

# create and train the model
lsa_model_gensim = LsiModel(corpus=corpus_tfidf, id2word=id2word, num_topics=5)

for index in range(5):
    # Print the first 10 most representative terms for topics
    print(f'Topic {index}: {lsa_model_gensim.print_topic(index, 10)}')


# Topic 0: 0.129*"n't" + 0.122*"not" + 0.115*"great" + 0.108*"did" + 0.106*"good" + 0.103*"nice" + 0.101*"room"
# ...

You can see that some topics include unuseful words, so it's recommended to first clean the whole corpus to get better topics.

BERT

BERT (Bidirectional Encoder Representations from Transformers) can also be used for topic modeling. Its main application is getting embeddings.

Get sentence embeddings or text embeddings if your text is short.
Apply any dimensionality reduction algorithm on your embeddings. (e.g., UMAP)
Apply clustering algorithm (e.g., HBSCAN, K-means)
Create topics based on TF-IDF. Apply the TF-IDF algorithm of each cluster so you get the importance of each word in the topic.

This algorithm was named BertTopic. With the UMAP algorithm, we can find outliers in our documents; they will be marked as -1. To train a BertTopic model, initialize and fit it:

from bertopic import BERTopic

reviews = pd.read_csv('./tripadvisor_hotel_reviews.csv')

topic_model = BERTopic()

topics, probs = topic_model.fit_transform(reviews['Review'].to_list())

BertTopic has a variety of parameters to get after training, including:

get_topic(number_of_topic) to get the same topic representation with words and their probabilities;
reduce_topics(docs, nr_topics=number_to_reduce) to reduce the number of topics to the nr_topics amount;
find_topics(word) to find the topics most suitable for the word.

Difficulties with topic modeling

Sensitivity to data quality: mistakes in the data can lead to irrelevant topics.
Fixed value of topics: sometimes you don't know the exact number of topics in the data and need many experiments to find the best number.
Human interpretation: it's hard to understand the human-like name of the topic by the importance of the words.

Conclusion

Topic modeling is a primary method for gaining a better understanding of text data. It can be used as a separate task or as an EDA step to understand the data distribution. There are many methods, some of them are simple to implement, while others are more complex. One should try different approaches to determine the best distribution of topics in the data.

How did you like the theory?

Report a typo