Computer scienceData scienceNLPMain NLP tasksText summarization

Extractive text summarization

A note on the intermediate text representation

16 minutes read

In this topic, we will take a look at a historical approach to automatic text summarization, extractive summarization, which selects a subset of sentences from the original document based on the importance metrics. Those importance scores determine whether a particular sentence will be included in the final summary.

A note on the intermediate text representation

Extractive summarization creates an intermediate representation of the text to find the important content. There are two types of representation: topic representation and indicator representation:

A comparative graph for intermediate representation

The topic-words technique identifies the words that describe the document's topic. One of the earliest works on summarization by H.P. Luhn used frequency thresholds to find the words that are representative of the topic. Latent semantic analysis (LSA) is an unsupervised method for extracting a representation of text based on observed words. With frequency-driven approaches, the word's weights (binary or continious) determine the word's correlation to the topic. Bayesian topic models are probabilistic models (e.g., HIERSUM) which use Latent Dirichlet Allocation (LDA). The main idea behind LDA is that documents are represented as a mixture of topics, where each topic is a probability distribution over words.

In the indicator representation, the sentence is represented as a list of features such as sentence length or relative position of the sentence. Graph methods (e.g., TextRank) represent the documents as a connected graph, with sentences as the vertices of the graph and edges between the sentences as the indicators of sentence similarity. ML approaches (e.g. Naive Bayes), classify the sentences as summary or non-summary based on their features, given a training set of documents and their extractive summaries.

The preliminary steps

There are roughly three stages in the process of extractive summarization:

Generating an intermediate text representation;
Assigning an importance score to every sentence;
Selecting the most important sentences to produce the summary.

We will track the described stages through three solutions — a technique based on word frequency, Luhn's summarizer, TextRank, a graph-based method, and the application of LSA.

We will use the Daily Mail/CNN news article dataset by HuggingFace.

Here are the required packages:

pip install spacy numpy nltk sumy rouge datasets

We will use the Sumy package, the most complete and maintained library for extractive summarization.

Import the necessary packages:

import datasets
import numpy as np
from rouge import Rouge

Load the CNN dataset (it may take some time):

dataset = datasets.load_dataset("cnn_dailymail", '3.0.0')

Let's take a look at the structure of the first dictionary in the train section of the dataset:

first_entry = dataset['train'][0]

The articles are stored with the article key, provided summaries — with the highlights key.

A small utility function to obtain the scores that will be reused later:

def evaluate_summary(prediction: str, reference: str) -> str:
    rouge = Rouge()
    scores = rouge.get_scores(prediction, reference)[0]
    out = {k: round(v["f"], 2) for k, v in scores.items()}
    avg = round(np.mean(list(out.values())), 2)
    return (
        f"ROUGE1: {out['rouge-1']} | ROUGE2: {out['rouge-2']}"
        f"| ROUGE_L: {out['rouge-l']} | AVG: {avg}"
    )

Luhn's summarizer

Luhn's summarizer was one of the first attempts in the field of text summarization. In the 1958 paper ''The Automatic Creation of Literature Abstracts", Luhn proposes that word frequency determines the word's significance. The diagram below gives the conceptual overview of the Luhn's approach:

The Luhn's summarizer flowchart

Let's see how Luhn's summarizer is implemented in Sumy. Import the required packages:

from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.parsers.plaintext import PlaintextParser
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer
import nltk
nltk.download('punkt')

Usage of LuhnSummarizer is pretty straightforward:

def summarize_luhn(article: str, sentence_count: int) -> str:
    ''' Utility function to perform Luhn's summarization.

        By default, LuhnSummarizer will select 100% of non-stop post-processed words as
        significant, but you can overwrite the significant_percantage attribute as a 
        fraction: summarizerLuhn.significant_percentage = 1/3

    '''
        
    parser = PlaintextParser.from_string(article, Tokenizer('english'))
    summarizerLuhn = LuhnSummarizer(Stemmer('english'))
    summarizerLuhn.stop_words = get_stop_words('english')
    luhn_summary = summarizerLuhn(parser.document, sentences_count = sentence_count)
    return ' '.join([str(sentence) for sentence in luhn_summary])


summarize_luhn(first_entry['article'], sentence_count = 2)

We get the ROUGE scores:

ROUGE1: 0.43 | ROUGE2: 0.32 | ROUGE_L: 0.43 | AVG: 0.39

This means that 43% of unigrams (ROUGE-1) and 32% of bigrams (ROUGE-2) match in the golden and produced summary, while the longest common subsequences (ROUGE-L) match by 43%, with an average of 39% match.

Here is the visualization of the matching words of both summaries:

Highlighted matches between the Golden and the Predicted summaries using Luhn's summarizer

Luhn's algorithm is intuitive in implementation and unsupervised but is known to have limited accuracy, a lack of synonym detection, and is prone to redundancy.

TextRank

TextRank represents the document as a graph with sentences as nodes and sentence content overlapping as edges. All edges of the graph are assigned random weights upon initialization, then the weight is determined by a similarity measure.

Given two sentences $S_i = \{ w_{i_1}, w_{i_2}, w_{i_3} ... w_{i_n} \}$ , where $\{ w_{i_1}, w_{i_2}, w_{i_3} ... w_{i_n} \}$ is a set of $n$ words in the sentence $i$ , and $S_j = \{ w_{j_1}, w_{j_2}, w_{j_3} ... w_{j_y} \}$ is a set of $y$ words in the sentence $j$ . $|S_i|$ , $|S_j|$ — lengths of respective sentences. Then, the similarity function for sentences can be introduced as follows:

\text{Sim}(S_i, S_j) = \frac{|{\{ w_k} \vert {w_k} \in S_i \& w_k \in S_j \}|}{\log(|S_i|) + \log(|S_j|)}

The node's rank (weight) can be calculated as:

\text{WS}(V_i) = (1 - d) + d \cdot \sum_{V_j \in \text{In}(V_i)} \frac{w_{ij}}{\sum_{V_k \in \text{Out}(V_j)} w_{jk}}\text{WS}(V_j)

Where

$d$ — a damping factor, a probability of jumping from a node to a connected node in the graph $d∈[0,1]$ , both TextRank and PageRank have $d = 0.85$ , with $1-d$ being a probability of jumping to a random node;
$V_i$ , $V_j$ — two connected nodes;
$w_{ij}$ — a previously calculated sentence similarity score for sentences $i,j$ ;
$\text{In}(V_i)$ — a set of nodes that point to $V_i$ ;
$\text{Out}(V_j)$ — a set of nodes that point to $V_j$

This formula is applied recursively until a stable score is assigned to a node. The score becomes stable when the error rate for the node's score is smaller than a specified threshold, where the error rate is $S^{k+1}(V_i) - S^{k}(V_i)$ , with $S^{k+1}(V_i)$ , $S^k(V_i)$ — sentence scores obtained at $k+1\text{'th}$ and $k\text{'th}$ iteration respectively.

The TextRank flowchart

After getting a grasp on TextRank, let's review the code implementation using the sumy library.

# ... previously imported packages
from sumy.summarizers.text_rank import TextRankSummarizer

def summarize_textrank(article: str, sentence_count: int) -> str: 
    parser = PlaintextParser.from_string(article, Tokenizer('english'))
    summarizerTR = TextRankSummarizer(Stemmer('english'))
    summarizerTR.stop_words = get_stop_words('english')
    textrank_summary = summarizerTR(parser.document, sentences_count = sentence_count)
    return ' '.join([str(sentence) for sentence in textrank_summary])


summarize_textrank(first_entry['article'], sentence_count = 2)

With the following ROUGE scores:

ROUGE1: 0.43 | ROUGE2: 0.32| ROUGE_L: 0.43 | AVG: 0.39

Once again, observe the highlighted words from both summaries:

Highlighted matches between the Golden and the Predicted summaries using TextRank

The main advantage of the TextRank algorithm is that it is unsupervised and does not require a training corpus, also you can use it with texts in any language. The drawback lies in the lack of semantic similarity detection — certain words may be important in the context, but won't be detected because TextRank doesn't differentiate synonyms. Furthermore, the results depend on the choice of the sentence similarity function.

Latent semantic analysis application to auto-summarization

The third algorithm utilizes latent semantic analysis, a multipurpose natural language processing technique. A document is represented as a matrix, and singular value decomposition is performed to extract the underlying relations between the document and the present terms.

At first, we select unique words from the entire document. Then, the algorithm constructs the matrix $A_{m×n}$ , where $m$ is the number of unique words, $n$ is the number of sentences, while $w_{i,j}$ shows the importance of a word $i$ in the sentence $j$ . Each column of $A$ is a sentence, and rows represent the importance measure for each unique word in relation to the sentence. The original algorithm uses the weighted term-frequency as the word's importance measure. $A$ will be sparse since every sentence doesn't contain every single word of the document.

The matrix A, where columns represent n sentences and rows represent m words from the document

In the next step, the singular value decomposition is performed on $A$ :

A=UΣV^T

$U_{m×n}$ contains the row vectors $ϕ_j=[u_{j1}u_{j2}...u_{jr}]$ , $j$ is a row in the matrix $A$ , and $r$ - singular vector space dimensionality; $\Sigma_{n \times n}$ — diagonal matrix with values sorted in descending order, and $V_{n \times n}^T$ , where each sentence $i$ is represented by the column vector $ψ_i=[v_{i1}v_{i2}...v_{ir}]^T$ .

SVD creates a mapping between the words-by-sentences matrix $A$ and the extracted concepts in the singular vector space. The concept can be thought of as a pattern of words or phrases that appear in similar contexts, for example, the words 'car' and 'bus' will fall under the concept of vehicle. $V^T$ is row-sorted in descending order by the importance of a concept extracted from the sentence and used in the sentence selection, with cells representing the relation between the concepts and the sentences. In the original implementation, the sentences are chosen from the most significant extracted concepts until the resulting summary contains a pre-defined number of sentences.

LSA flowchart

Let's dive into the implementation:

# ... previously imported packages
from sumy.summarizers.lsa import LsaSummarizer

def summarize_lsa(article: str, sentence_count: int) -> str:
    parser = PlaintextParser.from_string(article, Tokenizer('english'))
    summarizerLSA = LsaSummarizer(Stemmer('english'))
    summarizerLSA.stop_words = get_stop_words('english')
    lsa_summary = summarizerLSA(parser.document, sentences_count = sentence_count)
    return ' '.join([str(sentence) for sentence in lsa_summary])


summarize_lsa(first_entry['article'], sentence_count = 2)

The evaluation results:

ROUGE1: 0.12 | ROUGE2: 0.0 | ROUGE_L: 0.12 | AVG: 0.08

Looking at the summary word match, we can see why such low scores are obtained:

Highlighted matches between the Golden and the Predicted summaries using LSA

LSA's application comes with several limitations. If the size of the input document is large enough, it leads to degrading performance due to the SVD calculation. Plus, the results are difficult to interpret. This approach, however limited it may be, has the ability to detect synonyms and polysemous words (containing multiple meanings).

Conclusion

In this topic, we've reviewed a few fundamental solutions in extractive text summarization. All the described approaches have the advantage of being language-independent and unsupervised, but each method has its limitations discussed in later research. Once again:

The Luhn's algorithm is simple to implement but has limited accuracy, lack of synonym detection, and is prone to redundancy;
The TextRank algorithm does not require a training corpus; you can use it with texts in any language. However, it lacks semantic similarity detection;
The LSA may be slow due to SVD calculation, and the results are difficult to interpret. The big plus is that it can detect synonyms and polysemous words.

You can find more on this topic in Mastering Stemming and Lemmatization on Hyperskill Blog.

How did you like the theory?

Report a typo

Extractive text summarization

A note on the intermediate text representation

The preliminary steps

Luhn's summarizer

TextRank

Latent semantic analysis application to auto-summarization

Conclusion

Related topics