9 minutes read

Today, tons of textual data is generated every day, which results in the need for automating the extraction of main points of text as a way to navigate through the information. Summarization refers to the shortening of the original document to half of the original size while preserving the key points in the text. The history of this process dates back to the 1950s with the usage of simple techniques such as term frequency, while the most recent developments mainly rely on the deep learning approaches.

Types of text summarization

We can consider three branches of summarization by type:

  • Domain-independent (generic) summarization that condenses the original document to the most important points;
  • Query-based summarization aims to extract the answer to a question from the document — for example, this is a feature of many search engines that preview the page's contents that match the keywords of the search;
  • Domain-specific summarization takes into account field knowledge to follow certain guidelines, such as medical or legal terminology.
  • A digram showing the 3 main categories of summarization classification

Dividing by the input type, single-document summarization deals with a single text unit, while multi-document summarization processes multiple text sources on the same topic. The latter poses challenges since it requires the comprehension of multiple, at times conflicting, viewpoints.

There are two solution approaches to the task of text summarization — extractive and abstractive summarization. Extractive summarization (ES) produces the output from the most important sentences present in the original document. The abstractive approach (ABS) is closer to how humans approach summarization — it captures the underlying ideas of the text. It isn't tied directly to the vocabulary of the original documents. Instead, it utilizes paraphrasing and generates a shortened version of the interpretation.

On automatic summary evaluation

As with any machine learning task, a question arises — how do we evaluate the predictions? For summarization, there are human and automatic evaluations. Automatic evaluations are preferable since they are faster and cheaper than human evaluation, making it possible to quickly estimate the model quality.

A set of criteria that are considered by human experts for the evaluation is along the following:

  • The lack of redundancy
  • Grammaticality
  • The preservation of the most important points
  • Structure and coherence
  • Factual correctness

Text summarization is one of the text generation (TG) tasks; it shares automatic TG metrics for evaluation — most notably, BLEU and ROUGE. However, due to the fact that BLEU and ROUGE only consider the n-gram matches, BERTScore might be a more informative fit for the ABS models.

Extractive summarization

ES mostly consists of methods like graph-based, topic-based, statistical, and machine learning approaches. Graph-based algorithms such as LexRank or TextRank represent the text as a graph and then rank the importance of the sentences based on their connections within this graph structure. Machine learning approaches turn the ES problem into a binary classification problem and include algorithms like SVMs or neural networks to extract meaningful sentences based on the learned patterns from the training data (the training data consists of document — extractive summary pairs). Frequency-driven approaches choose significant sentences depending on their statistical properties, such as frequency or location. Topic-based methods, like Latent Semantic Analysis, identify central topics and extract relevant sentences based on how much the sentence relates to a certain topic.

While ES is capable of producing a compressed version of the source, it suffers from several drawbacks. The limitations include the possible lack of coherence and readability as the sentences and phrases are directly extracted from the source document. ES may also miss the nuanced or implicit meaning and cannot generate summaries when the relevant information is spread across multiple sections of the input text.

Abstractive summarization: an overview

Some of the early (pre-neural) ABS techniques include sentence compression, where a grammatical summary of a sentence is created, and sentence revision which synthesizes information across sentences and generates new sentences. Another pre-neural method is template-based. It relies on the observation that human summaries of a given type share the sentence structure. The structures are encoded as templates from the training set of summary pairs, and a summary is generated by filling the gaps in the templates for a particular document type.

Neural methods provide an end-to-end approach to ABS, where the entire pipeline of the classical methods is replaced by a single network. The majority of the latest ABS models are seq2seq, which contain an encoder, where sentences are encoded as a list of fixed-length vector representations that capture the words with their contexts, and a decoder that outputs a summary based on the encoded vectors. The model is then trained on document-summary pairs to maximize the probability of generating a correct summary.

State-of-the-art solutions

Specific improvements to the encoder-decoder architecture have been made, for example, the addition of the attention mechanism — one of the key ideas behind the transformer models that have been extended to solve the ABS tasks. Transformers fine-tuned on the ABS task have been shown to produce high-quality summaries, namely, PEGASUS, BART, and T5, among others.

The main idea of T5 (Text-to-Text Transfer Transformer) is treating all NLP tasks as text2text problems, where a text is taken as input and the text prediction is returned. This allows for the same model, hyperparameters, and loss function to be used on a variety of NLP tasks. The most important takeaway about T5 is how the model is pre-trained on a huge amount of unlabeled data and later fine-tuned on labeled data for a specific task (such as summarization, classification, QA, etc). In the text2text approach, both the encoder and the decoder networks are present.

Human evaluation tells how close the model has approached the human performance. Since 2019, the year of the PEGASUS, T5, and BART introduction, there have been further improvements, but the majority of them use those models as the building blocks. At this point, automatic abstractive summarization systems can generate outputs close to human summaries, with the PEGASUS paper mentioning that the expert evaluators did not consistently prefer human-generated summaries over machine-generated ones.

Challenges of the automatic summary generation

Advances in deep learning led to boosting the development of automatic text summarization and introduced models that can produce a summary indistinguishable from a human-written one, however, various issues are currently present:

  • Multi-document summarization — the majority of the research in the area focuses on single-document summarization and there are only a few systems for multi-document summarization;
  • Factual errors — the system generates a summary with entities that weren't present in the original document (this problem is known as entity hallucination). Alternatively, the entities are present in the source, but the relations present in the summary don't exist in the source.
  • Automatic evaluation — currently widespread metrics for evaluation suffer from certain shortcomings, such as indifference towards proper order or lack of paraphrase detection, and are known to have limited correlation with human judgment.

Moreover, a large amount of work focuses on the news summarization domain — news articles have a predictable structure, and news datasets happen to be the largest available category for training the models. This might hinder the ability to extend the system to other types of documents, such as conversation transcripts.

Implementation overview

There are multiple Python packages available for the summarization tasks. For extractive summarization, the Sumy package covers some of the most popular classical approaches, has a built-in parser for both plain text and web pages, and supports numerous natural languages.

HuggingFace is an open-source AI platform that contains pre-trained models for a variety of tasks, including text summarization. Since a great number of abstractive methods revolve around transformers, it's possible to perform ABS in a couple of lines of code by using the pre-trained model from the transformers module. You can access them via the 'Summarization' tag on HF hub. Also, HuggingFace has a module for automatic evaluationevaluate, with the implementations for the metrics suitable to assess the summarization model's performance. The datasets library allows loading and using the existing datasets for many problems, including text summarization. You can find out more about the available datasets on the dedicated HuggingFace page.

Besides the HuggingFace evaluation module, there are other packages for automatic performance evaluation — for example, torchmetrics (has fewer metrics than theevaluate package) or a pure ROUGE score package and the choice can be based on convenience or metric availability in a particular package.

Conclusion

In this topic, we overviewed possible ways to classify extractive and abstractive text summarization systems, along with some algorithms from the categories. Human-defined criteria for summary qualification and main automatic evaluation metrics have been introduced. We looked at a short comparison of three state-of-the-art models and discussed certain present challenges and available packages for the automatic summarization task.

3 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo