Computer scienceData scienceNLPLibrariesHugging Face

Transformer models for text summarization

5 minutes read

Text Summarization is one of the most popular tasks to solve in NLP. If you have long lectures or considerable books to read, you can use text summarization to make short texts with the primary information. Today, we will learn the differences between the state-of-the-art models for text summarization: T5, BART, and Pegasus. We will also cover how to do text summarization using prompting and LLM. To compare the results of these models, we will use a text about the history of coffee. The text can be found in the link.

BART

BART (Bidirectional and Auto-Regressive Transformers) is one of the state-of-the-art language models. Let's cover the most critical components of it:

  1. Architecture: BART is also based on the Transformer architecture but with bidirectional and autoregressive components.

  2. Pretraining Objective: BART uses denoising autoencoding as its pretraining objective, attempting to reconstruct corrupted text using transformations like token masking, sentence permutation, document rotation, token deletion, and text infilling.

  3. Bidirectional: Unlike T5, BART is inherently bidirectional. It can solve tasks that require both text understanding and generation.

  4. Encoder-Decoder: BART separates the encoder and decoder, making it well-suited for tasks where the input and output may have different formats or lengths.

Let's summarize our text using the BART model introduced in the original paper and pre-trained in English. To do this, we will use the Transformers library. First, we need to install it with !pip install transformers command. Then, we will create a pipeline for text summarization and use the path to a hugging face hub folder.

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

To apply text summarization, we use the created summarizer and mention the maximum and minimum length of the final text with max_length and min_length parameters.

print(summarizer(ARTICLE, max_length=80, min_length=30, do_sample=False))

# Coffee's discovery can be attributed to Kaldi, an Ethiopian goat herder in the 9th century. By the 15th century, coffee had firmly established itself within Arabian culture. European coffee houses, characterized by intellectual discourse, turned into vibrant hubs for societal interaction.

T5

T5 (Text-To-Text Transfer Transformer) is a transformer-based model that suggests that every task in NLP is a text-to-text task.

  1. Pretraining Objective: The pretraining process involves two types of training: supervised and self-supervised. During supervised training, downstream tasks from the GLUE and SuperGLUE benchmarks are used and converted into text-to-text tasks, as explained before. On the other hand, self-supervised training is done using corrupted tokens. This is achieved by randomly removing 15% of the tokens and replacing them with individual sentinel tokens. The encoder takes the corrupted sentence as input, while the decoder takes the original sentence as input. The target is then the dropped-out tokens, delimited by their sentinel tokens.

  2. Task Agnostic: T5 is task-agnostic, meaning it can be fine-tuned for various NLP tasks, including text summarization, by simply framing the task as a text-to-text problem.

  3. Encoder-Decoder: T5 uses a single model for both encoding and decoding. It's capable of handling different sequence lengths for input and output.

We will use the pszemraj/long-t5-tglobal-base-16384-book-summary model trained on the books dataset to summarize our example.

from transformers import pipeline

summarizer = pipeline("summarization", model="pszemraj/long-t5-tglobal-base-16384-book-summary")
print(summarizer(ARTICLE, do_sample=False))

#'The narrator introduces us to the history of coffee, explaining how it came to be so popular in the United States and how it became the drink we now enjoy. He also gives us a detailed description of some of the most popular coffee-styles, such as the drip coffee, the Frenchpress, and the Turkish coffee.'

Notice that the T5 mode adds information about the narrator in the summary, while the BART model provides results from the narrator's perspective.

PEGASUS

PEGASUS (Pre-training with extracted gap sentences for abstractive summarization) is the model explicitly trained for abstractive text summarization.

  1. Architecture: PEGASUS is also based on the Transformer architecture but with some modifications for abstractive text summarization.

  2. Pretraining Objective: PEGASUS uses a gap-sentence generation task, where whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.

  3. Abstractive Summarization Focus: PEGASUS is designed explicitly for abstractive text summarization, where it generates summaries that may not be present verbatim in the source text.

  4. Sentence-Level Masking: PEGASUS uses sentence-level masking during pretraining, which helps it focus on sentence-level information when generating summaries.

We must also install the sentencepiece library with the !pip install sentencepiece command to use PEGASUS, as it uses SentencePiece tokenization. We will use the google/pegasus-cnn_dailymail model as follows.

from transformers import pipeline

summarizer = pipeline("summarization", model="google/pegasus-cnn_dailymail")
print(summarizer(ARTICLE))

#Coffee boasts origins dating back centuries .<n>The coffee plant, scientifically known as Coffea, originally thrived in East Africa .<n>Today, the coffee landscape is as diverse as its history .

Notice that the summary is relatively shorter than it was for other models.

Prompting

Now, hundreds of large language models are tuned with prompting. You can use any model you like for text summarization. Let's create a prompt for ChatGPT to make the text summarization. We will add the criteria that the text should be concise. In addition, we can ask to list all the types of coffee. It would be very beneficial if we could not specify this in the previous models.

prompt = f"""Your task is to generate a short summary of a written story about coffee. Please
make the summary consice, informative and short. Do not use less than 30 words and do not write more than 80 words. In your summary list all the types of coffee from the initial text.
"""

You can use both OpenAI API or ChatGPT system. The output for this prompt will be Coffee, with origins dating to Ethiopian goat herder Kaldi, which has a diverse history. It evolved from Arabian qahveh khaneh to European intellectual hubs. Today, coffee offers many types: espresso, cappuccino, latte, drip coffee, French press, cold brew, Turkish coffee, mocha, and macchiato. Each caters to different tastes and preferences.

Conclusion

We have learned about various state-of-the-art models using transformer architecture for text summarization in this topic. Despite inheriting the transformer architecture, These models vary in pre-training strategies and outcomes. Choosing the most suitable model for your specific case is essential, considering the model's size and training dataset.

How did you like the theory?
Report a typo