2 minutes read

Text simplification allows a wider range of people to read professional and scientific texts. Some texts are written in a certain style with lots of specific terms, so most people cannot understand them. For example, think of legal texts: generally, only lawyers can easily read them, so text simplification is a useful tool for them. It is also relevant for learners of foreign languages. If your level of English fluency is A2, but you want to read a Dickens's book in the original language, you can just simplify the novel to the level you wish. The list of fields where Text Simplification is useful is immense.

What is text simplification?

Text Simplification is an NLP task of transforming a rather complicated text (designed for a small number of readers) into a text that anyone (or almost anyone) can read. Usually, this means simplifying a text for a limited audience to a common text. For example, you can simplify a scientific article on biomedicine so that more people could learn about the latest innovations in medicine and pharmacy.

Another example is legal text simplification. Legal documents are usually written in a specific style and have many terms understandable only for a small of group people. For the majority of readers, legal texts are incomprehensible. Usually, you would go to a lawyer to better understand a legal article. With legal text simplification, you won't need to go to a lawyer, you can just read a simplified version. However, you should realize that a simplified version may lack some crucial details, so you should read simplified texts with caution.

Text Summarization is another NLP task. Its goal is to produce a brief summary of the main ideas of a text, while text simplification aims to reduce the linguistic complexity of a text and retain the original meaning. These two tasks are often confused. For example, the Hugging Face model search filters offer a Text Summarization tag, but not a Text Simplification one. Many Text Simplification models have a Text2Text Generation tag.

Text Simplification improves the readability of sentences through several rewriting transformations, such as paraphrasing, deletion, and splitting. Paraphrasing is a task of changing a complicated token or phrase into a simpler one; it is a subtask of text simplification.

You also need to distinguish Sentence Simplification and Text Simplification, as well as Sentence Summarization and Text Summarization. Sentence Simplification means you simplify a text sentence by sentence. Thus, if you have 600 sentences in the input, you will get nearly the same number of sentences in the output (though you may skip some very short sentences). Text Simplification means that you simplify the whole text at once, and while there are 600 sentences in the input, the model may simplify it into a 30-sentence text: some sentences will be deleted, some – combined, some mentioned as clauses, phrases, or words.

In the following sections of this topic, you will get familiar with both Text Simplification and Sentence Simplification, since they use the same technique.

Main approaches to text/sentence simplification

The main approaches to text/sentence simplification include:

  • Seq2Seq modeling,

  • Edit-based Text/Sentence Simplification,

  • Lexical Simplification (Paraphrase),

  • Structural sentence simplification: it is a rule-based method, which builds on two main techniques: (1) Split and Rephrase and (2) Deleting complicated linguistic structures. This method can be used only for sentence simplification!

  • Back-translation: first translating a text from language A into language B, and then translating the already translated text back into language A.

Many researches also combine different methods, using hybrid approaches.

Evaluation metrics for Text/Sentence Simplification include ROUGE (the most common metrics for simplification and summarization tasks), SARI (the best for edit-based simplification), and SAMSA. The level of text simplicity can also be checked with a readability index. Readability indexes are math formulas that assess text comprehensibility. Such indexes include:

  • Flesch-Kinkaid Readability Test: its principle is the fewer words in sentences and the shorter the words, the simpler the text is. You can calculate this index with the following formula: 0.39(total wordstotal sentences)+11.8(total syllablestotal words)15.590.39 \cdot \Big(\frac{\text{total words}}{\text{total sentences}}\Big) + 11.8 \cdot (\frac{\text{total syllables}}{\text{total words}}) - 15.59.

  • Coleman-Lian Readability Test: calculation takes into account the average number of letters per word and the average number of words per sentence: CLI=0.0588L0.296S15.8\text{CLI} = 0.0588L - 0.296S - 15.8 where LL is the average number of letters for 100 tokens, and SS is the average number of sentences for 100 tokens.

  • SMOG grade: the idea behind it is that the complexity of a text is most influenced by complex words, which are always words with many syllables. The more syllables, the more complex the word is. This index has the following formula: grade=1.0430Number of polysullables×30Number of sentences+3.1291\text{grade} = 1.0430 \sqrt{\frac{\text{Number of polysullables} \times 30} {\text{Number of sentences}}} + 3.1291.

  • Among other indexes are Automated Readability Index, Fry Readability formula, Gunning Fog index, FORCAST, etc.

These readability indexes don't only evaluate simplified texts. They are used to check human-written texts too. For example, the US Social Security Administration has issued a special report on compliance with the requirements for language intelligibility. In particular, their employees use special software, StyleWriter, to help evaluate and simplify texts. Another example is The Virginia State Code which contains requirements for a mandatory level of readability for all life and accident insurance contracts, as well as verification of the level of their readability according to the Flesch-Kinkaid Virginia Codex 38.2 formula.

Please note that normal levels of readability (in readability indexes) are dependent on the language itself. For example, Hungarian is grammatically more complicated than German, so readability indexes for Hungarian texts will always be a little bit higher than for German ones.

Lexical simplification

Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning. Popular LS systems still predominantly use a set of rules for substituting complex words with their frequent synonyms from carefully handcrafted databases (for example, WordNet). Alternatively, these LS systems may automatically induce synonyms for complex words from comparable corpora or paraphrase databases.

LSBErt is one of the systems for lexical simplification. LSBErt finds compex words and generates substitutions for them. It's a relatively simple model, since some stages of classical lexical simplification are dismissed here (for example, morphological transformations). However, it's very effective, because it is able to simplify a text in one iteration. This model, in addition to selecting a complex word and replacing it, uses Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The former technology allows predicting missing tokens: to do this, it analyzes the context to the right and left of the word. The latter does something similar, but only to the right of the word.

Some other models include Glavas, REC-LS (a Recursive Context-Aware Lexical Simplification model), and Paetzold-NE. However, none of them are available as Python libraries or HF models.

Here is a short comparison of how LSBErt, Glavas, and REC-LS work. You are given a sentence "John composed these verses". LSBErt will substitute "composed" and "verses" like this: "John wrote these poems". Glavas, on the other hand, will simplify the same text in the following way: "John comprised these psalm". And look at what REC-LS will do: "John framed these writings". This example illustrates that LSBErt works better than Glavas and REC-LS.

Edit-based text simplification

Edit-based text simplification is an approach of simplifying a text by editing tokens or n-grams. This technique is very similar to lexical simplification – in fact, it is a more elaborate version of lexical simplification. It can be applied both to Sentence and Text Simplification tasks.

EditNTS Model is an example of edit-based sentence simplification model. Model suggests the following actions for each complicated n-gram:

  • ADD (e.g., add another token, so that the meaning would be more comprehensible),

  • KEEP (e.g., leave the n-gram as it is with no corrections),

  • DELETE (e.g., delete the n-gram),

  • STOP (e.g., end of the sentence).

Let's see how it works on an example sentence:

Soldiers garrison the city.

The model EditNTS will suggest to KEEP all unigrams except for garrison. For garrison, it will suggest three actions: DELETE garrison, ADD watch, ADD over. In the end, you will get a modified sentence:

Soldiers watch over the city.

LaserTagger is a similar text simplification model. It supports four different edit operations:

  • KEEP,

  • DELETE,

  • ADD,

  • SWAP (if there are two tokens).

The model supports two datasets formats: Wikisplit and Discofuse, so you need to transform your dataset into one of those or use the default Wikisplit dataset.

Another edit-based model is FELIX. FELIX supports two operations: tagging and insertion. FELIX shows much better results than LaserTagger. Among similar models, you can also look at GECToR and TST.

Unfortunately, none of them are available as Python libraries or HF models. To train them, you would need to check their GitHub repo code.

Seq2seq

The most popular way of simplifying texts or sentences is Seq2Seq (Sequence to Sequence). The main advantage of this kind of models is that it is available on the Hugging Face Hub. Beyond HF, Seq2Seq is also available on its original page on GitHub and in PyTorch library.

Seq2Seq transforms one sequence into another with the use of a recurrent neural network (RNN) or, more often, LSTM or GRU to avoid the problem of vanishing gradient. The main components of such models are encoder, decoder, and attention layer. BART and T5 can be classified as Seq2Seq models. The basic architecture is illustrated below:

The seq2seq architecture overviewEach rectangle in the picture above represents a cell in RNN, usually a cell of a GRU – controlled recurrent block, or a cell of LSTM – long short-term memory. Encoders and decoders can have common weights or, more often, use different sets of parameters. Most Seq2Seq models have multilayer cells.

Some of the Seq2Seq models for simplification outside of Hugging Face and PyTorch are:

  • Tensor2Tensor,

  • FairSeq, a BART Seq2Seq model available for Text Simplification and Machine Translation.

Seq2Seq models are widely represented for both Text and Sentence Simplification tasks. Beyond simple Seq2Seqs, there is also a ML+RL Seq2Seq type of models.

You can learn more about the Seq2Seq models in this chapter on Hugging Face.

English corpora for simplification tasks

Naturally, different text simplification methods require different datasets. Seq2Seq models suppose that you have a parallel corpus. Thus, you need to have a dataset with two essential columns: Complicated text and Simplified text. Among datasets available for text simplification are, for example, XSum and D-Wikipedia.

If you are doing Sentence Simplification, then you need to have aligned corpora, which means that sentences from both columns should be aligned by their meaning. Such corpora include:

  • Newsela: this dataset contains news articles with four simplified versions for each, produced manually by professional editors. Corpus level simplification is available, however, it has to be processed for sentence-level simplification. Newsela contains parallel simple-complex news articles with 11 grade levels.

  • WikiLarge.

  • Biendata.

  • Simple English Wikipedia dataset: it includes simplified versions of articles from English Wikipedia, and datasets are available with aligning "equivalent" sentences from the two sources to allow Seq2Seq model training.

  • Turk Corpus (available on the Amazon website).

You can also align sentence by your own. The most popular sentence aligners are Neural CRF Model and Fast-Champollion. Another way to implement sentence alignment is to implement Natural Language Inference (see Logical constructions).

Lexical Simplification, as you could guess, needs a different type of datasets. Here you should use a semantically-annotated corpus. The most useful LS datasets are those annotated manually, with each instance containing a sentence, a target complex word, and a set of suitable substitutions provided and ranked by humans with respect to their simplicity. Sometimes, using WordNet is also helpful. Here is a list of "classical" datasets for lexical simplification:

  • SemEval 2012 contains 2010 instances of simplicity rankings, and is considered a classic dataset for LS.

  • LexMTurk consists of 430 instances of simplicity rankings produced by 46 Amazon Mechanical "turkers" and 9 PhD students.

  • LSeval contains 500 instances of sentences from Wikipedia with target complex words and simpler substitutions suggested by 50 English-speaking turkers each.

  • BenchLS contains 929 instances and is a compilation of the LSeval and LexMTurk datasets, automatically corrected for spelling and inflection errors.

Implementation in Hugging Face

In Hugging Face (read this topic on how to use HS), you can simplify a text in just few lines of code. You will be using a model for scientific text simplification. This model is a Seq2Seq Transformers model, so you will need the transformers library:

from transformers import pipeline

simplifier = pipeline(model='haining/sas_baseline')

To check the model, you can use a slice from an academic article on NLP:

text = ['''Substantial evidence now shows that the differences between individuals in their empathy for one another are influenced by multiple genetic and environmental factors.\n''']

Now you can simplify it like this:

simplifier(text)


#  [{'generated_text': 'The question of whether empathy is genetic or environmental has long been a contentious one in psychology'}]

Conclusion

In this topic, you've learned what Text Simplification is and how it differs from Text Summarization. The topic has covered the most popular methods of Text and Sentence Simplification: Lexical Simplification, Edit-based Simplification, and, of course, Seq2Seq. Then, you have gone through some of the corpora that may be useful for this task. And finally, you have read how to use Seq2Seq models for Text Simplification in Hugging Face.

How did you like the theory?
Report a typo