Text simplification is an NLP task that consists of modifying an input text so that the output is more comprehensible and easier to read. SARI was developed as a text simplification metric that compares System output Against References and against the Input sentence, and measures the quality of words that are added, deleted, and kept by the systems. It compares the output by measuring the overlap with the reference text, and also tracks the semantic changes made by the system. Since text simplification is performed in a single language, the output is directly compared against the input.
The addition operation
The operations of addition, where system output is not in the input but occurs in any of the human references , i.e. , are rewarded. N-gram precision and recall for addition are as follows:
— a binary (later fractional) indicator of the occurrence of the n-gram in a given set, and
Binary indicator simply returns if a certain n-gram occurs in a set, and if it doesn't. Addition captures the addition of the important information during simplification.
The keep operation
Words that are kept in both and are rewarded. If multiple references are used, the count of references in which an n-gram was retained is considered — some words are considered simple and don't need to be simplified. is the n-gram count over with fractions, e.g. if a unigram occurs in 2 out of the total references, then it's count is weighted by in the computation of and :
with
The keep operation measures the proportion of words in that appear both in and .
The deletion operation
Overdeleting hurts readability much more than not deleting enough, so only precision is used for deleting:
with
The precision of what is kept also reflects the sufficiency of deletions. The definitions of both and are the same as in the previous section. The n-gram counts are weighted in to compensate n-grams that don't require simplification, according to human editors. Deletion tells how much information is lost in the simplification. The usage of precision only stems from the notion that it's important to keep the essential information, and losing the important fragments but keeping the miscellaneous details is not desirable.
The SARI score
and are averaged over the highest n-gram order (in the original implementation ), that is
with . The summation of the operation scores is usually done over the n-gram difference between the references and the input. The F score is calculated as usual.
Then, the SARI score is introduced as:
scores lie in the interval, with higher scores indicating better model performance. SARI has a high correlation with human judgement of simplicity, however, BLEU shows more correlation on meaning preservation and grammaticality. This might be explained by BLEU's design as a bilingual machine translation evaluation metric: BLEU doesn't use recall, higher scores are assigned to outputs with fewer changes and lengths that mirror their reference lengths, and there is no account for the differences between the inputs and the references.
Notation demystification
Let's perform the calculations for precision-addition and recall-addition of the unigram 'now' on a synthetic system from the original SARI paper:
| Sentence | Unigrams | |
|---|---|---|
| INPUT () | About 95 species are currently accepted |
|
| REF-1 () | About 95 species are currently known |
|
| REF-2 () | About 95 species are now accepted |
|
| REF-3 () | 95 species are now accepted |
|
| OUTPUT () | About 95 species are now agreed |
|
['95', 'About', 'accepted', 'are', 'currently', 'known', 'now', 'species']. The set of all three references is used here because the indicator simply checks if the unigram is in any of the references( if it is, if it's not).
Let's start by calculating the indicator of the occurrence of the unigram ('now') in a given set:
'now' doesn't appear in altogether, and there is an occurrence of it in and .
Then let's proceed to calculating the precision and recall for addition as:
Now let's move on to calculating the keeping of the word 'about' in the output.
Since in the keeping operation the count of the kept n-grams matters, the number of occurrences of the specific n-gram in the references has to be calculated. The word 'about' is present in REF-1 and REF-2, so :
Continuing the precision and recall calculation:
The deletion operation is performed in a similar manner. The computation might potentially end up in zero division. This is addressed in the TensorFlow and HuggingFace's implementations.
SARI usage with the evaluate module
Hugging Face comes with the metric package called evaluate, which provides a universal interface for evaluating the model results. You can obtain it as follows:
pip install evaluate
SARI's evaluate requires 3 arguments to be passed: inputs (under the sources key), system output (under predictions), and references. A SARI score of 100 signifies a total match.
import evaluate
sari = evaluate.load("sari")
sources=["Some people really enjoy windowshopping."]
predictions=["Some birds like the windows."]
references=[["Some people enjoy shopping.",
"People to to browse the stores."]]
sari_score = sari.compute(sources=sources,
predictions=predictions,
references=references)Conclusion
In this topic, you learned about SARI — the main metric for text simplification evaluation today. Other metrics for the simplification task include BLEU and SAMSA, among others. SARI rewards systems that simplify the outputs and highly correlates with human judgement on simplicity.