9 minutes read

Text simplification is an NLP task that consists of modifying an input text so that the output is more comprehensible and easier to read. SARI was developed as a text simplification metric that compares System output Against References and against the Input sentence, and measures the quality of words that are added, deleted, and kept by the systems. It compares the output by measuring the overlap with the reference text, and also tracks the semantic changes made by the system. Since text simplification is performed in a single language, the output is directly compared against the input.

The addition operation

The operations of addition, where system output OO is not in the input II but occurs in any of the human references RR, i.e. OIRO \, \cap \overline{I} \, \cap R, are rewarded. N-gram precision and recall for addition are as follows:

padd(n)=gOmin(#g(OI),#g(R))gO#g(OI)p_{add}(n) = \frac{\sum_{g \in O}\text{min}(\#_{g}(O \cap \overline{I}), \#_g(R))}{\sum_{g \in O}\#_g(O \cap \overline{I})}

radd(n)=gOmin(#g(OI),#g(R))gO#g(RI)r_{add}(n) = \frac{\sum_{g \in O}\text{min}(\#_{g}(O \cap \overline{I}), \#_g(R))}{\sum_{g \in O}\#_g(R \cap \overline{I})}

#g()\#_{g}(\cdot) — a binary (later fractional) indicator of the occurrence of the n-gram gg in a given set, and

#g(OI)=max(#g(O)#g(I),0), #g(RI)=max(#g(R)#g(I),0)\#_g(O \cap \overline{I}) = \text{max}(\#_g(O) - \#_g(I), 0) ,\ \#_g(R \cap \overline{I}) = \text{max}(\#_g(R) - \#_g(I), 0)

Binary indicator simply returns 11 if a certain n-gram occurs in a set, and 00 if it doesn't. Addition captures the addition of the important information during simplification.

The keep operation

Words that are kept in both OO and RR are rewarded. If multiple references are used, the count of references in which an n-gram was retained is considered — some words are considered simple and don't need to be simplified. RR' is the n-gram count over RR with fractions, e.g. if a unigram occurs in 2 out of the total rr references, then it's count is weighted by 2/r2/r in the computation of pp and rr:

pkeep(n)=gImin(#g(IO),#g(IR))gI#g(IO)p_{keep}(n) = \frac{\sum_{g \in I}\text{min}(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap O)}

rkeep(n)=gImin(#g(IO),#g(IR))gI#g(IR)r_{keep}(n) = \frac{\sum_{g \in I}\min(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap R')}

with

#g(IO)=min(#g(I),#g(O)), #g(IR)=min(#g(I),#g(R)/r)\#_g(I \cap O) = \text{min}(\#_g(I), \#_g(O)) ,\ \#_g(I \cap R') = \text{min}(\#_g(I), \#_g(R)/r)

The keep operation measures the proportion of words in OO that appear both in II and RR.

The deletion operation

Overdeleting hurts readability much more than not deleting enough, so only precision is used for deleting:

pdel(n)=gImin(#g(IO),#g(IR))gI#g(IO)p_{del}(n) = \frac{\sum_{g \in I}\text{min}(\#_{g}(I \cap \overline{O}), \#_g(I \cap \overline{R'}))}{\sum_{g \in I}\#_g(I \cap \overline{O})}

with

#g(IO)=max(#g(I)#g(O),0), #g(IR)=max(#g(I)#g(R)/r,0)\#_g(I \cap \overline{O}) = \text{max}(\#_g(I) - \#_g(O), 0) ,\ \#_g(I \cap \overline{R'}) = \text{max}(\#_g(I) - \#_g(R)/r, 0)

The precision of what is kept also reflects the sufficiency of deletions. The definitions of both RR' and rr are the same as in the previous section. The n-gram counts are weighted in RR' to compensate n-grams that don't require simplification, according to human editors. Deletion tells how much information is lost in the simplification. The usage of precision only stems from the notion that it's important to keep the essential information, and losing the important fragments but keeping the miscellaneous details is not desirable.

The SARI score

PP and RR are averaged over the highest n-gram order kk (in the original implementation k=4k=4), that is

Poperation=1kn=[1,...,k]poperation,Roperation=1kn=[1,...,k]roperationP_{\text{operation}} = \frac{1}{k}{\sum_{n=[1, ...,k]}p_{\text{operation}}}, R_{\text{operation}} = \frac{1}{k}{\sum_{n=[1, ...,k]}r_{\text{operation}}}with operation[keep,add,delete]\text{operation} \in [ {\text{keep}, \text{add}, \text{delete}} ]. The summation of the operation scores is usually done over the n-gram difference between the references and the input. The F score is calculated as usual.

Then, the SARI score is introduced as:

SARI=Fadd+Fkeep+Pdel3\text{SARI} =\frac{ F_{add} + F_{keep} + P_{del}}{3}

SARI\text{SARI} scores lie in the [0,100][0, 100] interval, with higher scores indicating better model performance. SARI has a high correlation with human judgement of simplicity, however, BLEU shows more correlation on meaning preservation and grammaticality. This might be explained by BLEU's design as a bilingual machine translation evaluation metric: BLEU doesn't use recall, higher scores are assigned to outputs with fewer changes and lengths that mirror their reference lengths, and there is no account for the differences between the inputs and the references.

Notation demystification

Let's perform the calculations for precision-addition and recall-addition of the unigram 'now' on a synthetic system from the original SARI paper:

Sentence Unigrams
INPUT (II) About 95 species are currently accepted
['About', '95', 'species',
 'are', 'currently', 'accepted']
REF-1 (R1R1) About 95 species are currently known
['About', '95', 'species',
 'are', 'currently', 'known']
REF-2 (R2R2) About 95 species are now accepted
['About', '95', 'species',
 'are', 'now', 'accepted']
REF-3 (R3R3) 95 species are now accepted
['95', 'species', 'are',
 'now', 'accepted']
OUTPUT (OO) About 95 species are now agreed
['About', '95', 'species',
 'are', 'now', 'agreed']

R=R1 R2 R3=R = R1 \ \cup R2 \ \cup R3 = ['95', 'About', 'accepted', 'are', 'currently', 'known', 'now', 'species']. The set of all three references is used here because the indicator simply checks if the unigram is in any of the references(11 if it is, 00 if it's not).

Let's start by calculating the indicator of the occurrence of the unigram gg ('now') in a given set:

#g(OI)=max(#g(O)#g(I),0)=max(10,0)=1\#_g(O \cap \overline{I}) = \text{max}(\#_g(O) - \#_g(I), 0) = \max(1 -0, 0) = 1

'now' doesn't appear in II altogether, and there is an occurrence of it in OO and RR.

 #g(RI)=max(#g(R)#g(I),0)=max(10,0)=1\ \#_g(R \cap \overline{I}) = \text{max}(\#_g(R) - \#_g(I), 0) = \max(1 - 0, 0) = 1

Then let's proceed to calculating the precision and recall for addition as:

padd=min(#g(OI),#g(R))#g(OI)=min(1,1)1=1p_{add} = \frac{\text{min}(\#_{g}(O \cap \overline{I}), \#_g(R))}{\#_g(O \cap \overline{I})} = \frac{\min(1, 1)}{1} = 1radd=min(#g(OI),#g(R))#g(RI)=min(1,1)1=1r_{add} = \frac{\min(\#_{g}(O \cap \overline{I}), \#_g(R))}{\#_g(R \cap \overline{I})} = \frac{\min(1,1)}{1} = 1

Now let's move on to calculating the keeping of the word 'about' in the output.

Since in the keeping operation the count of the kept n-grams matters, the number of occurrences of the specific n-gram in the references has to be calculated. The word 'about' is present in REF-1 and REF-2, so r=2r = 2:

#g(IO)=min(#g(I),#g(O))=min(1,1)=1\#_g(I \cap O) = \min(\#_g(I), \#_g(O)) = \min(1, 1) = 1

#g(IR)=min(#g(I),#g(R)/r)=min(1,1/2)=1/2\#_g(I \cap R') = \min(\#_g(I), \#_g(R)/r) = \min(1, 1/2) = 1/2

Continuing the precision and recall calculation:

pkeep(n)=gImin(#g(IO),#g(IR))gI#g(IO)=min(1, 1/2)1=1/2p_{keep}(n) = \frac{\sum_{g \in I}\text{min}(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap O)} = \frac{\min(1,\ 1/2)}{1} = 1/2rkeep(n)=gImin(#g(IO),#g(IR))gI#g(IR)=min(1, 1/2)1/2=1r_{keep}(n) = \frac{\sum_{g \in I}\min(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap R')} = \frac{\min(1, \ 1/2)}{1/2} = 1

The deletion operation is performed in a similar manner. The computation might potentially end up in zero division. This is addressed in the TensorFlow and HuggingFace's implementations.

SARI usage with the evaluate module

Hugging Face comes with the metric package called evaluate, which provides a universal interface for evaluating the model results. You can obtain it as follows:

pip install evaluate

SARI's evaluate requires 3 arguments to be passed: inputs (under the sources key), system output (under predictions), and references. A SARI score of 100 signifies a total match.

import evaluate

sari = evaluate.load("sari")

sources=["Some people really enjoy windowshopping."]
predictions=["Some birds like the windows."]

references=[["Some people enjoy shopping.",
             "People to to browse the stores."]]

sari_score = sari.compute(sources=sources,
                          predictions=predictions,
                          references=references)

Conclusion

In this topic, you learned about SARI — the main metric for text simplification evaluation today. Other metrics for the simplification task include BLEU and SAMSA, among others. SARI rewards systems that simplify the outputs and highly correlates with human judgement on simplicity.

How did you like the theory?
Report a typo