Computer scienceData scienceNLPNLP metricsClassic NLP metrics

SARI

12 minutes read

Text simplification is an NLP task that consists of modifying an input text so that the output is more comprehensible and easier to read. SARI was developed as a text simplification metric that compares System output Against References and against the Input sentence, and measures the quality of words that are added, deleted, and kept by the systems. It compares the output by measuring the overlap with the reference text, and also tracks the semantic changes made by the system. Since text simplification is performed in a single language, the output is directly compared against the input.

The addition operation

The operations of addition, where system output $O$ is not in the input $I$ but occurs in any of the human references $R$ , i.e. $O \, \cap \overline{I} \, \cap R$ , are rewarded. N-gram precision and recall for addition are as follows:

$p_{add}(n) = \frac{\sum_{g \in O}\text{min}(\#_{g}(O \cap \overline{I}), \#_g(R))}{\sum_{g \in O}\#_g(O \cap \overline{I})}$

$r_{add}(n) = \frac{\sum_{g \in O}\text{min}(\#_{g}(O \cap \overline{I}), \#_g(R))}{\sum_{g \in O}\#_g(R \cap \overline{I})}$

$\#_{g}(\cdot)$ — a binary (later fractional) indicator of the occurrence of the n-gram $g$ in a given set, and

$\#_g(O \cap \overline{I}) = \text{max}(\#_g(O) - \#_g(I), 0) ,\ \#_g(R \cap \overline{I}) = \text{max}(\#_g(R) - \#_g(I), 0)$

Binary indicator simply returns $1$ if a certain n-gram occurs in a set, and $0$ if it doesn't. Addition captures the addition of the important information during simplification.

The keep operation

Words that are kept in both $O$ and $R$ are rewarded. If multiple references are used, the count of references in which an n-gram was retained is considered — some words are considered simple and don't need to be simplified. $R'$ is the n-gram count over $R$ with fractions, e.g. if a unigram occurs in 2 out of the total $r$ references, then it's count is weighted by $2/r$ in the computation of $p$ and $r$ :

$p_{keep}(n) = \frac{\sum_{g \in I}\text{min}(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap O)}$

$r_{keep}(n) = \frac{\sum_{g \in I}\min(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap R')}$

with

$\#_g(I \cap O) = \text{min}(\#_g(I), \#_g(O)) ,\ \#_g(I \cap R') = \text{min}(\#_g(I), \#_g(R)/r)$

The keep operation measures the proportion of words in $O$ that appear both in $I$ and $R$ .

The deletion operation

Overdeleting hurts readability much more than not deleting enough, so only precision is used for deleting:

$p_{del}(n) = \frac{\sum_{g \in I}\text{min}(\#_{g}(I \cap \overline{O}), \#_g(I \cap \overline{R'}))}{\sum_{g \in I}\#_g(I \cap \overline{O})}$

with

$\#_g(I \cap \overline{O}) = \text{max}(\#_g(I) - \#_g(O), 0) ,\ \#_g(I \cap \overline{R'}) = \text{max}(\#_g(I) - \#_g(R)/r, 0)$

The precision of what is kept also reflects the sufficiency of deletions. The definitions of both $R'$ and $r$ are the same as in the previous section. The n-gram counts are weighted in $R'$ to compensate n-grams that don't require simplification, according to human editors. Deletion tells how much information is lost in the simplification. The usage of precision only stems from the notion that it's important to keep the essential information, and losing the important fragments but keeping the miscellaneous details is not desirable.

The SARI score

$P$ and $R$ are averaged over the highest n-gram order $k$ (in the original implementation $k=4$ ), that is

$P_{\text{operation}} = \frac{1}{k}{\sum_{n=[1, ...,k]}p_{\text{operation}}}, R_{\text{operation}} = \frac{1}{k}{\sum_{n=[1, ...,k]}r_{\text{operation}}}$ with $\text{operation} \in [ {\text{keep}, \text{add}, \text{delete}} ]$ . The summation of the operation scores is usually done over the n-gram difference between the references and the input. The F score is calculated as usual.

Then, the SARI score is introduced as:

$\text{SARI} =\frac{ F_{add} + F_{keep} + P_{del}}{3}$

$\text{SARI}$ scores lie in the $[0, 100]$ interval, with higher scores indicating better model performance. SARI has a high correlation with human judgement of simplicity, however, BLEU shows more correlation on meaning preservation and grammaticality. This might be explained by BLEU's design as a bilingual machine translation evaluation metric: BLEU doesn't use recall, higher scores are assigned to outputs with fewer changes and lengths that mirror their reference lengths, and there is no account for the differences between the inputs and the references.

Notation demystification

Let's perform the calculations for precision-addition and recall-addition of the unigram 'now' on a synthetic system from the original SARI paper:

	Sentence	Unigrams
INPUT ( $I$ )	About 95 species are currently accepted	`['About', '95', 'species', 'are', 'currently', 'accepted']`
REF-1 ( $R1$ )	About 95 species are currently known	`['About', '95', 'species', 'are', 'currently', 'known']`
REF-2 ( $R2$ )	About 95 species are now accepted	`['About', '95', 'species', 'are', 'now', 'accepted']`
REF-3 ( $R3$ )	95 species are now accepted	`['95', 'species', 'are', 'now', 'accepted']`
OUTPUT ( $O$ )	About 95 species are now agreed	`['About', '95', 'species', 'are', 'now', 'agreed']`

$R = R1 \ \cup R2 \ \cup R3 =$ ['95', 'About', 'accepted', 'are', 'currently', 'known', 'now', 'species']. The set of all three references is used here because the indicator simply checks if the unigram is in any of the references( $1$ if it is, $0$ if it's not).

Let's start by calculating the indicator of the occurrence of the unigram $g$ ('now') in a given set:

$\#_g(O \cap \overline{I}) = \text{max}(\#_g(O) - \#_g(I), 0) = \max(1 -0, 0) = 1$

'now' doesn't appear in $I$ altogether, and there is an occurrence of it in $O$ and $R$ .

$\ \#_g(R \cap \overline{I}) = \text{max}(\#_g(R) - \#_g(I), 0) = \max(1 - 0, 0) = 1$

Then let's proceed to calculating the precision and recall for addition as:

$p_{add} = \frac{\text{min}(\#_{g}(O \cap \overline{I}), \#_g(R))}{\#_g(O \cap \overline{I})} = \frac{\min(1, 1)}{1} = 1$ $r_{add} = \frac{\min(\#_{g}(O \cap \overline{I}), \#_g(R))}{\#_g(R \cap \overline{I})} = \frac{\min(1,1)}{1} = 1$

Now let's move on to calculating the keeping of the word 'about' in the output.

Since in the keeping operation the count of the kept n-grams matters, the number of occurrences of the specific n-gram in the references has to be calculated. The word 'about' is present in REF-1 and REF-2, so $r = 2$ :

$\#_g(I \cap O) = \min(\#_g(I), \#_g(O)) = \min(1, 1) = 1$

$\#_g(I \cap R') = \min(\#_g(I), \#_g(R)/r) = \min(1, 1/2) = 1/2$

Continuing the precision and recall calculation:

$p_{keep}(n) = \frac{\sum_{g \in I}\text{min}(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap O)} = \frac{\min(1,\ 1/2)}{1} = 1/2$ $r_{keep}(n) = \frac{\sum_{g \in I}\min(\#_{g}(I \cap O), \#_g(I \cap R'))}{\sum_{g \in I}\#_g(I \cap R')} = \frac{\min(1, \ 1/2)}{1/2} = 1$

The deletion operation is performed in a similar manner. The computation might potentially end up in zero division. This is addressed in the TensorFlow and HuggingFace's implementations.

SARI usage with the evaluate module

Hugging Face comes with the metric package called evaluate, which provides a universal interface for evaluating the model results. You can obtain it as follows:

pip install evaluate

SARI's evaluate requires 3 arguments to be passed: inputs (under the sources key), system output (under predictions), and references. A SARI score of 100 signifies a total match.

import evaluate

sari = evaluate.load("sari")

sources=["Some people really enjoy windowshopping."]
predictions=["Some birds like the windows."]

references=[["Some people enjoy shopping.",
             "People to to browse the stores."]]

sari_score = sari.compute(sources=sources,
                          predictions=predictions,
                          references=references)

Conclusion

In this topic, you learned about SARI — the main metric for text simplification evaluation today. Other metrics for the simplification task include BLEU and SAMSA, among others. SARI rewards systems that simplify the outputs and highly correlates with human judgement on simplicity.

How did you like the theory?

Report a typo