Computer scienceData scienceNLPNLP metricsClassic NLP metrics

ROUGE

6 minutes read

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating text generation models (summarization or machine translation). ROUGE is based on measuring the overlap between the model prediction and the human-produced reference. ROUGE has several variants; each corresponding to particular n-grams overlap — unigrams for ROUGE-1, bigrams for ROUGE-2, and so on. Also, a third most used ROUGE variant, ROUGE-L, uses the longest common subsequence (LCS) — the longest sequence of shared, not necessarily consecutive, ordered words between two sentences.

Definition

ROUGE-N refers to the direct n-gram overlaps between the candidate (prediction) and the reference. ROUGE is the F1 score from the n-gram precision and recall. Precision in the context of ROUGE reflects the fraction of the n-grams in the prediction that are also in the reference, and the recall is the fraction of reference n-grams that also appear in the model prediction.

You can calculate the ROUGE-1 score like this:

\text{ROUGE-1}_{\text{recall}} = \frac{\text{unigram cand. } \cap \text{ unigram ref.}}{|\text{unigram ref.}|}

\text{ROUGE-1}_{\text{precision}}= \frac{\text{unigram cand. } \cap \text{ unigram ref.}}{|\text{unigram cand.}|}

\text{ROUGE-1}_{\text{F1}}=2\cdot\frac{\text{recall}\cdot\text{precision}}{\text{recall} + \text{precision}}

For ROUGE-2, the core formulas are the same as for ROUGE-1, with the only difference in using bigrams instead of unigrams, so we omit them here.

For ROUGE-L, you will calculate the longest common subsequence between the candidate and the reference:

\text{ROUGE-L}_{\text{recall}} = \frac{\text{LCS}(\text{cand., ref.})}{\text{\#words in ref.}}

\text{ROUGE-L}_{\text{precision}} = \frac{\text{LCS}(\text{cand., ref.})}{\text{\# words in cand.}}

\text{ROUGE-L}_{\text{F1}} = 2\cdot\frac{\text{recall}\cdot\text{precision}}{\text{recall} + \text{precision}}

The ROUGE scores lie in the $[0,1 ]$ range, with a score of $1$ indicating the total match between the reference and the prediction. ROUGE is rather easy to interpret and understand why certain scores were achieved (the more direct overlaps between the reference and the generated output there are, the higher the score).

A small note on the ROUGE variants:

ROUGE-N: evaluates the overlap between the n-grams of prediction and the reference.
ROUGE-L: based on the Longest Common Subsequence, pays more attention to the word order when compared to ROUGE-N
ROUGE-W: weighted LCS and consecutive LCS are preferred.
ROUGE-S: based on the skip-bigrams.
ROUGE-SU: based on the skip-bigrams with the unigram-based co-occurrence.

In most cases, the ROUGE-1, ROUGE-2, and ROUGE-L are used in practice.

Example

Let's see how to calculate the ROUGE-1 and the ROUGE-L scores on a small example:

	Candidate	Reference
Text	I really loved reading the Hunger Games.	I loved reading the Hunger Games.
Unigrams	[('I',), ('really',),('loved',), ('reading',), ('the',), ('Hunger',), ('Games.',)]	[('I',), ('loved',), ('reading',), ('the',), ('Hunger',), ('Games.',)]
Bigrams	[('I', 'really'), ('really', 'loved'), ('loved', 'reading'), ('reading', 'the'), ('the', 'Hunger'), ('Hunger', 'Games.')]	[('I', 'loved'), ('loved', 'reading'), ('reading', 'the'), ('the', 'Hunger'), ('Hunger', 'Games.')]
LCS	I loved reading the Hunger Games.

The ROUGE-1 calculation is performed as

\text{ROUGE-1}_{\text{recall}} = \frac{\text{unigram cand. } \cap \text{ unigram ref.}}{|\text{unigram ref.}|} = \frac{6}{6} =1

\text{ROUGE-1}_{\text{precision}}= \frac{\text{unigram cand. } \cap \text{ unigram ref.}}{|\text{unigram cand.}|} = \frac{6}{7}

\text{ROUGE-1}_{\text{F1}}=2\cdot\frac{\text{recall}\cdot\text{precision}}{\text{recall} + \text{precision}} = \frac{12}{13} \approx 0.923

Similarly, the ROUGE-L is calculated as

\text{ROUGE-L}_{\text{recall}} = \frac{\text{LCS}(\text{cand., ref.})}{\text{\#words in ref.}} = \frac{6}{6} = 1

\text{ROUGE-L}_{\text{precision}} = \frac{\text{LCS}(\text{cand., ref.})}{\text{\# words in cand.}} = \frac{6}{7}

\text{ROUGE-L}_{\text{F1}} = 2\cdot\frac{\text{recall}\cdot\text{precision}}{\text{recall} + \text{precision}} = \frac{12}{13} \approx 0.923

ROUGE in practice

ROUGE is available in multiple Python packages, but for demonstration purposes, let's take a look at the Hugging Face evaluate.

The preliminary installation step:

pip install evaluate rouge_score

The use of evaluate is pretty straightforward:

import evaluate

rouge = evaluate.load("rouge")

references = [
    [
        "A bold, flavorful coffee with a slightly bitter aftertaste.",
        "A rich, full-bodied coffee with a smooth finish.",
    ]
]
predictions = ["A bold, full-flavored coffee with a slightly bitter aftertaste."]

results = rouge.compute(predictions=predictions, references=references)

results will contain the rouge1, rouge2, rougeL, and rougeLsum values:

{'rouge1': 0.8421052631578948,
 'rouge2': 0.7058823529411765,
 'rougeL': 0.8421052631578948,
 'rougeLsum': 0.8421052631578948}

rouge.compute() accepts the following parameters:

predictions (list): list of model predictions.
references (list or list[list]): list of (possibly multiple) references for each prediction.
rouge_types (list, default: ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']): list of the ROUGE types to compute.
use_aggregator (boolean, default: True): If True, returns aggregates.
use_stemmer (boolean, default: False): ifTrue, uses the Porter stemmer.

Limitations

ROUGE only operates on the overlaps. A score of 1 could only be obtained if both summaries have the exact same n-grams, thus making it hard to tell the model's performance from the computed scores alone (and, by extension, optimizing model performance by maximizing the ROUGE score is tricky since it might not correspond to how a human would perceive the quality of the generated text).

ROUGE is more suitable for models that don't include paraphrasing and do not generate new text units that don't appear in the references (e.g., extractive summarization, which only uses the most important text units from the original text).

Also, ROUGE does not have a proper mechanism for penalizing specific prediction lengths (either the summary is too brief or includes unnecessary details). It has a limited capacity for order detection, which is especially evident when the shorter n-grams are considered.

Conclusion

As a result, you now know the following aspects:

The theory behind the ROUGE metric;
How the ROUGE scores can be manually calculated on a small example;
The limitations of ROUGE and the models that could be evaluated with this metric;
How to use the evaluate module to get the ROUGE scores.

7 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo