Computer scienceData scienceNLPNLP metricsClassic NLP metrics

BLEU

6 minutes read

BLEU (Bilingual evaluation understudy) was first introduced in 2002 as a metric for machine translation. The main idea is the closer the reference matches the model's output, the better the model is. BLEU is one of the most popular benchmark metrics in tasks related to text generation, such as machine translation or text summarization.

Modified precision

BLEU relies on n-grams. You can see an example of obtaining unigrams and bigrams for a candidate and a reference sentence in the table below, along with the LCS (longest common subsequence). Blue highlight represents the shared words between the sentences.

Candidate

Reference

Text

I really loved reading the Hunger Games.

I loved reading the Hunger Games.

Unigrams

[('I',), ('really',),('loved',),
('reading',), ('the',), ('Hunger',), ('Games.',)]

[('I',), ('loved',), ('reading',),
('the',), ('Hunger',), ('Games.',)]

Bigrams

[('I', 'really'), ('really', 'loved'),
('loved', 'reading'), ('reading', 'the'),
('the', 'Hunger'), ('Hunger', 'Games.')]

[('I', 'loved'), ('loved', 'reading'),
('reading', 'the'), ('the', 'Hunger'),
('Hunger', 'Games.')]

LCS

I loved reading the Hunger Games.

BLEU includes multiple n-grams to calculate the precision, most commonly N=4N=4 , using unigrams, bigrams, trigrams, and 4-grams to get the scores. A scenario where the model multiples a certain word might occur, thus negatively affecting the output's quality. To take that into account, BLEU modifies the precision by clipping it to only consider the maximum number of n-gram occurrences in the reference(s).

Here is a popular example that demonstrates the clipped count calculation on bigrams:

Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
Output: The cat the cat on the mat

Output bigram

Count

Clipped count

In the count column, it counts how many times a bigram appears in the output.

In the clipped count, we go through each bigram and count the maximum number of times it appeared in any reference. For example, ‘cat the’ is not in either Ref.1 or Ref. 2, but the bigram ‘the mat’ appears in both – so its clipped count is 1.

the cat

2

1

cat the

1

0

cat on

1

1

on the

1

1

the mat

1

1

You can calculate the modified precision for the corresponding n-gram as follows:

pn=OOutputs n-gramOcountclip(n-gram)OOutputs n-gramOcount(n-gram)p_n = \frac{\sum\limits_{\text{O} \in \text{Outputs}} \ \sum\limits_{ \text{n-gram} \in \text{O}} \text{count}_{ \text{clip}}( \text{n-gram})}{\sum\limits_{\text{O} \in \text{Outputs}} \ \sum\limits_{ \text{n-gram} \in \text{O}} \text{count}( \text{n-gram})}

O\text{O} here corresponds to a single sentence in the predictions (outputs).

If the above feline example is considered, the modified precision for the bigrams will look like this:

p2=(clipped count)(count)=1+0+1+1+12+1+1+1+1=46p_2= \frac{\sum(\text{clipped count})}{\sum{(\text{count})}} = \frac{1 + 0 + 1+1+1}{2+1+1+1+1} = \frac{4}{6}

Brevity penalty and the BLEU score

Modified precision only penalizes longer outputs, but has no control over a shorter output being generated. The recall can't be used in BLEU, since the calculation might be performed over multiple references. This leads to the brevity penalty (BP) — a factor that ensures high scores can't be achieved if the candidate sentence is too brief:

BP={1,if model’s length > reference length,e1reference length/model length,if model’s lengthreference length \text{BP} = \begin{cases} 1, & \text{if } \text{model's length > reference length}, \\ e ^{1 - \text{reference length}/\text{model length}}, & \text{if model's length} \le \text{reference length } \end{cases}

Model length here refers to the total length of the predictions corpus, and the reference length is the reference corpus length.

Thus, the combined BLEU score is calculated like this:

BLEU=BPexp(n=1Nwnlogpn)\text{BLEU} = \text{BP}\cdot\exp \Big(\sum_{n=1}^N w_n \log {p_n} \Big)In this formula, pnp_n is the modified precision for each n-gram type (unigrams, bigrams, etc), and wnw_n represents how much every n-gram contributes to the score, by default, wn=1N=14=0.25w_n =\frac{1}{N} = \frac{1}{4} = 0.25. The weights sum up to 1 and might be customized for fine-tuning (or if N4N\neq 4).

The BLEU scores are in the [0,1][0, 1] range, with higher scores indicating better model quality. Generally speaking, a score of 0.7 is considered good, but the acceptable threshold value will depend on the task at hand.

The major advantages of BLEU are its simplicity and how it correlates with human judgment, thus making it a good metric to perform surface-level evaluation fast and inexpensively. However, using BLEU is often not enough as a sole evaluator of performance. BLEU has no mechanism for paraphrase detection and can solely consider string matches, which would produce lower scores for evidently correct results that incorporate synonyms. Another issue is the partial ability to capture order, especially when long phrases are present. Due to these issues, BLEU scores might not be that informative in terms of telling the true model quality.

There are various available BLEU implementations (such as nltk.translate.bleu_score), but in this topic the HF evaluate module will be used for demonstration purposes.

BLEU usage with the evaluate package

BLEU is available for usage from the Hugging Face evaluate module, which can be installed as follows:

pip install evaluate

bleu.compute() requires two keyword arguments to be passed: a list of model predictions, under the predictions key, and the reference list, under thereferences key:

import evaluate

bleu = evaluate.load("bleu")

references = [
    [
        "A bold, flavorful coffee with a slightly bitter aftertaste.",
        "A rich, full-bodied coffee with a smooth finish.",
    ]
]
predictions = ["A bold, full-flavored coffee with a slightly bitter aftertaste."]

bleu_score = bleu.compute(predictions=predictions, references=references)

bleu.compute() also accepts 3 optional arguments to customize the BLEU calculation for a specific use case:

  • tokenizer (Tokenizer13a by default) — can be changed to any other tokenizer that accepts a string and returns a list of tokens,
  • max_order (4 by default) — maximum n-gram order used,
  • smooth (False by default) — determines whether a specific smoothing technique is applied.

bleu.compute() returns a dictionary in the following format:

{
   "bleu":0.7016879391277371,
   "precisions":[
      0.9090909090909091,
      0.8,
      0.6666666666666666,
      0.5
   ],
   "brevity_penalty":1.0,
   "length_ratio":1.1,
   "translation_length":11,
   "reference_length":10
}

Conclusion

In this topic, you've learned:

  • the main ideas behind the BLEU metric, namely, the modified precision, which is a measure to penalize models that generate unnecessary long predictions, and the brevity penalty, a measure to ensure that the model doesn't produce shorter predictions;
  • how to use the evaluate module to calculate the BLEU scores on the set of predictions and references.
3 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo