Computer scienceData scienceNLPNLP metricsClassic NLP metrics

METEOR

7 minutes read

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric originally introduced for machine translation and later extended to other text generation tasks, that calculates the similarity between the system output and the references using n-grams. METEOR was developed as an improvement upon BLEU by incorporating the recall into the calculation. This topic covers the core aspects and the usage of the METEOR metric.

F-score

METEOR uses a weighted F-score (with recall being weighted higher than precision) based on mapped unigrams and a penalty function (fragmentation measure) for incorrect word order. This section deals with some preliminary steps and the weighted F-score calculation.

First, you search for the largest subset of mappings (called alignment) between the candidate and the reference, so exact matches are considered first, followed by matches after Porter stemming, and then WordNet synonyms. Alignment is a mapping between unigrams, such that every unigram in each string maps to zero or one unigram in the other string, and to no unigrams in the same string. Alignment could be illustrated as follows:

The unigram alignment illustration on two synthetic sentences

After the largest alignment is found, let $m$ be the number of mapped unigrams between the reference and the candidate. Then, the precision and the recall are equal to

P = \frac{m}{c}, R = \frac{m}{r}

where $c$ and $r$ are the unigram count in the candidate and in the reference, respectively. Then, you can calculate the weighted F-Score (with $\alpha$ controlling the relative weight for precision and recall):

F_{\text{mean}} = \frac{\text{PR}}{\alpha \text{P} + (1 - \alpha) \text{R}}

Penalty function

The penalty function is responsible for the correct word order in the candidate and is defined as follows:

\text{Penalty} = \gamma \Big (\frac{c_m}{m} \Big)^\beta,

where

$c_m$ — the number of matching chunks (a chunk is the set of unigrams that are positioned next to each other in the reference and in the candidate),
$m$ — the number of mapped unigrams between the reference and the candidate,
$\beta$ — a parameter responsible for the penalty's shape, and
$\gamma$ — the relative weight for the fragmentation penalty, $\gamma \in [0, 1]$ .

The penalty decreases with the lower number of chunks present (the longest matches are rewarded, while the more fragmented matches are penalized).

\text{METEOR} = F_{\text{mean}} \cdot(1 - \text{Penalty})

METEOR lies in the [0, 1] range, with higher values indicating better scores.

Synonym matches with WordNet

Synonym matching is done with WordNet, a lexical database that provides information about lexical relationships between words (including synonyms).

To perform synonym matching with WordNet, the first step is to identify the potential synonyms in the reference and candidate. Each word in reference is matched to a word in the candidate if any synonym of a candidate word is the exact match to the reference word. Consider an example:

Candidate	The aim was to advance the scheme
Reference	'The objective was to upgrade the system

In WordNet, 'aim' is a synonym for 'objective', 'advance' is a synonym for 'upgrade', and 'scheme' is a synonym for 'system', so they all will be matched.

Once the potential synonyms are identified, METEOR calculates a matching score based on the presence of these synonyms. A higher score is assigned when synonyms are found in both the candidate and the reference, indicating semantic similarity.

METEOR walk-through

In order to better understand the introduced concepts, let's manually calculate the METEOR score for the following reference—prediction pair:

Reference	The quick brown fox jumps over the lazy dog.
Prediction	The brown fox jumps over the dog.

The unigram alignment:

The unigram alignment illustration for the reference and prediction

$m$ , the number of mapped unigrams between the reference and the prediction, is 7. The unigram count in the prediction is 7 ( $c$ ), unigram count in the reference is 9 ( $r$ ). Setting $\alpha = 0.9$ (taken from the original METEOR paper, with recall being weighted 9 times more than precision). Now, let's calculate the precision and recall:

P = \frac{m}{c} = \frac{7}{7} = 1 \\ R = \frac{m}{r} = \frac{7}{9}

The weighted F-score:

F_{\text{mean}} = \frac{PR}{\alpha P + (1 - \alpha)R} = \frac{\frac{7}{9}}{0.9\cdot 1 +0.1 \cdot \frac{7}{9}} \approx 0.795

There is only a single chunk present, so the number of matching chunks ( $c_m$ ) is 1:

A chunk illustration for the reference and prediction pair

$\beta$ (a parameter responsible for the penalty's shape), is set to 3, and the relative weight for the fragmentation penalty, $\gamma$ , is 0.5 (both are the values from the original METEOR paper). Thus, the penalty becomes as follows:

\text{Penalty} = \gamma\Big (\frac{c_m}{m} \Big )^\beta = 0.5 \cdot \Big( \frac{1}{7} \Big) ^3 \approx 0.015

The final METEOR score for this reference—prediction pair:

F_{\text{mean}} \cdot (1 - \text{Penalty}) = 0.795 \cdot (1 - 0.015) \approx 0.783

METEOR usage with nltk

In this section, you'll go through the METEOR calculation with nltk's meteor_score module.

meteor has two mandatory arguments: pre-tokenized references and predictions, but also accepts optional arguments:

alpha, beta, and gamma (all of which correspond to the meanings of $\alpha,\, \beta$ , and $\gamma$ described above),
preprocess(Callable, set to str.lower by default),
stemmer (defaults to Porter stemmer), and
wordnet(the WordNet corpus). The default values (taken from the original METEOR paper) for the $\alpha,\, \beta$ , and $\gamma$ parameters are specified below.

Computing the METEOR score for a string is straightforward:

# import nltk
# nltk.download('punkt')
# nltk.download('wordnet')

from nltk.translate.meteor_score import meteor_score
from nltk import word_tokenize

prediction = "I am a large language model."
reference = "I am a large language model, also known as a conversational AI or chatbot trained to be informative and comprehensive."

results = meteor_score([word_tokenize(reference)], word_tokenize(prediction), alpha = 0.9, beta = 3, gamma = 0.5)
# 0.3096067695370831

Note that the order of the reference and the prediction is important. meteor_score could also be used for multiple references, where the first argument is the list of tokenized references. The module also contains composite functions (such aswordnetsyn_match).

Below is the NLTK code snippet that returns the WordNet matches [(6, 6), (5, 5), (4, 4), (3, 3), (2, 2), (1, 1), (0, 0)]. So, in this case, every word has been matched either in full or to its synonym. The words were chosen to be WordNet synonyms from the start — try changing the hypothesis and the reference strings.

# import nltk
# nltk.download('punkt')
# nltk.download('wordnet')

from nltk.corpus import WordNetCorpusReader, wordnet
from nltk import word_tokenize

from nltk.translate.meteor_score import wordnetsyn_match

candidate = word_tokenize('The aim was to advance the scheme')
reference = word_tokenize('The objective was to upgrade the system')

wordnetsyn_match(candidate, reference)

You could also consider the full prediction — reference alignment — the result of sequentially applying exact match, stemmed match, and WordNet-based match:

# previous imports
from nltk.translate.meteor_score import align_words

align_words(candidate, reference)

Here is what it outputs:

([(0, 0), (1, 1), (2, 2), (3, 3), (5, 5), (6, 6)],
 [(4, 'advanc')],
 [(4, 'upgrad')])

Here, the first list stores the alignment, the second list contains the unmatched stems from the candidate, and the third list contains the unmatched stems from the reference. We can see that after the stemming, the "advance"—"upgrade" synonym was lost because WordNet doesn't consider "advanc" to be a synonym for "upgrad".

Some considerations for METEOR

METEOR was designed to address BLEU's weaknesses, namely:

the lack of recall;
indirect measures of how well-formed the candidate text is (higher order n-grams are used as substitutes);
the lack of explicit word-matching between the candidate and the reference;
the geometric average equals 0 if one of the component n-gram scores is zero. Thus, BLEU scores at the sentence (or segment) level can be meaningless.

METEOR is capable of paraphrasing detection, which requires external resources, thus, unlike BLEU, METEOR is a language-dependent metric (currently, METEOR supports 5 languages besides English).

However, METEOR has its limitations. Here are some of them:

METEOR has a higher correlation with human judgment when compared to BLEU, but the maximum correlation at the sentence level was 0.403, indicating that METEOR is limited in capturing the nuances;
METEOR depends on a lexical database such as WordNet. If certain words or phrases are not present in the database, they might not be properly considered during evaluation, leading to potential inaccuracies.
$\alpha$ , $\beta$ , and $\gamma$ affect the produced scores and might need tuning to fit specific requirements.

Conclusion

By finishing this topic, you are now familiar with the main components of the METEOR metric (namely, the weighted F-score and the penalty function), as well as with certain motivations and limitations of the metric. You also learned how to calculate METEOR on a prediction-reference pair, and how to use METEOR with the nltk module.

How did you like the theory?

Report a typo