Computer scienceData scienceNLPNLP metrics

BERTScore

8 minutes read

BERTScore is a metric for evaluating the quality of text generation models, such as machine translation or summarization. It utilizes pre-trained BERT contextual embeddings for both the generated and reference texts, and then calculates the cosine similarity between these embeddings. This topic covers the core concepts behind the BERTScore metric.

The motivation behind BERTScore

BERTScore was designed to directly improve upon n-gram-based text generation metrics, such as BLEU or METEOR, by addressing two primary limitations:

Inability to detect paraphrases: for example, with the reference text "people like foreign cars," n-gram-based metrics would assign a higher score to "people like visiting places abroad" instead of "consumers prefer imported cars". This leads to underestimated performance when semantically correct phrases deviate from the reference. In BERTScore, the similarity between two sentences is computed as the sum of the cosine similarities between their token embeddings, thereby providing the capability to detect paraphrases.
Failure to capture remote relationships and to penalize significant changes in semantic order: for instance, the BLEU score is not severely affected if the phrases are switched from "A because B" to "B because A", especially when A and B are long phrases. BERTScore's contextual embeddings are trained to recognize order and deal with distant dependencies present in the text.

The architecture

At a high level, the BERTScore metric computation flow involves the following steps:

Token representation: The first step in calculating BERTScore is to represent both the reference and candidate sentences using contextual embeddings. One can you different BERT variants to obtain the contextual embeddings (such as RoBERTa or DistilBERT, etc).
Cosine similarity: In the second step, pairwise cosine similarity between each token in the reference sentence and each token in the candidate sentence.
BERTScore calculation: In this step, each token in the reference sentence is matched to the most similar token in the candidate sentence, and vice versa, to calculate recall and precision. These are then combined into the F1 score.
Importance weighting: This optional step assumes that rare words can be highly indicative of sentence similarity; thus, IDF (Inverse Document Frequency) is incorporated into the calculation.
Baseline rescaling: The final step aims to make the score more human-readable and transform it into the 0-1 range.

We will explore each of these steps in greater detail further down the line.

The BERTScore architecture is outlined below:

The BERTScore architecture overview for reference and candidate on the input

Given a tokenized reference sentence $x = \langle x_1, ..., x_k \rangle$ , the embedding model generates a sequence of vectors $\langle \mathrm{x_1}, ..., \mathrm{x_k} \rangle$ . Similarly, the tokenized candidate sentence $\hat{x} = \langle \hat{x}_1, ..., \hat{x}_k \rangle$ is mapped to $\langle \hat{\mathrm{x}}_1, ..., \hat{\mathrm{x}}_k \rangle$ .

BERT or its variants are responsible for tokenizing the input text into a sequence of word pieces. Unknown words are broken down into commonly observed sequences of characters. The representation of each word piece is then computed using a transformer encoder that applies self-attention and nonlinear transformations repeatedly.

The vector representation allows for a soft similarity measure. The cosine similarity of a reference token $x_i$ and a candidate token $\hat{x}_j$ is $\frac{\mathrm{x_i^T \hat{\mathrm{x}}_j}}{\lVert \mathrm{x_i}\rVert \lVert \mathrm{\hat{x}_j}\rVert }$ and since pre-normalized vectors are used, the similarity becomes the dot product:

\mathrm{x_i^T \hat{\mathrm{x}}_j}

Embeddings are context-dependent, changing based on the sentence's context. This context-awareness enables BERTScore to evaluate semantically similar sentences even if their phrasing differs.

For recall calculation, each token in $x$ is matched to its most similar counterpart in $\hat{x}$ , and vice versa for precision. Greedy matching is then used to maximize the similarity score.

The F1 score for a reference $x$ and a candidate $\hat{x}$ can be calculated using the following:

R_{\text{BERT}} = \frac{1}{|x|} {\sum_{x_i \in x} {\max}_{\hat{x}_j \in \hat{x}} \mathrm{x}_i^\top \hat{\mathrm{x}}_j}, \, P_{\text{BERT}} = \frac{1}{|\hat{x}|} {\sum_{\hat{x}_j \in \hat{x}} {\max}_{{x_j \in x}} \mathrm{x}_i^\top \hat{\mathrm{x}}_j},

F_{\text{BERT}} = 2\cdot \frac{P_{\text{BERT}} \cdot R_{\text{BERT}} }{P_{\text{BERT}} + R_{\text{BERT}} }

There is a step known as importance weighting that may be performed. This step is based on the assumption that rare words could be more indicative of sentence similarity—and therefore more important—than common words. Inverse Document Frequency (IDF) can be incorporated into the existing BERTScore equations, although the effectiveness of this step can depend on both data availability and the specific domain of the text.

Baseline rescaling is carried out to make the score more human-readable. While cosine similarity values theoretically lie within the $[-1, 1]$ interval, they usually occupy a smaller range in practice. The rescaled recall $\hat{R}_{\text{BERT}}$ is the following ( $P_{\text{BERT}}$ and $F_{\text{BERT}}$ are similarly rescaled):

\hat{R}_\text{BERT} = \frac{ R_{\text{BERT}}-b }{1-b}

After rescaling, $\hat{R}_{\text{BERT}}$ generally falls between 0 and 1. The constant $b$ is calculated using datasets from the Common Crawl project for each language and contextual embedding model. In this process, candidate-reference pairs are formed by grouping two sentences together. These pairs usually exhibit very low lexical and semantic overlap, as the sentences are randomly paired and the corpus is diverse. Finally, BERTScores are calculated for each of these pairs, and $b$ is determined as the average of these scores.

The implementation overview

Hugging Face comes with the metric package evaluate, which provides a universal interface for assessing model performance.

You can install the required dependencies using the following command:

pip install bert_score evaluate

To use BERTScore within evaluate , three arguments are required: predictions, references, and either lang or model_type. The latter specifies which variation of the BERT model (e.g., RoBERTa, DistilBERT) should be used.

The results is a dictionary containing re-scaled values for precision, recall, and the F1 score, along with a library hash code.

Here's a code example to demonstrate how BERTScore is used with evaluate:

import evaluate

bertscore = evaluate.load("bertscore")
predictions = [
    "The sea, which encompasses more than 70% of the Earth's surface, is a large body of salty water"
]
references = [
    "The sea is a vast expanse of saline water that covers more than 70% of the Earth's surface"
]
results = bertscore.compute(
    predictions=predictions, references=references, model_type="distilbert-base-uncased"
)

print(results)

Output:

{
   "precision":[
      0.9478850960731506
   ],
   "recall":[
      0.9566641449928284
   ],
   "f1":[
      0.9522543549537659
   ],
   "hashcode":"distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.33.1)"
}

BERTScore significantly outperforms other text evaluation metrics, primarily because it utilizes contextual embeddings. These embeddings address the limitations of traditional word- or character-based metrics. However, this reliance on contextual embeddings also makes the metric dependent on the language and the quality of the pre-trained model. Additionally, using BERTScore requires downloading a BERT model. The default model takes up 1.4 GB of storage space, although lighter variants are available in the evaluate implementation.

Conclusion

In conclusion, you should now be familiar with:

The underlying motivation behind the metric;
The key components of BERTScore;
How to use evaluate in conjunction with BERTScore.

How did you like the theory?

Report a typo

BERTScore

The motivation behind BERTScore

The architecture

The implementation overview

Conclusion

Related topics