Virtual assistants such as ChatGPT and Gemini rely on Generative Large Language Models (LLMs), but their responses can sometimes be off-topic or factually incorrect, making evaluation essential. This topic explores key benchmarks and metrics for assessing LLM performance and introduces advanced evaluation methods, including human and model-based approaches.
Nuances of Evaluating Generative Models
Evaluating the quality of responses from generative models, such as those used in text generation, translation, or summarization, is not straightforward. Unlike classification tasks, where the answer is either right or wrong, generative tasks involve open-ended responses that may vary. This introduces a set of challenges in evaluation.
The evaluation of LLMs is categorized into three main groups:
Knowledge and capability evaluation assesses how well models perform tasks and understand data. Examples include question answering with benchmarks, knowledge completion using LAMA (LAnguage Model Analysis) or KoLA (Knowledge-oriented LLM Assessment), reasoning skills tests, and tool integration tasks.
Alignment evaluation ensures the model aligns with user intent and ethical standards. It tests instruction following, identifies bias, and evaluates cultural sensitivity in diverse contexts.
Safety evaluation focuses on preventing harmful or inaccurate outputs. It includes factual accuracy checks, and monitoring toxicity and harmful content.
To assess the quality of generative outputs, we rely on quantitative metrics such as perplexity, BLEU, and ROUGE for numerical evaluation, qualitative metrics as user feedback for subjective assessment, and specialized metrics to address concerns including toxicity, fairness, and factual consistency.
One common metric is perplexity, which evaluates how well a model predicts the next word in a sequence. A lower perplexity score suggests the model is better at predicting the text, but it does not account for how creative or relevant the output is.
Another metric, accuracy is often used in classification and question-answering (QA) tasks to measure correctness by comparing correct and incorrect answers.
For tasks such as machine translation or summarization, BLEU and ROUGE scores are often used. These metrics compare n-grams (or sequence of n words) in the generated text to reference text. While they help measure how similar the output is to the reference, they do not fully capture more subjective qualities like creativity or fluency.
Evaluation metrics such as BLEU, ROUGE, accuracy, and perplexity are useful but have some limitations. Metrics often prioritize outputs that closely match the reference, which can lead to missing other valid answers and struggling to assess whether the text fits the context. That is why a mix of metrics, benchmarks, model evaluations, and human assessments is necessary to effectively evaluate the performance of LLMs.
Benchmarks
Benchmarks are standardized datasets and tasks that allow researchers and developers to measure and compare performance and abilities of different language models across different domains and tasks. They help track progress in the field of NLP and enable objective evaluation. Benchmarks also come with predefined evaluation metrics, making the evaluation process more consistent.
There are over 100 LLM benchmarks and evaluation datasets that can be used to assess LLMs performance. Some of the key benchmarks include:
GLUE (General Language Understanding Evaluation) tests general language understanding through tasks like sentiment analysis and natural language inference, providing information about how well a model understands and processes different text-based inputs. For example, a model can be assessed with the sentence “I love this movie!” to determine if it expresses positive or negative sentiment.
SuperGLUE extends GLUE by including more challenging tasks, such as commonsense reasoning, which require deeper language comprehension and the ability for models to apply logical thinking. For example, given the sentence, "The ground is wet because it rained last night," a task could ask, "Why is the ground wet?" The model needs to recognize that rain caused the ground to be wet.
SQuAD (Stanford Question Answering Dataset), meanwhile, focuses on reading comprehension and question answering, assessing a model’s ability to extract relevant information from a passage of text. As an example, given the passage, "Albert Einstein was born in 1879 in Ulm, Germany," a question could be, "Where was Albert Einstein born?" The model should accurately extract "Ulm, Germany" as the answer.
However, these benchmarks have disadvantages. They focus on domain-specific tasks and often do not assess more open-ended, generative capabilities. Benchmarks as GLUE and SuperGLUE are useful for evaluating tasks with fixed answers but they may not fully capture the nuances of creative tasks. Additionally, there is a risk that models become overfitted to perform well on these benchmarks, potentially affecting their ability to generalize to real-world situations. Models that excel at benchmark tasks may struggle when faced with complex problems not covered in the datasets.
Human-as-a-judge
Human evaluation can be used for assessing the quality of generative models, particularly when it comes to aspects such as creativity, coherence, and emotional impact. One way human evaluation is conducted is through user feedback, where input from real users helps assess how well the model meets their expectations and performs the given task. For tasks like storytelling, humor generation, or creative writing, human judgment is irreplaceable. After all, who better to judge creativity than a human?
Unlike automated metrics, human evaluators can assess qualities that are harder to measure but just as important. For example, when evaluating humor, automated metrics might miss problems of timing or delivery, while a human can easily determine if something is funny. However, human evaluation can be costly and time-consuming, especially when dealing with large datasets. Also, different evaluators may have varying opinions, leading to inconsistencies in ratings. Finally, bias can sometimes skew the results.
Model-as-a-judge
The concept of using models as evaluators is popular in natural language processing (NLP), particularly for large-scale tasks. In this approach, a more advanced and expensive LLM or state-of-the-art models, including GPT-4, Claude, and Llama, are used to automatically score or rank generated outputs based on criteria such as scalability, speed, and reproducibility.
In model-as-a-judge evaluation, models assess outputs through comparison tasks or predefined scoring systems. For example, a model might compare generated text to a reference and rate its quality based on coherence or fluency. Alternatively, it could rank multiple outputs and select the best one. Model-based evaluation is efficient, faster, and more cost-effective compared to human evaluation, and ensures reproducible results. However, the model may introduce biases based on its training data, and careful calibration is needed to align the evaluation with human judgment.
Conclusion
Evaluating generative language models requires a mix of approaches. Metrics like perplexity, BLEU, and ROUGE measure quality but fail to capture creativity or emotional depth. Benchmarks such as GLUE and SQuAD assess performance on specific tasks but may not address real-world challenges. Human evaluation is ideal for judging creativity and emotional impact but can be costly and biased. Additionally, models can evaluate outputs efficiently but need careful tuning to align with human judgment. Therefore, combining these methods works better for effective evaluation.